Delta-1 variant of SARS-COV-2 acquires spike V1264L and drives the pandemic in Indonesia, Singapore and Malaysia

Since April 2021, d variant of SARS-COV-2 has gradually overtaken all other variants and become a dominant pandemic driver around the world. It has evolved and yielded four subvariants: d 1, d 2, d 3 and d 4. While trying to understand how these subvariants drive the pandemic in Southeast Asia, I noticed that many d 1 genomes from Indonesia encode an extra spike substitution, V1264L. Coincidentally, this confers an acidic dileucine motif because residues 1157-1262 are acidic and residue 1265 is leucine. Such a motif may affect subcellular trafficking of the resulting spike protein. Alarmingly, this V1264L-encoding d 1 subvariant (referred to as d 1L) has become the dominant pandemic driver in Indonesia, Singapore, Malaysia and East Timor. Moreover, it has acquired additional spike substitutions: L1234L in Singapore and D215Y/N in Malaysia. On the average, the resulting sublineages carry 46-48 mutations per genome, making them some of the most mutated variants identified so far. Moreover, a d 1 sublineage from the United Kingdom has acquired V1264L along with spike Y145H and A222V, a signature substitution of a SARS-COV-2 clade that was a major pandemic driver in Europe during the summer of 2020. A222V improves an extensive hydrophobic interaction network at the N-terminal domain of spike protein and may make this sublineage more virulent than d 1 and d 1L. Some d 2 subvariant genomes identified in the United States of America and other countries also encode V1264L. Thus, V1264L is a recurrent spike substitution frequently acquired by d subvariants during convergent evolution. This recurrence also suggests that V1264L is one key mechanism by which d variant adopts to expand its ‘evolutionary cage.’ adapted from PyMol presentation of the spike protein structure 6XR8 from PDB ( E ) Cartoon summarizing evolutionary relationship between d variant and its subvariants herein. Two other studies of d variant genomes subvariants


INTRODUCTION
Coronavirus disease 2019  has caused the pandemic for almost two years, which has not only led to tragic loss of life but also crippled the economy around the world. Although successful development of multiple vaccines has brought great hope, we still do not know very little about where this pandemic is heading and when it will end. Thus, it is important to fully understand severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the culprit virus that causes this contagious disease [1]. Genomic surveillance has revealed that its evolution is key to driving waves of cases around the world. For genomic surveillance, the global initiative on sharing avian influenza data (GISAID) has played a critical role and as of October 10, 2021, over 4.25 million SARS-COV-2 genomes have been deposited into the database [2]. From the genomes, it is clear that the virus has yielded many variants, including four variants of concern as designated by the WHO: a (B.1.1.7) [3], b (B.1.351) [4], g (P.1) [5] and d (B.1.617.2) [6]. Astonishingly, these variants all emerged from complete obscurity and then rapidly rose to the status of major pandemic drivers. This raises two important questions: 1) how they have been evolving and 2) what future potential variants of concern will look like.
Related to these two questions, I wondered whether there are any simple rules that govern evolution of this virus. If so, such rules will help us predict its evolutionary trajectory and identify potential future variants of concern. Obviously, as the genetic material, the ~29,700-nucleotide viral genome is the key determinant. In theory, every nucleotide of this genome could be altered during SARS-COV-2 evolution. In reality, many such changes, however, are non-functional or even detrimental and thus offer no evolutionary advantage. As a result, just like what occurs with cancer mutations, only favorable or gain-of-function ones are selected during SARS-COV-2 evolution. In cancer biology, mutations are grouped into two categories, known as 'driver' and 'facilitator' mutations. A third category is 'passenger' mutations because they are apparently selected due to association with the first two categories.
One thumb of rule is that 'driver' and 'facilitator' mutations are recurrent whereas 'passenger' mutations are often not. I reasoned that such concepts can be used to understand and annotate SARS-COV-2 mutations. I also envisaged that there are genomic elements limiting SARS-COV-2 evolution and such elements form an 'evolutionary cage' demarcating the space within or around which SARS-COV-2 evolves and generates new variants. I believe that the recurrent nature of 'driver' and 'facilitator' mutations will help identify the genomic elements and map out the shape of this invisible cage.
With these considerations as the simple guide, I have tracked genomes in the GISAID database closely. As a result, I have recently found that d variant has evolved actively and yielded four subvariants (d1, d2, d3 and d4) in India [7]. Among them, d1 has emerged as the major pandemic driver and d2 plays a much less important role, whereas d3 and d4 have gradually faded away [7]. This emerging theme about d variant is also true in Europe [7] and the USA [8], with d1 becoming almost the sole pandemic driver in the United Kingdom and Spain. Thus, a relevant question is how the four subvariants have contributed to the pandemic in other continents and countries around the world.
Related to this question, I tried to understand how the subvariants have driven the pandemic in Southeast Asia. I noticed that many d1 genomes from Indonesia encode an extra spike substitution, V1264L, along with an extra NSP3 substitution (T678I). Alarmingly, this d1 sublineage, d1L, has become the predominant pandemic driver in Indonesia, Singapore, Malaysia and East Timor. Mechanistically, V1264L confers an acidic dileucine motif. This is significant as such motifs are known to regulate endocytosis and subcellular trafficking of membrane-associated receptors and viral envelope proteins [9][10][11][12]. Here, I describe how d1L has evolved and driven the pandemic in Southeast Asia, and demonstrate that V1264L has also been acquired by d1 and d2 subvariants identified in the United Kingdom and USA, respectovely.

RESULTS AND DISCUSSION
d1 and d2 subvariants drive the pandemic in Indonesia The epidemiological curve from Indonesia indicates that the country managed relatively well in 2020 (Fig. 1A). The first wave started at the end of 2020. The variant B.1.466.2 accounts for a third of COVID-19 cases from the first wave (Fig. 1A). This variant carries one small deletion (H69V70del) and four spike substitutions, N439K, G614G, P681R and P809S. After April 2021, d variant has gradually become the predominant driver ( Fig. 1A-B). In support for high virulence of d variant, the second wave is much more powerful than the first one. Vaccination was initiated at the beginning of 2021 and reached a full vaccination rate of ~17% in September 2021 (Fig. 1A). Temporal distribution of d variant and its subvariants, d1 and d2, indicated that d1 is the dominant driver whereas d2 has played a less important role (Fig. 1B). This is similar to what has occurred in India, Europe and the USA [7,8]. Then I utilized Coronapp [13,14] for mutation profiling d genomes identified in Indonesia after July 16, 2021. The results showed that a majority of the genomes encode two extra substitutions, spike V1264L (Fig. 1C) and NSP3 T678I (Fig. 1D). Importantly, these two substitutions are associated with each other and belong to a d1 sublineage, which is referred to d1L (with the letter L denoting L1264). Thus, d1 has evolved and yielded a new sublineage encoding two extra substitutions. Consistent with this, the average mutation load in the d-genomes analyzed is 42 per genome ( Fig. 2A), which is 1-2 mutations more than d1 genomes identified in India after August 01, 2021 [7].
As for the origin of d1L, the first such genome from Indonesia was identified on January 19, 2021 (EPI_ISL_3315002, GISAID). It is quite puzzling is that the major d1L wave only took off in the country 4-5 months later (Fig. 1A). In India, the first d1L genome was only reported on April 06, 2021, so it is unlikely that d1 gave rise to d1L in India. Multiple genomes with the collection dates of January 08 and 09, 2021 were deposited into the GISAID database from East Timor and Slovakia in October and September 2021, respectively. Out of 2,257 d genomes from Slovakia, there are only 12 d1L genomes (GISAID, accessed on October 06, 2021). One complication is that some early genomes in the database appear to have incorrect sample collection date information, perhaps due to typos and book-keeping errors. The geological proximity of East Timor to Indonesia suggests that d1 may have yielded d1L in one of these two countries. But this possibility will need to be verified.
d1L is a major pandemic driver and acquires spike L1234I in Singapore During the pandemic, Singapore has taken strict public health measures. Such measures have been highly effective in blocking a [3], b [4], g [5] and many other variants from causing outbreaks in the country from September 2020 through June 2021 (Fig. 1B). In the summer of 2021, the four subvariants (d1, d2, d3 and d4) led to some small outbreaks in the country, but it managed to control them successfully. By contrast, these subvariants have led to major waves in the rest of the world. The situation took a sharp turn in Singapore at the beginning of August 2021, followed by a surging wave that is still ongoing (Fig. 1B). This epidemiological prognosis raises an important question about the SARS-COV-2 variants that have been driving this surging wave. Troublingly, this surging wave occurred when the country was reaching a high level of vaccination (Fig. 2B). Since April 2021, the vaccination rate has increased rapidly to an impressive level of ~80% (Fig. 1B). Other public control measures have been similar during the pandemic in the country. Thus, this raises an intriguing possibility that the variant(s) behind this surging wave may be much more virulent than many other variants that had spread to the country before the summer of 2020. If so, it would be important to identify the variant(s).
As described above (Fig. 1), during the summer of 2021, d1L was a key driver of a surging wave in Indonesia, a neighboring country of Singapore. The geological proximity suggests that d1L may also be one major driver of the surging wave in Singapore. Indeed, as shown in Fig. 2B, d1L was a major variant in July 2021 and became almost the sole pandemic driver one month later. In support of this, d1L genomes were identified in almost all 828 COVID-19 cases analyzed (Fig. 2C). Unexpectedly, almost 40% genomes from September belong to a sublineage (referred to as d1L1) that encodes an extra spike substitution (L1234I) and a silent mutation (P217P, Fig. 2C). Consistent with this, the average mutation load of the genomes from Singapore is 46 per genome, which is 4 mutations more than what was identified in the genomes from Indonesia ( Fig. 2A). Alarmingly, since July 2021, d1L1 genomes have rapidly risen to the current level of almost 40% (Fig. 2C). Thus, spike L1234I may confer evolutionary advantage to d1L1. Related to this, L1234 is in at the border of the transmembrane domain and cytoplasmic tail. Coincidentally, this is close to V1264, so L1234I may synergize with V1264L. Although L1234I is a rather conserved substitution at the physicochemical level, it may finetune spike protein function and regulation.
In Singapore, the first d1L1 case was identified on August 22, 2021. Over one month earlier, a related case (EPI_ISL_4254882, GISAID) was identified in Indonesia on July 03, 2021, suggesting that d1L1 might have arisen in Indonesia and then spread to Singapore. Now there are 884 genomes in the GISAID database, with 881 from Singapore and the remaining three from Indonesia, Australia and Hong Kong. Thus, d1L1 remains largely localized to Singapore. This offers a narrow opportunity to control the spread of this new variant.

d1L drives the pandemic and acquires spike D215Y/N in Malaysia
Having established that d1L is a major pandemic driver in Indonesia and Singapore, I next investigated whether this variant has spread to neighboring countries and driven the pandemic there. To address this, I analyzed the situation in Malaysia. As shown in Fig. 3A, the country has experienced three waves of COVID-19 cases, with the latest one starting in July 2021 and being the most powerful. Epidemiological curve and immunization progress in Malaysia.
Vaccination progress has been reached an impressive level compared to many other countries in Asia (Fig. 3A). Despite this vaccination progress, the third wave is much more powerful than the first two. As occurred in Singapore, this epidemiological prognosis suggests that the variants behind the third wave may be highly virulent. The geological proximity to Indonesia and Singapore suggests that d1L may be one key driver of the third wave in Malaysia. Indeed, d1L has been a major driver since July 2021 (Fig. 3B). It was detected in ~50% cases in August 2021 (Fig. 3B). Thus, d1L has contributed significantly to the latest surging wave of COVID-19 cases in Malaysia.
To understand how d1L has evolved in the country, I performed mutation profiling via Coronapp [13,14]. As shown in Fig. 3C-D, multiple new spike and NSP3 mutations have been detected in 5-20% of the 446 d1L-genomes analyzed, suggesting that d1L have given rise to different sublineages in Malaysia. Consistent with this, the mutation load is 45-46 per genome ( Fig. 4A), which is similar to what observed with the genomes from Singapore ( Fig. 2D) but 3-4 mutations more than that observed with the genomes from Indonesia ( Fig. 2A). This suggests that there are extra mutations in the d1L genomes from Malaysia. Mutation profiling uncovered two notable substitutions are D215Y and D215N, which alters the same residue, D215. It should be noted, however, that due to its low abundance (~10% each, Fig. 2C), these two substitutions do not account fully for the higher mutation load (Fig. 4A) than those genomes identified in Indonesia ( Fig. 2A). These two substitutions could be used to identify two subgroups of genomes. Analysis of these two subgroups revealed average mutations loads of 48 and 46-47 per genomes for the D215Y-and D215N-encoding subgroups, respectively ( Fig. 4B-C). Such mutation loads make both sublineages two of the most mutated SARS-COV-2 variants identified so far. Related to this, the most mutated variant is 49 mutations per genome [7]. While D215Y is associated with spike A845S in some genomes, D215N-encoding genomes carry a small deletion spanning the codons of H69 and V70 ( Fig. 4D-E). Therefore, d1L has yielded two highly mutated sublineages in Malaysia.

V1264L-encoding d genomes identified around the world form different groups
Having established that d1L has been a major pandemic driver in Indonesia, Singapore and Malaysia, I asked how it has spread to other countries. Related to this, d1L appears to be a major driver in East Timor: out of 33 d-genomes identified there, 17 encode d1L (GISAID, accessed on October 05, 2021). Thus, d1L is a key pandemic driver in East Timor.
For the rest of the world, I searched d1L genomes the GISAID database (accessed on October 07, 2021). For this, T678I of NSP3 was also used to identify d1L variant originated from Southeast Asia. This is because in the United Kingdom, there is a d1L variant that encodes A1711V instead of T678I of NSP3 (see below). This research identified 1,554 d1L-T678I genomes from Europe. There are 129 such genomes identified in North America, with 19 in Canada (16 of them from British Columbia). In Asia, 200, 118 and 2 such genomes have been identified in Japan, South Korea and China, respectively. None have been identified in South America and only one has been found in Africa. Therefore, the spread of d1L-T678I around the world is still at the very early stage. But this may change soon, so it will be necessary to watch out the spread of this variant due to its higher virulence than d1 itself.
Then I carried put phylogenetic analysis of d1L genomes identified in the world by May 31, 2021. As shown in Fig. 5, d1L genomes from Indonesia form one large group. Within this group are only one genome identified in Singapore and several from the U.K. Strikingly, the genomes from the U.K. encode A1711V instead of T678I of NSP3, suggesting that they are from a distinct root. A majority of d1 genomes identified in country encode A1711V [7], so d1 subvariant has acquired V1264L in the U.K., independently from d1L identified in Indonesia, Singapore and Malaysia. Thus, there are two distinct groups of d1L genomes, encoding T678I or A1711V of NSP3. Intriguingly, phylogenetic analysis revealed that both d2 and d3 subvariants have also acquired spike V1264L, yielding new d2L and d3L sublineages, respectively (Fig. 5). Thus, V1264L is a recurrent spike substitution that has been acquired by different d subvariants as they evolve to yield new sublineages.
Then I utilized Coronapp [13,14] to analyze the mutation profile of d1L genomes identified in the U.K. Shown in Fig. 6A are spike substitutions encoded by 963 V1264L-encoding d genomes identified in the country during the first weeks of September 2021. It is surprising to notice that in addition to typical d1L substitutions, about 80% of the genomes carry two extra spike substitutions, Y145H and A222V (Fig. 6A). In the country, there is a d1 sublineage, d1V, that encodes these two substitutions, but not V1264L [15]. Thus, it is likely that d1V has acquired V1264L and given rise to a new sublineage, which has been referred to as d1V1 [15].
Moreover, almost all genomes carry A1711V of NSP3 (Fig. 6B), with an average mutation load of 43-44 per genome (Fig. 6C). They do not encode T678I (Fig. 6C) as those d1L genomes identified in Southeast Asia (Fig. 1D). A1711V is a signature substitution of d1 subvariant identified in the U.K. [15], so it is conceivable that d1-NSP3_A1711V has acquired V1264L in the U.K. in a manner completely independent of d1L identified in Southeast Asia. Fig. 6D are the monthly genomes corresponding to d1V1 and the total V1264L-encoding d genomes (i.e. the sum of d1L and d1V1). Since July 2021, the d1V1 genome number has increased exponentially, suggesting that Y145H, A222V and V1264L

Shown in
confer evolutionary advantage to d1V1. A222V is a signature substitution of the SARS-COV-2 GV clade that was a major pandemic driver in Europe during 2020. A222V improves an extensive hydrophobic interaction network [15]. Thus, d1V1 may be even more virulent than d1L.

Analysis of V1264L-encoding d2-genomes
Like d1V1, d2 genomes encode spike A222V [15]. Phylogenetic analysis identified two large groups of d2 genomes encoding V1264L (Fig. 5). One such group contains genomes mainly from the USA, and the other possesses genomes from Europe and Africa. By analogy to d1L, d2 V1264L-encoding d2 subvariant is referred to as d2L. Mutation profiling of 1,047 d2Lgenomes identified in the USA during September 2021 revealed an extra spike substitution N1074S (Fig. 7A). As shown in Fig. 7B, N1074S-encoding d2L genomes carry an average of 40-41 mutations per genome. This is much lower than what is observed with different d1L sublineages (e.g. Figs 2B & 4B-C). As shown in Fig. 7C, N1074S does confer viral fitness to d2L. As described previously, d2 is typically less virulent than d1 [15]. Indeed, the growth of d2L or its N1074S-encoding sublineage is less dramatic than d1L and its sublineages. Thus, d2L or its N1074S-encoding sublineage may be less dangerous than d1L and its sublineages described above.

Mechanistic impact of V1264L and other spike substitutions
As shown in Fig. 8A-B, V1264 is located at the C-terminal cytoplasmic tail, which is important for cytoplasmic trafficking [16,17]. A recent study showed that the cytoplasmic tail of spike protein facilitates its expression on cell surface and promotes syncytia formation between infected cells [18]. Coincidentally, V1264L confers an acidic dileucine motif to spike protein because residues 1157-1262 are acidic and residue 1265 is leucine. Such a motif may affect endocytosis of many membrane-associated receptors of the resulting protein [9,10,12].
Notably, L1234I encoded in d1L is located at the border between the transmembrane domain and the cytoplasmic tail (Fig. 8A). L1234I may synergize with V1264L in finetuning the function and regulation of spike protein. H69 and Y145 are within a super antigenic site in the N-terminal domain (Fig. 8A), so their alterations (H69del and Y145H, respectively; Fig. 4) may confer immune evasions. A222 is part of an extensive hydrophobic network (Fig. 8D) [7], so A222V should improve this network. Spike A845 is in proximity to Q836 (Fig. 8E). Both residues are part of a region C-terminal from the fusion peptide (Fig. 8A). This region is important for spike function in promoting cell-cell fusion [19].
The cartoon in Fig. 8F summarizes evolutionary relationship between d variant and its subvariants described herein. Two other studies of d variant genomes from India, Europe and the U.S.A. supports a model in which this variant yields mainly four subvariants [7,8].
Alarmingly, d1 sublineages such as d1L, d1L1, d1V and d1V1 appear to be even more virulent than d1 itself. Moreover, d1L1 and d1V1 may be more virulent than d1L. This is alarming. Therefore, we will need to track evolution of these sublineages very closely and map their evolutionary trajectory in a precise and timely fashion, which will help us adopt the best public health measures and develop the new generation of vaccines.
Identification of d1L and its sublineages herein also provides support for a continuously branching model on SARS-COV-2 evolution [7,8]. Moreover, according to the 'evolutionary cage' hypothesis, there are genomic elements limiting the evolution and such elements form an 'evolutionary cage' that demarcate the space within which SARS-COV-2 evolves and generates new variants. Thus, the recurrent nature of spike V1264L suggests that V1264 is one such element and not optimal for SARS-COV-2. If so, V1264L is a driver or facilitator mutation that optimizes the function of the cytoplasmic tail of spike protein. Of relevance, spike P681R, P681H and V1176F are also such types of mutations [20,21]. Thus, acquisition of V1264L is an important mechanism by which d1 subvariant expands its evolutionary cage. Because V1264L is expected to finetune the cytoplasmic tail of spike protein and modulate its role in endocytosis, subcellular trafficking and cell-cell fusion, these results highlight the need to characterize other substitutions that also alter the cytoplasmic tail of spike protein.

SARS-COV-2 genome sequences, mutational profiling and phylogenetic analysis
The genomes were downloaded the GISAID database on the dates specified in the figure legends. The CoVsurver (https://www.gisaid.org/epiflu-applications/covsurver-mutationsapp/) was used to analyze mutations on representative SARS-COV-2 genomes. Fasta files containing specific groups of genomes were downloaded from the GISAID database. During downloading, each empty space in the Fasta file headers was replaced by an underscore because such a space makes the files incompatible for subsequent mutational profiling, sequence alignment and phylogenetic analysis, as described with details in another study [7]. The Fasta headers were shortened and modified further [7]. The cleaned Fasta file was used for mutational profiling via Coronapp (http://giorgilab.unibo.it/coronannotator/), a web-based mutation annotation application [13,14]. The cleaned Fasta file was also uploaded onto SnapGene (version 5.3.2) for multisequence alignment via the MAFFT tool. RAxML-NG version 0.9.0 [22] was used for phylogenetic as described [7].

Defining different variant genomes using various mutation markers
Genomes of a, b, g, d and other SARS-COV-2 variants were downloaded from the GISAID database as defined by the server. d subvariant genomes were defined as described [7]. Briefly, the nucleocapsid substitutions G215C and R385K were used as mutation markers for d1 or d3 genomes, respectively. Spike substitutions A222V and K77T were used as markers for d2 or d4 genomes, respectively. In Europe, there are many d1V genomes that also encode spike A222V, so the NSP3 substitution P822L was used together with spike A222V to identify d2 genomes. As discussed previously [7], there are several limitations with these markers. But they should not affect the overall conclusions.

PyMol structural modeling
The PyMol molecular graphics system (version 2.4.2, https://pymol.org/2/) from Schrödinger, Inc. was used for downloading structure files from the PDB database for further analysis and image generation. Structural images were cropped via Adobe Photoshop for further presentation through Illustrator.

Pandemic and vaccination data
Pandemic and vaccination data were downloaded from the Our World in Data website as described [7]. September 27, 2021 for mutation profiling via Coronapp [13,14]. Shown here are spike (C) and NSP3 substitutions (D).   [13,14]. Shown here are spike (C) and nucleocapsid substitutions (D). Note that a small deletion (H69V70del) is only detected as H69del by Coronapp.   profiling via Coronapp [13,14]. Shown are spike (A) and NSP3 (B) substitutions. d1L from Indonesia encodes T678I of NSP3 (Fig. 1D), but d1L originated from the U.K. encodes A1711V of NSP3 (panel B). d1V1 carries spike Y145H, A222V and V1264L [7]. Its parental strain is d1V, encoding spike Y145H and A222V [7], indicating that spike V1264L has been independently acquired by different variants.  Coronapp [13,14]. Shown in (A) and (B) are the mutation loads and spike substitutions, respectively. (C) Monthly distribution of d2L subvariant and its N1074S-encoding sublineage. For the analysis, the GISAID website was accessed on September 28, 2021.  (Fig. 8A). A222 is part of a hydrophobic core [7]. (E) Structural details of spike A845 and a neighboring residue, Q836. The structural models (C-E) were adapted from PyMol presentation of the spike protein structure 6XR8 from the PDB database. (E) Cartoon summarizing evolutionary relationship between d variant and its subvariants described herein. Two other studies of d variant genomes from India, Europe and the USA supports a similar model in which this variant branches out and yields mainly four subvariants [7,8].