Molecular Evolution of SARS-CoV-2 Structural Genes: Evidence of Positive Selection in the Spike Glycoprotein

Background: SARS-CoV-2 has caused a global pandemic since early 2020 and remains a serious public health issue worldwide. Four structural genes, envelope (E), membrane (M), nucleocapsid (N) and spike (S), play a key role in controlling entry into human cells and virion assembly of SARS-CoV-2. The evolution of these genes may determine infectivity of SARS-CoV-2, but thus far, little is known about them. Methods: We analyzed 3090 SARS-CoV-2 isolates from the GenBank database to determine the evolutionary patterns of the four structural genes by employing various molecular evolution algorithms. Results: Phylogenetic analyses showed that global SARS-CoV-2 isolates can be clustered into three to four major clades based upon protein sequence. Although intragenic recombination was not detected among different alleles, purifying selection has affected the evolution of these genes. By analyzing full genomic sequences of these alleles, our result revealed that codon 614 of the S glycoprotein has been subjected to a strong positive selection pressure, and a consistent D614G mutation was identied. Additionally, another potentially positive selection site at codon 5 in the signal sequence of the S protein was also identied with a consistent L5F mutation. The allele containing the D614G mutation has undergone signicant expansion during SARS-CoV-2 transmission, implying a better adaptability of isolates with the mutation. Nevertheless, L5F allele expansion was found to be relatively restricted. The D614G mutation is located at subdomain 2 (SD2) of the C-terminal portion (CTP) of the S1 subunit. Protein structural modeling showed that the D614G mutation may cause the disruption of a salt bridge between S protein monomers and increase their exibility, consequently promoting receptor binding domain (RBD) opening, virus attachment, and ultimately entry into host cells. Located at the signal sequence of S protein, the L5F mutation may facilitate protein folding, assembly, and secretion of the virus. Conclusions: This is the rst reported evidence of positive Darwinian selection in the spike gene of SARS-CoV-2. This nding Our interhuman

between SARS-CoV-2 and bat SARSr-CoV, particularly with E and M proteins. Alternatively, it indicates an extreme conservation of E and M proteins and their functions among coronaviruses [14].
Further analysis revealed that there are 14 single nucleotide polymorphisms (SNPs) of E gene, but only 5 single amino acid polymorphic (SAP) loci on the E protein. A similar result was observed for the M gene and protein, with 37 SNPs and 9 SAPs. In contrast, 126 SNPs and 75 SAPs were detected on N gene and protein, respectively. S protein, the most important component that mediates virus entry by receptor binding and membrane fusion and determines the infectivity of SARS-CoV-2 [15], harbors 155 SNPs and 90 SAPs. Considering the size of nucleotides and amino acid residues, N gene has the maximum sequence variability with 10.02% (126/1257) SNPs and 17.90% (75/419) SAPs, respectively. However, S gene has the most pairwise nucleotide differences among the four structural genes, indicating a relatively greater genetic diversity (Table 1). Distinct phylogenetic patterns of the four structural genes Phylogenetic analysis revealed that all SARS-CoV-2 E proteins form three clusters. Similar to E protein, a phylogenetic tree of SARS-CoV-2 M proteins is formed from three clusters with few branches ( Fig. 1a and  b). Results suggest that both E and M genes may display a relatively high conservation during coronavirus evolution. In contrast, SARS-CoV-2 N and S proteins show distinct phylogenetic patterns compared with E and M proteins. Four and three main phylogenetic clusters with various branches are identi ed in the N and S proteins, respectively ( Fig. 1c and d). Given the crucial roles of N and S proteins in virus transcription, assembly, and entry to host cells, the in uence on infectivity based on whether SARS-CoV-2 isolates harbor different N and S variants (such as those clustered into different clades) remains unknown.
Purifying selection drives evolution at a whole structural gene level of SARS-CoV-2 during human to human transmission Although numerous studies have demonstrated that recombination plays an important role on the emergence of SARS-CoV-2 [16-18], how this virus evolves during its global transmission has not yet been pro led. Therefore, we rst analyzed intragenic recombination events of each structural gene using RDP4.
Results indicate that no recombination events occurred among the alleles of each gene (data not shown).
Potential recombination events were also assessed by reticulate network tree-phi test, using SplitsTree4 software. Although some internal nodes are present in N and S alleles, no clear evidence for recombination can be validated for each gene via phi test (P > 0.05) (Fig. 2). This result indicates a relatively stable state of SARS-CoV-2 during transmission, though a possible genetic interaction of different isolates may have occurred when it became a global pandemic [19,20]. Tajima's D, and Fu and Li's D* and F* statistics were calculated to examine the mutation neutrality hypothesis of the four structural genes of SARS-CoV-2. Results reveal that the evolution of all four genes does not support population neutrality, but rather favors purifying selection (Table 2; Additional le 1: Figure S1). The average of all pairwise dN/dS ratios (ω) among alleles of each structural gene of SARS-CoV-2 is 0.5443 in E, 0.1562 in M, 0.07978 in N, and 0.4980 in S, respectively. In aggregate, these results suggest that at the whole gene level, inconsistent purifying selection is the main evolutionary force at work given the experimental conditions (Table 2; Additional le 1: Figure S1). For Tajima's D, Fu and Li's D* and F* tests, the cutoff was set at 0.05.
The SARS-CoV2 S gene is operated on by positive selection at a de nitive codon located at the C-terminal portion of the S1 subunit, and another potential codon located at the signal sequence Guo et al. (2020) reported that the S gene of SARSr-CoV populations in their natural host, Chinese horseshoe bats, has evolved through positive selection at some codons [13]. As mentioned above, at the whole gene level, purifying selection is the main force driving the evolution of studied genes. Whether positive selection pressure accelerates the diversi cation of the structural genes of SARS-CoV-2 remains unclear. We therefore used codon-substitution models to estimate the ratio of nonsynonymous over synonymous substitutions (dN/dS), also known as "ω". These codon substitution models include M0 (one-ratio, one ω for all sites ), M1a (nearly neutral, two classes of sites, ω0 < 1 and ω1 = 1), M2a (positive selection, allows three site classes including ω0 < 1, ω1 =1, and ω2 > 1), M3 (discrete, allows unconstrained discrete distribution of ω among sites), M7 (β, t to a β distribution for ω among sites) and M8 (β and ω > 1, t to a beta distribution with an extra rate that allows ω > 1) and can be typed into a null model group (M0, M1a, M7), and a positive selection model group (M3, M2a, and M8) [21]. The role of recombination in the polymorphisms of the four analyzed genes is excluded because no intragenic recombination was detected (Fig. 2). By using a Maximum Likelihood (ML) method, we did not nd any codon of either the E or M gene subject to positive selection (data not shown). Although a potential positive selection site in the N gene, 208A, is identi ed using an M3 model, but the LRT P-value is more than 0.05 (M0 Vs. M3) and the result is not validated by other models including the M8 model, suggesting that evidence for N gene positive selection is limited (Additional le 2: Table S1). For the S gene, we found the average ω to be 0.37199, calculated using the one-ratio (M0) model of the codeML software package, suggesting that purifying selection operates S gene evolution during SARS-CoV-2 transmission among humans. In three LRTs, all alternative models (M3, M2a, and M8) are signi cantly better ts (P < 10 -4 ) than relevant null models (M0, M1a, and M7), indicating that some S sites were subjected to strong positive selection (ω = 18.22175-20.61283) ( Table 3). A single positive selection site (614D) is identi ed in the S gene with a posterior probability of 1.000 in all three models [22], clear evidence for this site experiencing positive selection while the virus transmits between human hosts. This result is furthermore validated using internal xed effects likelihood (IFEL) and evolutionary ngerprinting methodology, implemented using the HyPhy software package (Fig. 3) [23][24][25]. To our surprise, the positive selection site is not located at the receptor binding domain (RBD) or receptor binding motif (RBM), as we anticipated, which generally play the most integral role in virus-receptor interaction and virus entry into host cells [26]. This result suggests that relative genetic stability of this motif bene ts virus survival. Interestingly, the site under positive selection pressure always has a D614G (for the S gene is 1841A>G) mutation, implying that such a mutation may enhance virus adaptability in human hosts. Another potential positive selection site at codon 5 was also identi ed, and a L5F mutation (for the S gene is 13C>T) was always found, with posterior probabilities greater than 0.95, 0.93 and 0.92 (critical values) calculated by M3, M2a and M8 models (Table 3), respectively. A similar result was also con rmed using an evolutionary ngerprinting method (Additional le 3: Figure S2).  The evolutionary relationship of S gene alleles with and without D614G and L5F mutations Phylogenetic trees of S gene alleles were derived to test the evolutionary relationship among alleles with and without the D614G mutation. As shown in Fig. 4a, the 173 alleles of the S gene clustered into four clades. Alleles with a D614G mutation could be found in all the 4 clades. A dominant clade contains 79 out of total 85 alleles with this mutation. The remaining 6 mutated S alleles were distributed among the other 3 clades. This result is supported by a parsimony network of S gene alleles created using PopART software (http://popart.otago.ac.nz) [27]. Two central alleles (representative virus isolates are WH01 and GZMU0019) and associated alleles around them form a star-scattering network, suggesting that the S gene may have two potential origins (Fig. 4b). All S alleles with a D614G mutation are closely related with a few point mutations, and comprise a scattered star structure, suggesting the expansion of the SARS-CoV-2 population with a D614G mutation on the S gene. In contrast, alleles of the N gene suggest a single ancestor, as analyzed by parsimony networking, though 3 phylogenetic clades were identi ed (Additional le 4: Figure S3).
A total of 5 alleles with L5F mutations were discovered, and all of them were located within a single clade, accounting for 83.33% of all alleles in the clade (Additional le 5: Figure S4a). Further parsimony network analyses revealed that S alleles with a L5F mutation are not closely related, but rather are distributed in both WH01 and GZMU0019 haplotype groups (Additional le 5: Figure S4b). No scattered star structure could be formed, indicating that the L5F mutation might arise from independent origins, unlike the D614G mutation.

Frequency of S alleles with D614G mutations increases in SARS-CoV-2 isolates during human to human virus transmission
Considering that mutations in a positive selection site should be bene cial to the survival of the individuals carrying the mutation, we postulate that the D614G (1841 A>G) mutation may increase the spread of SARS-CoV-2. Evidence is obtained from the haplotype network analysis aforementioned of S alleles (Fig. 4b). S gene haplotypes (alleles) with a D614G mutation (representative isolate, GZMU0019) have evolved many subtypes and comprise a star structure with GZMU0019 in the center. This starburst pattern with one haplotype in the center and many other haplotypes surrounding the central haplotype suggests a signature of rapid population expansion [28]. To further study whether SARS-CoV-2 isolates with a D614G mutation have a selective advantage in survival during transmission among humans we calculated the frequency of S alleles carrying a D614G mutation in each week from collected SARS-CoV-2 isolates from December 24, 2019 through April 20, 2020 (17 epidemiological weeks). Detailed information on these isolates including collection date, collection region, and accession or biosample number is available in Additional le 6: Table S2, and Additional le 7: Table S3.
In 173 S gene alleles, 85 carried a D614G mutation, accounting for 49.13% of the total. Similarly, 47 of 99 S proteins were found to carry a D614G mutation, accounting for 47.47% of the total. The rst two isolates of our collected data, GWHABKF00000001 and WH01 (isolated in December 24, 2019 and December 26, 2019, respectively), carried 614D mutations in the S protein. The rst SARS-CoV-2 isolate with a D614G mutation was isolated from a patient with COVID-19 on February 5, 2020 (week 7 in our dataset). After that, except for weeks 9 and 10 (possibly due to the small number of samples and sampling deviation), an increasing trend in the proportion of isolates carrying the D614G mutation in the S protein was noteworthy. In week 17, the last week of our dataset, 91.11% of SARS-CoV-2 isolates carried this mutation (Fig 5a; Additional le 6: Table S2). Further analyses revealed that the frequency of the D614G mutation in the S gene steadily increased when data from weeks 6 through 17 were combined (Fig 5b; Additional le 6: Table S2). To exclude the in uence of sample size on the result (in some weeks, only 4-6 isolates were collected) we reorganized the dataset by taking both sample size and sampling time into account. Various panels of 200-300 isolates were studied, with similar results observed ( Fig. 5c and d; Additional le 7: Table S3). Taken together, these results suggested that SARS-CoV-2 isolates with a D614G mutation might increase the virus's ability to transmit and hence rapidly spread around the world.
D614G mutations in the S gene may destabilize the S protein trimer and promote receptor binding and membrane fusion We found that the D614G mutation is located at the subdomain 2 (SD2) at the C-terminus of RBD and close to the two potential cleavage sites between S1 and S2 [29] (Fig. 6a). Considering positive selection is usually bene cial to the survival of the individual carrying the mutation, we speculate that the D614G mutation may facilitate structural conformation change to promote receptor binding or membrane fusion [8,30], and in turn, improving infectivity. From the latest cryo-electron microscopy-determined (cryo-EM) structure of the SARS-CoV-2 S protein, the negatively charged sidechain of D614 points toward the positively charged sidechain of K854 from the neighboring monomer (Fig. 6b) [29] . The distance between the closest atoms of the two residues is 2.6 Å, which is an optimal distance to form a salt bridge (Fig. 6c). From the modelled structure with a D614G mutation, the distance is increased to 5.2 Å (Fig. 6d), which would potentially abolish the salt bridge and destabilize the integrity of the S trimer in wild type. It has been reported that human receptor ACE2 binds to an "open" conformation of the S protein, where RBD moves away from the core structure, and exposes its receptor binding surface. The entire S trimer then undergoes a series of dramatic conformation changes, including cleavages between S1 and S2, disassociation of S1, and post-fusion transformation of S2 [31,32]. Mutations at cleavage sites and adding internal crosslinks to the S trimer would keep the protein in a stable and "closed" conformation, where the receptor binding surface of RBD is inaccessible [29,33]. Therefore, we hypothesize that the highly transmissible D614G mutation, driven by positive evolutionary selection, promotes the accessibility of RBD by eradicating a critical salt bridge between S protein monomers, which subsequently triggers membrane fusion upon ACE2 binding.

Discussion
Many recent studies have demonstrated the continual evolution of SARS-CoV-2 [34][35][36]. Four structural genes of SARS-CoV-2, E, M, N and S, may determine the infectivity or pathogenesis of this persistent virus, but the molecular evolution patterns of these structural genes remain largely unknown. Among the four aforementioned structural genes, E and M show the highest homology, with relatively few SNPs and SAPs, indicating the importance of the conservation of these two genes for virus survival. A key factor for viral transcription and assembly is the N protein [37,38]. High sequence variability was found in the N protein, suggesting a vast adaption of the virus during host transmission. Previous studies shows that elevated genetic variation has been found among bat SARSr-CoVs, particularly in the S gene [13]. High nucleotide diversity (π; Table 1) of the S gene was detected in SARS-CoV-2 isolates, suggesting that it may bene t virus survival in human hosts. Recombination has played an important evolutionary role during the emergence of SARS-CoV-2 [39,40]. It has been reported that SARS-CoV-2 is a recombinant virus of bat and pangolin (Manis javanica) CoVs, suggesting a critical role of recombination [39]. When this zoonotic virus transfers from animal to human and leads to continuous human to human transmission, no clear evidence of recombination is found among the alleles from the four structural genes in our study (Fig. 2), suggesting that the evolution of these genes is not driven by recombination. Li et al. studied the origin of SARS-CoV-2 and showed evidence of strong purifying selection in the S and other genes among bat, pangolin and human coronaviruses, indicating similarly strong evolutionary constraints in different host species [40]. Likewise, our results show that purifying selection drives evolution at the whole structural gene level of SARS-CoV-2 during its transmission between human hosts (Table 2; Additional le 1: Figure S1). This result implies that in general, genetic variation of these structural genes will not confer a signi cant disadvantage to virus survival. Because no recombination events occurred during SARS-CoV-2 evolution, nonsynonymous mutations would be removed at a high rate during virus transmission [41], and positive selection sites with mutations will be xed. The frequency of S alleles with a D614G mutation is increasing during the SARS-CoV-2 spread ( Fig. 3; Fig. 5; Additional le 6: Table S2; Additional le 7: Table S3).
We identi ed another potentially positive selection site at codon 5 of the S gene, with a consistent L5F mutation (Table 3; Additional le 3: Figure S2). Considering that signal sequence (SS) is a short hydrophobic peptide that plays an important role in guiding viral proteins into endoplasmic reticulum (ER) for proper folding and assembly [9], we postulate that L5F mutations may increase hydrophobicity of the SS, thus facilitating the entry of the S protein into ER for folding and assembly, and in turn, secretion of the virus. Our results moreover show that majority of S alleles with a D614G mutation cluster in one clade, and make a distinct star-scattering network (Fig. 4), suggesting a potentially recent common ancestor of these mutants. Nevertheless, sporadic alleles with an L5F mutation identi ed thus far indicate that the L5F mutation might be subject to relatively weaker pressure, still at an early stage of positive selection (Additional le 5: Figure S4).
The positive selected D614G mutation may play an important role in the adaptability of SARS-CoV-2 in both the host and virus populations [42]. Another explanation is that the mutation is driven by speci c interactions between high levels of viral sequence divergence and polymorphic host receptors or interacting proteins [43]. The S protein is the key determinant for tissue tropism, and host range and speci city of coronaviruses such as SARS-CoV-2. The virus infects host cells through the interaction between the S protein and its cellular receptor, ACE2 [11]. In this process, viral entry requires the precursor S protein cleaved by cellular proteases including trypsin, furin, transmembrane serine protease 2 (TMPRSS2), or endosomal cathepsin L, which generate receptor binding subunit S1 and membrane fusion S2 [30,44,45]. From structural studies of both SARS-CoV and SARS-CoV-2, the receptor binding domain (RBD) located at the C-terminal of S1 and the adjacent N-terminal domain (NTD) are relatively exible, the feature required for receptor recognition and subsequent membrane fusion [29,46]. From S protein structural modeling (Fig. 6), we hypothesize that the D614G mutation, driven by positive selection through molecular evolution, promotes accessibility of RBD through the loss of a critical salt bridge between S protein monomers ( Fig. 6c and d), which subsequently triggers membrane fusion upon ACE2 binding. The exact in uence and detailed mechanism of the D614G mutation on SARS-CoV-2 infectivity and expansion require further investigation. It should be noted that a L5F mutation is always found in a potentially positive selection site of the S gene (codon 5). The low frequency (3.82%, 5/131) of S alleles with an L5F mutation does not show a clearly increasing pattern, possibly due to relatively weaker positive selection pressure compared to codon 614. Because the end date of the studied isolates is late April, 2020, the persistent frequency monitoring of L5F in S alleles is required to determine whether it experiences expansion. Potential effects of the L5F mutation to SARS-CoV-2 need to be documented.

Conclusions
We present modern evolutionary virology analyses on a large and comparative set of SARS-CoV-2 structural gene sequences derived from an international collection of SARS-CoV-2 isolates. Distinct phylogenetic patterns of four structural proteins of SARS-CoV-2 are illustrated. Protein sequence comparisons show that E and M genes exhibit a relatively close genetic a nity to bat SARSr-CoV, suggesting the evolutionary conservation of these two genes. In contrast, relatively high genetic variation is observed in N and S proteins among SARS-CoV-2 isolates, implying extensive adaptability. No clear intragenic recombination is detected among the four genes, suggesting that recombination is not the primary evolutionary force driving the evolution of these genes. Our analysis, however, shows that purifying selection pressure may be the main force operating on a whole gene level of SARS-CoV-2 during interhuman transmission.
We identi ed a codon in the S gene that is de nitively experiencing positive selection pressure, and always leads to a D614G mutation in S proteins. S alleles with a D614G mutation have expanded rapidly among SARS-CoV-2 isolates. The D614G mutation signi cantly extends the distance between monomers in the S protein trimer, which may disrupt the salt bridge formed by D614 and K854 between monomers. Such disruption may promote RBD opening and facilitate the entry of the virus into host cells, in turn contribute to the diffusion of these mutated alleles. Codon 5 of the S gene is another potential positive selection site. Although a limited number of alleles with an L5F mutation were identi ed, this mutation may potentially affect the assembly and secretion of SARS-CoV-2. A close examination of the L5F mutation may be required in case another expansion occurs. Because the S protein is a key target for SARS-CoV-2 vaccines, therapeutic antibodies, and diagnostics, the D614G and L5F mutations should be paid close attention to. Owning to the fact that the exact mechanism remains unclear, further research should focus on the function of these mutation sites, particularly how they affect the expansion of these mutated alleles in SARS-CoV-2.

SARS-CoV-2 isolates
Complete full-length genomic sequences of SARS-CoV-2 were downloaded from the 2019 Novel Coronavirus Resource (2019nCoVR), China National Center for Bioinformation (all of which were also uploaded to the NCBI GenBank database). Sequences were manually checked for the structural gene integrity and sequencing accuracy and nally a total of 3090 isolates were selected and veri ed for the present study. These isolates were collected from December 24, 2019 to April 24, 2020 in China, USA, Japan, Pakistan, Australia, Greece, Germany, Peru, Turkey, Kazakhstan, Iran, Serbia, Thailand, Netherlands, Sri Lanka, Czech Republic, Malaysia, India, et al. Detailed information on these isolates, including GenBank accession number or biosample number is summarized in Additional le 8: Table S4.

Sequence analysis of four structural genes and proteins
The E, M, N, and S gene sequences were extracted from the SARS-CoV-2 global isolate collection and aligned using the MEGA X software package using Muscle (codons) parameters [47]. Because some regions of genomic sequences of SARS-CoV-2 couldn't be exactly identi ed (in which nucleic acid bases are shown as degenerate bases [e.g. N, R, Y]), we were sometimes unable to obtain all of the four structural gene sequences from an isolate. Allele type and DNA sequence polymorphism analyses were performed using software DnaSP 6.12.03 [48]. Protein sequences and polymorphism loci of these isolates were also aligned and analyzed using the MEGA X software package [47].

Molecular evolution analysis
An unrooted phylogenetic tree of the four structural proteins was constructed using the MEGA X software package [47], with evolutionary history inferred using the Maximum Likelihood method, based on the JTT matrix-based model for E protein sequences, General Reversible Chloroplast + Freq. model for M protein sequences, JTT matrix-based model for N protein sequences, and the Jones et al. w/freq. model for S protein sequences. Model selection was conducted in MEGA X [43]. Bootstrap values were estimated using 1000 replications. Initial trees for the heuristic search were obtained automatically by applying Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using each model mentioned above. Trees were drawn to scale, and FigTree V1.4.4 was utilized to form cladogram branches [49]. The aligned DNA sequences were also screened using RDP4 software to detect intragenic recombination among the alleles of each structural gene [50]. Six methods implemented in RDP4 were utilized. These methods include RDP [50], GENECONV [51], BootScan [52], MaxChi [53], Chimaera [54], and SiScan [55]. Common settings for all methods include considering sequences as linear and setting the statistical signi cance cutoff was at 0.05, with Bonferroni correction for multiple comparisons. Phylogenetic evidence and polishing of breakpoints were also required. Potential recombination events (PREs) were classi ed as those identi ed by at least two methods. A reticulate network tree of the alleles of the four structural genes of SARS-CoV-2 was also generated using Splitstree4 [56]. Phi tests implemented in Splitstree4 were used to de ne probable recombination events. Tajima's D, Fu and Li's D* and F* tests were employed to test the mutation neutrality hypothesis of a whole gene as previously described by our research group [57]. These analyses were carried out using DnaSP 6.12.03 [48]. A statistical signi cance cutoff was set at 0.05 for Tajima's D test, Fu and Li's D* and F* tests. The false discovery rate method and 1000 replications in a coalescent simulation were applied for correcting multiple comparisons. Non-neutrality evolution was determined when identi ed by at least two of three tests. Nonsynonymous and synonymous mutations of the alleles of the four structural genes were calculated using the MEGA X software package [47].

Analysis of positive selection at codon level
The selection pressure operating on the four structural genes of SARS-CoV-2 was investigated using the Maximum Likelihood (ML) method. Analyses were performed using a visual tool from the codeml software program, named EasyCodeML [58]. Three nested models (M3 vs. M0, M2a vs. M1a, and M8 vs. M7) were compared and likelihood ratio tests (LRTs) were applied to assess a best t of codes. Model tting was performed using multiple seed values for dN/dS, assuming the F3x4 model of codon frequencies. Positive selection was inferred when the individual site or codon had a ratio of nonsynonymous to synonymous mutations (dN/dS ratios) of greater than one (ω > 1). When the LRT was signi cant (P < 0.05), Bayes empirical Bayes (BEB) (M8 model) and Naive Empirical Bayes (NEB) methods (M3 and M2a models) were further employed to identify amino acid residues that likely evolved under positive selection based on a posterior probability threshold of 0.95. Results from the M8 model were taken as the standard, in accordance with Yang et al. [17]. An M3 model was used for the frequency distribution of codon class analysis also as recommended by Yang et al. [22]. The HyPhy software package was used to validate the results obtained using the ML method [59].

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication Not applicable
Availability of data and materials All data generated or analyzed during this study are included in this published article and its additional les.
Competing interests The authors have declared no con ict of interests.
Funding This research was supported by National Natural Science Foundation of China (grant number 31870001) to X.Y.Z.
Availability of data and materials All data generated during this study are included in this published article and its Additional les 1 and 2.
Authors' contributions XYZ designed, carried out, and analyzed the data and wrote the manuscript. YZ designed, carried out, and analyzed the structure of SARS-CoV-2 S protein. KH, XZ, YQ, YL and LeY collect the genomic data of the isolates. YH and BH supervised and assisted in research planning and also supervised the manuscript. All authors read and approved nal manuscript.     Evolutionary relationship of S alleles with or without D614G mutation. a Phylogenetic tree of S gene based on nucleotide sequences of 173 alleles. The evolutionary history is inferred using the Maximum Likelihood method and Tamura-Nei model. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. Each clade is highlighted with different color. Alleles are shown with their representative isolate names, and alleles with D614G mutation are highlighted in red. Bootstrap values more than 0.5 are shown. b Parsimony network of SARS-CoV-2 S gene haplotype (allele) diversity obtained from 3090 isolates worldwide. Each oblique line linking between haplotypes (haplotype name is shown as its representative isolate name) represents one mutational difference. Unlabeled nodes (Gray circle) indicate inferred steps have not found in the sampled populations yet. The ancestral haplotype, or root of the network, is labeled with a square, and represent haplotype name is marked green or red. The red nodes indicate haplotypes with D614G mutation, while green or black nodes indicate haplotypes without D614G mutation. Dotted boxes indicate major haplotype groups.