Characteristics of SARS-CoV-2 isolates, structural gene and protein sequences
The 3090 SARS-CoV-2 isolates harbor only 16 alleles of E and 40 alleles of M, but an abundant number of alleles of N and S genes, which contain 131 and 173, respectively. These alleles correspond to 10, 14, 88 and 99 different amino acid sequences of E, M, N, and S proteins, respectively. Protein sequence comparison of the WH01 isolate with a SARSr-CoV isolate, bat-SL-CoVZC45, shows identity of 100% (75/75) in E, 98.65% (219/222) in M, 94.27% (395/419) in N and 80.06% (1171/1273) in S proteins, respectively. These results imply a close homology between SARS-CoV-2 and bat SARSr-CoV, particularly on E and M proteins. On the other hand, it indicates an extreme conservation of E and M proteins and their functions among coronaviruses[10].
Further analysis revealed that there are 14 single nucleotide polymorphisms (SNPs) of E gene, but only 5 single amino acid polymorphic (SAP) loci in the E protein. Similar result was observed on M gene and protein, with 37 SNPs and 9 SAPs. In contrast, 126 SNPs and 75 SAPs are detected on N gene and protein, respectively. S protein, the most important component that mediates virus entry by receptor binding and membrane fusion and determines the infectivity of SARS-CoV-2 [11], harbors 155 SNPs on alleles and 90 SAPs in the protein. Considering the size of nucleotides and amino acid residues, N gene has the maximum sequence variability with 10.02% (126/1257) SNPs and 17.90% (75/419) SAPs, respectively. However, S gene has most pairwise nucleotide differences among the four structural genes, indicating a more genetic diversity of S gene (Table 1).
Table 1. Summary of genetic diversity of the 4 structural genes of the SARS-CoV-2 isolates
Gene
|
Sequence, n*
|
Sequence length
|
h
|
π
|
S
|
θ
|
ƞ
|
E
|
2928
|
228
|
16
|
0.00012
|
14
|
0.00475
|
15
|
M
|
2891
|
669
|
40
|
0.00018
|
37
|
0.00665
|
40
|
N
|
2253
|
1260
|
131
|
0.00056
|
126
|
0.01081
|
130
|
S
|
2339
|
3825
|
173
|
0.00075
|
155
|
0.00753
|
169
|
h,Haplotypes,
π, Nucleotide diversity
S, Polymorphic sites
θ, Theta (per site) from S, population mutation ration
Ƞ, Total number of mutations
* Some bases of SARS-CoV-2 genomic sequences are not exactly identified; thus, the number of gene sequences were less than 3090.
Distinct phylogenetic patterns of the four structural genes
The phylogenetic analysis revealed that all SARS-CoV-2 E proteins form three clusters. Similar to E protein, phylogenetic tree of SARS-CoV-2 M proteins is formed by three clusters with few branches (Fig. 1a and b). The results suggest both E and M genes may display a relatively high conservation during coronavirus evolution. In contrast, SARS-CoV-2 N and S proteins show distinct phylogenetic pattern as compared with that of E and M. Four and three main phylogenetic clusters with various branches are identified in the N and S proteins, respectively (Fig. 1c and d). Given the crucial roles of N and S proteins in virus transcription, assembly, and entry to host cells, whether SARS-CoV-2 isolates harboring different N and S variants (such as those clustered into different clades) may influence their infectivity remains unknown, and requires further study.
Purifying selection drives the evolution at whole structural gene levels of SARS-CoV-2 during its human to human transmission
Although many studies demonstrated that recombination plays an important role on the emergence of SARS-CoV-2 and its contribution to admit SARS-CoV-2 as a human infectious pathogen [12-14], how this virus evolves during its global transmission has not been profiled yet. Therefore, we first analyzed intragenic recombination events of each structural gene using RDP4. The results indicate there were no recombination events occurred among the alleles of each gene (data not shown). Recombination event is also assessed through reticulate network tree by phi test in SplitsTree4. Although some internal nodes are noticed in N and S alleles, no clear evidence for recombination is validated of each gene by Phi test (p>0.05) (Fig. 2). It indicates a relative stable state of SARS-CoV-2 during its transmission though a possible genetic interaction of different isolates might have occurred when it became a global pandemic [15, 16]. In addition, Tajima’s D, Fu and Li’s D* and F* statistics were calculated to examine the mutation neutrality hypothesis of the four structural genes of SARS-CoV-2. The results reveal that the evolution of all four genes does not match the neutral hypothesis, but favor purifying selection (Table 2; Additional file 1: Figure S1). The average of all pairwise dN/dS ratios (ω) among the alleles of each structural gene of SARS-CoV-2 is 0.5443 in E, 0.1562 in M, 0.07978 in N, and 0.4980 in S gene, respectively. All together, these results suggest that at the whole gene level, inconsistent purifying selection is the main evolution force (Table 2; Additional file 1: Figure S1).
Table 2. Summary of neutrality for the four structural genes in SARS-CoV-2 isolates
Gene
|
Tajima’s D
|
Fu and Li’s D* test
|
Fu and Li’s F* test
|
dN
|
dS
|
dN/dS (ω)
|
Selection
|
E
|
-2.29974, P<0.01
|
-3.18477, P<0.02
|
-3.38505, P<0.02
|
0.006836
|
0.1256
|
0.5443
|
Purifying selection
|
M
|
-2.74611, P<0.001
|
-5.64276, P<0.02
|
-5.50855, P<0.02
|
0.001294
|
0.008296
|
0.1562
|
Purifying selection
|
N
|
-2.87598, P<0.001
|
-9.67153, P<0.02
|
-7.95879, P<0.02
|
0.000251
|
0.003146
|
0.07978
|
Purifying selection
|
S
|
-2.87646, P<0.001
|
-11.01171, P<0.02
|
-8.59037, P<0.02
|
0.000609
|
0.001223
|
0.4980
|
Purifying selection
|
SARS-CoV2 S gene is operated by positive selection at a definitive codon located at the C-terminal portion of S1 subunit and a potential codon located at the signal sequence
Guo et al. reported that the S gene of SARSr-CoV populations in their natural host, Chinese horseshoe bat (Rhinolophus sinicus), has evolved through positive selection at some codons[9]. As mentioned above, at the whole gene level, purifying selection is the main force driving the evolution of studied genes. Whether positive selection pressure accelerates the diversification of the structural genes of SARS-CoV-2 remains unclear. Therefore, we used codon-substitution models to estimate the ratio of nonsynonymous over synonymous substitutions (dN/dS), also known as ω. The role of recombination in the polymorphism of four genes is excluded because no intragenic recombination was detected (Fig. 2). By using ML model, we don’t find any codon of E and M gene subjecting to positive selection obviously (data not shown). Although a potential positive selection site 208A in the N gene is identified by using M3 model, it is not validated by any other models especially the M8 model, suggesting the evidence for N gene positive selection is limited (Additional file 2: Table S1). For the S gene, we found the average ω is 0.37199 calculated by M0 model of the codeML package, suggesting that purifying selection operates S gene evolution of during SARS-CoV-2 transmission among human beings. In three LRTs, all alternative models (M3, M2a, M8) are significantly better fit (P<10-4) than relevant null models (M0, M1a, M7), indicating that some sites of S were subjected to strong positive selection (ω=18.22175-20.61283) (Table 3). A single positive selection site (614D) is identified in the S gene with posterior probability of 1.000 in all the three models [17], a clear evidence showing that this site is still experiencing positive selection when the virus is transmitting from human to human. The result is also validated using internal fixed effects likelihood (IFEL) and Evolutionary Fingerprinting methods implemented in HyPhy package (Fig. 3) [18-20]. To our surprise, the positive selection site is not located at the receptor binding domain (RBD) or receptor binding motif (RBM) as we anticipated, which play the most important role in virus-receptor interaction and virus entry into host cells [21]. This result suggests that a relatively genetic stability of this motif would benefit the virus survival. Intriguingly, the site under positive selection pressure always has a D614G (for the S gene is 1841A>G) mutation, implying such mutation may enhance virus adaptability in human hosts. Another potential positive selection site at codon 5 is also identified, and a L5F mutation (for the S gene is 13C>T) is always found, with posterior probabilities greater than 0.95, 0.93 and 0.92 (critical values) calculated by M3, M2a and M8 models (Table 3), respectively. Similar result was also confirmed by Evolutionary Fingerprinting method (Additional file 3: Figure S2).
Table 3. Log-likelihood values and parameter estimates for the SARS-CoV-2 S gene sequences
Model
|
Ln L
|
Estimates of parameters
|
Model compared
|
LRT P-value
|
Positive sites
|
M3 (discrete)
|
-6766.339162
|
p0=0.96797, p1=0.02883, p2=0.00320
ω0=00.26126, ω1= 2.70530, ω2=20.61283
|
M0 vs. M3
|
0.000000001
|
5 L 0.958*,28 Y 0.850,221 S 0.901,614 D 1.000***,677 Q 0.891
|
M0 (one ratio)
|
-6790.072925
|
ω0=0.37199
|
Not Allowed
|
M2a(selection)
|
-6766.432802
|
p0=0.81731, p1=0.17872, p2=0.00397
ω0=0.17504, ω1=1.00000, ω2=18.76936
|
M1a vs. M2a
|
0.000004385
|
5 L 0.9258,28 Y 0.812,221 S 0.832,614 D 1.000***,677 Q 0.828
|
M1a (neutral)
|
-6778.770190
|
p0=0.70461, p1=0.29539
ω0=0.04395, ω1=1.00000
|
Not Allowed
|
M8(beta&ω)
|
-6768.829411
|
p0=0.99578, p=0.40368, q=0.82224
p1= 0.00422, ω= 18.22175
|
M7 vs.M8
|
0.000030400
|
5 L 0.931,28 Y 0.817,221 S 0.831,614 D 1.000***,677 Q 0.828
|
M7(beta)
|
-6779.230494
|
p=0.00857, q=0.02623
|
Not Allowed
|
LnL is the log likelihood; ω is ratio of dN/dS, LRT P-value indicates the value of chi-square test; Parameters indicating positive selection are presented in bold; Positive selection sites were identified by the Bayes empirical Bayes (BEB) methods under M8 model. The posterior probabilities (p)≥0.80 are shown, (p)≥0.95 (p)≥0.99, and (p)=1.000 are indicated by *, ** and ***, respectively. Yang et al. recommended that results from M8 model were preferred to find sites under positive selection pressure.
Evolutionary relationship of S gene alleles with or without D614G and L5F mutation
Phylogenetic tree of S gene alleles was derived to test the evolutionary relationship among the alleles with or without D614G mutation. As shown in Fig. 4a, the 173 alleles of the S gene could be clustered into four clades. Alleles with D614G mutation could be found in all 4 clades, among which a dominant one contains 79 out of 85 alleles with such mutation. The remaining 6 mutated S alleles are distributed in other 3 clades. This result is also supported by the parsimony network of S gene alleles using PopART (http://popart.otago.ac.nz) [22]. Two central alleles (representative virus isolates are WH01 and GZMU0019) and associated alleles around them form a star scattering network, suggesting that the S gene may have two potential origins (Fig. 4b). All S alleles with D614G mutation are closely related (with a few point mutations), and comprise a scattered star structure, suggesting the expansion of SARS-CoV-2 population with D614G mutation on S gene. In contrast, alleles of the N gene show a single ancestor analyzed by parsimony network though 3 phylogenetic clades are identified (Additional file 4: Figure S3).
A total of 5 alleles with L5F mutation are found and all of them are in one clade, accounting for 83.33% of all alleles in the clade (Additional file 5: Figure S4a). Further parsimony network analysis reveals that S alleles with L5F mutation are not closely related, but distribute in both WH01 and GZMU0019 haplotype groups (Additional file 5: Figure S4b). No scattered star structure of these alleles can be formed, indicating L5F mutation might arise from independent origins unlike D614G mutants.
Frequency of S allele with D614G mutation increased in SARS-CoV-2 isolates during human to human transmission
Considering that mutation of a positive selection site should be beneficial to the survival of the individuals carrying the mutation, we postulate that the D614G (1841A>G) mutation may help the spread of SARS-CoV-2. Some evidence is obtained from the haplotype network of S alleles mentioned above (Fig. 4b). S gene haplotypes (alleles) with D614G mutation (representative isolate, GZMU0019) have evolved many subtypes and comprise a star structure with GZMU0019 in the center. This starburst pattern with one haplotype in the center and many other haplotypes surrounding the central haplotype suggests a signature of rapid population expansion [23]. To further study whether SARS-CoV-2 isolates with D614G mutation have advantage in survival during its transmission among human beings, we calculated the frequencies of S alleles carrying D614G mutation in each week from the collected SARS-CoV-2 isolates from December 24, 2019 to April 20, 2020 (17 weeks). Detailed information of these isolates including collection date, collection region and accession or biosample numbers is summarized on Additional file 6: Table S2 and Additional file 7: Table S3.
In 173 S gene alleles, 85 carry D614G mutation, accounting for 49.13% of all. Similarly, 47 out 99 S proteins carry D614G mutation, accounting for 47.47% of all. The first two isolates, GWHABKF00000001 and WH01 in our data collect (isolated in December 24, 2019 and December 26, 2019, respectively), carry 614D in the S protein, while the first SARS-CoV-2 isolate with a D614G mutation is isolated from a patient with COVID-19 on February 5, 2020 (week 7 in our dataset). After that, except for week 9 and week 10 (possibly due to the small number of samples and sampling deviation), a spread trend that more and more proportion of isolates carry the D614G mutation in the S protein stands out. In the week 17, the last week of our dataset, 91.11% of SARS-CoV-2 isolates carry this mutation (Fig 5a; Additional file 6: Table S2). Further analysis reveals that the frequency of D614G mutation in the S gene was steadily increasing when combining data from week 6 to 17 (Fig 5b, Additional file 6: Table S2). To exclude the influence of sample size on the result (in some weeks, only 4-6 isolates were collected in the dataset), we reorganized the dataset by taking both the sample size and sampling time into account. Various panels of 200-300 isolates were studied and similar results were observed (Fig. 5c and d; Additional file 7: Table S3,). Taken together, these results suggest that SARS-CoV-2 isolates with D614G mutation may increase their ability to transmit, and contribute to the rapid spread of this virus to the world.
D614G mutation of S gene may destabilize S protein trimer and promote receptor binding and membrane fusion
We found that the D614G mutation is located at the subdomain 2 (SD2) that at the C-terminus of RBD and close to the two potential cleavage sites between S1 and S2 [24] (Fig. 6a). Considering that positive selection is usually beneficial to the survival of the individual carrying the mutation, we speculate that the D614G mutation may facilitate structural conformation change to promote receptor binding or membrane fusion[5, 25], and in turn improving the infectivity. From the latest cryo-electron microscopy (cryo-EM) structure of SARS-CoV-2 S protein, the negatively charged sidechain of D614 points towards the positively charged sidechain of K854 from the neighboring monomer (Fig. 6b) [24] . The distance between the closest atoms of the two residues is 2.6 Å, which is an optimal distance to form salt bridge (Fig. 6c). From the modelled structure with D614G mutation, the distance is increased to 5.2 Å (Fig. 6d), which would potentially abolish the salt bridge and destabilize the integrity of the S trimer in wild type. It has been reported that human receptor ACE2 binds to an “open” conformation of S protein, where RBD move away from the core structure and expose its receptor binding surface. The entire S trimer then undergoes a serial of dramatic conformation changes, including cleavages between S1 and S2, disassociation of S1 and post-fusion transformation of S2 [26, 27]. Changes including mutations at cleavage sites and adding internal crosslinks in S trimer would keep the protein in a stable and “closed” conformation where the receptor binding surface of RBD is inaccessible [24, 28]. Therefore, we hypothesize that the highly transmissible D614G mutation driven by the positive selection through evolution promotes accessibility of RBD by losing a critical salt bridge between the S protein monomers, which subsequently triggers membrane fusion upon ACE2 binding.