Intra-host dynamic variations in SARS-CoV-2

The mutations make uncertain to SARS-CoV-2 disease control and vaccine development. At population-level, single nucleotide polymorphism (SNPs) have displayed mutations for illustrating epidemiology, transmission, and pathogenesis of COVID-19. These mutations are to be expected by the analysis of intra-host level, which presented as intra-host variations (iSNVs). Here, we performed spatio-temporal analysis on iSNVs in 402 clinical samples from 170 patients, and observed an increase of genetic diversity along the day post symptom onset within individual patient and among subpopulations divided by gender, age, illness severity and viral shedding time, suggested a positive selection at intra-host level. The comparison of iSNVs and SNPs displayed that most of nonsynonymous mutations were not xed suggested a purifying selection. This two-step tness selection enforced iSNVs containing more nonsynonymous mutations, that highlight the potential characters of SARS-CoV-2 for viral infections and global transmissions.

In the present study, we quantitatively assess the SARS-CoV-2 genetic diversity and viral evolution within individual hosts, advancements in a deep viral genome sequencing 16 and an updating empirical analysis pipelines 17 . Spatio-temporal analysis of the genomic data revealed increased viral genetic diversity of SARS-CoV-2, and demonstrated a two-step tness selection played a predominant role in the understanding of evolutionary of SARS-CoV-2.
In each sample, we examined high-depth (≥100x) SARS-CoV-2 genomic sites for iSNVs, and ltered the iSNVs with stringent threshold(≥5%), which could su ciently distinguished the true iSNVs from the sequencing errors (see Methods). These iSNVs with stringent threshold were widely distributed along the genome, and the iSNVs number in each sample was not affected by genomic coverage ratio or sequencing depth (R 2 , 0.074 and 0.006, respectively, Figure S1B, Figure S1C). Totally, we identi ed 7,037 iSNVs in the 374 samples with a median density of 0.53 iSNVs/kb, which is comparable to the iSNVs previously observed in Ebola virus 17 , Yellow fever virus 19 and In uenza A 20 (Table S2). Notably, sequencing data showed that 28 samples did not contain iSNVs compared to the reference genome ( Figure S1D).

Uneven distribution of intra-host variations in SARS-CoV-2 genomes
We examined the locations of iSNVs along the SARS-CoV-2 genome. Compared to the lower density of these iSNVs (0.58 iSNVs/kb), we observed higher iSNVs density in 5'-UTR (1.23 iSNVs/kb) and 3'-UTR (1.07 iSNVs/kb) ( Figure 1E). 6,790 (96.49%) iSNVs were occurred in the coding regions. Most iSNVs (4,625, 68.11%) were identi ed in ORF1ab, following by S gene (903 iSNVs) and N gene (459 iSNVs) ( Figure S1F). However, after we normalized iSNVs with gene length, the highest frequency of iSNVs was obtained in ORF8 (1.02 iSNVs/kb), following by N protein (0.906 iSNVs/kb) ( Figure 2D), which were consistent with previous identi cation at SNP level 21 . The analysis on allele position in codon position showed that 2,329, 2,178 and 2,283 iSNVs were occurred at the 1 st , 2 nd and 3 rd codon position, respectively ( Figure 2E). The Fisher-exact test on the iSNVs in codon position for each ORF displayed that ORF10 and E gene had more iSNVs at the 1 st codon positions ( Figure 2E).
We then examined the distribution of iSNVs among the patients, and found the iSNVs occurred in high frequency SNPs (hfSNPs) in the global SARS-CoV-2 genome database shared in more patients 7 ( Figure  2F). In addition, most (63.08 %) of iSNVs have low values of minor allele frequency ranged from 0.05 to 0.20, which were rarely shared among multiple individuals ( Figure S1E). 81.02% of iSNVs were only occurred in one patient, and 11.24% were found in two individuals. According to the neutral evolution model, we constructed a simple framework to establish the basic expectations of both allele frequencies and the genomic distance between two alleles of iSNVs in a stimulated model ( Figure 2G, Figure 2H). Compared to the neutral evolution model, both allele frequencies and iSNVs distribution in the genome displayed an uneven distribution of these iSNVs, suggesting a purifying and positive selection on these iSNVs ( Figure 2G, Figure 2H).

Increased genetic diversity with the disease progression
To uncover the dynamic change of viral iSNVs in COVID-19 patients, we performed spatio-temporal analysis on the iSNVs along with the epidemic period and the disease progression, as well as that from different specimens. We observed a steady increase of iSNVs along with the epidemic period among the patient population (estimated value from 0.15 to 0.83 iSNVs/kb within 97 days) ( Figure 3A). Meanwhile, iSNVs also accumulated during the infection in individual hosts, from 0.52 iSNVs/kb on the initial day to 0.85 iSNVs/kb on 30-day post symptom onset ( Figure 3B, Table S3). This accumulation of mutations could also be observed in 81.97% (50 out of 61) paired samples from COVID-19 patients ( Figure S2A). No signi cant difference in genomic coverage was observed among different sample types or along the epidemic period ( Figure S2B). The increased genetic diversity could be observed in all three different kinds of specimens ( Figure S2C). Of note, while more iSNVs were occurred in feces than in sputum and pharyngeal swab samples during the early days of symptom onset, the accumulation of iSNVs in digest system is slower than that in respiratory system ( Figure S2C, Table S3).
Next, we measured dynamic changes of the genetic diversity in ORFs with the disease progression, and observed a rapid accumulation of mutation in the S, N, ORF1ab and other genes. The accumulation in nonsynonymous is more rapid in all genes, in comparison to synonymous mutations that respected as the neutrality ( Figure 3C), and S gene displayed the highest accumulation rate ( Figure 3C). To test whether the accumulation of genetic diversity was caused by their tness selections, we computed the iSNVs number of nonsynonymous and synonymous divergency among genes, and found that 5,197 iSNVs displayed as nonsynonymous mutations as and higher than that in synonymous (1,593). The ratio of nonsynonymous to synonymous variants was 3.26. The ratio of nonsynonymous to synonymous of iSNVs in S gene (ratio=5.31) was signi cantly higher than the other part of viral genome ( Figure 3D). The mean values of minor allele frequencies of non-synonymous and synonymous iSNVs were 0.189 and 0.195, respectively ( Figure S2B). The ratio (Ka/Ks) between the number of nonsynonymous substitutions per nonsynonymous site (Ka) and the number per of synonymous substitutions per synonymous site (Ks) in S gene was also increased from 1.01 to 2.46 with the disease progression. Assuming that Ks represent a neutral expectation, Ka/Ks values of S gene over than 1 in the late of disease progression, indicated signatures of positive selection with disease progression at least in S genes ( Figure 3E). In addition, the fraction of nonsynonymous mutations in predict epitope region 22 was signi cantly higher than expected (27.57% vs. 25.74%, p-value=0.016), while nonsynonymous mutations out of epitopes regions was signi cantly lower ( Figure 3F). The further correlation analysis between the fraction of nonsynonymous sites and the time of symptoms onsets, displayed nonsynonymous increased genes along with disease progression ( Figure S2E), especially in S gene ( Figure 3G).

RNA editing in the increase genetic diversity
Host-dependent RNA editing of APOBEC and ADARs has been acknowledged to cause the bias of mutational type in SARS-CoV-2 23 , thus, we measured all mutational types on all iSNVs. The top ve mutation types of iSNVs were U-to-C, followed by C-to-U, G-to-U, A-to-G, and G-to-A ( Figure 4A). Among them, four were the common substitutions caused by APOBECs/ADARs 23 ( Figure 4A). In addition, unlike the A-to-I RNA editing signal in human transcriptome, we did not observe an obvious depletion of G bases in position -1 that presents at A-to-I edited positions ( Figure S3A). We also calculated the correlation between the minor allele frequencies of iSNVs and the time post symptom onset. Previous study has showed an induction of APOBECs triggered by coronavirus infection but ADAR did not 23 . Consistently, the minor allele frequencies of C-to-U and G-to-A mediated by APOBEC-mediated RNA editing slightly increased in vivo. By contrast, the other mutational type including ADAR-mediated RNA editing in vivo did not increased along the infection time of patients ( Figure 4B, Figure S3B). These results suggested RNAediting of APOBECs were also impacted by the infectious of SARS-CoV-2, which turned back to change the mutation rates of C-to-U and A-to-G in SARS-CoV-2.
In uence of patient-derived immune-selection on the genetic diversity To evaluate the in uence of immune-selection on viral mutation, we measured the dynamic changes of the number of iSNVs within patient that strati ed based on gender, age, illness severity and viral shedding time ( Figure S4A, Table 1). Each sample was recalibrated based on symptom onset date ( Figure 5A). The increased genetic diversity of iSNVs could be observed in all groups ( Figure 5A). Female patients, mid-age (15-65y) patients, the patients with mild symptoms and the patients within 4-6 weeks viral shedding duration accumulated iSNVs much faster than other patients in each corresponding group ( Figure 5A). Different slops and initial values were observed in these patient-groups, suggested a different tness selection at the initial infection and the following infection stage along the day post symptom onset.
To further investigate the potential mutations in uenced by patient-derived immune-selection, we compared the proportions population with or without iSNVs that frequently occurred in these patients. Among the 52 iSNVs sites that were highly occurred frequency in the population (shared by more than six patients), we identi ed 4, 5, 4 and 8 iSNVs signi cantly existed in server, old, long viral shedding time and male patients, compared to mild/moderate, young, short viral shedding time(<14 days) and female patients( Figure 5B). These iSNVs were distributed in ORF1ab, S, N, ORF6 and ORF8. Intriguingly, among the 52 high frequent iSNVs sites, we identi ed 27 iSNVs preferred occurred in severe patients; and 12 of them were hfSNPs sites. In contrast, the 25 iSNVs that showed preference in moderate patient did not overlapped with hfSNPs ( Figure 5C, p-value < 0.001). This enrichment of hfSNPs were not observe in any other categories that strati ed based on gender, age, and viral shedding time ( Figure S4B), indicating that the propagation of these hfSNPs might under patient-derived immune-selection. Taken together, the accumulation and the different distribution of iSNVs in each group, hinted the possibility of tness selections along the symptom onset.

Uneven purifying selection process from iSNVs to SNPs
To identify the mutations puri ed from iSNVs to SNPs, we compared the genomic site of iSNVs and SNPs in public database 6,7 . Among 7037 iSNVs, 15.59% of iSNVs had been identi ed as SNPs before our observation period (May, 2020), and 11.28% of iSNVs was xed from May, 2020 to December, 2020, and the rest iSNVs (73.13%) were still not xed as SNPs ( Figure 6A). Nonsynonymous iSNVs displayed a lower xation rate than synonymous iSNVs (20.92% vs. 41.12%, p-value <0.001; Figure 6B), which is supported the model that nonsynonymous iSNVs rise to high frequency with an individual due to positive selection, but are less likely to become xed in the population due to purifying selection. Then, we performed Fisher's exact test to compare the proportion of xed mutations in each gene, and S and ORF1ab had a lower xation rates (21.04% and 20.56%, Figure 6C), either in nonsynonymous or synonymous sites ( Figure S5A). Consistently, the nonsynonymous-to-synonymous ratio of iSNVs in ORF1ab, N and S (excluding D614G) were over than that estimated in identi ed SNPs, presenting as an uneven purifying selection in these genes. As disease progress, iSNVs xation rates in nonsynonymous and synonymous sites were stable in the population ( Figure 6D, Figure S5B), indicating a similar purifying selection with disease progress. Interestingly, we also observed that mutation in iSNVs might be earlier detected before they xed in SNPs. For example, accumulative frequencies on C7051T alleles had been observed in our studies before May, whereas the rst C7051T SNP was reported until June, 2020 ( Figure  6F). All these data indicate iSNVs provide prospective and complement genetic information to illustrate the SARS-CoV-2 evolution.
Molecular function on variations in S protein before purifying selection S protein drives the cellular entry binding on receptors and acts as a major determinant of host range, cell, tissue tropism, and pathogenesis of coronaviruses 24 . Therefore, we further analyzed 21 of total 606 iSNVs sites identi ed in the coding region of S protein that caused 20 amino acids changes. 1) Nine were detected outside of RBD more than six individuals; including three linked iSNVs (A22298G, G22299A, and G22302A) with the substitutions of two amino acids were located ahead of RBD of protein (R246E and S247N, Figure 7A). 2) Eleven resided within the receptor-binding domain (RBD) or S1/S2 cleavage sites more than two individuals, including seven iSNVs were located in the receptor-binding motif (RBM). We compared the mutation sites of seven iSNVs in RBM in SARS-CoV-2 to the consensus sequencing in other animal (Bat and Pangolin) SARS-CoV-2-like coronavirus, and found that all these sites were heterogeneous ( Figure 7A). The patients with these mutations have no contact history except for two patients with G485V (Table S4); no iSNVs emerged in the rst time point of genome sequencing in patients with multiple time point ( Figure S6A, Table S4); and no evidence supported these iSNVs were linked in the genome. Taken together, these data indicated that iSNVs in RBD seem to be generated by independently viral evolution.
To elucidate the effects of these mutations at the molecular level, we rst used the SARS-COV-2 pseudovirus infection assay to assess the viral entry e cacy of 20 of the 22 S-protein mutants except S50L and M731V. Compared to reference strain, 18 of 20 tested mutants displayed a decreased (foldchange <0. 25) or comparable e cacy (fold-change <4 or fold-change >0.25) of viral entry; only mutants R685G and D614G exhibited a similar level of increased viral entry e cacy ( Figure 7B, fold-change >4). Since the residue L518V is far away from the binding interface between S protein and hACE2/CB6 and four mutants (N422K, E471K, G485V and Y505C) displayed a signi cant decrease (fold-change < 0.25) of viral entry e cacy, the other ve mutants in RBD (V407L, L452Q, V483F, Q493H and Q498H) were test for the sensitivity to CB6 (Table S5). Wide type and D614G was included as control. Modest differences between these RBD variants and reference strain (within 4-fold) were observed in their susceptibility to CB6 (Table S5). It is worth mentioning that some variants including V483F, Q493H and Q498H were even more sensitive to CB6 compared with reference strain, which illustrate that CB6 antibody could still block the entry of virus with the existence of RBD mutation.
Then, we focused on the mutations within RBM and simulated the bonding of the corresponding mutants bound to human Angiotensin-converting enzyme 2, hACE2 25 and neutralizing antibodies CB6 26 using molecular dynamics simulations ( Figure 7C). The simulation of the mutation L518V were not conducted as previously described. For comparison, we also simulated and inspected the binding of WT RBD to hACE2 and CB6, respectively. Cα root-mean-square deviation (Cα RMSD) of the complex of different mutant RBD bound to hACE2 varied within the similar ranges of the corresponding complex of WT RBD, suggesting that the 9 mutations may not induce dramatic conformational changes ( Figure S6B). Then we looked into more structural details and found that different mutations could affect the binding to hACE2 in varied ways. For example, the residue 422 was mutated from the uncharged side chain N to the positive charged K in mutant N422K. Therefore, a much stronger hydrogen bond was formed between K422 and E406. In addition, strong repulsion between K422 and R403 and attraction between K422 and Y453 induced the broken of the hydrogen bonding between Y505 and E37 and Y453 and H34. As a result, both the numbers of hydrogen bond and contact area formed between the mutant RBD and hACE2 were reduced apparently ( Figure 7D). As for the mutation G485V, since V has a relatively bulky side chain comparing to G, in order to reposition the bulky side chain, G485V mutation led to the escaping of F486 from the hydrophobic pocket formed by the residues including F28, L79, M82 and Y83 of hACE2. As a result, not only the contact area but also the numbers of hydrogen bond formed between mutant RBD and hACE2 was reduced. Especially, the hydrogen bond formed between Y505 of S protein and E37 of hACE2 which is important in binding was lost ( Figure 7E). As a possible consequence, the binding of mutants N422K/G485V to hACE2 would be weakened. The binding free energy calculations using MM/GBSA method also indicated that the weakened binding of N422K/G485 and hACE2 ( Figure 7C). The hydrogen bond and contact area changes of the other mutants were also inspected and discussed ( Figure S6B). These data illustrated most of the mutants with observed iSNVs in study decreased the viral entry ability by the reduce of hydrogen bond and/or contact area formed between mutant RBD and hACE2. We also analyzed simulation results of the complex with the mutant RBD and the antibody CB6 ( Figure S6C). Most of the mutants reduced the contact area a lot compared to WT, such that, the decreased binding a nities were observed which was manifested by the reduced number of hydrogen bond and higher binding free energy obtained in MM/GBSA calculations. Therefore, both the observation of mutations in pseudoviral infection assay and computation of their interaction displayed a weaken trends of viral infection in most of iSNVs identi ed in S protein.

Discussion
The ongoing pandemic of SARS-CoV-2 has raised worldwide alert. Mutations of SARS-CoV-2 arise naturally as the virus replicates, and presented as SNPs under the selection and transmission. Within one year after the rst case of COVID-19 was con rmed, thousands of mutations on SNPs have already been identi ed, among which only a very small minority caused the changes in SARS-CoV-2 infectivity and immune evasion. The mutations that cropped up in SARS-CoV-2 genome can be used to unearth the signs of selections that accumulated during viral evolution 27,28 . In this study, we illustrated intra-host variations of SARS-CoV-2 and demonstrated the two-step tness selections. The rst step of selection occurs after randomized mutations were generated, and a positive selection (might be patient-derived tness selection) mediates an accumulation of iSNVs that indicated by increased genetic diversity along with COVID-19 disease progression. The second step selection is purifying selection, which accounts for the reduction of nonsynonymous mutations from iSNVs to SNPs. This two-step tness model highlights the SARS-CoV-2 evolution at within-host scale to better understand the substrate for global pandemic.
The positive selection resulted two characters of iSNVs: i) the increase of genetic diversity along with COVID-19 progress, and ii) an uneven distribution of iSNVs among patients and genome. Under the selection, the former forces virus presents as more mutants that changed amino acids, which might affect key features of the virus including the infectivity, virulence and immunogenicity. Studies of Ebola virus 29 , Lassa virus 30 and in uenza virus 11 have also compared the proportions of nonsynonymous to synonymous, also displayed intra-host positive selection. High rates of mutation accumulation over short time periods in SARS-CoV-2 have been reported previously in studies of immunode cient or immunosuppressed patients who are chronically infected with SARS-CoV-2 [31][32][33] . The latter result of positive selection displayed the different pressure of positive selection among patients and genomic regions. The spatio-temporal analysis on the viral mutations within hosts to monitor dynamics of genomic changes along with the effects of age, gender as well as viral shedding time and illness severity for better understanding viral nature history and the selective pressure in different patients, to trace viral spreading, and guide the design of vaccines. Several recent publications around the SARS-CoV-2 genome have found signals of positive selection [34][35][36] and conservation within the gene encoding spike glycoprotein. Because the detection of viral mutations in most studies still relied on consensus genomic sequences, which was after the purifying process, this tness selection along the day post symptom onset has not been observed at SNP level. Instead of relying solely on the dominant viral sequences in the patients, we included low frequency viral mutations within hosts, the iSNVs, which might be better to re ect the dynamic changes of selective pressure for more e cient viral spreading or immune evasion.
As nonsynonymous mutation accumulated in individuals, the observation of differences between withinand inter-patients suggested a strong purifying selection from iSNVs to SNPs, especially in S and ORF1ab. This purifying selection process is also observed in Lassa virus 30 , Ebola virus 29 and dengue virus 37 . In addition, we assessed the viral entry e cacy of high frequencies mutations iSNVs suggested a weaken trend of viral evolution. However, we remain unclear whether these weaken virus could persists in patients with long COVID-19 38 . However, it should be note that all these raised mutations observed in population are expected under the current situation. Once circumstance was changed, such as antibody therapy used, the selection pressures on each mutation should be varied. For example, the selection arising from antibody might be different and strong. These strong selections might rapidly remodel multiple virus genetics through direct selection or genetic drift. Therefore, more potential mutations of SARS-CoV-2 that presented as iSNVs should be noticed, as well as the mutation R685G in cleavage sites increasing the viral entry e ciencies, at least in some cells and might direct results of SARS-CoV-2 persisted in multiple organs 38 , although it was never occurred as SNPs in any patient in public database.
The big concern about SARS-CoV-2 evolution is whether serious mutants that changed the viral infections or virulence, and whether these mutants would spread worldwide in future. Recently, the emergence and spread of mutants D614G 8 and the strain from a distinct phylogenetic cluster lineage B.1.1.7 9 has been considered as a pandemic threaten to public health 39,40 . These mutants were identi ed bene ted by the global genomic monitor program. Consistent with our results, these mutations have been detected as iSNVs in previous samples 41 . Taken considering that iSNVs emerged more nonsynonymous mutations by positive selection, genomic monitor at within-host scale might be an important complement of current genomic monitors. Controlling or eliminating infectious agents at their sources of transmission should also consider to control the viral mutations to reduce their further adaption in future.
In the urgent time of vaccine development and the treatment strategy selection, we might not have enough time to wait for the mutations to x in the population to prove their functions 42 . Early knowledge of potential evolution would bene t vaccine design 5,43 . The associations between these emerged mutations and the illness severity and treatments should be carefully considered. More genomic data at intra-host level should be explored to illustrate the genetic selections in whole genome scale, and their potential effects on illness severity, clinical outcomes as well as the susceptibility to different populations.

Patients and clinical cohort
Our study included all COVID-19 con rmed and admitted patients between January 20 and April 30, 2020 in Beijing Ditan Hospital, where received and con rmed the rst case in Beijing, China. Totally, we enrolled 204 cases, which occupied 34.4% con rmed patients in Beijing. All patients were con rmed by the RT-PCR test in pharyngeal swabs and administrated in Beijing Ditan Hospital. All patients are treated and managed in the ward after being diagnosed. Standardized electronic medical records were applied to collect basic demographics and epidemiological features, medical histories, and clinical information.  16 . Brie y, this approach uses direct tagmentation of RNA/DNA hybrids using Tn5 transposase to greatly simplify the sequencing library construction process, allowed us to conduct rapid library preparation using low volume input RNA templates (5.4 ul) within 4 hours. The meta-transcriptome libraries further underwent the enrichment process using biotinylated RNA probes targeting the whole viral genome (iGeneTech, Beijing, China). The nal viral-enriched libraries were sequenced on an Illumina NextSeq500 in 2x75bp pair-end mode.

Sequencing analysis of the viral genome
Quality control and error correction were implemented as previous reported 45 . To avoid nucleotide-speci c substitution errors in each read, we removed the low-quality bases at the end of reads with a threshold of Q20 and a minimum read length requirement of 50 bp. Reads without their corresponding paired reads were disregarded. The remaining paired reads were used as clean reads.
High quality viral genomic data were selected for iSNVs analysis. We rstly mapped the clean read data to the reference genome Wuhan-Hu-1 (GenBank accession no. NC_045512.2), and obtained 4.11 Mb (QRI: 0.44-22.96 Mb) high-quality viral reads per sample. After removing the low-quality genomic data, we obtained 402 samples with enough data, and meet the following criteria: i) with sequencing depth >=1 and reference coverage >=50%, ii) and depth >=100 and reference coverage >=10%.
iSNVs calling to avoid sequencing error and contaminants Calling of iSNV was performed as previously described but with different parameters 17 . Brie y: (1) Sequencing reads were pair-ended aligned to the reference genome sequence (GenBank accession no. A series of criteria were used to ensure high quality iSNVs with Q30 reads support and: (1) minor allele frequency of ≥5% (a conservative cutoff based on an error rate estimation); (2) depth of the minor allele of ≥5; and (3) strand bias of the minor allele less than tenfold or the sher-exact test for the major allele and minor allele, and (4) to avoid the errors by reads mapping, the serial adjacent iSNVs (distance <50bp) that contains >5 iSNVs were further ltered.
PCR, sequencing errors and the potential contaminants induced in metagenomic sequencing always the big concern of next generation sequencing. To better detect these contaminants and sequencing error, 1) we rstly enrolled negative control and pharyngeal swab from the healthy to con rm whether we have false-positive of virus in the study.
2) The iSNVs patterns from each round were carefully examined, and we did not observe similar iSNVs pattern in each round. 3) We also examined the nucleotide stat in each site, we only detected 12/7037 iSNVs containing three polymorphic state. Compared to the substitution, the sequencing error might be randomized to occur nucleotide acid. 3) Sanger sequencing technology was used to validate the iSNVs result. We selected the major mutations occurred in one cluster to perform PCR-based Sanger sequencing, and displayed with a consistent result. 4) Considering the previous research 45 by PCR amplicon that the identi cation of iSNV with the threshold of MuAF >0.007 could be a <0.001 false discovery rate (FDR), we elevated the threshold of MuAF to 0.05, and the result should e cient reduce FDR, and avoid all sequencing errors and potential contaminant in the study.

Normalized iSNVs and mutation rate estimation
The iSNV number for each sample were normalized to iSNVs/kb, where only the available region to identify iSNVs were calculated. The iSNVs sites were calculated along the genome by removing the duplicated iSNVs in each patient, where iSNVs in different samples of the same patients were only calculated once. We also applied linear regression to evaluate the correlations between iSNVs and the time of outbreak and infections in each patient.
We The simulation of iSNV mutation position distribution was permutated with runif() function in R package, following the uniform distribution with iSNV number as parameters. Then we ranked the real iSNV and simulated position and calculated the genomic distance between two neighbor mutants. The genomic distances between two simulated iSNVs followed Poisson distribution, while that estimated iSNV biased from the simulation.
iSNV functional and epidemiological annotation The iSNVs were annotated by Perl scripts and compared with the genomic annotation le of reference genome Wuhan-Hu-1(NC_045512.2) from NCBI. The public SNP les was downloaded from public database 2019nCoVR 6,7 (https://bigd.big.ac.cn/ncov/variation/annotation ) on November 19, which collected the whole genome sequences from CNGBdb, GenBank, GISAID, GISAID, GWH and NMDC. To better understand the frequencies of SNPs that occurred in population, each iSNVs were compared to the levels of SNPs according to the frequency in 2019nCoVR at the same period of the sampling. Level I to III was according to previous de nition in public database, whereas level I represents the frequency was more than 0.05; level II represents the frequency was between 0.01 and 0.05; level III represents the frequency was less than 0.01. The iSNVs never appeared in public database, were named as level IV. Ka Epitope" and Organism set to "SARS-CoV-2" and Host set to "Humans".
High correlated iSNVs in genome by phasing analysis Pairwise phasing analysis was performed for adjacent iSNVs (distance <50bp) as previous reported 19 .
For a given pairwise iSNVs, reads harboring both positions were extracted from the alignment (SAM le). The reads with both sites mutated are referred to as phased reads, and those with only one site mutated are referred to as non-phased reads. The ratio of phased to non-phased reads that more than 0.9 was selected as phased iSNVs. Moreover, the phased alternative iSNVs allele frequency should >0.05. For the phased iSNVs in a same protein codon, we re-annotate the iSNVs with phased alleles.
In spite of short distance (<50 bp) in the genome, the correlation of long-distance (distance >50 bp) pairwise iSNVs were also calculated by linear adjust R-square of variation. High correlated and potential linked iSNVs should be identi ed in at least 3 samples, and with a high correlation of MuAFs (>0.6).
S protein structure analysis and molecular dynamics simulation All Molecular Dynamistic (MD) simulations were performed using AMBER 20 package. The crystal structure of binary complexes between hACE2 and RBD (PDB ID codes 6LZG) 25 was used as the initial structure in MD simulation. Protein was modeled with AMBER FF14SB 53 all-atom protein force eld and solvated by a truncated octahedron TIP3P 54 water box in which the boundary is at least 11 Å from any protein atoms. The solvated protein was neutralized and lled with a concentration of 0.13 M of KCl salt.
In these simulations, the SHAKE 55 algorithm with a relative geometric tolerance of 10 -5 was used to constrain all chemical bonds. Mass repartitioning was also applied, which means the mass of the heavy atom that the hydrogen is attached to is adjusted so that the total mass remains the same. Thus, all dynamics utilized a 4 fs time step. Long-range electrostatics was treated by the particle-mesh Ewald (PME) 56 method with default settings and a 9 Å direct space nonbonded cutoff was used in all simulations. The system was rst subjected to 10,000 steps of minimization and then gradually heated to 300 K under constant volume conditions in 1 ns. After another 5 ns of simulations using the constant isothermal-isobaric ensemble at 1 atm and 300 K, each system was equilibrated for an additional 10 ns. The Monte Carlo baristas and a weak-coupling thermostat were used. MD simulations were extended for 300 ns with coordinates recorded every 10 ps. The same procedure was followed for simulations of the antibody CB6 26 and RBD complex (PDB ID code 7C01 and 7BWJ). Nine RBD mutants: V407L, N422K, L452Q, E471K, V483F, G485V, Q493H, Q498H, Y505C bound to either hACE2 or CB6 were also simulated. The simulations of these systems were also followed the same procedure. In the analysis of the simulation trajectories, hydrogen-bond is de ned as a geometry with a cutoff length of 3.5 Å between the two heavy atoms of the hydrogen-bond donor and acceptor and an X-H· · ·Y (X and Y stand for heavy atoms) angle cutoff of 135°. A hydrogen-bond is counted when the distance between X and Y is less than 3.5 Å and the X-H· · ·Y angle is greater than 135°.

Package of pseudoparticles bearing spike protein from SARS-CoV-2 and variants
The codon-optimized S gene of SARS-CoV-2 (NC_045512.2) were constructed into pSecTag2/Hygro A plasmid and using as a template to generate S mutants following site-directed mutagenesis PCR. The various spike protein pseudovirus bearing luciferase reporter gene were packaged as reported previously 57 . In brie y, 24 h prior to transfection, 6×10 5 293T cells were plated per well in 6 well plates. All Using the HIV-1 p24 antigen quanti cation, we normalized the pseudotyped virus particles to the same amount (10 ng of p24   Figure 1 The selection process from iSNV to SNPs. (A) The mutations were initialized with random mutations. Considering the sequencing errors, the low allele frequencies of iSNVs (commonly <5%) were ignored in iSNVs analysis; and the high allele frequencies of iSNVs were considered as SNPs (>95%). (B) Two steps of mutation tness selection: from low frequency random mutations to iSNVs, which occurs within host and affected by RNA-editing and host immunity; and from iSNVs to SNPs, which occurs both intra-and inter-host process, including the tness selection and purifying selection. (C) The observed distribution of genomic distances between two random mutations should follow Poisson distribution, whereas that of iSNVs and SNPs are away from the expected distribution. The number of observed mutations was also decreased. (D) The diagram of observed mutations at random mutation level, iSNV level and SNP level.

Figure 2
The spatio-temporal analysis on the genomic sequencing and iSNVs locus identi ed in this study.  The histogram in the left shows the frequencies of iSNVs related to illness severity, and the iSNVs in red font are those related to severe outcomes. The iSNVs marked with star were those signi cantly differential distributed between server and non-server population. The mutations were distinguished by color (red: more severe patients observed, and grey: more non-severe patients observed). The dash line represents the average proportion of the severe patients. The plot on the right represents the number of cases with these iSNVs. The level of the frequencies of these iSNVs in public database were marked with different color of line (level I to IV: red, green, blue and purple).  Below are the mutations of amino acid residue in SARS-CoV-2, PCoV-GD, Bat-RaTG13, PCOV-GX and