Genome-wide mutation landscape of SARS-CoV-2

Background: SARS-CoV-2 has become a pandemic and researchers have built phylogenetic trees to trace the spread of the virus. However, the accumulation rate of variations and mutational hotspots remain largely unclear. Results: We collected more than 3,100 SARS-CoV-2 genome sequences from GISAID and profiled the landscape of whole genome variations. We detected 2,096 single nucleic variants (SNVs) and seven short deletions. 1,224 of them (58.4%) are missenses variation, altering the corresponding residues. We found the accumulation rate of SNVs in the current spreading situation is 6.36e-2/day. We found 15 missenses SNVs are extremely high frequent (existing in more than 100 genome sequences, p < 1e-5), effecting ORF1ab , S , ORF3a , M , ORF8 , and N . Moreover, one frequent substitution at locus 23,403 changes the 614 th amino acid of spike glycoprotein from D to G, potentially effecting the functions of this key protein. Conclusion: Our study provided the specific mutational patterns missenses high frequent SNVs effecting 6 genes of the virus, may promoting the adaption of the virus during evolving.

(807) one is C to T transitions and the second most frequent (327) is G > T transversion ( Supplementary Fig. S2). We annotated 2,096 SNVs and found 1,224 are missense variant , which is significant lower than random background constructing by shuffling all SNVs in the genome 500 times (p < 7.6e-6, Supplementary Fig. S3), suggesting SARS-CoV-2 is under selection pressure during COVID-19 pandemic. To investigate the accumulation rate of SNVs, we calculated average number of SNVs for genome sequences collecting at the same date, and used a linear regression to fit the mean SNV numbers with date.
We found a clear positive correlation and estimated the coefficient of date is 6.36e-2 (p = 3.12e-24, F-test) (Fig. 1), indicating a new SNV occurs in SARS-CoV-2 population per 16 days in current outbreak. All of these above evidences indicate the current reference genome of SARS-CoV-2 cannot reflect the genetic information of whole virus population, e.g. representing only one version of each locus and ignoring major variations in the population, which may complicate downstream analysis, such as metagenome analysis. It is necessary to construct a new SARS-CoV-2 reference genome incorporating known variants. Graph genome has been shown a good solution to address such problems [15][16][17].
Here, we constructed a graph genome of SARS-CoV-2 by variation graph toolkit [15] based on current reference genome and the frequent mutations (https://github.com/xjtu-omics/SARS-CoV-2_graph_genome) to facilitate future analysis in this pandemic.

SNV hotspots in SARS-CoV-2
We compared the SNV frequency with the random profiles, and detected 244 significantly high frequent SNVs (p < 1e-5, Supplementary Table S1). We found the high frequent SNVs are significantly enriched in last two 1kb bins (p < 1e-5 for bin 28,000-29,000 and 29000-29,303, Supplementary Table S2), which contain gene ORF8, N, ORF10, and 3'UTR of the virus. Of these high frequent SNVs, we further detected 22 SNVs occurred in more than 100 genome sequences (Fig. 2, Supplementary Table S3). All of these 22 hotspots related with six genes, including 13 hotspots locate at region of ORF1ab, three hotspots locate at region of N, two hotspots locate at region of ORF3a, one hotspot locate at region of M and one locate at region of ORF8 ( Fig. 2A). We annotated 22 hotspots SNVs and found five synonymous, fifteen missense, one at 5'UTR and regulatory region of SARS-CoV-2 genome (locus 241, C > T, in 1,649 genome sequences) and one stop lost SNV (locus 20,268, A > G, in 117 genome sequences), leading to the elongation of ORF1ab protein ( Fig. 2A).
To further explore the relations between SNV hotspots and geographic locations, we did an enrichment analysis and found different continents had specific enrichment patterns on SNV hotspots (fold change (FC) > 1, p < 0.01,  Feb. 20 th , 2020, this variation gradually dominates (Fig. 3A).
Furthermore, this missense variant alters the 614 th amino acid from Asp (D) to Gly (G) of spike glycoprotein (coded by S gene), the so-called S protein initiating virus entering human cells [10]. We analyzed the mutation sensitivity of each amnio acid in spike glycoprotein by Phyre2 [18], and found the mutation sensitivity score of the 614 th amino acid is 4 (Fig. 3B, Supplementary Table S6), indicating potential functional alternation of the spike glycoprotein. Whether this particular mutant strain has evolved to alter infection efficiency remains unclear.

Discussion
We analyzed more than 3,100 SARS-CoV-2 genome sequences, and detected SNV and indel mutations across whole genome. The graph genome of SARS-CoV-2 incorporates high frequent mutations, facilitating downstream analysis.
We found 22 SNV hotspots existing in more than 100 genome sequences, and 15 of them are missense variants. The SNV at locus 23,403, observed in 1,656 genome sequences (52.4%), alters the 614 th amino acid of spike glycoprotein from D to G, might altering its function. However, this mutation does not appear at the interface between virus S protein and human ACE2 but in the flexible hinge region of S protein, requiring further functional investigation. Besides spike glycoprotein, proteins coded by ORF1ab, N, M, ORF3a, ORF8 are also affected by SNV hotspots (Fig. 2A, Supplementary Table S3) and their implications on infection and mortality rates require further investigation. Although 3' to 5' exoribonuclease of ExoN in coronavirus [19] in principle renders lower mutation rate than other RNA virus, the observed accumulation of continent specific mutational hotspots sounds an alarm that immunity gained from either vaccines or previous infection might not be sufficient to prevent future outbreaks of COVID-19.

Conclusions
In this study, we analyzed the genome-wide mutations of SARS-CoV-2 and detected 2,096 SNVs and seven short deletions. Based on the date of mutation occurring, we calculated the accumulation rate of SNVs in current spreading situation is 6.36e-2/day. Of the mutations, we found 22 SNV hotspots, affecting amino acid sequence of 6 genes, including ORF1ab, N, M, ORF3a, ORF8, and S.
The SNV hotspots show a clear continent specific patterns, may resulting in different pandemic situations in different continent. The SNV hotspot at locus of spike glycoprotein from D to G, may resulting the structure and functional alternation and promoting the adaption of the virus during evolving.

Mutation calling
We aligned 3,160 complete sequences to reference genome by BWA [20] and called mutations with SAMtools [21] and BCFtools. We used snpEff [22] to annotate the mutations.

Graph genome construction
We removed the singleton mutations, which occurred only in one strain of SARS-CoV-2 and built its graph genome with variation graph toolkit [15] based on the rest of mutations.

SNV frequency enrichment analysis
We constructed 500 random mutation profiles for each genome sequence with keeping the same mutation numbers. We calculated the mean i and standard

Genomic bins enrichment of high frequent SNVs
We shuffled the high frequent SNVs across whole genome 500 times to construct the background SNV distributions. We split whole genome into 30 nonoverlapped 1kb bins, and counted SNV numbers in each bin for both real and background SNV distributions. For the i-th bin, we calculated the p value by

SNV hotspots location enrichment analysis
For each highly frequent SNV, we calculated the FC of each continent, and did the fish exact test to calculate the p value by fisher.test function in R. The FC was calculated as following.

Statistical analysis
The statistical analysis were performed in R environment (https://www.rproject.org/).
The authors declare they have no competing interests.   The genomic annotation plot is downloaded from UCSC genome browser. B.
Continent specific enrichment of SNV hotspots. The red block means significant enrichment with p < 0.01, fisher-exact test. The column label is "locus, SNV type", e.g. 28,144, T > C denotes T to C substitution at genome locus 28,144.    proportion of genome sequences with this substitution. To avoid the bias, we removed the date with smaller than 10 genome sequences. B. Mutation sensitivity colored spike glycoprotein structure. The red region in the structure is the 614 th amino acid, which is change from D to G since this SNV. The mutation sensitivity score of the 614 th amino acid is 4.