Comparison Of SARS-CoV-2 Virus Variant Genomes Detected In China and USA

In spreading period of the SARS-CoV-2 virus which is the cause of COVID-19 in the world, it is seen that the genome of the virus mutates and this mutation processes generates new SARS-CoV-2 variants. In this study we investigated the variant genomes which were detected in China and USA. The publicly available SARS-CoV-2 virus genomes, which were detected in human body are multi-aligned and the obtained results reported. There are 87 genomes for China variants and 200 for USA variants in the used data. The analyses are made for each domain of the genomes. The analysis results show that, the variant genomes in the investigated two groups of SARS-CoV-2 have some similar mutation characteristics as well as some characteristic features that differ from each other. The nucleotide mutations in 8782 (C>T mutation) and 28144 (T>C mutation) are common for both of the variant groups. However, in USA variant group some other mutation positions on the variants’ genome were detected. The percentage of missense mutations detected in USA variants is higher than the percentage of synonymous mutations. On the other hand, the percentage of synonymous mutations is higher than the percentage of missense mutations for the variants detected in China. Additionally, the domains where the most mutations are detected on the genome are the regions that affect the interaction of the virus with the host.


Background
A new coronavirus, which is also called as SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), was discovered in December 2019 ( [1]).This name was chosen because the virus is genetically related to the coronavirus responsible for the SARS outbreak of 2003 ( [2]).While related, the two viruses are differing from each other.The name of the disease which is caused by the virus is COVID-19 (Coronavirus Disease-19).The virus is discovered rstly in Wuhan city of China.Because the property of infectious of the virus is very high, the virus spread rstly in China and then in many countries globally.The rst con rmed case of COVID-19 infection in the USA reported on January 20, 2020 ( [3]).The virus is affecting 213 countries and territories around the world and two international conveyances with 5,814,886 cases, including 357,984 deaths as of May 28, 2020 ([4]).
Coronaviruses are RNA viruses with a positive stranded genome of approximately 30 kb.The two-thirds of the genome consists of a large open reading frame (ORF) which is called as ORF1ab ( [10]).This large ORF consists of two large replicase sub ORFs, ORF1a and ORF 1b.The 3' end of the coronavirus genome includes several structural and accessory protein genes.These domains are a spike (S) glycoprotein gene, an envelope (E) protein gene, a membrane (M) glycoprotein gene, a nucleocapsid (N) phosphoprotein gene and several ORFs that encode putative non-structural proteins (NSP) ( [11]). Figure 1 shows the gene structure of the SARS-CoV-2 genome.ORF1ab region contains 16 non-structural proteins and S contains two sub spike protein genes, S1 and S2.The position in the genome and size info of the domains are listed in Table 1 with short descriptions ( [12]).

Domain Name
Position Size Short Description NSP1 (Non-structural protein 1)

266-805 540
Inhibits host translation by interacting with the 40S ribosomal subunit.The NSP1-40S ribosome complex further induces an endonucleolytic cleavage near the 5' UTR of host mRNAs, targeting them for degradation.Viral mRNAs are not susceptible to NSP1-mediated endonucleolytic RNA cleavage thanks to the presence of a 5'-end leader sequence and are therefore protected from degradation.By suppressing host gene expression, NSP1 facilitates e cient viral gene expression in infected cells and evasion from host immune response.

806-2719 1914
May play a role in the modulation of host cell survival signaling pathway by interacting with host PHB and PHB2.Indeed, these two proteins play a role in maintaining the functional integrity of the mitochondria and protecting cells from various stresses.

2720-8554 5835
Responsible for the cleavages located at the N-terminus of the replicase polyprotein.In Some mutation types are synonymous which means the mutation of the corresponding nucleotide doesn't cause a change in the coded amino acid.On the other hand, some mutation types are missense which means the mutation causes a change in the coded amino acid.Although there are some other type of mutation types as well ([8], [9]), in this study we mostly concentrated on these two type of mutations in coding domains of the variant genomes.
Some early works consider the SARS-CoV-2 variant analysis for the variants detected in a speci c location, especially in China ([6], [10]).In this paper we compared the variation characteristics of two variant genome groups which are gathered from two distinct locations of the world, China and USA.The variant genomes in both of the groups are multi-aligned and some statistical characteristics are obtained for each of the domains of the SARS-CoV-2 genome.The obtained analysis results are given in Sect. 2.
The discussion and conclusion notes are given in Sect. 3 and 4, respectively.In Sect. 5 of this paper, the used material and method are described.Finally, the abbreviations are given in Sect.6.

Results
We rstly investigated the non-coding parts, which is also called as untranslated regions (UTR), of the SARS-CoV-2 genomes for all the variant genomes.The rst coding domain of the genome is NSP-1 (Fig. 1) and the starting nucleotide position of NSP-1 is 266.Therefore, the rst 265 nucleotides are taken as non-coding region (5' UTR).The most common nucleotide mutation types and the corresponding mutation locations are shown in Fig. 1 "No Mutation" case, 12 mutation between nucleotide types cases and four deletion cases.The percentage value for the "No-Mutation" type shows the percentage of variants which have the same nucleotide type with the reference genome for the investigated genome region.The percentage values within the brackets are indicating the percentage of the corresponding most common mutation types among all the mutation types in the investigated genome region.For example, 22% of the most commonly detected mutations are C > T mutations in 5' UTR region.In the following parts of the paper this graphical display method will be used for expressing mutation percentages.
The last domain (ORF10) of the SARS-CoV-2 genome nishes in 29674th nucleotide.The rest of the genome sequence from this point is also taken as non-coding region (3' UTR).The most common nucleotide mutation types and the corresponding mutation locations are shown in Fig. 3. To make this analysis we used the same data of reference and variant genomes which were used in 5' UTR analysis.
The mutations occur mostly in the last 40 nucleotide of this non-coding region.The most occurred mutation types are T > C mutation (35.61%),A>-deletion (32.87%) and C > T mutation (19.17%) like 5' UTR.The mutation percentages in other nucleotide positions are below 10%.

Analysis of Mutations in SARS-CoV-2 China Variants
After investigating the non-coding regions of the SARS-CoV-2 genome, we investigated the coding regions of the genome.We considered only the variants gathered from China as the rst step.The total number of variants in China group is 87.The most common nucleotide mutation percentages, the mutation types, the corresponding mutation locations and the domain boundaries are shown in Fig. 4. The dashed green vertical lines are showing the boundaries of the reference SARS-CoV-2 genome domains.The names of the genomic domains are shown on the top of the graph.The percentage values next to the domain names are the percentages of the total mutation number in the corresponding domain to the total mutation number in all variant genomes.The percentage values are calculated by using the formula shown in ( 1 ).
If Fig. 4 is examined, it can be easily seen that the mutations in two nucleotide positions dominate the results.These points are 8782 (C > T mutation) and 28144 (T > C mutation).This result is consistent with the previous studies ([6], [7], [10]).All the other mutations in speci c nucleotide positions have the percentage values less than 10%.Most common nucleotide mutations are T > C (30.39%) and C > T (28.68%).If the nucleotide positions of the mutations are considered, it is seen that the total mutation number for some genome domains are very low.The domains with mutation percentages less than 2% are NSP7, NSP9-11, S, ORF3a, E, M, ORF6, ORF7b, ORF8 and ORF10.The domain mutation percentages which are calculated by the formula given in ( 1 ) are shown in Fig. 5.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js The nucleotide numbers of domains differ from each other.Therefore, the mutation density for each domain can be also considered.In this step the nucleotide mutation densities for each domain is calculated by using the formula ( 2 ) and the obtained results are shown in Fig. 6.The densest domain is NSP16 and least dense domain is NSP11 for the variants detected in China.
As known, all the nucleotide mutations don't cause an amino acid change in translation step of gene regulation process.In these synonymous mutation types, the nucleotide mutates but the changed codon codes the same amino acid type with the unchanged codon.On the other hand, some nucleotide mutations causes a change in amino acid type.This type mutations which are called as missense mutation may cause the structure of the result protein to change.Because of this reason, missense mutations are more critical.Figure 7 shows the most common mutation types in coding region of the SARS-CoV-2 genome variants detected in China.The red dots are indicating the most common mutation types in the speci c nucleotide positions.The red dots in "No Mutation" line are the nucleotide locations where no mutation was detected among all variants.The percentage value next to the "No Mutation" label shows the percentage of non-mutated nucleotides in the entire genome to the total number of nucleotides in the entire genome.Likewise, the red dots in "Missense" and "Synonymous" lines are showing the nucleotide positions where missense and synonymous mutations are detected, respectively.
The percentage values next to the "Missense" and "Synonymous" labels show the percentage of missense and synonymous mutations in the entire genome to the number of mutations in the entire genome, respectively.
However, the nucleotide numbers of domains differ from each other.Therefore, the mutation density for each domain can be also considered by using the formula shown in ( 5) and ( 6).The mutations are occurred most densely in NSP8 domain (Fig. 9).
• 100 ( 5 ) • 100 As said before, the mutations in nucleotides may affect the generated amino acid sequence.To analyze this phenomenon, the generated amino acid sequence of the reference genome was compared with all the variants' amino acid sequence products and most common amino acid changes were detected.The Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js result graph is shown in Fig. 10.In other words, Fig. 10 shows the amino acid change percentages which are due to missense mutations.
Amino acids can be categorized into four different classes according to their polarity characteristics.These categories are called as the amino acids with non-polar R groups (A, V, L, I, P, F, W, M), the amino acids with polar R groups (G, S, T, Y, C, N, Q), the amino acids with negative charged R groups (D, E) and the amino acids with positive charged R groups (K, R, H).The nucleotide mutations may cause the change of an amino acid which is in a speci c category to another amino acid which is in a different category.The obtained category change analysis results are shown in Fig. 11.It is easily seen that most amino acid change type is detected in (non-polar R group) > (non-polar R group).This kind of change is mostly detected in NSP12 domain.This fact supports the analysis result shown in Fig. 8.
To show the total amino acid change counts in each of the domains for each amino acid change types, we used a color coded display method shown in Fig. 12 with color map bar.This display method gives information about the relative quantities of the resulting mutation types.For example, synonymous amino acid change in NSP2 domain is the most amino acid change occurred among all other amino acid changes.If the color map on the right side is controlled, it can be seen that the exact synonymous amino acid change count in NSP2 domain is 74.This means that in NSP2 domain 74 synonymous amino acid changes are detected.The amino acid change count for (non-polar R group) > (non-polar R group) change type in NSP12 domain is higher than all the other missense mutation results for all domains, which is consistent with the results shown in Fig. 11.

Analysis of Mutations in SARS-CoV-2 USA Variants
After examining the variants detected in China, as the second step we investigated the SARS-CoV-2 variants detected in the USA.The total number of the obtained variant genomes for USA is 200.The most common nucleotide mutation types, the corresponding mutation locations and the domain boundaries are shown in Fig. 13.
Like in China case (Fig. 4), in nucleotide positions 8782 (C > T mutation) and 28144 (T > C mutation) high nucleotide mutation percentage values are detected for the variants detected in the USA.This result is consistent with the previous studies ([6], [7], [10]).However, there are 8 additional mutation positions which have mutation percentage values greater than 10%.These nucleotide positions and the corresponding most common nucleotide mutation types are listed in Most common nucleotide mutation type is C > T (40.95%).This result differs from the result of China variants.In China case the most common nucleotide mutation type was T > C (30.39%).If the domain based mutation percentages and densities are considered for USA variants, the result graphs given in Fig. 14 and Fig. 15 are obtained.If China and USA cases are compared (Fig. 5 vs Fig. 14 and Fig. 6 vs Fig. 15) it can be said that the domains of most common nucleotide mutation positions are not consistent for these two variant groups.
If the percentage of synonymous and missense mutations in each domain are considered, the graph given in Fig. 17 is obtained.The graphs were sketched by considering the descending order of the missense mutations.If Fig. 17 is compared with Fig. 8, it can be noticed that NSP2, NSP3 and NSP12 are in rst four domains for both of the variant groups.Finally, the mutation density graphs for synonymous and missense mutations are shown in Fig. 18.For the variants detected in the USA, the missense mutations are occurred more densely in ORF8 domain.In China case, ORF8 was fourth dense domain (Fig. 9).
The generated amino acid sequence of the reference genome was compared with all the variants' amino acid sequence products and most common amino acid changes were detected for USA variants.The result graph is shown in Fig. 19.Like in China case (Fig. 10), for USA variants there is an amino acid change which has more than 10% percentage in ORF8 domain.This change is again from "L" amino acid to "S" amino acid.However, there are 5 additional amino acid changes which have more than 10%.These missense mutation results are listed in Table 3.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js The amino acid category change result for USA variants is shown in Fig. 20.The most amino acid change type is detected in (non-polar R group) > (non-polar R group) like in China case.
The total amino acid change counts in each of the domains for each amino acid change types are shown in Fig. 21 with a color coded display method.The amino acid change count for (non-polar R group) > (non-polar R group) change type in NSP12 domain is again higher than all the other missense mutation results for all domains like China case (Fig. 12).However, there are some other amino acid change count for USA variants like (non-polar R group) > (non-polar R group) change type in NSP3 domain or (non-polar R group) > (non-polar R group) change type in N domain.

Methods
The links for the genomics data of the reference genome and the variants are publicly available in a web page of China's National Genomics Data Center (NGDC) which is a COVID-19 dedicated page ([13]).As the rst step we gathered the genomes detected in China and the genomes detected in the USA.The numbers of variant genomes in the China group and USA group are 87 and 200, respectively.Because most of the analyses are percentage based, the difference between the numbers of variant genomes in the groups don't effect the result dramatically.
We multi-aligned the variants by using Bioinformatics toolbox of MATLAB software as the second step.
The gap penalty and extension penalty are accepted as 10 and 0.5, respectively.In third step of the study we used NC_045512 ( [12]) as the reference genome and compare the multi-aligned variant genomes with NC_045512 genome.The main reason of using this genome as the reference genome is to make the results comparable with the previous studies ([6], [7], [10]) where this genome is also used as the reference genome.

Discussion
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js In this study we showed some statistical characteristics of two SARS-CoV-2 variant groups which are detected in two different locations in the world, China and USA.The nucleotide mutations in 8782 (C > T mutation) and 28144 (T > C mutation) are common for both of the variant groups.However, in USA variant group some other mutation positions on the variants' genome were detected.The domains where the most common nucleotide mutations are detected are not consistent for these two variant groups.However, the most amino acid change type is detected as (non-polar R group) > (non-polar R group) for both China and USA variants.The variant group which is detected in China shows more synonymous mutations than missense mutations as percentage.On the other hand, the opposite situation is true for the variant group which is detected in the USA.This causes the number of missense mutations to increase in USA variant group.It is considered that as long as the virus continues to spread, other mutation sites and missense mutations may also appear.Because the COVID-19 disease which is caused by SARS-CoV-2 virus spread in the USA more widely than China as of May 2020 ( [14]), the obtained results for USA variant group supports this prediction.
If we consider the analysis results for all variants (both China and USA variants) and focus on missense mutations, it is seen that there are three domains in which mutations leading to amino acid change are most prominent: ORF8, S and ORF3a (Fig. 10 and Fig. 19).As shown in Table 1; ORF8 may play a role in host-virus interaction, S glycoprotein attaches the virion to the cell membrane by interacting with host receptor and has a role in initiating the infection and ORF3a forms homotetrameric potassium sensitive ion channels (viroporin) and may modulate virus release.The common feature of these domains is that they directly modulate the interaction of SARS-CoV-2 virus with the host.Phylogenetic analysis suggested that the corresponding virus family probably originated in bats and spread to people ( [15]) and the genome of the virus continues to mutate.This result shows us that the considered domains where the most mutations are detected on the genome are regions that affect the interaction of the virus with the host and the mutations in these domains are probably effective also in cross-host evolution.

Conclusions
As of May 2020, almost all the countries of the world are ghting against the SARS-COV-2 virus and the corresponding disease COVID-19.It seems that this will be a long-ghting process until the required vaccines or other effective treatment methods are developed.However, we believe that investigating the variants of the virus and sharing knowledge on variants will be effective in the ght with the virus.In this study we tried to show the differences of two variant groups which are detected in only two different locations, China and USA.But this type of analysis works must continue for more variants for better understanding the virus and prevail in this ght.

2 .
To make this analysis we compared the reference sequence with 287 variant genomes (87 for China, 200 for USA) of SARS-CoV-2 virus detected in China and USA.The solid line shows the most common mutation percentage values in the corresponding nucleotide position of the genome.The nucleotide positions of the genome are shown in x-axis of the graph.The mutation percentage values are shown in left side of the gure.Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.jsOn the other hand, points show the most common nucleotide mutation types in corresponding nucleotide position.The nucleotide mutation types are shown in right side of the gure.There are 17 different cases:

Figure 3 The
Figure 3

Figure 5 The
Figure 5

Figure 7 The
Figure 7

Figure 8 The
Figure 8

Figure 9 The
Figure 9

Figure 10 The
Figure 10

Figure 14 The
Figure 14

Figure 15 The
Figure 15

Figure 20 Loading
Figure 20Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js [5]kages the positive strand viral genome RNA into a helical ribonucleocapsid (RNP) and plays a fundamental role during virion assembly through its interactions with the viral genome and membrane protein M. Plays an important role in enhancing the e ciency of subgenomic viral RNA transcription as well as viral replication.As the virus is transmitted from person to person via droplet transmission ([5]) easily, more patients are infected and as a result the virus is expected to accumulate more variants.So, while the virus spreads continuous globally, it is seen that the genome of the virus mutates and this mutation processes addition, PL-PRO possesses a deubiquitinating/deISGylating activity and processes both 'Lys-48'-and 'Lys-63'-linked polyubiquitin chains from cellular substrates.Participates together with NSP4 in the assembly of virally-induced cytoplasmic double-membrane vesicles necessary for viral replication.Antagonizes innate immune induction of type I interferon by blocking the phosphorylation, dimerization and subsequent nuclear translocation of host IRF3.Prevents also host NF-kappa-594 Forms a hexadecamer with NSP7 (8 subunits of each) that may participate in viral replication by acting as a primase.Alternatively, may synthesize substantially longer products than oligonucleotide primers.Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js virus morphogenesis and assembly.Acts as a viroporin and self-assembles in host membranes forming pentameric protein-lipid pores that allow ion transport.Also plays a role in the induction of apoptosis.): Could be a determinant of virus virulence, since, when expressed in an otherwise attenuated JHM strain of murine coronavirus, it can dramatically increase the lethality of the latter.Seems to stimulate cellular DNA synthesis in vitro.366 May play a role in host-virus interaction.Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js generates new SARS-CoV-2 variants ([6]).

Table 2 The
Nucleotide Positions of Most Common Nucleotide Mutations in the Variant Genomes Detected in the USA.

Table 3
The Most Occurred Amino Acid Changes Because of Missense Mutations in USA Variants.