We firstly investigated the non-coding parts, which is also called as untranslated regions (UTR), of the SARS-CoV-2 genomes for all the variant genomes. The first coding domain of the genome is NSP-1 (Fig. 1) and the starting nucleotide position of NSP-1 is 266. Therefore, the first 265 nucleotides are taken as non-coding region (5’ UTR). The most common nucleotide mutation types and the corresponding mutation locations are shown in Fig. 2. To make this analysis we compared the reference sequence with 287 variant genomes (87 for China, 200 for USA) of SARS-CoV-2 virus detected in China and USA.
The solid line shows the most common mutation percentage values in the corresponding nucleotide position of the genome. The nucleotide positions of the genome are shown in x-axis of the graph. The mutation percentage values are shown in left side of the figure.
On the other hand, points show the most common nucleotide mutation types in corresponding nucleotide position. The nucleotide mutation types are shown in right side of the figure. There are 17 different cases: 1 “No Mutation” case, 12 mutation between nucleotide types cases and four deletion cases. The percentage value for the “No-Mutation” type shows the percentage of variants which have the same nucleotide type with the reference genome for the investigated genome region. The percentage values within the brackets are indicating the percentage of the corresponding most common mutation types among all the mutation types in the investigated genome region. For example, 22% of the most commonly detected mutations are C > T mutations in 5’ UTR region. In the following parts of the paper this graphical display method will be used for expressing mutation percentages.
The last domain (ORF10) of the SARS-CoV-2 genome finishes in 29674th nucleotide. The rest of the genome sequence from this point is also taken as non-coding region (3’ UTR). The most common nucleotide mutation types and the corresponding mutation locations are shown in Fig. 3. To make this analysis we used the same data of reference and variant genomes which were used in 5’ UTR analysis. The mutations occur mostly in the last 40 nucleotide of this non-coding region. The most occurred mutation types are T > C mutation (35.61%), A>- deletion (32.87%) and C > T mutation (19.17%) like 5’ UTR. The mutation percentages in other nucleotide positions are below 10%.
2.1 Analysis of Mutations in SARS-CoV-2 China Variants
After investigating the non-coding regions of the SARS-CoV-2 genome, we investigated the coding regions of the genome. We considered only the variants gathered from China as the first step. The total number of variants in China group is 87. The most common nucleotide mutation percentages, the mutation types, the corresponding mutation locations and the domain boundaries are shown in Fig. 4. The dashed green vertical lines are showing the boundaries of the reference SARS-CoV-2 genome domains. The names of the genomic domains are shown on the top of the graph. The percentage values next to the domain names are the percentages of the total mutation number in the corresponding domain to the total mutation number in all variant genomes. The percentage values are calculated by using the formula shown in ( 1 ).
If Fig. 4 is examined, it can be easily seen that the mutations in two nucleotide positions dominate the results. These points are 8782 (C > T mutation) and 28144 (T > C mutation). This result is consistent with the previous studies ([6], [7], [10]). All the other mutations in specific nucleotide positions have the percentage values less than 10%. Most common nucleotide mutations are T > C (30.39%) and C > T (28.68%). If the nucleotide positions of the mutations are considered, it is seen that the total mutation number for some genome domains are very low. The domains with mutation percentages less than 2% are NSP7, NSP9-11, S, ORF3a, E, M, ORF6, ORF7b, ORF8 and ORF10. The domain mutation percentages which are calculated by the formula given in ( 1 ) are shown in Fig. 5.
The nucleotide numbers of domains differ from each other. Therefore, the mutation density for each domain can be also considered. In this step the nucleotide mutation densities for each domain is calculated by using the formula ( 2 ) and the obtained results are shown in Fig. 6. The densest domain is NSP16 and least dense domain is NSP11 for the variants detected in China.
As known, all the nucleotide mutations don’t cause an amino acid change in translation step of gene regulation process. In these synonymous mutation types, the nucleotide mutates but the changed codon codes the same amino acid type with the unchanged codon. On the other hand, some nucleotide mutations causes a change in amino acid type. This type mutations which are called as missense mutation may cause the structure of the result protein to change. Because of this reason, missense mutations are more critical. Figure 7 shows the most common mutation types in coding region of the SARS-CoV-2 genome variants detected in China. The red dots are indicating the most common mutation types in the specific nucleotide positions. The red dots in “No Mutation” line are the nucleotide locations where no mutation was detected among all variants. The percentage value next to the “No Mutation” label shows the percentage of non-mutated nucleotides in the entire genome to the total number of nucleotides in the entire genome. Likewise, the red dots in “Missense” and “Synonymous” lines are showing the nucleotide positions where missense and synonymous mutations are detected, respectively. The percentage values next to the “Missense” and “Synonymous” labels show the percentage of missense and synonymous mutations in the entire genome to the number of mutations in the entire genome, respectively.
However, the nucleotide numbers of domains differ from each other. Therefore, the mutation density for each domain can be also considered by using the formula shown in ( 5 ) and ( 6 ). The mutations are occurred most densely in NSP8 domain (Fig. 9).
As said before, the mutations in nucleotides may affect the generated amino acid sequence. To analyze this phenomenon, the generated amino acid sequence of the reference genome was compared with all the variants’ amino acid sequence products and most common amino acid changes were detected. The result graph is shown in Fig. 10. In other words, Fig. 10 shows the amino acid change percentages which are due to missense mutations.
Amino acids can be categorized into four different classes according to their polarity characteristics. These categories are called as the amino acids with non-polar R groups (A, V, L, I, P, F, W, M), the amino acids with polar R groups (G, S, T, Y, C, N, Q), the amino acids with negative charged R groups (D, E) and the amino acids with positive charged R groups (K, R, H). The nucleotide mutations may cause the change of an amino acid which is in a specific category to another amino acid which is in a different category. The obtained category change analysis results are shown in Fig. 11. It is easily seen that most amino acid change type is detected in (non-polar R group) > (non-polar R group). This kind of change is mostly detected in NSP12 domain. This fact supports the analysis result shown in Fig. 8.
To show the total amino acid change counts in each of the domains for each amino acid change types, we used a color coded display method shown in Fig. 12 with color map bar. This display method gives information about the relative quantities of the resulting mutation types. For example, synonymous amino acid change in NSP2 domain is the most amino acid change occurred among all other amino acid changes. If the color map on the right side is controlled, it can be seen that the exact synonymous amino acid change count in NSP2 domain is 74. This means that in NSP2 domain 74 synonymous amino acid changes are detected. The amino acid change count for (non-polar R group) > (non-polar R group) change type in NSP12 domain is higher than all the other missense mutation results for all domains, which is consistent with the results shown in Fig. 11.
2.2 Analysis of Mutations in SARS-CoV-2 USA Variants
After examining the variants detected in China, as the second step we investigated the SARS-CoV-2 variants detected in the USA. The total number of the obtained variant genomes for USA is 200. The most common nucleotide mutation types, the corresponding mutation locations and the domain boundaries are shown in Fig. 13.
Like in China case (Fig. 4), in nucleotide positions 8782 (C > T mutation) and 28144 (T > C mutation) high nucleotide mutation percentage values are detected for the variants detected in the USA. This result is consistent with the previous studies ([6], [7], [10]). However, there are 8 additional mutation positions which have mutation percentage values greater than 10%. These nucleotide positions and the corresponding most common nucleotide mutation types are listed in Table 2. The mutations in nucleotide position 3037, 14408, 23403 and 25563 have higher percentage values than the mutations in nucleotide positions 8782 and 28144.
Table 2
The Nucleotide Positions of Most Common Nucleotide Mutations in the Variant Genomes Detected in the USA.
Nucleotide Position | Mutation | Domain |
1059 | C > T | NSP2 |
3037 | C > T | NSP3 |
8782 | C > T | NSP4 |
14408 | C > T | NSP12 |
17747 | C > T | NSP13 |
17858 | A > G | NSP13 |
18060 | C > T | NSP13 |
23403 | A > G | S (S1) |
25563 | G > T | ORF3a |
28144 | T > C | ORF8 |
Most common nucleotide mutation type is C > T (40.95%). This result differs from the result of China variants. In China case the most common nucleotide mutation type was T > C (30.39%). If the domain based mutation percentages and densities are considered for USA variants, the result graphs given in Fig. 14 and Fig. 15 are obtained. If China and USA cases are compared (Fig. 5 vs Fig. 14 and Fig. 6 vs Fig. 15) it can be said that the domains of most common nucleotide mutation positions are not consistent for these two variant groups. |
If the percentage of synonymous and missense mutations in each domain are considered, the graph given in Fig. 17 is obtained. The graphs were sketched by considering the descending order of the missense mutations. If Fig. 17 is compared with Fig. 8, it can be noticed that NSP2, NSP3 and NSP12 are in first four domains for both of the variant groups. Finally, the mutation density graphs for synonymous and missense mutations are shown in Fig. 18. For the variants detected in the USA, the missense mutations are occurred more densely in ORF8 domain. In China case, ORF8 was fourth dense domain (Fig. 9).
The generated amino acid sequence of the reference genome was compared with all the variants’ amino acid sequence products and most common amino acid changes were detected for USA variants. The result graph is shown in Fig. 19. Like in China case (Fig. 10), for USA variants there is an amino acid change which has more than 10% percentage in ORF8 domain. This change is again from “L” amino acid to “S” amino acid. However, there are 5 additional amino acid changes which have more than 10%. These missense mutation results are listed in Table 3.
Table 3
The Most Occurred Amino Acid Changes Because of Missense Mutations in USA Variants.
Domain | Amino Acid Change |
NSP2 | M > P |
NSP13 | M > P |
NSP13 | M > P |
S (S1) | D > G |
ORF3a | Q > H |
ORF8 | L > S |
The amino acid category change result for USA variants is shown in Fig. 20. The most amino acid change type is detected in (non-polar R group) > (non-polar R group) like in China case.
The total amino acid change counts in each of the domains for each amino acid change types are shown in Fig. 21 with a color coded display method. The amino acid change count for (non-polar R group) > (non-polar R group) change type in NSP12 domain is again higher than all the other missense mutation results for all domains like China case (Fig. 12). However, there are some other amino acid change count for USA variants like (non-polar R group) > (non-polar R group) change type in NSP3 domain or (non-polar R group) > (non-polar R group) change type in N domain.