Phylogenetic analysis of human coronavirus genomes (Figure 1A) revealed that the newly identified coronavirus SARS-CoV-2 Wuhan-Hu-1 sequence was closer to SARS-CoV Tor-2 as well as SARS-CoV Urbani, and relatively distant to two alphacoronaviruses (HCoV-229E, HCoV-NL63).
Nucleotide composition analysis (Supplementary Figure 1A) revealed that SARS-CoV-2 Wuhan-Hu-1 had the highest compositional value of U% (32.2) which was followed by A% (29.9), and similar composition of G% (19.6) and C% (18.3). At the third position, We observed that nucleotide U also occurred most frequently. Thus, most codons of SARS-CoV-2 Wuhan-Hu-1 tended to be U ending. Moreover, the mean GC and AU compositions (Supplementary Figure 1B) were 37.9% and 62.1% (SARS-CoV-2 Wuhan-Hu-1), 41.0% and 59.0% (SARS-CoV Tor2), 40.8% and 59.2% (SARS-CoV Urbani), 41.5% and 58.5% (MERS-CoV HCoV-EMC), 36.8% and 63.2% (HCoV-OC43), 32.0% and 68.0% (HCoV-HKU1), 38.0% and 62.0% (HCoV-229E), 34.4% and 65.6% (HCoV-NL63), respectively indicating that SARS-CoV-2 Wuhan-Hu-1 as well as other human coronaviruses in this study were all AU rich, which was consistent with recent reports [11-15].
RSCU analysis of the complete coding sequences of SARS-CoV-2 Wuhan-Hu-1 revealed that all the over-represented codons (RSCU value >1.6) ended with A/U whereas most of the under-represented codons (RSCU value <0.6) ended with C/G (Supplementary Table 2). The highest RSCU value for the codon was AGA for R (2.67) amino acid and lowest was in UCG for S (0.11). The heatmap analysis (Figure 1B) further revealed that all human coronaviruses analyzed in this study share the over-represented codons (UAA, GGU, GCU, UCU, GUU, CCU, ACU) and the average RSCU value >2.0, whereas UCA were over-represented only in SARS-CoV-2 and SARS CoV.
The profiles of codon usage patterns among different genes of human coronaviruses were further analyzed (Figure 1C and Figure 2 ). As for spike (S) gene, all human coronaviruses analyzed in this study shared the over-represented codons (UCU, GCU, CUU, GUU, ACU ) and all ended with U, whereas two codons (CCA, ACA) were over-represented only in SARS-CoV-2. In addition, SARS-CoV-2 did not use CGA for arginine nor CCG for proline. As for envelop (E) gene, two codons (UAC, GCG) were over-represented only in SARS-CoV-2 and SARS-CoV. All human coronaviruses analyzed in this study did not use two synonymous codons (CGC, CGG) for arginine as well as CCG for proline and UGA for stop codon at all. SARS-CoV-2 and SARS-CoV did not use CAA for glutamine nor UAU for tyrosine, whereas they use GCG for alanine, AUC for isoleucine, UCG and AGC for serine . As for membrane (M) gene, three codons (GUA, GAA, GGA) were over-represented only in SARS-CoV-2. As for nucleocapsid (N) gene, all human coronaviruses analyzed in this study share the over-represented codons (GCU, ACU, CUU) and all ended with U. The average RSCU values of GCU in complete gene, S gene, E gene, M gene and N gene in all human coronaviruses analyzed in this study were 2.22, 2.12, 1.79, 2.13, 2.16, respectively. GCU for alanine was identified as the highly preferred codon among the human coronaviruses.
Amino acids are degenerate and each amino acid has different number of synonymous codons except for methionine (Met, M) and tryptophan (Trp, W). The overall amino acid usage of the human coronaviruses was shown in Supplementary Figure 2. Leucine and valine were the two most frequently used amino acids in all human coronaviruses analyzed in this study, CUU and GUU were preferred codons for leucine and valine, respectively (Figure 3), whereas tryptophan, histidine and methionine were the three least used ones, which was consistent with recent report [14].
To further estimate the degree of codon usage bias, intrinsic codon bias index (ICDI), codon bias index (CBI) and effective number of codons (ENC) values were calculated (Table 1). ICDI value (0.144), CBI value (0.306) and ENC value (45.38) all exhibited relatively low codon usage bias of SARS-CoV-2, similar to SARS-CoV Tor2, SARS-CoV Urbani, MERS-CoV HCoV-EMC, HCoV-OC43, HCoV-229E whereas different from HCoV-HKU1 (ICDI 0.372; CBI 0.532; ENC 35.617) and HCoV-NL63 (ICDI 0.307; CBI 0.476; ENC 37.275), which exhibited moderate codon usage bias.
We next attempted to determine the forces influencing the codon usage bias. Accumulating evidence suggests that the formation of codon usage bias is affected by many factors, and two generally accepted major forces are mutation pressure and natural selection[16]. Other influential factors include gene expression level, gene length, GC content, GC contents at the third base of one codon (GC3), RNA stability, hydrophilicity, and hydrophobicity, etc. When G or C is in high or low proportion at the third position of the codon, mutational pressure is involved [17]. From Supplementary Figure1, it clearly showed that both G3 and C3 were lower than A3 and U3, suggesting the contribution of mutational force acting on codon usage pattern. Moreover, all preferred codons were A/U ending (Figure 1B, 1C and Figure 2), which further suggested that mutational force contributed to shape codon usage in this virus. Furthermore, to better understand the relations between gene composition and codon usage bias, an ENC–GC3 scatter diagram of ENC versus GC3S (ENC plotted against G+C content at the third codon position) was constructed. When codon usage pattern is only affected by GC3 resulting from mutation pressure, the expected ENC values should be just on the solid curved line. As shown in Supplementary Figure 3, all points lie together under the expected ENC curve, indicating that some independent factors, such as natural selection might also play a role in codon usage bias of human coronaviruses.
Apart from human, many animal species can also be infected by different types of coronaviruses. Previous studies have shown that some animals such as bats are believed to represent the original reservoir of several human-infecting coronaviruses (1). In order to provide additional information to better understand the evolution of SARS-CoV-2, we further compared the codon usage pattern of SARS-CoV-2 and non-human coronaviruses (Supplementary Table 1).
Phylogenetic analysis (Figure 4A) showed that SARS-CoV-2 was most closely related to recently reported Bat coronavirus RaTG13[18]. Nucleotide composition analysis (Supplementary Figure 4) revealed that similar to SARS-CoV-2 Wuhan-Hu-1, all the non-human coronaviruses analyzed in this study had the highest compositional value of U% and nucleotide U occurred most frequently at the third position. The heatmap analysis (Figure 4B) revealed that SARS-CoV-2 and all the non-human coronaviruses analyzed in this study shared the over-represented codons (GGU, UCU, CCU) and all ended with U, meanwhile they shared the under-represented codons ( UCG, GGG, GCG, CCG, CGG, ACG, CGA) and most ended with G except for CGA. Codon usage pattern of SARS-CoV-2 was generally found a high similarity to that of betacoronaviruses except for Bat coronavirus HKU4-1, Bat coronavirus HKU5-1(Figure 4C, Supplementary Figure 5-8). Moreover, the profiles of codon usage patterns among different genes of SARS-CoV-2 and non-human coronaviruses were further analyzed, as shown in Figure 5 and Supplementary Figure 9-12. We found similar codon usage pattern among SARS-CoV-2 and its phylogenetic relatives such as RaTG13, Bat-SL-CoVZC45, Bat-SL-CoVZXC21, PCoV_GX-P1E, PCoV_GX-P4L, which may reflect the evolutionary relationship between SARS-CoV-2 and these non-human coronaviruses. These results are in accordance with the full-genome phylogenetic analysis (Figure 4A). The overall amino acid usage of the non-human coronaviruses was shown in Supplementary Figure 13 . Similar to SARS-CoV-2 , leucine and valine were the two most frequently used amino acids in all non-human coronaviruses analyzed in this study, CUU and GUU were preferred codons for leucine and valine, respectively.
Furthermore, similar to SARS-CoV-2, all the non-human coronaviruses analyzed in this study exhibited relatively low codon usage bias according to the intrinsic codon bias index (ICDI), codon bias index (CBI) and effective number of codons (ENC) values, as shown in Supplementary Figure 14. Nucleotide composition analysis (Supplementary Figure 4 ) and ENC-GC3S plot (Supplementary Figure 15) revealed that both mutational force and natural selection contribute to shape codon usage in non-human coronaviruses.
Overall, in the present study we attempted to take a snapshot of the characteristics of codon usage pattern in novel coronavirus SARS-CoV-2. As a result, we found all over-represented codons ended with A/U and this novel coronavirus had a relatively low codon usage bias. Both mutation pressure and natural selection were contributors to the bias. Additionally, the overall codon usage pattern of SARS-CoV-2 was generally similar to that of its phylogenetic relatives among non-human coronaviruses such as RaTG13. Our findings are consistent with the recent observations [11-15] and provide new insights into the characteristics of codon usage pattern in coronaviruses. These results also have important implications for future work.
Firstly, the information of genome-wide codon usage pattern of SARS-CoV-2 may be helpful to get new insights into the evolution of this new virus. With the increase of SARS-CoV-2 genome data available, we could reevaluate the codon usage pattern of SARS-CoV-2 more comprehensively to track the evolutionary changes between them. In this regard, genome-wide codon usage patterns in 100 complete genome sequences of SARS-CoV-2 isolates including SARS-CoV-2 Wuhan-Hu-1 from different geo-locations were analyzed herein. All information about the used isolates can be found in Supplementary table 3. The heatmap analysis (Supplementary Figure 16) revealed 12 preferred codons (GGU, GCU, UAA, GUU, UCU, CCU, ACU UAA, GGU, GCU, UCU, GUU, CCU, ACU) ending with A/U among all the 100 isolates, and the average RSCU value of these over-represented codons vary from 1.63 to 2.67 (Supplementary Figure 17). The highest RSCU value was for the codon AGA for R (2.67) amino acid and lowest was in UCG for S (0.11). We noted that the overall codon usage pattern appeared to be similar among the tested 100 SARS-CoV-2 isolates from different geo-locations, reflecting minimal evolutionary changes among them.
Additionally, compared to other members of human coronaviruses as well as non-human coronaviruses in different hosts, we found that the overall codon usage pattern of SARS-CoV-2 is generally similar to that of its phylogenetic relatives among non-human betacoronaviruses such as RaTG13 (Figure 5), which may reflect the evolutionary relationship between SARS-CoV-2 and these non-human coronaviruses.
Secondly, the information of genome-wide codon usage pattern of SARS-CoV-2 may have potential value for developing coronavirus vaccines to combat this pandemic disease. The information on codon usage by SARS-CoV-2 may pave the way to design strategies such as codon deoptimization[19-21], the use of the least preferred codons to modify the SARS-CoV-2 genome to reduce virulence for the development of a safe and effective vaccine. This strategy has several advantages. Deoptimized viruses could express an identical antigenic repertoire of T- and B-cell epitopes because they contain the intact wide type amino acid sequence. Moreover, deoptimized viruses can efficiently replicate in vitro while being highly attenuated in vivo, which is important for vaccine production and their safe implementation.