Evolution of codon usage in 2019-new coronavirus causing human infection

Xiaoting Yao (  yaoxiaoting@nwafu.edu.cn ) Northwest A&F University: Northwest Agriculture and Forestry University https://orcid.org/0000-0002-2853-6240 Yanfei Xie Northwest A&F University: Northwest Agriculture and Forestry University Xiong Guan Northwest A&F University: Northwest Agriculture and Forestry University Silu Ni Northwest A&F University: Northwest Agriculture and Forestry University Chenxiang Zuo Northwest A&F University: Northwest Agriculture and Forestry University Siddiq Ur Rahman Khushal Khan Khattak University Karak Dekun Chen Northwest A&F University: Northwest Agriculture and Forestry University Wentao Ma Northwest A&F University: Northwest Agriculture and Forestry University

Nucleotide composition analysis of SARS-Cov-2 complete sequences (% GC12 represents the G + C content at the rst and second positions of codons.
GC3 represents the G + C content at the third positions of codons.
AU3 represents the A + U content at the third positions of codons. A  U  G  C  AU  GC  A3  U3  G3  C3  AU3  GC3  GC1  GC2  GC12 MN996528. Codon usage patterns of SARS-Cov-2 and its hosts  Table 2, Fig. S2). Patterns of gene-speci c over-represented codons were also observed in the SARS-Cov-2 isolates, 11 of the 18 preferred codons were over-represented in the E gene, 10 of the 18 were over-represented in the M gene, 6 of 18 were over-represented in the N gene and 9 of 18 were over-represented in the S gene. The gene-speci c RSCU patterns indicated the independent evolution dynamics of the SARS-Cov-2 isolates. In addition, to estimate the potential effects of the host and vector on the viral codon usage pattern, the RSCU patterns were considered and matched with various potential hosts such as human and bat ( Table 2, Fig. S2). Among these 18 preferred codons, we found that the ratio of common/uncommon preferred codons was 6:12 between SARS-Cov-2 and human and 13:5 between SARS-Cov-2 and bat ( Table 2, Fig. S2). AA represents amino acid; the "RSCU" value represents the pattern of relative synonymous codon usage; orange colors represents the codons favored by SARS-Cov-2 and hosts (RSCU > 1); over-represented (RSCU > 1.6), and under-represented (RSCU < 0.6) codons are marked as bold with red and green colors, respectively, the ideal codons for SARS-Cov-2 are marked as underline.

Accession ID
Measuring the similarity in uences between the overall codon usage of SARS-Cov-2 and that of hosts Spearman's correlational distance analysis was used to further estimate similarity of codon usage patterns and to investigate how the overall codon usage patterns of hosts and SARS-Cov-2 participated in evolutionary process. This analysis was performed to determine the similarities of general codon usage patterns between SARS-Cov-2 and hosts. Such RSCU-dependent analysis was applied routinely for the viral hosts, and remain limited to codon usage patterns and similarities [28][29][30][31]. Here, we performed this method through the hierarchical clustering analysis of virus and hosts in this study, and estimated their overall codon usage similarities. This optimized method was performed to present a clear sight of codon usage patterns. Two main groups were noted in this analysis. It was shown that one cluster included the virus and the vector (bat) and the other cluster only included the host (human) (Fig. 2). The statistical tests for the distances of RSCU values (each of which was compared with a synonymous shu ing null model) indicated that a signi cant signature of codon usage patterns existed for vector and SARS-Cov-2 (P < 0.01) compared with human and SARS-Cov-2 (P > 0.05). This suggested that possible viral transmission in humans may depend on the vector (bat).

Codon usage adaptation in SARS-Cov-2
Codon adaptation index (CAI) analysis was performed to investigate the relationship between the codon usage patterns and the expression levels of SARS-Cov-2 coding sequences, which re ected the adaptation of virus to their host cellular machinery. The CAI values are ranged from 0 to 1, and higher CAI values are considered as higher levels of codon usage bias [32]. The CAI values were obtained for each gene of SARS-Cov-2 in relation to human and bat, respectively ( Fig. 2 and Table S1). In the SARS-Cov-2 isolates of E genes, the mean CAI value was noted in relation to human (0.617 ± 0.001) and bat (0.602 ± 0.001). In the M genes, the mean CAI value was noted in relation to human (0.674 ± 0.001) and bat (0.670 ± 0.001). In the N genes, the mean CAI value was noted in relation to human (0.731 ± 0.001) and bat (0.710 ± 0.001). In the S genes, the mean CAI value was noted in relation to human (0.710 ± 0.001) and bat (0.755 ± 0.001).
The Student's t-test was applied to estimate the signi cant differences in this study, and it suggested that there were signi cant differences in the CAI values ( Fig. 3 and Table S1).

Evolutionary rates of various genes in SARS-Cov-2
To investigate why the CAI value is not coincident between S genes and the other three genes in relation to human and bat, we estimated the evolutionary rates of SARS-Cov-2 strains whose collection date was known, by the Bayesian coalescent approach according to the sequences of various protein-coding genes. Using the best-t model, Bayesian estimated the mean substitution rates for these genes were between 2.35 × 10 − 4 and 4.21 × 10 − 3 substitutions per site/year (Table 3). Among the structural proteins-encoding genes, the E gene had the fastest evolutionary rate (4.21 × 10 − 3 ) and a 95% highest probability density (HPD) of between 5.40 × 10 − 8 and 1.3 × 10 − 2 . The S gene evolved at the slowest rate of 2.35 × 10 − 4 with the 95% HPD of between 1.29 × 10 − 8 and 7.21 × 10 − 4 .

Discussion
In the present study, an evolutionary analysis was performed using 41 whole genome sequences of SARS-Cov-2 obtained from different geographic locations.
Our results indicated that the SARS-Cov-2 occurring in various regions mainly formed two groups. Apparently, the SARS-Cov-2 has spread to many countries/regions during 2019-2020, and led to a severe global outbreak [33]. Besides, the phylogenetic tree also indicate that the geographical locations play an important role in SARS-Cov-2 evolution, and such ndings may help to trace the viral root of emerging strains in the future. In addition, current results also suggest that some infected countries have more than one genetic lineages.
This survey of the SARS-Cov-2 whole genomes suggests a preference for A and U nucleotide over G and C, indicating this preference in uences the codon usage patterns for the viral translation process. This result is similar to our previous study on Crimean-Congo hemorrhagic fever virus (CCHFV), which is also enriched with A/U nucleotide [17].However, the biological senses of this phenomenon are still unclear, making it crucial to investigate the causes for the signi cantly increased A/U nucleotide content in the viral genomes [34]. Some previous reports revealed that the amino acids content was also an important factor determining the nucleotide composition at the 1st and 2nd codon locations of viral genomes, while the protein alteration was driven by functional selection. Meanwhile, 69% of the variation at the 3rd codon locations always implied synonymous or silent mutations, without the in uences of functional selection [34].
RSCU analysis can provide some insights to the in uence of natural hosts on SARS-Cov-2 although it may require further validation with experimental researches in animal models. Earlier reports indicated that the codon usage patterns of Ebola virus (EBOV) were not similar with its hosts [35,36]. Our ndings are consistent with the previous studies suggesting that A/U-ended codons are higher enriched in viral genomes than that in the host genomes [37,38].
Additionally, some earlier researches also suggested that the identical contents of codon usage patterns between virus and their hosts may improve the translation e ciency of the amino acids, while the opposite contents of codon usage patterns may ensure the exact folding of viral proteins [14,36,39,40].
These ndings also indicated that the similar codon usage patterns between SARS-Cov-2 and their hosts could improve the ability of viral genome to participate in the translational process. Particularly, the codon usage patterns of SARS-Cov-2 genomes could be largely impacted by the selection pressure of their common hosts, which may promote the adaptation to the cellular environment of their hosts and e cient replication [29,41]. However, the effect of selection pressure from hosts (human) on building SARS-Cov-2 codon usage patterns is not similar with the vector (bat). Previous reports on EBOV, CCHFV and Flaviviridae virus re ected the codon usage patterns are different within their hosts and vectors [35,36,42]. Based on RSCU values, Spearman's correlational distance analysis was also performed to estimate the general codon usage similarities between SARS-Cov-2 and hosts. That makes it clear that possible SARS-Cov-2 transmission in humans is based on its vector (bat). Models of the infection of animals have indicated the substantial constraints on the evolution of arbovirus [43][44][45][46]. Consistent with the earlier reports, our results have clearly showed the correlation of overall SARS-Cov-2 codon usage patterns with bat and not with human. According to this nding, we demonstrated that translational selection pressure plays an important role in shaping the codon usage patterns of SARS-Cov-2.
Interestingly, the codon usage patterns of different coding sequences were different, even within a single isolate at the same time. The codon usage patterns of S genes and N genes were closer to that of humans, while E genes and M genes were closer to that of bats (Table 2 and Fig. S2). This suggested that the evolution of codon usage patterns of SARS-Cov-2 individual coding sequences is possibly related to the function of different genes in viral pathogenesis. To further con rm the effect of natural selection, the CAI analysis was performed. CAI was frequently employed to measure the gene expression and the adaptation of SARS-Cov-2 genes to their hosts, which indicated the effect of natural selection. It would suggest that the highly expressed genes re ected a strong bias for speci c codons than genes that were less frequently expressed. Therefore, if the CAI value was high, the codon usage bias would be extremely high and the effect of natural selection was prevalent, and vice versa [47]. Based on the CAI values for the SARS-Cov-2 coding sequences, different levels of adaptation of SARS-Cov-2 host and vector were observed. In our results, the greatest adaptation of SARS-Cov-2 S genes were to bat, closely followed by human, indicating that the replication of S genes may be more e cient in the vector cells than in the host cells. However, CAI values for the other three genes (E, M and N) tended to be higher for human, which may be attributable to the possibly higher e ciency of protein synthesis within its host.
To further investigate the inconsistent adaptation of different genes, we deduced the evolutionary rates of SARS-Cov-2 strains. In most cases, it is less signi cant to consider the collection date than the location of isolated sequences. However, as for fasting-evolving organisms like RNA viruses, isolation changes in the date could be used to evaluate the time since virus last shared one common ancestor. So we required the collection date of SARS-Cov-2 to assume the constant evolutionary rates of divergence between sequences. Additionally, according to the assumption of substitution rate constancy, the differences of isolation dates could provide information about the rates of molecular evolution. In other words, the quantity of evolutionary rate has accumulated since the collection date [48]. A Bayesian coalescent approach suggested the mean substitution rates were between 2.35 × 10 − 4 and 4.21 × 10 − 3 in different genes, which was comparable with the reports for hepatitis A and B viruses and Newcastle disease virus [49][50][51]. It has suggested that the lowest substitution rate of the SARS-Cov-2 S genes may contribute to its more important function in viral synthesis and replication, which may be subjected to lesser natural selection pressure and suffered fewer substitution.

Conclusion
In conclusion, this study indicated that the codon usage patterns of SARS-Cov-2 were mainly shaped by the translation selection. Importantly, there were similarities of codon usage patterns between SARS-Cov-2 and its natural hosts. It also demonstrated that different genes have different adaptation to their hosts. And we further suggested that SARS-Cov-2 isolates were evolving at a rapid substitution rate under their translation selection pressure of their hosts.
The present study will be required to build SARS-Cov-2 adaptation in different hosts that will contribute to the understanding and control of SARS-Cov-2 infection and transmission.

Data collection
The newly sequenced 41 SARS-Cov-2 complete genomes have been downloaded from the GenBank database (http://www.ncbi.nlm.nih.gov/genbank/). The comprehensive information of SARS-Cov-2 strains were provided in Supplementary   Table S1, including the accession number, collection date, viral host and the geographic origin of isolation. Out of them, different datasets were extracted from the whole genome sequences, containing the E, M, N and S structural genes, and used in the further analysis. The sequences were aligned using the local software MAFFT v7.450 [52], and manually adjusted using BioEdit v7.2.5 [53].
Phylogenetic analysis of SARS-Cov-2 The phylogenetic tree was reconstructed using maximum-likelihood method in PAML v4.9 and was estimated by the bootstrap analysis with 1000 replicates. The tree was designed by the online software named the Interactive Tree Of Life v2 [54]. A total of 41 SARS-Cov-2 strains were used in our research.

Nucleotide contents analysis
The nucleotide content analysis of 41 SARS-Cov-2 whole genomes were calculated using CodonW software. The total nucleotide content of A, U, G and C, the nucleotide content at the 3rd location (A3, U3, G3 and C3), and GC nucleotide content at the 1st (GC1), 2nd (GC2) and 3rd (GC3) locations were measured individually. Additionally, the average frequency of GC at 1st and 2nd positions and the whole AU/GC compositions were also calculated. Nonsynonymous codons (UGG and AUG) and termination codons (UAA, UAG and UGA) were abandoned from this study.  [59]. The resulting dendrogram was drown with ggplot2 in R program [60].

Synonymous Codon Usage Analysis
Additionally, the observed probability was performed as the P-value to display signi cance.

Codon adaptation index
As a quantitative measure, the codon adaptation index (CAI) was performed to estimate the gene expression level on the basis of its coding sequences, and ranged from 0 to 1. The higher frequent codons suggested the higher relative adaptation to their hosts, and genes with higher CAI values were considered to be tted over those with lower CAI values [61]. CAI analysis of the SARS-Cov-2 coding sequences was employed with CAIcal server, which implemented an improved method of CAI calculation [47]. The codon usage patterns of Vespertilio murinus and Homo sapiens were used as references. Except for the three termination codons, nonsynonymous codons (UGG and AUG) were also excluded in our study.

Evolution substitution rates
We analyzed 41 representative whole genome sequences of SARS-Cov-2 isolates during 2019 to 2020 with the MCMC program in BEAST v 2.4.8 software to measure the evolutionary rate of different genes. Under the model that assumed a constant substitution rate, this program provided a maximum-likelihood method to estimate the rate. The jModel Test software v 2.0.1 was performed to choose the best-tted model based on the Akaike information criterion (AIC) [62]. For model comparison, the Bayes factor could estimate each model test and yield the best results with marginal likelihood estimated on the basis of Newton & Raftery method [63]. As implemented in BEAST package, the best-tted model was GTR (general time reversible) + Γ4 (gamma distribute rate  Similarity distance analysis of the codon usage using SARS-Cov-2 and its hosts (Spearman correlational distances = 1 − SpearmanRho). CAI of SARS-Cov-2 coding sequences to its hosts. In the plot, the x-axis represents different genes of SARS-Cov-2. The Y-axis represents the CAI value.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.