Host Adaptation of Codon Usage in SARS-CoV-2 From Mammals Indicate Natural Selection

The outbreak of COVID-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections, spread across hosts from humans to animals, transmitting particularly effectively in mink. How SARS-CoV-2 selects and evolves in the host, and the differences in the evolution of different animals are still unclear. To analysis the mutation and codon usage bias of SARS-CoV-2 in infected humans and animals. The SARS-CoV-2 sequence in mink (Mink-SARS2) and binding energy with receptor were calculated compared with human. The relative synonymous codon usage of viral encoded gene was analyzed to characterize the differences and the evolutionary characteristics. A synonymous codon usage analysis showed that SARS-CoV-2 is optimized to adapt in the animals in which it is currently reported, and all of the animals showed decreased adaptability relative to that of humans, except for mink. The neutrality plot showed that the effect of natural selection on different SARS-CoV-2 sequences is stronger than mutation pressure. A binding anity analysis indicated that the spike protein of the SARS-CoV-2 variant in mink showed a greater preference for binding with the mink receptor ACE2 than with the human receptor, especially as the mutation Y453F and N501T in Mink-SARS2 lead to improvement of binding anity for mink receptor. In summary, mutations Y453F and N501T in Mink-SARS2 lead to improvement of binding anity with mink receptor, indicating possible natural selection and current host adaptation. Monitoring the variation and codon bias of SARS-CoV-2 provides a theoretical basis for tracing the epidemic, evolution and cross-species spread of SARS-CoV-2.

Introduction SARS-CoV-2 is a β-coronavirus that emerged in 2019 and spread worldwide, leading to an ongoing global pandemic [1,2]. As of November 29th 2021, the number of infected cases reached 261 million, and more than 5.2 million deaths have occurred (Johns Hopkins University statistics; https://coronavirus.jhu.edu/map.html). SARS-CoV-2 has a single-stranded positive-sense RNA genome containing 29,903 nucleotides and consisting of 11 open reading frames (ORFs) encoding 27 proteins [3]. The S glycoprotein is a fusion viral protein that functions in recognition of the host receptor ACE2 [4].
There is a broad host spectrum because SARS-CoV-2 binds a receptor common to humans and animals [5]. To date, the following animals have been reported to be susceptible to infection: cats, dogs, tigers, lions, ferrets, and mink [6][7][8][9][10]. SARS-CoV-2 infection of pets, including cats and dogs [8,10], was the earliest reported animal infections in the epidemic. Later, in a report on SARS-CoV-2 infection in tigers, lions, and human keepers in a New York zoo, epidemiologic and genomic data indicated human-to-animal transmission [11]. Other animals, including snow leopards and gorillas, tested positive for SARS-CoV-2 after showing signs of illness [12,13]. It is noteworthy that a study from The Netherlands reported the spread of SARS-CoV-2 from humans to mink and from mink back to humans in mink farms [14]. Eightyeight mink and 18 staff members from sixteen mink farms were con rmed to be infected with SARS-CoV-2 as determined by high throughput sequence analysis. The adaptation of SARS-CoV-2 to bind the mink receptor and the viral evolution in the mink host are worthy of further study.
Codon usage bias refers to differences in the frequency of occurrence of synonymous codons during protein translation, which differs between hosts [15]. Codon usage bias is dominated by selection pressure in some bacteria, while it was driven by mutation in virus, such as Ebola virus [16,17]. Viruses differ markedly in their speci city toward host organisms, and the analysis of the viral genome structure and composition contributes to the partial understanding of virus evolution and adaptation in the host [14,18]. Further exploration of the codon usage pattern of SARS-CoV-2 in different hosts, especially the codon architecture of the Spike gene, indicate host adaptation related to cross-species transmission.
Surveillance of the substitution and selection of the SARS-CoV-2 genome is important for the study of viral evolution and for tracking viral transmission. In particular, study of the Spike gene helps to evaluate the immunization effect of vaccinations and to adjust the vaccine design in a timely manner. This study focuses on the divergence of theSARS-CoV-2 genome composition and codon usage in human and animal hosts to investigate the natural selection that might play a role in virus evolution, adaptability, and transmission.

SARS-COV-2 sequences and data collection
A total 209 SARS-COV-2 genome sequences from humans, cats, dogs, tigers, lions, hamster and minks were used for the genetic analysis (The strains information was recorded in the Supplementary Table S1).
All the genomic sequences selected by the hosts were obtained from the GIAISD database (https://www.gisaid.org/). Isolate Wuhan/WIV04 was used as the reference strain.
Evolutionary analysis 209 SARS-COV-2 genomes were used for phylogenetic analysis. The evolutionary history was inferred by using the Maximum Likelihood method and Tamura-Nei model [19]. The tree with the highest log likelihood (-42442.50) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.0500)). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. There was a total of 29903 positions in the nal dataset. Evolutionary analyses were conducted in MEGA-X [20].

Identi cation of mutations
The sequences were aligned using MEGA-X, and the single nucleotide polymorphisms were analyzed using the SNiPlay pipeline by uploading aligned Fasta format le (https://sniplay.southgreen.fr/cgibin/analysis_v3.cgi) [21]. All the sequences including coding regions, 5'UTR and 3'UTR were used for the analysis.

Estimation of nonsynonymous and synonymous substitution rates
The number of nonsynonymous substitutions per synonymous site (dN) and the number of synonymous substitutions per nonsynonymous site (dS) for each coding site were calculated using the Nei-Gojobori method (Jukes-Cantor) in MEGA-X. The Datamonkey adaptive evolution server (http://www.datamonkey.org) was used to identify sites where only some of the branches have undergone selective pressure. The mixed-effects model of evolution (MEME) and xed effects likelihood (FEL) approaches were used to infer the nonsynonymous and synonymous substitution rates.

Neutral evolution analysis
Neutrality plot analysis was constructed to determine the codon bias in uenced by natural selection. It re ected the neutrality effect of directional mutation pressure equilibrium in shaping the codon usage bias [25]. Mostly the GC3 position has an equal probability of appearing A/T and G/C nucleotides. There will be variation between GC12 against GC3 regression values due to the directional mutational pressure.
Here GC12 against GC3 were plotted with a regression line.

Spike protein sequence and structure reconstruction
The crystal structure of the SARS-CoV-2 receptor-binding domain (RBD) in complex with human ACE2 (PBD ID: 6M0J) was used for structural analysis. Structures of ACE2 and the viral spike from mink were constructed by the SWISS model server (https://swissmodel.expasy.org/). Comparisons of the predicted protein structures and pairwise comparisons were analyzed using PyMOL software.

Molecular dynamics
For the binding free energy ( ), we simulated the minimized annealing energy through molecular dynamics (MD) simulation in YASARA [26]. We performed three iterations of energy minimization for the set of wild-type residues in the viral spike protein bound with human ACE2 and mutant residues in the Mink-SARS2 spike with mink ACE2. The relative binding energy (∆ ) are reported as the mean and standard deviation values across three replicates.

Selective coe cient index
The selection coe cient index (S) of all SARS-CoV-2 codons was estimated by the FMutSel0 model in the program CODEML (PAML package) [27]. The tness parameter of the most common residues at each location is xed to 0, while the other tness parameters are limited to −20 < F < 20.

Plasmids construction and cell culture
The gene fragment of RBD of SARS-COV-2 (NCBI ID: MN996528.1) inserted into pCAGGS plasmid was donated by Prof. Jianguo Wu, RBD fragments of SARS-COV-2 and Mink-ACE2 (NCBI ID: MW269526.1) were synthesized by Genscript Inc. The mutants of RBD genes were constructed using the Mut Express II Fast Mutagenesis Kit (Vazyme, C214). The wild type and mutant genes were cloned into pCAGGS vector with a 6 ´ Histidine tags using the EcoRI and XhoI restriction sites. Mink-ACE2 fragment was inserted into vector pcDNA3.1-eGFP (named pcDNA3.1-mACE2-eGFP). The products were subsequently transformed into competent E.coli Top10 strain. Recombinant plasmids were veri ed by Next-generation sequencing. Plasmid pcDNA3.1-hACE2-eGFP expressing hACE2 fused with eGFP was ordered from Fubio Ltd (MC_0101086). HEK293T and BHK-21 cells were cultured by high-sugar DMEM medium in 5% CO2 atmosphere of 37℃ incubator.

Protein expression and puri cation
The recombinant RBDs of SARS-COV-2 mutants were expressed in CHO cells. Recombinant pCAGGS plasmids were transfected into CHO cells in 245 mm dishes according to the manufacturer's recommendations; the supernatant was collected and centrifuged after 5 days. The soluble proteins were puri ed by using HisPur™ Ni-NTA Resin (Thermo Scienti c, 88221). Puri ed proteins were eluted in buffer consisting of 20 mM Tris-HCl (pH 8.0) and 150 mM NaCl.

Flow cytometry
The plasmids pcDNA3.1-hACE2-GFP and pcDNA3.1-mACE2-eGFP were transfected into BHK-21 cells using Liposome 6000 (Beyotime, C0526) according to the manufacturer's instructions. The cells expressing hACE2-GFP and mACE2-GFP were collected after transfection for 24 h and suspended in the PBS buffer. Then, the cells were incubated with the puri ed His-tagged RBDs at a nal concentration of 30 μg/mL at 37ºC for 30 min. After washed by PBS twice and incubation with anti-His/APC antibodies (1:5000), the cells were determined using a BeckMan CytoFLEX and the data were analyzed using FlowJo V10 software.

Statistical analysis and mapping
Statistical analyses were performed using ANOVA followed by Turkey's post hoc test (Fig 2C&2F) or Student's t-test (Fig 3E), and the data were considered signi cantly different if the p-value was less than 0.05. The SPSS 20.0 software was used for regression curve t. ***p<0.001, **p<0.01, *p<0.05, ns. means no signi cant. The gures were mapped by the software PRISM GraphPad 5.0.

Sequence and analysis of SARS-CoV-2 isolated from animals
As of Jun 20th 2021, more than 2.53 million SARS-CoV-2 genome sequences had been uploaded to the GISAID database. It is important to study the mutation rates and selective pressures on the SARS-CoV-2 genome during the spread of the epidemic. The evolutionary entropy increased at speci c sites in the whole genome of SARS-CoV-2. In addition to humans, SARS-CoV-2 infects other animals (Fig. 1A) and evolves in these animals. A phylogenetic tree was reconstructed based on animal-derived whole genome consensus sequences compared with the SARS-CoV-2 human isolate WIV04 (Fig. 1B). Most SARS-CoV-2 clade isolates from the same animal clustered together, and the same clade contained sequences from all the mink regardless of their geographic region.
The cluster of SARS-CoV-2 from mink (Mink-SARS2) has more substitutions compared to the reference sequence WIV04 (Supplementary Table S2), and the substitutions of cytidine in Mink-SARS2 account for nearly 50% of the substitutions, while in other animals, cytidine accounts for only 30% of the substitutions (Fig. 1C). The substitution of adenine in SARS-CoV-2 in other animals is threefold higher than that in Mink-SARS2. To track how the substitutions occurred in the Mink-SARS2 genome, we recorded all the mutations in the Mink-SARS2 genome in reference to the WIV04 genome. The results in Fig. 1D & 1F show that the cytidine-to-uracil transition occurred more than 40% of the time and was eightfold higher than the uracil-to-cytidine substitution. Notably, the substitutions of guanine and adenine were more than threefold higher in nonsynonymous mutations than in synonymous mutations (Fig. 1E).

Mutational spectra of Spike protein in human and animal samples
The variation in the spike gene was evident when all the included sequences isolated from humans and animals were recorded in our study, which led to the identi cation of a number of highly variable residues, including the relatively high-frequency amino acid variation sites. C-to-U substitutions were scattered throughout the SARS-CoV-2 genome and accounted for 24.06% of the substitutions in the spike gene in of all epidemic strains analyzed as of February 2, 2021 ( Fig. 2A). Because of the widespread transmission of D614G (GAT>GGT), A-to-G substitutions accounted for 56.12% of all the monitored strains. The result of dN-dS indicates the natural selection for mutations in these speci c sites in the spike gene (dN-dS>0 indicates positive selection, and dN-dS<0 indicates puri cation selection). Fig. 2B shows that sites 222, 262, 439 and 614 were exposed to strong positive selection pressure, while positions 294, 413, 1018 and 1100 were subjected to purifying selection during evolution.
CAI was used to quantify the codon usage similarities between different coding sequences based on a reference set of highly expressed genes [28]. To clarify the optimization of SARS-CoV-2 in different hosts, we calculated the average CAI of the SARS-CoV-2 whole genome (Fig. 2C). Interestingly, SARS-CoV-2 in bat hosts has a higher value of CAI relative to humans, while dogs had an obviously decreased CAI value compared to humans (Fig. 2C). The ENC-plot analysis helps us to further understand the in uence factors of SARS-CoV-2 codon usage bias. The result (Fig. 2D) showed most sources of SARS-CoV-2 located slightly below the standard curve (R 2 =0.7563, P=0.005), indicating that the codon usage bias is not only affected by mutation factors, but also affected by evolutionary pressure. This is consistent with Rupam's report [25]. The neutrality plot showed the linear regression coe cient of SARS-CoV-2 sequences was -0.1161 (Fig. 2E), indicating that the pressure for codons at rst and second position are different from those at third position in the evolutionary process. The rst and second position of codons are mainly affected by mutation and third position is mainly affected by selection. Considering codon usage in the spike gene in different hosts, Fig. 2F shows that pangolins, cats, dogs, tigers, and lions all had a lower CAI value than humans. The bias of codon usage in the spike mutants are shown in Supplementary Table S3. These results indicated that SARS-CoV-2 optimized codon usage to adapt to the animals in which infection has been reported, but all of them showed a downward trend in adaptability relative to humans except for mink.

SARS-CoV-2 spike mutant shows greater preference for binding with mink receptor
The spike interacts with human and mink ACE2 through the amino acids H34Y, Y41, L79H, M82T and G354R (Fig. 3A&3B), which form electric charge attraction and hydrophobic interactions with residues Asn439, Tyr453, Phe486 and Asn501 on spike. To distinguish the differentiation of receptor sequences between different animals and humans, the ACE2 amino acid sequences in humans, mink, ferrets, tigers, cats, and dogs were aligned (Fig. 3C). The results showed that the critical mutations H34Y, L79H and G354R appear in mink and ferret ACE2 (Fig. 3B). On the other hand, viral variation is another important factor that should also be considered when analyzing infection differences between animals and humans. Corresponding to the contact residues on the receptors, alignment of the viral sequence contacts of ACE2 on spike indicated that residues binding receptor are conserved (Fig. 3C), while residues at site 453, which interact with those at position 34 in ACE2 (Fig. 3D), showed a higher binding a nity for F453-Y34 in mink than for Y453-H34 in humans (Fig. 3E). The interaction between T501 and R354 showed increased binding energy with mink receptor than humans (Fig. 3E). The ow cytometry results showed that Y453F, F486L and N501T mutants lead to the stronger overt uorescent shift of cells with mink ACE2 than wild type RBD, while only F486L and double mutant Y453F&F486L have a similar function with binding to human receptor (Fig. 3F). These variations indicate that SARS-CoV-2 spike shows a greater preference for binding with the mink receptor than humans ACE2 after this mutation occurs.

Codon usage and tness analysis of each amino acid in SARS-COV-2 encoded proteins
Amino acid substitutions within the SARS-CoV-2 Spike RBM may have contributed to host adaption and cross-species transmission. N439K, S477N and N501Y were the most abundant variations throughout the RBM regions (Fig. 4A). N439 does not bind directly with ACE2 but functions in the stabilization of the 498-505 loop [21], but the N439K substitution is absent in animal CoVs (Fig. 3C). Previous computational analysis combined with entropy analysis of the spike showed that S477N may have decreased stability compared with the wild type [29]. Since human SARS-CoV-2 and Mink-SARS2 do not show very different codon usage bias (Fig. 2C) and because viral codon bias depends on the host, we compared the codon usage frequency of SARS-CoV-2 and SARS-CoV (Fig. 4B), for which ferrets are common hosts. Because substitutions N501T (AAU>ACU) in mink and N501Y (AAU>UAU) in humans occurred nonsynonymously in the rst and second positions and since these substitutions had a lower frequency than other noted substitutions (Fig. 1F & 2A), further study on the relationship of these substitutions is needed. The results of selective coe cient index in Fig. 4C show the differences of relative tness in the SARS-CoV-2 codons, CGA and CGG have the high tness score in all codon-speci c estimates, and the tness of T (ACU) was close to Y (UAU). In addition, it was observed that among the 12 ORFs encoded by SARS-CoV-2, the codons encoding different ORFs of SARS-CoV-2 possess diversity RSCU values (Supplementary Table S4). It is worth noting that UCA encoding Ser have excessive bias in ORF7b (Fig. 4D), and AGG encoding Arg overbiased in ORF6 (Supplementary Table S4).

Discussion
Tracking animal variants arising from human contact or produced from animal bodies is an interesting topic and allows for better understanding of the evolutionary mechanism and selection tness of SARS-CoV-2 in the host. Regardless of the probability of contact between different animals and SARS-CoV-2, the transmission of the virus between animals is inseparable from susceptibility and host adaptability. Some viruses with codon bias observed in the process of adaptation are related to their hosts, including Rotavirus, HPV, and Marburg virus et al. [30][31][32]. Mink were the rst extensively farmed species to be affected by the COVID-19 epidemic, indicating that mustelids, including mink and ferrets, are more sensitive to SARS-CoV-2 than other animals [33]. Several mink farms in The Netherlands, Denmark, USA, and Spain all reported infection cases [34][35][36][37][38], indicating mink-to-mink and mink-to-human (Netherlands, Denmark) transmission. Other animals, including tigers and lions, are also susceptible to SARS-CoV-2 infection [39]. Hence, comparison of the susceptibility and the natural evolutionary pressure in different hosts for SARS-CoV-2 is meaningful and helpful for clarifying the host adaptation mechanisms and monitoring the epidemic. Viral genes and genomes exhibit varying numbers of synonymous codons depending on the host [40]; hence, the codon usage bias of the virus has a strong relationship with its host. Studying the preferred synonymous codon usage and base substitutions helps to provide an understanding of the codon patterns of the virus in relation to their hosts and in relation to viral genome evolution. The convergence effect of virus codon preference on the host is widely recognized and is also one of the main natural selection forces for the coevolution of viruses and hosts [41]. In this study, we compared the codon bias of SARS-CoV-2 in mink with that of SARS-CoV in ferrets. Residues threonine (T) and tyrosine (Y) had similar codon biases in SARS-CoV-2 and SARS-CoV (Fig. 4B), which both have the capability to infect mink and ferrets. The N501T variation mostly appeared in mink, while the N501Y mutation present only in humans cannot be explained from the perspective of codon bias and indicates that these two variations belong to two separate lineages.
The WebLogo diagram in Fig. 4B shows that SARS coronaviruses preferentially have U-or A-ending codons. This is consistent with a previous report [25], and the G or C nucleotides in the third position of the preferred SARS-CoV-2 codons are not well represented. This feature may lead to an imbalance in the tRNA pool in infected cells, resulting in reduced host protein synthesis. The substitution rate of C-to-U was the highest in most of the reported sequences in animal species (Fig. 1C). This may be because the surrounding context of cytidine in the sequence strongly in uences the possibility of its mutation to U [42]. In the mink sequences, we observed an 8-fold increase in C-to-U substitution compared with the U-to-C substitution, which was higher than the reported 3.5-fold increase in mink [36][36][36] [37][38], suggesting host adaptation of SARS-CoV-2 in mink over time and the ongoing outbreaks in multiple mink farms. In mink, the variations in G and A with nonsynonymous substitutions were higher than those with synonymous substitutions, which needs to be further analyzed. In addition, the sequences of other animal-CoVs are limited, such as those in the dogs and lions in the GISAID database, which is a limiting factor for comparison of base substitutions.
CAI was used to measure the synonymous codon similarities between the virus and host coding sequences. For each animal source of the SARS-CoV-2 sequence, we calculated the average genome and spike gene values in the CAI (Fig. 2C & 2D). Bat-CoV (RaTG13) and SARS-CoV-2 (from humans) had higher CAI values, which indicates that the viruses adapt to their hosts (bat and human) with optimized or preferred chosen codons, while the dog source of SARS-CoV-2 had lower CAI values, suggesting that SARS-CoV-2 adapts to dogs with random codons. This nding was consistent with the conclusion that, compared to dogs, humans are favored hosts for adaptation [43]. The whole genome or spike sequence in Mink-SARS2 had a similar substitution level to human SARS-CoV-2, pointing to the ongoing adaptation of SARS-CoV-2 to the new host and using the preferred chosen codons.
The spike protein is critical for virus infection and host adaptation. We observed that three nonsynonymous mutations in the RBM domain, Y453F, F486L and N501T, independently emerged but were rarely observed in human lineages; these residues are directly involved in contact with the surface of the S-ACE2 complex and therefore are relevant to new-host adaptation. Other mutations within the RBM domain should also be monitored to prevent viral transmission and to prevent the virus escaping from host antibody immunity. Such as the B.1.351 SARS-CoV-2 variant carrying the E484K and N501Y mutations, which can bind to K31 and Y41 on ACE2 to promote virus entry into the cell [44], had the ability to re-infect the recovered patients or people who have been vaccinated [45].
The current data show that the preventive effect of vaccines on Delta variation was reduced [46], but it can still prevent the occurrence of severe cases. The codon bias and frequency of mutants should be further evaluated to optimize the vaccine and block transmission.   The binding free energy of the wild-type RBD and three mutants with the human receptor and mink receptor, respectively. (F) Characterization of the binding between human ACE2 (hACE2, upper) and mink ACE2 (mACE2, lower) with spike RBD mutants by FACS. His-tagged wild type RBD, RBD mutants and NTD proteins were incubated with CHO cells expressing with eGFP-fused ACE2s. Cells stained with the wild type, mutant RBD and NTD proteins are shown in bright green, red and orange, respectively. The spike NTD was set as the negative control.