Acquisition of virus genome reference sequences
The accession IDs of all the virus genome reference sequences (RefSeq) and their corresponding host range label (under label ‘Host’) are acquired from the ‘Viral genome browser’ of National Center for Biotechnology Information (NCBI)54. The accession IDs was used to download coding sequences later through Biopython toolkit. Viruses with limited sample count in host ranges are ignored in later studies except ‘Human’, ‘Vertebrates’, ‘Invertebrates’, ‘Land plants’, and ‘Bacteria’, but they are remained in the dataset as negative samples throughout the study. The incomplete viral genome sequences (labelled as ‘Incomplete’ in ‘RefSeq type’) were discarded throughout the study. The multi-partite virus which has multiple NCBI accession IDs for multiple genome segments are summarised as the same virus, which all the genes in different genome segments are considered in the same virus. Total 10820 samples were retrieved with 488 Human samples, 1758 Vertebrates samples, 1851 Invertebrates samples, 1763 Land plants samples, and 4041 Bacteria samples.
Acquisition of other virus genome sequences
For testing the trained RF model, complete virus genome sequence data of MERS-CoV, Zaire Ebolavirus, West Nile virus, Zika virus, Orthohantavirus, Influenza A virus, Henipavirus, Lyssavirus Rabies, and SARS-CoV-2 were acquired from NCBI database. The accession IDs of all those virus genomes used by this study are acquired from the NCBI Virus database55, which other information such as ‘Host’, ‘Pangolin’ et al were also downloaded there (incomplete genomes were discarded), and only the viruses has ‘Host’ label were downloaded. The accession IDs was used to download coding sequences through Biopython toolkit, and the viruses are separated into two groups based on whether they are labelled as ‘Homo Sapiens’ in ‘Host’. Total 639 genome sequences of MERS-CoV (256 human-sourced, 383 non-human-sourced), 563 genome sequences of Zaire Ebolavirus (435 human-sourced, 128 non-human-sourced), 1823 genome sequences of West Nile virus (137 human-sourced, 1686 non-human-sourced), 240 genome sequences of Zika virus (208 human-sourced, 32 non-human-sourced), 826 genome sequences of Orthohantavirus (142 human-sourced, 684 non-human-sourced), 614 genome sequences of Influenza A virus (131 human-sourced, 483 non-human-sourced), 55 genome sequences of Henipavirus (35 human-sourced, 20 non-human-sourced), 1862 genome sequences of Lyssavirus Rabies (30 human-sourced, 1832 non-human-sourced), and 755151 genome sequences of SARS-CoV-2 (all human-sourced) were retrieved. The WHO Name information related to SARS-CoV-2 was acquired from ‘cov-lineages.org’ database, which is assigned to the samples according to their pangolin labels56.
Calculation for RSCU
The Relative Synonymous Codon Usages (RSCU) of the virus genome, as readouts of codon usage biases, are calculated based on codon counts and amino acid counts of coding sequences according to their definition proposed in previous publication31,32. All the coding sequences, or CDS, in a virus genomes (either mono-partite or multi-partite) were converted into counts of each codon and counts of each amino acid. The counts of the same codons from all CDS in a virus genome were summed to represent codon counts for the whole genome, which same method was applied to the counts of amino acids. RSCU of each codon were generated based on each codon count and respective amino acid count. The codons for 1-box amino acids, UGG (Try) and AUG (Met) are ignored due to unchanged values (= 1). The stop codons UAA, UAG and UGA are also ignored because they were not relevant to translation efficiency. Thus, the RSCU dataset (or DR) consists of total 59 codon features. As the start and stop codons are discarded, our analysis was purely focused on the codon usage biases of the coding sequences in translation efficiency.
Dimensionality reduction of the RSCU data matrix
The raw RSCU data matrix, which virus genomes as samples and codons as features, was first normalised through the Z-score Normalisation method. The normalised RSCU data matrix was later compressed by Principal Component Analysis (PCA) method, where the cut-off threshold was set as 1.0 (no loss in explained variance). This method can compress the normalised data without loss of the variances by removing the redundant features. Dimensional reduction analysis was performed on the normalised and compressed data using the Uniform Manifold Approximation and Projection (UMAP) algorithm45. The raw RSCU data matrix was then transposed to study the viral codon behaviours, which codon labels become samples and all the virus genomes become features. The transposed RSCU data matrix was applied with the same dimensional reduction pipeline as above.
Independent T-test
The independent T-test was performed through the python package Scipy. The RSCU of each codon was analysed by comparing host and non-host virus genomes (i.e. human virus v.s. not-human virus), which results with p-value smaller than 0.05 will be considered as significantly different.
Random forest machine learning
The train datasets were resampled by the Synthetic Minority Oversampling Technique (SMOTE)46 to overcome sample imbalance. The Random Forest models were trained with SMOTE-resampled train datasets with Scikit-learn57, when Balance accuracy, F1 score, Recall scores and ROC-AUC scores were used as the standard of the model performance due to sample imbalances. The open-source OPTUNA framework was used for hyper-parameters tunning with specific trials for different scenarios (20 or 50 trials)58. The predicted probabilities from the models’ predictions were considered as the readout of virus codon fitness (VCF) in a specific host, which is computed from embedded function of Scikit-learn. The feature importance metrics are also computed from embedded function of Scikit-learn.
Selection of additional features
Besides RSCUs, other feature datasets were used to achieve better machine learning prediction. Datasets ‘Codon%’ (DCodon%) and ‘AminoAcid%’ (DAminoAcid%) were simply calculated with percentages of different codons or amino acids in total codon count of amino acid count of virus genome. Stop codons were ignored in both DCodon% and DAminoAcid% (thus 61 features in DCodon% and 20 features in DAminoAcid%, stop codons were not included). Dataset ‘ATGC%’ (DATGC%) was simply calculated with frequency of each nucleotide (A%, U%, G%, C%, AU%, GC%), which AU% and GC% were calculated by summing A%/U% and G%/C%. Dataset ‘Start-Stop Codon%’ (DStartStopCodon%) is the calculated with frequency of the start codon (AUG%) in all the CDS of virus genomes, and the frequency of each stop codons (UAA%, UAG%, UGA%) in all the CDS of virus genomes. Dataset ‘CDS Length’ (DCDS Length) consists of features genome length, Concatenated CDS length, CDS count, and the mean and standard deviation of CDS length, which Concatenated CDS length is the sum of all CDS length. Dataset ‘HumanCorr’ (DHumanCorr) is correlation coefficients calculated between virus RSCU and Human reference RSCU, which is Human reference RSCU acquired from the CoCoPUTs database59. Dataset ‘HumanCorr(AA)’ (DHumanCorrAA) dataset is correlation coefficients calculated among between virus RSCU and Human standard RSCU from each Amino Acids. Met (M) and Trp (W) as well as Stop codons were ignored. Dataset ‘Partite’ (DPartite) consist of the virus classifications of either mono-partite or multi-partite. Dataset ‘Taxonomy’ (DTaxonomy) consist of tertiarily encoded data based on taxonomy information acquired from the NCBI Taxonomy database with Realm (or Clade), Kingdom, Phylum, Class, Order. The positive samples were labelled as ‘1’ and negative samples were labelled as ‘-1’, which unknown samples were labelled as ‘0’.
For searching additional features beneficial to model performance, above datasets were individually or combinatively added to RSCU dataset before training models in different conditions (i.e. different train-test ratio, different hyperparameter tuning strategies et al), which the trained models were evaluated and compared with Balanced accuracy after parameters optimisation (Supplemental Fig. 2). 50 trials were set in OPTUNA hyper-parameters tuning. As consequences, Taxonomy dataset and CDS Length dataset additional to RSCU dataset increases Balanced accuracy of the trained models, and three datasets were generated: DR (DRSCU), DRT (combination of DRSCU and DTaxonomy), and DRTC (combination of DRSCU, DTaxonomy, and DCDS Length).
Leave-One-Out machine learning
To further confirm reliability of using RSCU and other features in predicting HVCF scores of different host-ranged viruses, the Leave-One-Out (LOO) method was carried out, which all other samples were used to train a RF model in predicting one test sample. The same process was carried out separately to all the samples, and only 5 trials were used in OPTUNA hyper-parameters tunning. Model performances such as Balanced accuracy and Recall score were generated with summary of total 10820 predictions (correct/wrong predictions for 10820 samples or models). The models with wrong predictions were later re-trained with 50 trials setting in OPTUNA hyper-parameters tunning to see whether it will have correct predictions (Supplemental Fig. 3).
Simulation for SARS-CoV-2 evolution path
The predicted probability from the DRTC-trained Recall-optmised RF model trained with all 10820 samples was considered as the readout of human virus codon fitness score (HVCF score) because it has best prediction performance. To predict an codon fitness evolution path between two viruses, the start-point virus and the end-point virus were determined between SARS-CoV-2 and a target Betacoronavirus for either the Forward or Backward mutation simulation. The Forward mutation is from the target Betacoronavirus to SARS-CoV-2, while the Backward mutation is from SARS-CoV-2 to the target Betacoronavirus. At each mutation step in the evolutionary process simulation, every possible mutation was applied to the codon count compositions of the virus, including all possible substitution (i.e. AAA→AAG), addition (i.e. +AAA), or deletion (i.e. -AAA) of codons. The DRSCU and DCDS Length of the updated virus codon compositions were re-calculated except for the DTaxonomy (remained as Coronaviridae). The new HVCF score was then predicted with the updated DRTC. Among new HVCF scores derived from all the possible mutations, the simulated mutation generating the lowest/highest HVCF score (depending on simulation direction, Forward or Backward) was selected. When multiple mutations have the same lowest/highest HVCF score, the additional analysis of correlation coefficient was calculated between updated DRSCU and DRSCU of the end-point virus (SARS-CoV-2 in the Forward path or the target Betacoronavirus in the Backward path). The simulated mutation generating the best correlation coefficient was selected.
Similar to the gradient descent, this process was step-by-step repeated until reaching the HVCF score of the end-point virus. In some cases, the simulation may reach a stagnation because of possibility of mutually contradictory mutations (i.e., AAA→AAG then AAG→AAA). To avoid such meaningless loop stagnation, mutually contradictory mutations were forbidden in the simulation process. For instance, if an ongoing evolutionary path has AAA→AAG, then mutation AAG→AAA must be excluded in the subsequent simulation.