Evolutionary trajectory and origin of SARS-CoV-2 variant

: Alignment-based quality attributes have identified several variants of SARS-CoV-2(severe acute respiratory syndrome coronavirus 2) that causes COVID-19 pandemic, but they face challenges to provide dynamic trajectory and origin of these variants. This study employs an alignment-free approach combining Fréchet distance (Fr) and artificial recurrent neural network to reveal evolutionary trajectory and origin of SARS-CoV-2 variant. Fr generates a distance matrix of 84 genome features and more than one million of genome sequences. Recurrent neural networks use this Fr matrix to quantitatively identify variants and reveal the evolutionary trajectory and origin of SARS-CoV-2. Total 34 SARS-CoV-2 variants have been identified. All these variants dynamically delete their genome during evolution, but their trajectory and deletion degree varies with individual variants, which can be classified into 3 groups, slight mutation group (13 members), middle level deletion (17 members), and high deletion (4 members). The slight deletion group works like wild type and its trajectory waves only slightly and temporarily, which has very low infection capacity. The high deletion group fluctuates with a rough trajectory characterized as a large loss and it also infects humans lightly. The middle deletion group gradually deletes their genome with a certain rhythm trajectory, corresponding to the pandemic peaks. This group causes most of the global COVID-19 cases. At least 3 mink coronavirus variants pose 56 genome features similar to SARS-CoV-2 and they are predicted to be able to infect human, and thus mink is the most likely origin of SARS-CoV-2, and the origin path follows this order: mink, cat, tiger, mouse, hamster, dog, lion, gorilla, leopard, bat, and pangolin. Therefore, this mink-origin SARS-CoV-2 evolves with a gradual deletion rhythm to infect humans artificial recurrent neural

Current studies on the origin and evolutionary trajectory depend on the traditional alignment-based methods. These studies have identified variants on the basis of mutation characteristics and have provided useful information to understand the static states of virus mutations 4,[15][16][17][18][19][20] . However, SARS-CoV-2 have accumulated more than 50,000 mutations 21 and a variant (e.g alpha) usually carries many mutations. It is biased to discriminate variants based on one or two mutation signatures although much effort has been paid to select variant signatures 22,23 . More importantly, virus variants keep mutating during their dynamical evolution. A certain number of signatures might disappear at a time point but come back again later for a variant. Therefore, it is challenging for these static state approaches to provide a clear picture for SARS-CoV-2 origin and evolution trajectory.
Recently I created a quantitative system using Fréchet distance 24 (Fr) to compute distance between a variant and reference genome decomposing into 84 genome features (4 single nucleotides, 16 dinucleotides and 64 codons) 25 . Combining Fr and artificial neural network has provided a clear picture for SARS-CoV-2 genome evolution 25 . Here I expanded this system to investigate SARS-CoV-2 variant evolution trajectory and origin.

Variant identification
This study identified variant clusters(referred as variants) by two steps, UMAP 26 pre-classification and confirmation by long short-term memory 27 (LSTM, an artificial recurrent neural network architecture) ( Figure 1A). All genome sequences used here were the same as that of my recent study 25 , including 1,128,954 genome sequences filtered out from 2,212,864 that contained 12 officially defined variants 18,19 . Most of sequences were identified as alpha (973995 samples) and delta (116104 samples) deposited on July 4, 2021 in GISAID database. The study also employed the quantitative matrix precreated for studying genome evolution 25 , which contains Frs for 1,128,954 genome sequences and 84 genome features, including 4 single nucleotides, 16 dinucleotides and 64 codons (materials and methods). This present study split these 1,128,954 samples into weekly chunks following the timeseries order from December, 2019 to July 4, 2021. These weekly chunk data were used to identify variants.
For explaining the algorithm to identify variants, I described below the detail in identifying variant 0 as an example. A variant is defined here as that it must have at least 50 members in the pre-classification step. To have enough members in a variant, I used first 8 week data to pre-classify variant 0 to variant 4 by UMAP ( Figure 1B, materials and methods). These pre-classified members of variant 0 were subjected to further confirmation via using LSTM, in which a LSTM model was built with 4 layers (materials and methods) and members in variant 0 were used as train-set and members of the rest of variants (variant 1-4) as test-set. The mean absolute error(MAE) was calculated for the training set (variant 0 in this case). MAE measures distance between two groups (e.g. actual value and prediction here) and it can infer outliers, which have large MAE. In this study, MAE threshold was set as >mean + 1.5 standard deviation. MAE distribution for variant 0 was plotted ( Figure 1C) and all samples within the threshold were finally classified as variant 0 members and a few samples outside the threshold were treated as outliers, which were put back to the sample pool waiting for the next cycle UMAP( Figure   1D) . This model trained by variant 0 was used to search its full members from the rest data pool (from week #9 to July 4, 2021. ) week by week by using weekly chunk data.
Similarly, variant 1-4 members were identified. The 84 feature Frs of these 5 variants showed that variant 0 was closed to reference sequence with less Frs changes ( Figure 1E), but variant 2-4 had undergone a series of mutations. Variant 4 had already mutated most of its 84 features( Figure 1E) within 8 week's data. It was unbelievably fast. This indicated that SARS-CoV-2 actually rapidly mutated, and also suggested that SARS-CoV-2 actually stayed around the human community for a long time before emerging.
After these variant 0-4 identification, searching for other groups of variants was performed by moving the week window to the next and finally all 34 variants were identified ( Figure 1F). From variant 3, all variants identified here contained mix members of existing official variants( Figure 1F). For example, variant 4 already contained members of alpha, gamma, delta and others( Figure 1F). This indicates that this variant 4 already has a mix of multiple features (mutations) of current variants. To understand the evolution across all variants, I need to find a way to compare their similarity. MAE is a metric because MAE actually calculates the distance between samples. A variant has a larger MAE, it has longer distance from wild type and it is less similar to wild type. I treated variant 0 as a wild type and used its Fr matrix as train-set and the Fr matrix of other 33 individual variants as test-set to compute the MAE of samples.

Evolutionary trajectories of variants
Plotting the median of MAE and prediction cases of each variant showed that variant 1 was closed to wild type as expected and variant 23 was the one with most mutations ( Figure 4E). From the Loess regression curve derived from prediction cases, variants(e.g. 13) with middle level of mutation have more capacity to infect humans. Variants carrying low and high mutations had low infection capacity.
Variant 24 also had the highest prediction cases, but it already over mutated so its infection capacity might go down soon if it continued to mutate. However, its genome was forecasted as stable in the near future (purple line in figure 2C), its infection capacity would be still high in the near future. The

SARS-CoV-2 origin path
To understand the origin path of SARS-CoV-2, I also calculated the MAE between human wild type and animal coronavirus samples from GISAID, but human samples were reversed in time-series order, from 2021 to 2019. As done before, the Fr matrix of variant 0 was used as train-set to fit the pre-built LSTM model and animal samples were treated as a test to calculate MAE for samples. Animal samples with lowest MAE were closed to human wild types. Ranking minimum MAE for all animals revealed that mink was very close to human wild type (MAE near 0, Figure 4A), followed in order by cat, tiger, mouse, hamster, dog, lion, gorilla, leopard, bat, and pangolin. This indicated that mink coronavirus had the ability to infect humans directly and it was the most likely origin of SARS-CoV-2. In contrast, it is unlikely for a coronavirus from bat or pangolin to infect humans.
To understand how mink coronavirus was so close to SARS-CoV-2, I plotted the MAE between human and mink ( Figure 4B) and found that several of the mink samples had mutated to be similar to human wild type. Moreover, these mink mutants had 56 consensus features (defined as ones with the same sign of positive and negative Fr, Figure 4C) and they had 25 features(out of 56) different from normal mink viruses( Figure 4D). These mink viruses actually shared 57% (32 out of 56) features with humans ( Figure 4E). Except for 0, only 16% (9/56) was different between mink and human. Therefore, mink is predicted as the SARS-CoV-2 origin.

Discussion
This study employed LSTM and Fr to quantitatively identify 34 SARS-CoV-2 variants. Traditionally, variants have been identified on the basis of quality attributes, especially point mutation, leading to a bottleneck for understanding their dynamic evolution trajectories. In contrast, quantity attributes can be used for downstream modeling and pattern recognition and appreciating their evolutionary trajectories.
Evolutionary trajectory of SARS-CoV-2 have been intensely reviewed and discussed 5,6 , but they have not been revealed. This study uncovered its trajectory, in which it overall continues to shorten its genomes during evolution. Individual variants have their own trajectories, but those with middle level of deletion have more infection capacity and these variants (e.g. variant 13 and 24) are the real source causing COVID-19 pandemic. The wild type SARS-CoV-2 and highly mutated variants have low infection capacity.
The wild type SARS-CoV-2 has less capacity to infect humans and the early samples already carry many mutations. It is unlikely for these viruses to mutate suddenly, suggesting that SARS-CoV-2 or its similar virus already stay inside the human community for a long time, like years before 2019 emerging.
Recent researches have found that mink coronavirus can trans-infect human 7 . Consistently, this study also predicted mink coronavirus as the origin of SARS-CoV-2. Furthermore, this study further uncovered the origin path of SARS-CoV-2, a mystery hanging on for years, which follows this order: mink, cat, tiger, mouse, hamster, dog, lion, gorilla, leopard, bat, and pangolin. Therefore SARS-CoV-2 comes from our neighbors like mink, cat and mouse, instead of bat and pangolin as thought.

Materials and methods
Except the method for identifying variant, all other materials and methods were referred to my paralleled study 25 Briefly, this study directly used the matrix (1,128,954*84) containing Fréchet distance of 84 genome features (4 single nucleotides, 16 dinucleotides and 64 codons) for 1,128,954 genome sequences filtered from total 2,212,864 sequences, which was downloaded on July 4, 2021.
All computations were done under Linux with python 3.8, TensorFlow 2.4.0 and Scikit-learn 0.24.0.

Discrete Frechet distance (Fr)
This study computed the coupling Fr 24 for each feature (e.g. "A") for a given individual virus genome (e.g. EPI_ISL_601443, alpha variant) against the reference genome(NC_045512).
Given P and Q as a feature (e.g. "A") trajectory for the reference and a variant respectively.

Fr between Q and P is defined as
Fr(Q,P) = min{ max distance(Q[b i ], P[a i ]) for all possible couplings between P and Q} Frs for all 84 features of more than one million virus were deposited in the project website

Variant identification
The UMAP 26 was used to pre-cluster variants and the variants were confirmed by using a LSTM model. This LSTM model contained 4 hidden layers, the activation was set to 'relu', and adam was used as the optimizer. MAE was used for the loss. 150 and 20 was respectively set as epoches and batch size.
Validation split was set to 0.1.

Evolutionary trajectory and infection case prediction and forecast
This part used the same LSTM model as my paralleled study. Two independent machine-learning sets were prepared in this study, including test (5%) and training (95%) following the order of time series.
Batch size and epochs were set to 64 and 50 for all machine learning. dropout (0.2) was set for two model layers for all models in this study. 4 hidden layers were set to a typical running. Adam was used for model optimizer and mean_squared_error was set for the loss.
The forecasts were set 30 days after July 4, 2021.

Final graphing
Several final summary were drawn by using ggplot2 in R. Otherwise, they were completed by python.

All detailed figures and data was deposited in the project website
https://combai.org/ai/covidvariant/

Funding
No funding resource associated with for this project

Acknowledgment
Thank GISAID (https://www.gisaid.org) for providing the full data.       TCA  AGT  GTC  GCC  ATT  AGC  TGT  GTT  GTA  CGC  TA  GCG  AT  GC  TAC  ATA  GT  TT  GG  GCA  TGG  TGC  CT  TAA  GAG  GGC  AG  ACG  CA  ACT  AGA  AAT  AC  CTT  C  TCT  T  ACA  AAG  CCC  TAT  TTA  CCG  TGA  GAA  CTA  CAT  G  GA  CG  TTG  CTG  CAC  CGG  GGT  TG  GCT  TTT  GAC  CGA  TCC  CC  ACC  CCT  CTC  GGA  ATG  CAG  TC  AA  CAA  TAG  CCA  GTG  TTC  A  TCG  CGT  GGG  AAA  AAC  AGG GAC  AGG  GGT  ATT  TAC  AAA  TAA  AAT  GG  TA  AT  TG  AA  G  T  A  CA  TTG  ACA  ACT  AGA  CTA  TGC  GCT  TTC  CTG  TAG  GCA  CTC  GGC  TCC  GCC  CCC  TCG  CGC  CGA  CGG  CCG  GGA  CCA  GC  TCT  CAG  AGC  GAT  ATC  GCG  AGT  TGT  AAC  GAA  CAC  CG  CGT   log2 fr   top1  top2  top3 Top3 mink consensus  CAC  CG  AAC  GAA  TGT  GC  AGC  AGA  CA  GAT  ACA  CAG  TCT  GGA  TAG  CTA  TTC  CTC  TCC  CCA  GCC  CCC  TCG  TGC  GCT  CCG  GGC  GCA  TTG  CGC  CGG  CTG  GCG  ACT  T  G  A  ATC  AGT  TG  AT  AAT  AA  CGA  TA  GG  TAC  TAA  GAC  ATT  AAA  CAT TAC  CGT  GG  TA  AA  CG  AAC  AAA  TTC  G  A  TGT  CAG  GC  GCA  TGC  AGC  ATC  GCG  CCC  CCG  CGA  CGC  CGG  CTG  GCC  GCT  GGC  TCG  TTG  GGA  ACT  AGA  CTA  CTC  TAG  TCC  CCA  GAT  TCT  AGT  TAA  GAA  CAC  AGG  GAC  CA  TG  T  AT  ACA  ATT  CAT  GGG  AAT   log2 fr   top3mink_median  human_median Top3mink vs top5 human