Evolutionary trajectory of SARS-CoV-2 genome

DOI: https://doi.org/10.21203/rs.3.rs-1009010/v1

Abstract

Traditionally alignment-based phylogenetics faces challenges to uncover the evolutionary trajectory of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2). This study develops a novel alignment-free system and reveals the evolutionary trajectory of SARS-CoV-2 from more than one million of genome sequences. This new system contains Fréchet distance(Fr) and artificial recurrent neural network. Fr computes the distance between variant and reference genome, which is decomposed into 84 features (4 single nucleotides, 16 dinucleotides and 64 codons). Recurrent neural network predicts and forecasts time-series Fr trajectory, inferring SARS-CoV-2 evolutionary trajectory. Generally SARS-CoV-2 genome mutates rapidly via deletion during COVID-19 pandemic. Among single nucleotides, C mutates fast but T changes slowly. C-prefix dinucleotide (e.g. CG and CT) also loses dramatically during evolution. Similarly, the virus genome also deletes several codons prefixed by C (e.g. CCT) but gains several T and A prefix codons (e.g.TTA and ATT) during its evolution. Interestingly, codon CCT and CT centrally control the entire SARS-CoV-2 genome, and their evolutionary trajectories fit COVID-19 cases spike. Therefore C-prefix feature trajectory marks SARS-CoV-2 evolution.

Full Text

This preprint is available for download as a PDF.