SARS-CoV-2 Origins and Evolution: Insights from Coronaviruses Recombination and Phylogenetic Analysis


 Background: It is imperative in the midst of a global epidemic to investigate the origins of the infectious agent especially when it has reached parts of the world with either ailing economies or pre-existing political turmoil consistent with non-functional health systems. Methods: To explore the possibility of cross species infection, genomic recombination and the emergence of novel coronaviruses in the near future we carried out recombination and phylogenetic analysis to determine the spatio-temporal evolution and origins of the current SARS-CoV-2 virus. Results: Our findings prove using two robust recombination tools, RDPv4.100 and SimPlot3.5.1 analysis that SARS-CoV-2 is a recombinant of pangolin and bat RaTG13 sequences as been previously shown elsewhere. We also report one novel recombination event between two SARS-CoV-2 sequences (SARS-CoV-2 sequence, MT188341, SARS-CoV-2 sequence, MT293183). Bearing in mind that the prerequisite for recombination is the occurrence two viral sequences in the same reservoir, biological niche or host at the same time we postulate either co-infection with the two viral sequences, or superinfection, both scenarios have not been reported elsewhere. Conclusion: The possibility of recombination between the SARS-CoV-2 sequences poses the likelihood of the emergence of new and maybe more or less virulent “strains” of the virus. We believe that the future of science lies in our ability to be able to use computational based methods to predict the genetic sequences of infectious agents of the next epidemics. The addition of more SARS-CoV-2 sequences has a bearing on the understanding of the origin, evolution and clinical outcome prediction of given viral genomes. More SARS-CoV-2 sequences are needed to elucidate our understanding of this family of viruses.


Introduction
Coronaviruses (CoVs) are of the order Nidovirales, sub-order Cornidovirineae, family Coronaviridae and of the subfamily Orthocoronavirinae according to the recent ICTV taxonomic classi cation (https://talk.ictvonline.org/taxonomy/). The subfamily is divided into four genera namely Alphacoronaviruses, Betacoronaviruses, Gammacoronaviruses and Deltacoronaviruses (1). Alpha-CoVs and Beta CoVs have been mainly associated with mammals while Gamma-CoVs and Delta-CoVs have been associated with birds (2,3). The CoV genome is a single stranded positive sense RNA virus with a genome ranging from 27kbp to 31kbp, making it the largest known RNA genome to date (4). All CoVs share similarities in their genomic organisation, which comprises 16 non-structural proteins from nsp1 to nsp16 encoded by the ORF 1a/b at the 5' end (5). The genome also encodes four structural proteins: the spike protein (S), envelope protein (E), membrane protein (M) and the nucleocapsid protein (N) (6,7).
CoVs history dates back to the 1960s, where a variety of animal pathological conditions ranging from transmissible gastroenteritis to infectious bronchitis and to encephalitis in mice, were described; canine respiratory CoVs (8,9), mouse hepatitis virus (MHV) (10,11), bovine CoV (12), feline CoV (9) and turkey CoV (13) were among the rst animal and bird diseases to be documented.
Up until late 2019 only six CoVs were known to infect humans (14); these are HCoV229E, HCoV-OC43, SARS-CoV, HCoV-NL63, HCoV-HKU1 and MERS (15). The 229e and OC43 are the most widely studied HCoVs and they account for about 15-29% of all human common cold cases (16). SARS-CoV was the aetiological agent behind the 2003 outbreak in China (17), MERS-CoV was the pathogen behind the 2012 outbreak in the Middle East (18) and SARS-CoV-2 is responsible for the current and ongoing outbreak that started in Wuhan, China in November 2019 (4).
The origin of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has always been a question of many studies since November 2019 and even before the focus has been on the origins of SARS-CoV from the 2003 outbreak (1). It is imperative in the midst of a global epidemic to wonder why the origins of the infectious agent matter, especially when it has reached parts of the world with either ailing economies or political turmoil consistent with non-functional health systems. CoVs are normally hosted in, and their evolution moulded by bats (19,20). It has been hypothesized that a large portion of the coronaviruses in humans are of bat origin (21). Obviously, a few research groups have as of late a rmed the hereditary comparability between SARS-CoV-2 and a bat betacoronavirus (22)(23)(24)(25)(26)(27). The entire genome sequence of SARS-CoV-2 has 96.2% identity to a bat SARS-related coronavirus (SARSr-CoV; RaTG13) collected from Yunnan territory, China (28,29), yet it is not fundamentally the same as the genomes of SARS-CoV (about 79%) or MERS-CoV (about half) (30,31). It has additionally been a rmed that the SARS-CoV-2 uses a similar receptor, the angiotensin converting enzyme II (ACE2), as the SARS-CoV (25). In spite of the fact that the particular course of transmission from characteristic animal reservoirs to humans stays unclear (20,27), a few studies have demonstrated that pangolins may have provided a partial spike gene to SARS-CoV-2; the basic functional sites in the spike protein of SAR-CoV-2 are about indistinguishable from one isolated in an infection from a pangolin (32)(33)(34). These ndings indicate that SARS-CoV likely evolved in bats or pangolins over long periods of time. In general there are three theoretically plausible scenarios to explain the origins of SARS-CoV-2 (i) natural selection in animal host before zoonotic transfer (ii) natural selection in humans after zoonotic transfer and (iii) selection during laboratory passaging of experimental viruses (35). In any of the 3 scenarios the selection or change in the genome is due to either recombination or mutation (19).
Despite these recent discoveries, several aspects of the evolutionary tenets and driving forces behind the outbreak of SARS-CoV-2 remain unexplored (36). Here we further interrogated and investigated the likelihood of the most popular theory: of the origin of SARS-CoV-2 from bat and pangolin CoVs, by using recombination analysis of 60 reference CoV whole genomes and 413 SARS-CoV-2 sequences to explore the possibility of recombination as a driving force. This work provides new views into the factors driving the evolution of SARS-CoV-2 and its pattern of spread through the human population as shown in the phylogenetic analysis of 690 SARS-CoV-2 whole genome sequences from different geographical areas.

Methods
Design This is an exploratory study to investigate the origin of CoVs as previously postulated in literature, CoVs can emerge from recombination events between human CoVs and other species CoVs (5). We rst analysed the possibility of SARS-CoV-2 being a recombinant by aligning 60 CoV sequences with the SARS-CoV-2 reference sequence, then to detect the possibility of recombination within SARS-CoV-2 viruses a second recombination analysis was done using 413 SARS-CoV-2 sequences. In both recombination analyses tests were carried out to validate the recombination events. Phylogenetic analysis of SAR-CoV-2 sequences with previously known CoV reference and representative sequences was also done. Lastly we did a phylogenetic analysis of 690 SARS-CoV-2 sequences from the current outbreak to get a clearer picture of the velocity of the evolution of these sequences, from different geographical locations.

Source of sequence data
Four sequence data sets were used 1) 60 representative and reference CoV sequences 2) 413 SARS-CoV-2 complete genome sequences from the current outbreak 3) 473 CoV sequences, which were a combination of datasets 1 and 2 and 4) All the 690 SARS-CoV-2 complete genome sequences that were available from Genbank database at the time of the analyses. All the sequences including the reference SARS-CoV-2 sequence (NC_045512.2) were downloaded from Genbank (https://www.ncbi.nlm.nih.gov/genbank/) and the NCBI Virus resource (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/), see also supplementary material 1a, b, c and d for the Fasta format of the sequences).

Alignments and phylogenetic analysis
All alignments as listed in Table 1 below were done using MAFFET version 7.45 (37), (https://mafft.cbrc.jp/alignment/server/large.html) (38). All phylogenetic trees were generated using the Neighbour-Joining method with 1000 bootstraps, and the Newick outputs were viewed and modi ed in ITOL (https://itol.embl.de/tree/). For the 60 CoV reference sequences (59 previously known CoVs and the reference SARS-CoV-2 sequence, NC_045512.2) the MAFFET alignment CLUSTAL output was used for the recombination analysis (R1), for the 473 Coronavirus Sequences (413 SARS-CoV-2 sequences from Genbank plus 60 Coronavirus reference sequences) the Newick format of the tree was uploaded, viewed and modi ed in iTOL as mentioned above, and the CLUSTAL alignment used for the recombination analysis (R2). And lastly for the 690 SARS-CoV-2 sequences the Newick format of the tree was also uploaded, viewed and modi ed in iTOL.
Recombination analysis 60 representative CoV sequences and the 473 sequence data sets as listed in Table 1 were included in the recombination analysis. We constructed alignments of these sequences using MAFFET and the CLUSTAL outputs were used for the recombination analyses. This alignment was analysed using RDP v4.100 (39) (with default settings) which implements analysis of recombination using a suite of 9 recombination detection methods or algorithms: Recombination detection program (RDP) (39), BOOTSCAN (40), CHIMAERA (41), GENECONV (42), MAXIMUM X 2 (43), PhylPro (44), VisRD (45), LARD (46) and SISCAN (47). Only recombination events that were identi ed by at least four methods in the RDP v4.95 software were considered.

Phylogenetic Incongruence testing
The Shimodaira-Hasegawa (SH) test (48) using W-IQ-TREE v1.6.10 (49) was used to interrogate and corroborate the RDPv4.100 results. CLUSTAL alignments (50)(51)(52) of all the 60 CoV whole-genome reference sequences and the 473 sequences dataset, were used to compute the log-likelihoods of phylogenetic trees in W-IQ-TREE (http://iqtree.cibiv.univie.ac.at) (49). The tool tests tree topology, estimates model parameters such as substitution rates and optimizes tree branch lengths to lessen computational usage. Default settings of the W-IQ-Tree were used, including best t model (53) and ultrafast bootstrap analysis (1000 alignments) (54) to run tree topology analysis including the Kishino-Hasegawa (KH) test (55), Shimodaira-Hasegawa (SH) test (48) and approximately unbiased (AU) test (56) to test if there is a difference in evolutionary patterns amongst trees generated after removing the recombinant regions from the original sequences.
The two sets of trees constructed were denoted A1 and A2, A1 representing the trees generated from the original CoV reference sequences alignment and A2 denoting the trees generated after removing the recombinant regions from the original reference sequences, after the recombination analyses. The alignments without recombinant regions were generated automatically from RDP v4.95 (39) after the recombination analysis.

Pairwise percentage identity and Similarity plots
Using SARS-CoV-2_NC_045512.2 as the query sequence, we performed a similarity plot analysis using Simplot implemented in DAMBE6 (57) to estimate the pairwise genetic relatedness of the newly isolated SARS-CoV-2 with other reference corona viruses isolated from different species. We used default sliding window size of 500 nucleotides and step size of 50 under the TN93 distance (58) model as parameter settings for this analysis. The graph that is generated is a set of lines (or optionally strings of points) that re ect the similarity (or distance) of each reference sequence to the query sequence. In order to generate this plot, a sliding window is passed across the alignment in small steps (the window size and step size are selectable).  Table 3. The pangolin CoVs also cluster closely with SARS-CoV with about 78.12% pairwise percentage identity. All of the recombination events were in the non-structural protein region of the genome.  Table 3. Hence it is reasonable to hypothesize that natural recombination and/or mutational events on a SARS-CoV backbone sequence might have led to the emergence of SARS-CoV-2, which may also provide basis for not to ruling out the possibility of laboratory manipulation of the SARS-CoV. Only event 147 was in the structural protein encoding region of the genome, speci cally the nucleocapsid protein region of the genome. *Approximate genomic region was estimated from nucleotide positions based on the SARS-CoV-2 sequence, NC_045512.2 using the Genbank annotated format of the sequence.

Pairwise genetic distance Simplot of SARS-CoV-2 and other related CoV
To corroborate the ndings from the RDP4 software in detecting recombination, we used the Similarity Plot. The Similarity Plot allows identi cation of one query sequence, generally the one suspected to be the recombinant or the mosaic, and the rest of the sequences are considered as reference sequences. The analysis indicated that the newly isolated SARS-CoV-2 represented by sequence SARS-CoV-2_NC_045512.2 in our analysis is more closely related to bat Coronavirus (RaTG13) further supporting preliminary reports that suggested a possible species jump from bats to humans to seed the current pandemic virus associated with COVID19 (see Figure 3). One pangolin CoV MT040336 also shows close identity to SARS-CoV-2 (85.1%) similarity as also shown in Table 3. Table 3. Pairwise percentage identity of SARS-CoV-2 and other related CoV Phylogenetic Incongruence testing To con rm the ndings from the recombination analysis (RDP4) and sequence similarity (Simplot), we used a more conclusive test, the SH test (48). The null hypothesis of the SH test states that the difference between trees (branch length, topology or likelihoods) is zero. In Table 4 the observed differences between the 60A1 and 60A2 and/or 473A1 and 473A2 were signi cantly greater than zero and the null hypothesis was rejected, thus declaring that these trees are signi cantly different i.e. incongruent (p< 0.05) as shown by the p-values (p-SH) indicating that there is substantial phylogenetic incongruence between the 60A1 and 60A2 and/or 473A1 and 473A2 trees. The incongruence between these trees allude to the fact that recombination plays a role in CoV evolution and in SARS CoV-2 in particular. After demonstrating that recombination events play a major role in CoV evolution we sought to determine how the new 413 SARS-CoV-2 sequences would cluster together and also with previously known CoV sequences by constructing 3 phylogenetic trees as described in Figure 4, 5 and 6. Figure 4 shows 60 representative CoV sequences and how they are classi ed into the different genera. It also shows SARS-CoV-2 reference sequence being closely related to bat RaTG13 sequence in the sub-genus Sarbecovirus. Pangolin CoVs and other bat CoVs also cluster close to the SARS-CoV-2 virus. Figure 5 is an extension of Figure 4, with the addition of a collapsed branch of 413 SARS-CoV-2 sequences from the current outbreak. Figure 6 is an analysis of 690 SARS-CoV-2 complete sequences, with the tree rooted against the SARS-CoV-2 reference sequence (NC_045512.2).

Recombination analysis
Our results show as been shown previously the likelihood of SARS-CoV-2 being a recombinant of pangolin and bat RaTG13 sequences (29,35,66,67). Besides the events of interest reported in this paper, we also show in our analyses different other recombination events among CoV sequences, 28 and 37 signi cant events in R1 and R2 respectively (provided in supplementary material 2a and 2b). This is consistent with studies done elsewhere on the role played by recombination as a driving force in the evolution of CoVs (20,23,68). We hypothesize based on the different recombination events between SARS-CoV-2 and bat CoVs other than bat RaTG13 and pangolin CoVs other than pangolin CoV MT040336, that this multiplicity of events is not a recent occasion, but a natural event that happened over a longer period of time.
We report one event between two SARS-CoV-2 sequences (SARS-CoV-2 sequence, MT188341, SARS-CoV-2 sequence, MT293183). Bearing in mind that the prerequisite for recombination is that two viral sequences have to be in the same reservoir or host at the same space and time we postulate either co-infection with the two viral sequences, or superinfection after initial infection with one of the viral sequences, both scenarios have not been reported elsewhere. The possibility of recombination between the SARS-CoV-2 sequences poses the likelihood of the emergence of new and more virulent "strains "of the virus. Evidence of genetic diversity and rapid evolution of this novel CoV has been presented elsewhere (69), using an analysis of just 86 genomes.
The RDP4 analysis, well supported NJ trees, SH test and the SimPlot results all corroborate to the fact that recombination plays a role in CoV evolution. A1 trees generated from the original sequence alignments and A2 trees generated after removing the recombinant regions from the original sequences, were incongruent for the 60 CoV sequences and the 473 CoV sequences (i.e the trees showed different phylogenies), hence supporting recombination as an essential driving force in CoV evolution. The SimPlot as also shown by the pairwise percentage identity table shows the close relation between SARS-CoV-2, bat RaTG13 and pangolin MT040336 CoVs as previously alluded to.
In all the recombination events we also show that most events occurred in the non-structural protein encoding regions especially in the ORF1ab, which have different roles in replication of CoVs (4). Only one event, 147 in R2 occurred in part of the genome that encodes the nucleocapsid protein which is an antagonist of interferon (IFN) and viral encoded repressor of RNA interference (63).

Phylogenetic analysis
We also show the phylogenetic relatedness of the different genera of the CoVs in relation to SARS-CoV-2 reference sequence and in relation to 413 SARS-CoV-2 sequences. These results are in agreement with the recombination analyses mentioned above.
Phylogenetic analysis of 690 SARS-CoV-2 sequences available at the time of the study shows a distinct geographical distribution and evolution of the virus from the Chinese epicentre to two American epicentres past Middle east to Europe and then to Africa.
We hypothesize migration of SARS CoV-2 sequences in a clockwise fashion when referring to Figure 6, from the Chinese/USA sequences in the early days of the epidemic, a migration into the Middle East (No UAE sequences yet) in purple then the rise of an epicentre in the USA, represented by the rst green cluster, then followed by a spillage of these sequences into Europe and Africa (only sequences from South Africa were available), probably before the travel bans. The second green cluster is also exclusively USA sequences, which may represent another geographical epidemic epicentre in the same country given the size of the USA. The basis of this theory is that travel bans were not put into effect for about 3 months into the epidemic, which allowed mingling of sequences from distant parts of the world, as represented by the bulk of sequences in frequently visited countries and also in popular transit airports in countries such as Turkey, Istanbul. It will be interesting to have sequences from Dubai or any UAE states to qualify this hypothesis. It was estimated that the overall risk of SARS-CoV-2/COVID-19 importation/introduction to Africa was lower than that to Europe (1% vs 11%, respectively) (70).

Caveats and limitations to understanding CoV evolution
Our current knowledge of CoVs is limited and mostly focused on 7 human, medically important and human closely related CoVs that have been associated with cross species infection, while the rest of the other plethora of CoVs biology is largely understudied and unknown. According to the virus pathogen database (https://www.viprbrc.org/brc/home.spg?decorator=vipr) there are about 4000 CoV sequences from different animal and avian species. Hence assumptions made from studying a limited number of CoVs cannot be necessarily generalised and applied to all CoVs. The same logic applies to just SARS-CoV-2 sequences from the current outbreak. Many SARS-CoV-2 sequences are still being deposited into Genbank and other repositories, and until a threshold number of representative sequences are attained, the SARS-CoV-2 community of researchers will remain underpowered to make assumptions closest to the reality of what happened in the emergence and evolution of this group of viruses.

Conclusion
The possibility of recombination between the SARS-CoV-2 sequences poses the likelihood of the emergence of new and more or less virulent "strains" of the virus. We believe that the future of science lies in our ability to be able to use computational based methods to predict the genetic sequences of infectious agents of the next epidemics. The addition of more SARS-CoV-2 sequences is essential in achieving this, as it has a bearing on the understanding of the origin, evolution and clinical outcome prediction of given viral genomes. More SARS-  NC_045512.2 as the recombinant sequence and in green the Pangolin coronavirus, MT040336 as the major parent and the SARS-CoV, NC_004718 as the minor parent, as also shown in Table 2.

Figure 2
Unrooted Neighbour Joining tree of event 125 of R1 analysis showing in red SARS-CoV, NC_004718 as the recombinant sequence and in green the Pangolin coronavirus, MT040336 as the major parent and the Bat coronavirus RaTG13, MN996532.1 as the minor parent, as also shown in Table 2.  60 CoV reference sequences. Phylogenetic analysis of 60 CoV whole genome sequences. The sequences were aligned using MAFFET ). This unrooted tree was generated using The Neighbor-Joining (NJ) method, a very popular method due to its good balance between accuracy and e ciency (68). The Newick format of the tree was uploaded and modi ed in iTOL https://itol.embl.de/tree/. It can be seen from the tree that the SARS-CoV-2 reference sequence is a Betacoronavirus more closely related to Bat Coronavirus (RaTG13), to Pangolin CoVs, and also toSARS CoV sequence from the 2003 outbreak in Asia (NC004718).

Figure 6
Phylogenetic analysis of 690 SARS-CoV-2 complete genome sequences obtained from the NCBI Virus resource. The sequences were aligned using MAFFET (38). This tree was generated using the Neighbor-Joining (NJ) method (68). The Newick format of the tree was uploaded and modi ed in iTOL https://itol.embl.de/tree/. For clarity of node labels see Supplementary material 3c, which is a Newick Supplementary Files