The emergence of SARS-CoV-2 by an unusual genome reconstitution

SARS-CoV-2 has been spreading remarkedly fast around the world since its emergence while the origin of the virus remains ambiguous. Here, we constructed all of the original prototype genome sequences of SARS-CoV-2 by selecting the common nucleotide among the different virus strains with species. Phylogenetic analysis on the prototype sequences showed that SARS-CoV-2 was a direct descendant of Bat-CoV and was closely related to Pan-CoV, Bat-SL-CoV, and SARS-CoV. The pairwise comparison of SARS-CoV-2 with Bat-CoV showed an unusual replacement of the motif consisting of 7 amino acids within the spike protein of SARS-CoV-2. Database searches showed that the motif originated from a surface protein of Plasmodium malariae, suggesting that the SARS-CoV-2 was emerged after acquiring the motif of the malaria surface protein.


Main Text
Coronaviruses (CoVs) are prevalent causes of the common cold widely present in nature with a broad spectrum of hosts. Although they frequently infect humans, the natural host of CoVs are animals and thereby all of the CoVs for the common colds, HCoV-229E, HCoV-NL63, HCoV-OC43, and HCoV-HKU1, have zoonotic origins (1). The genomes of CoVs consist of a single-stranded, positive RNA of 26,000 to 32,000 base pairs and a variable number (from 6 to 11) of open reading frames (2) The genome sequence of SARS-CoV-2 (3,4) suggested that SARS-CoV-2 stemmed from coronaviruses of bat or pangolin (5,6). Because of the ambiguity of the origin of SARS-CoV-2, we adopted a different approach to investigate the origin of SARS-CoV-2. First, we pooled out all of the available genome sequences of SARS-CoV-2 as well as its related viral species by the BLAST sequence similarity from all of the publicly available genome databases (Table S1). Since genetic mutation or recombination occurs on an individual basis and these events cannot occur at the same position repeatedly, the prototype sequence of the genome within a species is always the most prevalent base among the members. Therefore, we constructed the original prototype aa sequences of SARS-CoV-2 and its related species, Bat-CoV, Pan-CoV, Bat-SL-CoV, and SARS-CoV ( Fig. S1 & Table S2). The prototype sequences for each viral species were determined by choosing the most prevalent nucleotide bases in each position within the same species from pairwise comparison. The prototype sequences of the ve viral species were directly used for phylogenetic analysis by the neighbor-joining method using Unipro UGENE bioinformatics toolkits (7). As shown in Figure-1A, SARS-CoV-2 was a direct descendant of Bat-CoV and was closely related to Pan-CoV, Bat-SL-CoV, and SARS-CoV. The sequence alignment of the prototype amino acid sequences by using multiple sequence alignments (MSA) using MAFFT program (8) showed that the amino acid (aa) sequence of SARS-CoV-2 shared 98.67% sequence similarity with Bat-CoV while 94.51%, 94.35% and 82.93% of the sequences of SARS-CoV-2 were identical with that of Pan-CoV, Bat-SL-CoV, and SARS-CoV respectively ( Table 1). The aa sequence similarities were well-matched with their phylogenetic distances.
The zoonotic origin of SARS-CoV-2 is under debate to be either Bat-CoV or Pan-CoV. The current argument on the origin of SARS-CoV-2 would not be surprising since phylogenetic analyses can generate different results if individual genome sequences were used. Considering the sequence variability of RNA genomes as in the case of SARS-CoV-2, we believe that the prototype sequence analysis approach would generate a more precise result than conventional approaches in phylogenetic analysis for species consisting of RNA genome.
To investigate how SARS-CoV-2 originated, we compared pairwise differences between the prototype aa sequences of SARS-CoV-2 and Bat-CoV. The pairwise aa sequence analysis showed that both of the two virus genomes encode 14 genes (Fig. 1B). The aa sequence mismatches were randomly distributed except for the consecutive 7 aa alteration (439 NNLDSKV 445) in the spike protein. Since genetic mutation is a random process, the mutated points are expected to be distributed randomly throughout the genome. Considering this nature of mutations, the consecutive 7-aa sequence difference in the spike protein was peculiar, indicating that an unusual event happened during the emergence of SARS-CoV-2 from Bat-CoV. Hence, 3D models of the spike proteins were made to analyze the consequence of alteration to the structure of the spike protein during the emergence of SARS-CoV-2. The spike protein of coronavirus (CoV) is the surface protein that binds to a receptor on the host cell surface. The spike protein consists of three large domains: a large ectodomain, a single-pass transmembrane anchor, and a short intracellular tail (9). The ectodomain is further divided into a receptor-binding subunit S1 and a membrane-fusion subunit S2. Coronavirus rst binds to a receptor on the host cell surface through its S1 subunit and then fuses the viral and the host membranes through its S2 subunit, meaning that the S1 domain plays the most critical role for the virus's invasion into its host (10,11). As shown in Figure-1C, altering the consecutive 7-aa on the outer layer of the spike protein's S1 domain did not affect the overall structural integrity of the spike protein. The 3D modeling also showed that the consecutive 7-aa alteration resulted in a new motif to occupy more space in the S1 domain than its original structure by transforming a partial αhelical structure to a random coil. Since the S1 domain acts as the binding domain for the entry of the virus, enlargement of its space as well as adaptation of different conformation in the outermost layer surface of the S1 domain would endow SARS-CoV-2 to expand its host range.
The unusual pattern of genetic alteration motivated us to investigate the origin of the 7-aa motif (439 NNLDSKV 445). To trace down its origin, we overlappingly defragmented the aa sequence of the spike protein of SARS-CoV-2 into 9 aa units. The fragmented aa sequence matched with all of the available protein sequences. All of the aa sequences matched only with the spike proteins of coronaviruses except for the 7-aa motif. Very surprisingly, the NNLDSKV motif perfectly matched only with a motif of a surface protein (GenBank Accession : XP_028861348.1) of Plasmodium malariae which causes malaria to humans. The NNLDSKV motif was located at aa 449-455 of a membrane-bound surface protein of P. malariae (Fig. 2).
The clinical presentation of COVID-19 very differs from that of other coronaviruses (12). The acquisition of the P. malariae gene in the spike protein seems to explain why. Plasmodium species, which is responsible for malaria, infect liver cells and erythrocytes to promote blood clotting and damages in the heart, liver, or kidney. Unlike other coronaviral infections, it is very interesting to note that symptoms similar to the case of the P. malariae infection were reported in COVID-19. Red cell distribution width (RDW) of erythrocytes has been reported to be enlarged after the malarial invasion because the growth of the malaria parasite causes the cells to be enlarged (13,14). Foy et al., recently reported the observation of the same phenomenon in COVID-19 that elevated RDW is correlated with increased mortality risk in COVID-19 (15). Not only is the enlargement of erythrocytes in COVID-19 correlated with malaria infection, but also clinical presentations such as blood clotting and damages in the heart, liver, or kidney were also observed in COVID-19 (16)(17)(18). Furthermore, the acquisition of the P. malariae gene seems to explain why hydroxychloroquine, an anti-malaria drug for P. malariae infection, shows some e cacy in COVID-19 (18)(19)(20). Considering these facts, the investigation of different kinds of anti-malaria drugs for COVID-19 treatment needs to be pursued.
According to the current outbreak statistic of COVID-19 (https://www.worldometers. info/coronavirus/), a racial background affects the morbidity and mortality of COVID-19. It seems to be that the morbidity and mortality by COVID-19 are much more serious in the Western world than in Eastern countries. Multisystem in ammatory syndrome in children never reported and the morbidity and mortality by COVID-19 are much milder in Eastern countries. It would be very interesting to investigate the role of the NNLDSKV motif among different racial backgrounds including ACE-2 since the genetic polymorphism of ACE-2 is known to differ among different ethnic and racial groups (21). Competing interests: The authors declare no competing interesting.
Data availability: All data are available in the main manuscript and supplementary material.