In three months after the declaration of the SARS-COV-2 pandemic (1), the scientific community has been struggling to understand the complexity of this novel coronavirus of still debated zoonotic origin (2), the clinical symptoms (3, 4), the risk factors, the potential treatments, in the urgent effort to contain the infection, to predict potentially serious disease outcomes, to find a cure. Despite the circulation of SARS-COV-2 before the pandemic declaration has been ascertained, little is known about its worldwide spreading and the genetic changes that originated different viral strains.
The first case of COVID-19 was reported in Wuhan, China, in December 2019 and, despite the relatively low mortality (2% on average) and the high percentage of asymptomatic or pauci-symptomatic subjects (over 80%), the viral outbreak has literally caused a dramatic collapse of the health care system in the most hit countries, as in Northern Italy.
In the urgent race to find efficient drugs and decrease the complications, a deeper understanding of the genomic diversity of this virus is crucial. Indeed, the existence of different strains and their temporal and geographical distribution can provide relevant information on: how the virus spread all over the world; the possible acquisition of selective advantages; the most conserved sequences suitable for a vaccine design.
Since the first complete genomic sequence of SARS-COV-2 release on January 5th 2020 (1, 2), thousands of additional sequences have been deposited. Different virus strains have been reported (5) but, to the best of our knowledge, this manuscript describes the first analysis of all the genomic sequences reported so far. The variant call format (VCF) files, containing 21296 sequences of SARS-COV-2 at the date of manuscript preparation, was downloaded from http://covseq.baidu.com/ (6).
The geographical distribution of available sequences is reported in Figure S1. The samples were collected from December 2019 to April 2020 at different times across different countries (Figure S2). The first sequence was reported in Asia (Wuhan, December 2019), where the virus outbreak originated. It is worth noting that few sequences were early obtained (January 2020) in United States (Supplementary Table S1).
Using as reference the first reported sequence of SARS-COV-2 (RefSeq NC_045512.2), we described the geographical distribution of the variants reported in the merged VCF file of all available sequences.
China is the country with the highest percentage of unmutated viral genomes (13.8%) followed by USA (2.3%) (Table 1). In addition, observing the date of the first collected sample with viral sequence identical to the reference, we can notice how USA, Northern Europe, Australia and South Africa have been involved in the spread of the pandemic immediately after China (Figure 1).
To further explore the geographical occurrence, the timing of the virus circulation and to track its mutational evolution, we analyzed the minimum number of variants/sequence in each country (Figure S3). This approach allowed us to define how some countries severely hit by COVID-19, as Italy, Spain and Brazil, are characterized by the occurrence and spread of already mutated forms of the virus. The number of variants/sequence spans mainly from 0 to 12; the majority of sequences carried from 4 to 5, and only few cases (63) had more than 12 variants (Figure S4a). A plausible hypothesis is that, whereas few variants might have provided favorable features, as an improved infectivity, a higher number of variants did not result in a selective advantage.
A total of 7197 variants were identified, the majority are missense (60%) and synonymous (32%) (Figure S4b). As expected, putative loss of function mutations (nonsense, deletion or insertion) were not found among the most frequent variants. Moreover, we observed 7733 different haplotypes, classified the 20 most frequent according to the number of cases (Table 3) and evaluated their frequencies, geographical distribution, temporal occurrence, and potential connection (Figure S6-S9). 5161 haplotypes are unique, whereas the remaining occurred in more than one patient.
Tracing mutations in the virus’ genome is crucial to evaluate potential functional consequences and in the attempt to obtain an efficient vaccine. Indeed, more evolutionary conserved regions should be preferential target for the production of a vaccine. Our analysis underlines that, despite the high number of variants, the regions coding for the polyprotein ORF1ab, the spike (S) and the membrane (M) proteins are the most conserved (Figure S4c). This is not surprising, since most evolutionary conserved regions usually encode for essential proteins. However, we identified the missense variant p.Asp614Gly in the gene encoding for the spike protein as the most frequent (found in 13451 sequences, 63% of the total) (Table 2). This variant was previously reported in smaller cohorts of samples (7, 8) and preliminary studies suggest that it might improve the binding affinity of the S protein to the human ACE2 receptor, reported as main entry site of the virus into human cells (9), through the cleavage of the S1/S2 domain (10). Indeed, it was previously shown that SARS-CoV infection can be enhanced by exogenous proteases (11). In addition, a very recent report (12) indicates a positive correlation between the frequency of this variant and a higher case-fatality rate, although further studies should be performed to demonstrate a causal role of this variant in a more severe disease outcome. Moreover, it is extremely relevant to notice that we found the p.Asp614Gly variant as a single mutation only in two patients, suggesting that this variant alone might not provide a selective advantage to the virus. Indeed, in the large majority of cases p.Asp614Gly occurs in concomitance with other variants, in particular with two variants located in the ORF1ab gene, the missense mutation p.Pro4715Leu and the c.-25C>T in the 5’ untranslated region, all characterized by similar frequency.
These three variants are indeed in strong linkage and constitute the most common haplotype, identified in 1523 sequences (haplotype 1, Table 3). This haplotype, first reported in Northern Italy (Lombardia) in February and representing 40% of Italian cases, has mainly spread to the rest of the European countries (1059 cases) and to North America (213 cases). Moreover, this triplet of variants represents the “common core” of 4400 different haplotypes, affecting 12949 patients (61% of total), of which the most frequent are reported in Table 3. Interestingly, 90% of the sequences reported in Italy (73 out of 81) has these three variants (data not shown).
A deeper evaluation of the haplotypes sharing the same variants and their geographical occurrence revealed the existence of specific clusters, likely reflecting a temporal and spatial spread of virus strains. Whereas haplotype 1 originated different haplotypes more common in North America (haplotypes 2, 8, 13) or in Europe (haplotype 4) (Figure S5), others seem to be peculiar for specific areas, as haplotype 3, first reported in USA, mainly present in North America, and characterized by 4 completely different variants (Table 3).
It would be extremely interesting and relevant to associate the different haplotypes to a specific outcome of the disease, and to understand whether the acquired mutations have functional consequences in terms of infectiveness, clinical severity, and potential responsiveness to specific treatments.