Worldwide SARS-COV-2 haplotype distribution

The world is experiencing one of the most severe viral outbreaks in the last years, the pandemic infection by SARS-COV-2, causative agent of COVID-19 disease. The virus reached over 120 countries, with a total number of 6.5 million infected, and 320000 deaths. A deeper understanding of its genomic diversity is mandatory. We analyzed 21296 SARS-COV-2 reported sequences, dening the existence of recurrent haplotypes and their specic geographical distribution.


Abstract
The world is experiencing one of the most severe viral outbreaks in the last years, the pandemic infection by SARS-COV-2, causative agent of COVID-19 disease. The virus reached over 120 countries, with a total number of 6.5 million infected, and 320000 deaths. A deeper understanding of its genomic diversity is mandatory.
We analyzed 21296 SARS-COV-2 reported sequences, de ning the existence of recurrent haplotypes and their speci c geographical distribution.

Main Text
In three months after the declaration of the SARS-COV-2 pandemic (1), the scienti c community has been struggling to understand the complexity of this novel coronavirus of still debated zoonotic origin (2), the clinical symptoms (3,4), the risk factors, the potential treatments, in the urgent effort to contain the infection, to predict potentially serious disease outcomes, to nd a cure. Despite the circulation of SARS-COV-2 before the pandemic declaration has been ascertained, little is known about its worldwide spreading and the genetic changes that originated different viral strains.
The rst case of COVID-19 was reported in Wuhan, China, in December 2019 and, despite the relatively low mortality (2% on average) and the high percentage of asymptomatic or pauci-symptomatic subjects (over 80%), the viral outbreak has literally caused a dramatic collapse of the health care system in the most hit countries, as in Northern Italy.
In the urgent race to nd e cient drugs and decrease the complications, a deeper understanding of the genomic diversity of this virus is crucial. Indeed, the existence of different strains and their temporal and geographical distribution can provide relevant information on: how the virus spread all over the world; the possible acquisition of selective advantages; the most conserved sequences suitable for a vaccine design.
Since the rst complete genomic sequence of SARS-COV-2 release on January 5 th 2020 (1, 2), thousands of additional sequences have been deposited. Different virus strains have been reported (5) but, to the best of our knowledge, this manuscript describes the rst analysis of all the genomic sequences reported so far. The variant call format (VCF) les, containing 21296 sequences of SARS-COV-2 at the date of manuscript preparation, was downloaded from http://covseq.baidu.com/ (6).
The geographical distribution of available sequences is reported in Figure S1. The samples were collected from December 2019 to April 2020 at different times across different countries ( Figure S2). The rst sequence was reported in Asia (Wuhan, December 2019), where the virus outbreak originated. It is worth noting that few sequences were early obtained (January 2020) in United States (Supplementary Table   S1).
Using as reference the rst reported sequence of SARS-COV-2 (RefSeq NC_045512.2), we described the geographical distribution of the variants reported in the merged VCF le of all available sequences.
China is the country with the highest percentage of unmutated viral genomes (13.8%) followed by USA (2.3%) ( Table 1). In addition, observing the date of the rst collected sample with viral sequence identical to the reference, we can notice how USA, Northern Europe, Australia and South Africa have been involved in the spread of the pandemic immediately after China (Figure 1).
To further explore the geographical occurrence, the timing of the virus circulation and to track its mutational evolution, we analyzed the minimum number of variants/sequence in each country ( Figure  S3). This approach allowed us to de ne how some countries severely hit by COVID-19, as Italy, Spain and Brazil, are characterized by the occurrence and spread of already mutated forms of the virus. The number of variants/sequence spans mainly from 0 to 12; the majority of sequences carried from 4 to 5, and only few cases (63) had more than 12 variants ( Figure S4a). A plausible hypothesis is that, whereas few variants might have provided favorable features, as an improved infectivity, a higher number of variants did not result in a selective advantage.
A total of 7197 variants were identi ed, the majority are missense (60%) and synonymous (32%) ( Figure   S4b). As expected, putative loss of function mutations (nonsense, deletion or insertion) were not found among the most frequent variants. Moreover, we observed 7733 different haplotypes, classi ed the 20 most frequent according to the number of cases (Table 3) and evaluated their frequencies, geographical distribution, temporal occurrence, and potential connection ( Figure S6-S9). 5161 haplotypes are unique, whereas the remaining occurred in more than one patient.
Tracing mutations in the virus' genome is crucial to evaluate potential functional consequences and in the attempt to obtain an e cient vaccine. Indeed, more evolutionary conserved regions should be preferential target for the production of a vaccine. Our analysis underlines that, despite the high number of variants, the regions coding for the polyprotein ORF1ab, the spike (S) and the membrane (M) proteins are the most conserved ( Figure S4c). This is not surprising, since most evolutionary conserved regions usually encode for essential proteins. However, we identi ed the missense variant p.Asp614Gly in the gene encoding for the spike protein as the most frequent (found in 13451 sequences, 63% of the total) ( Table 2). This variant was previously reported in smaller cohorts of samples (7,8) and preliminary studies suggest that it might improve the binding a nity of the S protein to the human ACE2 receptor, reported as main entry site of the virus into human cells (9), through the cleavage of the S1/S2 domain (10). Indeed, it was previously shown that SARS-CoV infection can be enhanced by exogenous proteases (11). In addition, a very recent report (12) indicates a positive correlation between the frequency of this variant and a higher case-fatality rate, although further studies should be performed to demonstrate a causal role of this variant in a more severe disease outcome. Moreover, it is extremely relevant to notice that we found the p.Asp614Gly variant as a single mutation only in two patients, suggesting that this variant alone might not provide a selective advantage to the virus. Indeed, in the large majority of cases p.Asp614Gly occurs in concomitance with other variants, in particular with two variants located in the ORF1ab gene, the missense mutation p.Pro4715Leu and the c.-25C>T in the 5' untranslated region, all characterized by similar frequency.
These three variants are indeed in strong linkage and constitute the most common haplotype, identi ed in 1523 sequences (haplotype 1, Table 3). This haplotype, rst reported in Northern Italy (Lombardia) in February and representing 40% of Italian cases, has mainly spread to the rest of the European countries (1059 cases) and to North America (213 cases). Moreover, this triplet of variants represents the "common core" of 4400 different haplotypes, affecting 12949 patients (61% of total), of which the most frequent are reported in Table 3. Interestingly, 90% of the sequences reported in Italy (73 out of 81) has these three variants (data not shown).
A deeper evaluation of the haplotypes sharing the same variants and their geographical occurrence revealed the existence of speci c clusters, likely re ecting a temporal and spatial spread of virus strains.
Whereas haplotype 1 originated different haplotypes more common in North America (haplotypes 2, 8, 13) or in Europe (haplotype 4) ( Figure S5), others seem to be peculiar for speci c areas, as haplotype 3, rst reported in USA, mainly present in North America, and characterized by 4 completely different variants (Table 3).
It would be extremely interesting and relevant to associate the different haplotypes to a speci c outcome of the disease, and to understand whether the acquired mutations have functional consequences in terms of infectiveness, clinical severity, and potential responsiveness to speci c treatments. Figure 1 Geographical distribution of the reference sequence based on the date of the rst reported unmutated sequence. Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplementaryinfo.docx