DOI: https://doi.org/10.21203/rs.3.rs-30406/v1
Background. A novel, human-infecting coronavirus causing CoVID-19 was first identified in Wuhan, China in late December, 2019. Within a short span of time more the virus has recorded more than 1 million deaths world-wide. This study is designed to address the overall evolutionary process of the novel Coronavirus complete genomes. Addressing the complexity and huge population size, network-based approaches are used in mapping samples to their reported locations.
Results. Total of 473 complete human-coronavirus genomes representing 20 different countries are studied including 17 states from the United States and samples collected from the Cruise-diamond princess. The phylodynamic network of global-scale is classified into five clusters contained two clusters U1 and U2 of the USA samples. Cluster B is a shared cluster of China and the USA while A and C are of diverse nature. We found that Chinese samples aggregated in cluster A and B which aided in retaining the homogeneous viral genomic pool. In contrast, samples from the USA and Spain were split into distinct clusters indicating multiple port entries and a possibility in implying a delay in quarantine measures. Among the samples from the USA, we found that sequences reported from Washington and Virginia are scattered indicating evolutionary diversity.
Conclusion. This report provides insight into the transmission pattern of CoV2 which is complicated to evaluate exclusively through conventional surveillance means. Our data not only identify the transmission network but also suggest that the severity of the disease is linked to the spatial diversity of infection.
A novel, human-infecting coronavirus called SARS-CoV2 causing COVID-19 was first identified with the use of next-generation sequencing in Wuhan, China in late December, 2019 [1]. Contagion in medical workers and family clusters were also reported confirming human-to-human transmission [2]. Patients infected with COVID-19 exhibit a high fever, sore throat, dyspnoea, with invasive lesions present in both lungs as revealed by chest radiography [2, 3]. Within a period of 4 months the virus has spread to more than 210 countries becoming an international emergency where European Region, Region of the Americas, Western Pacific Region and Eastern Mediterranean Region are the worst affected. As of April 13, 2020, more than 1773084 confirmed cases have been reported around the world, with 111640 fatalities (www.cdc.gov). SARS-CoV2 is a RNA virus due to which it has high mutation rate which alternatively allows for estimating the underlying genealogy connecting sampled viruses [4]. SARS-CoV2 shares 96.3% of genetic similarity with the bat coronavirus RaTG13, which was obtained from bats in Yunnan in 2013 and is used as an outgroup in recent studies [5]. Identifying the origin and transmission pattern of such a pathogen is imperative to block the means of further spread [6].
Several approaches are being employed to combat the pandemic. Treatment with antiviral drugs, chloroquine, corticosteroids and convalescent plasma transfusion are being tested with limited success [7-12]. Development of a potential vaccine is a time-consuming process and till then conventional public health procedures, such as isolation, quarantine, community distancing and social containment, can be used to stop the spread of this viral disease [13]. In order to successfully employ this tactic phylogenetic methods can be employed in clinical studies to investigate the pathogen spread within individual and in communities. Moreover, understanding the global transmission and phylodynamic pattern of CoV-19 can assist in tracking undocumented COVID-19 infection sources and trace the route of infection transmission. New cases are being reported every day and with that sequencing data is also readily accessible. In our study we included sequence entries from 20 different countries, analyzed and mapped 473 complete CoV2 genomes and connect them through a network-based distances retrieved from whole-genome sequencing.
To understand the spread and evolving dynamics of CoV2, here we mapped all the genomes available on NCBI virus database (www.ncbi.nlm.nih.gov/labs/virus). Total of 473 complete CoV2 genomes comprising sequence entries from 20 different countries were selected for analyses. Based on available reports Bat-CoV genome was used as an outgroup source [5]. Our analyses are consistent with other reports which shows that samples from Wuhan (MT291831) and Shenzhen/Hongkong (MN975262) are closest to the source. The former sample spread out into two clusters A and B engaging three samples (MN997409-Arizona, MT106054-Texas and MN938384-Hongkong/Shenzhen) to connect with cluster B and one sample, MT304489-Taxas for cluster A, sharing one and four mutations each (Figure 1). For better understandings, we have classified the whole network into five clusters, where the distant U1 and U2 are rich in samples of the USA. Cluster B is mainly a shared cluster of China and USA while A and C are diverse.
The center of cluster A is shared by samples from USA, China and Taiwan while the Chinese source shares ancestry (two mutations each) to Colombian (MT256924) and Indian (MT050493) sample respectively. The sample from Taiwan provide a sole outgroup (MN985325) to cluster U1 which densely contains the sequences from Washington DC, USA. Cluster B is heavily centered to USA and China and provide direct descendants to Vietnam, Israel, India, Pakistan, Italy, Nepal, Australia, Sweden and Korea sharing one to four mutations. Interestingly, the Swedish sample is using Australian node rather than Chinese. Second cluster of the USA, U2 is connected to cluster B by a rather small cluster C that contained European and South American samples from Spain, France and Peru. The French sample of cluster C provide an outgroup to the U2 cluster that contained sequences from different states of the USA.
Collectively, our global scale CoV2 spreading dynamics indicate countries with multiple or different source entries are assisting viral evolution at a rapid phase.
Phylodynamics of the USA
Until April 13, 2020 there were more than 400 sequences from the USA. Here, we have analyzed the 355 complete genome samples of the USA reported from seventeen different states including 24 samples from the cruise ship Diamond princess that had 3771 passengers on board out of which more than 700 confirmed cases of CoV2 [14]. Since the cruise was carrying CoV2 positive patients from Hongkong, we used Bat-CoV genome as an outgroup. To our interest, the cruise samples grouped next to the ancestor, here we call it Cruise-cluster. Along with the Cruise-cluster one sample each from Oregon (OR, MT304487) and Texas (TX, MT276331) stayed closer to the ancestor (Figure 2). The OR sample provide a base for one sample each for California (CA), Georgia (GA) and five for Washington (WA). The central base of the Cruise-cluster is shared with the Arizonian sample directly infected from China (discussed above). Overall, the C-cluster shares similarity with majority of the samples from CA and further bifurcated. The left side group of WA samples is in the same group we previously mentioned as U1 and is connected by an arbitrary ancestor to the C-group suggesting that Cruise samples are not the direct source for U1. Ultimately the only valid source left is from Taiwan. Similar case can be observed in the right cluster where the Cruise-cluster is not providing an actual ancestral link.
Previously, phylodynamic is used to describe immunodynamics, epidemiology, and evolutionary biology’ to understand how infectious diseases are transmitted and evolve [15]. Variety of evolutionary models assume a tree to facilitate the testing and discussion of hypotheses. However, the increase in population size more complex evolutionary scenarios are poorly described by such models [16]. Such limitations have led to the development of a number of different types of phylogenetic networks. To estimate evolutionary frequency of the available human CoV2 genomes and map them on to the geographical locations we present our analyses through median-joining network.
Analyzing the global scale evolution and spread of human CoV2, we have noticed the presence of Chinese samples only in cluster A and B highlighting the efficacy of tight quarantine practices of Chinese citizens that proved to be efficient in retaining the homogeneous viral genomic pool. On the other hand, samples from the USA were split into distinct clusters indicating multiple port entries of the virus and implying a delay in quarantine measures. Although USA had restrictions in place on all the traffic coming from China but such measures were not applied to the traffic coming from rest of the world, hence the virus was not contained as efficiently as it was contained in China. A similar phenomenon was observed in Spanish samples located in three different clusters (A, B and C) and shares ancestors from Taiwan, China, The USA and Israel separately. Contrary, genomes reported from the USA population indicate that the passengers from the Cruise Diamond Princess were efficiently quarantined and treated and are not the major source for the spread of infection in the USA. The clustering of the cruise samples near the ancestral node are justified by two main reasons. Firstly, passengers were carrying the virus from the epicenter, China and secondly, they remained isolated inside the cruise which restricted viral evolution. Specifically, sequences of WA and VA has shown diversity and are scattered almost in every cluster. Overall, our data emphasize that the CoV2 spread is higher in the USA due to heterogeneity in viral pool when compared to rest of the affected countries. Besides the US government need to take some strict measures to keep the viral spread limited to the source by restricting free movements of the citizens.
All the sequences used in study are retrieved from the NIH NCBI Virus database (http://www.ncbi.nlm.nih.gov/labs/virus). Entries with incomplete genome are removed and we left with a final dataset of 473 sequences, containing 355-USA, 71-China, 18-Spain, 04-Korea, 03-Taiwan, 02 sequence each from India, Pakistan, Vietnam, Nepal, Israel and Iran and 01 sequence each from Australia, Finland, France, Peru, Brail, Japan, Sweden and Colombia. Prior to perform an alignment with MAFT online server [17], the non-genomic alphabets were removed. Including the Bat corona sequence, alignment was performed with a strategy mode keeping 1PAM/k=2 substitution matrix and 1.53 gap-penalty score. Alignment file was manually adjusted by removing the 5’-prime 30-40 nucleotides and 3’-prime poly-A sequences. The aligned dataset was transported to MEGA for generation to time-tree where neighbor-joining approach was utilized [18]. The DNASp6 packages was utilized for the data format conversion purposes [19]. The PopART version 1.7 was used to convert all the time trees into median joining network using the epsilon value of “0” and the final networks were drawn with iteration value of 5000 [20, 21]. For graphical manipulations, the Microsoft package paint.net was considered. To reproduce this data, the alignment file of all 473 genomes can be accessed from supplementary section.
Acknowledgments
The authors acknowledge all the researchers who are working hard in collecting, sequencing and data deposition of CoV2-19 samples and thank all the medical professionals who working on the front-line fighting CoVID19.
Funding
No internal or external funding sources are supporting this data.
Conflict of interests
The authors declare no conflict of interest.
Author Contribution
All authors have read and approved the manuscript. AA and SM: Designed, Analyses and Writing. FN: Writing and proof reading
Availability of data and materials
All the data is submitted.