Evolutionary Frequency of the Available Human Coronavirus Genomes

DOI: https://doi.org/10.21203/rs.3.rs-30406/v1

Abstract

Background. A novel, human-infecting coronavirus causing CoVID-19 was first identified in Wuhan, China in late December, 2019. Within a short span of time more the virus has recorded more than 1 million deaths world-wide. This study is designed to address the overall evolutionary process of the novel Coronavirus complete genomes. Addressing the complexity and huge population size, network-based approaches are used in mapping samples to their reported locations.

Results. Total of 473 complete human-coronavirus genomes representing 20 different countries are studied including 17 states from the United States and samples collected from the Cruise-diamond princess. The phylodynamic network of global-scale is classified into five clusters contained two clusters U1 and U2 of the USA samples. Cluster B is a shared cluster of China and the USA while A and C are of diverse nature. We found that Chinese samples aggregated in cluster A and B which aided in retaining the homogeneous viral genomic pool. In contrast, samples from the USA and Spain were split into distinct clusters indicating multiple port entries and a possibility in implying a delay in quarantine measures. Among the samples from the USA, we found that sequences reported from Washington and Virginia are scattered indicating evolutionary diversity.

Conclusion. This report provides insight into the transmission pattern of CoV2 which is complicated to evaluate exclusively through conventional surveillance means. Our data not only identify the transmission network but also suggest that the severity of the disease is linked to the spatial diversity of infection. 

Background

A novel, human-infecting coronavirus called SARS-CoV2 causing COVID-19 was first identified with the use of next-generation sequencing in Wuhan, China in late December, 2019 [1]. Contagion in medical workers and family clusters were also reported confirming human-to-human transmission [2]. Patients infected with COVID-19 exhibit a high fever, sore throat, dyspnoea, with invasive lesions present in both lungs as revealed by chest radiography [2, 3]. Within a period of 4 months the virus has spread to more than 210 countries becoming an international emergency where European Region, Region of the Americas, Western Pacific Region and Eastern Mediterranean Region are the worst affected. As of April 13, 2020, more than 1773084 confirmed cases have been reported around the world, with 111640 fatalities (www.cdc.gov). SARS-CoV2 is a RNA virus due to which it has high mutation rate which alternatively allows for estimating the underlying genealogy connecting sampled viruses [4]. SARS-CoV2 shares 96.3% of genetic similarity with the bat coronavirus RaTG13, which was obtained from bats in Yunnan in 2013 and is used as an outgroup in recent studies [5]. Identifying the origin and transmission pattern of such a pathogen is imperative to block the means of further spread [6].

Several approaches are being employed to combat the pandemic. Treatment with antiviral drugs, chloroquine, corticosteroids and convalescent plasma transfusion are being tested with limited success [7-12]. Development of a potential vaccine is a time-consuming process and till then conventional public health procedures, such as isolation, quarantine, community distancing and social containment, can be used to stop the spread of this viral disease [13]. In order to successfully employ this tactic phylogenetic methods can be employed in clinical studies to investigate the pathogen spread within individual and in communities. Moreover, understanding the global transmission and phylodynamic pattern of CoV-19 can assist in tracking undocumented COVID-19 infection sources and trace the route of infection transmission. New cases are being reported every day and with that sequencing data is also readily accessible. In our study we included sequence entries from 20 different countries, analyzed and mapped 473 complete CoV2 genomes and connect them through a network-based distances retrieved from whole-genome sequencing.

Results

To understand the spread and evolving dynamics of CoV2, here we mapped all the genomes available on NCBI virus database (www.ncbi.nlm.nih.gov/labs/virus). Total of 473 complete CoV2 genomes comprising sequence entries from 20 different countries were selected for analyses. Based on available reports Bat-CoV genome was used as an outgroup source [5]. Our analyses are consistent with other reports which shows that samples from Wuhan (MT291831) and Shenzhen/Hongkong (MN975262) are closest to the source. The former sample spread out into two clusters A and B engaging three samples (MN997409-Arizona, MT106054-Texas and MN938384-Hongkong/Shenzhen) to connect with cluster B and one sample, MT304489-Taxas for cluster A, sharing one and four mutations each (Figure 1). For better understandings, we have classified the whole network into five clusters, where the distant U1 and U2 are rich in samples of the USA. Cluster B is mainly a shared cluster of China and USA while A and C are diverse.

The center of cluster A is shared by samples from USA, China and Taiwan while the Chinese source shares ancestry (two mutations each) to Colombian (MT256924) and Indian (MT050493) sample respectively.  The sample from Taiwan provide a sole outgroup (MN985325) to cluster U1 which densely contains the sequences from Washington DC, USA. Cluster B is heavily centered to USA and China and provide direct descendants to Vietnam, Israel, India, Pakistan, Italy, Nepal, Australia, Sweden and Korea sharing one to four mutations.  Interestingly, the Swedish sample is using Australian node rather than Chinese. Second cluster of the USA, U2 is connected to cluster B by a rather small cluster C that contained European and South American samples from Spain, France and Peru. The French sample of cluster C provide an outgroup to the U2 cluster that contained sequences from different states of the USA.

Collectively, our global scale CoV2 spreading dynamics indicate countries with multiple or different source entries are assisting viral evolution at a rapid phase. 

Phylodynamics of the USA

Until April 13, 2020 there were more than 400 sequences from the USA. Here, we have analyzed the 355 complete genome samples of the USA reported from seventeen different states including 24 samples from the cruise ship Diamond princess that had 3771 passengers on board out of which more than 700 confirmed cases of CoV2 [14]. Since the cruise was carrying CoV2 positive patients from Hongkong, we used Bat-CoV genome as an outgroup. To our interest, the cruise samples grouped next to the ancestor, here we call it Cruise-cluster. Along with the Cruise-cluster one sample each from Oregon (OR, MT304487) and Texas (TX, MT276331) stayed closer to the ancestor (Figure 2). The OR sample provide a base for one sample each for California (CA), Georgia (GA) and five for Washington (WA). The central base of the Cruise-cluster is shared with the Arizonian sample directly infected from China (discussed above). Overall, the C-cluster shares similarity with majority of the samples from CA and further bifurcated. The left side group of WA samples is in the same group we previously mentioned as U1 and is connected by an arbitrary ancestor to the C-group suggesting that Cruise samples are not the direct source for U1. Ultimately the only valid source left is from Taiwan. Similar case can be observed in the right cluster where the Cruise-cluster is not providing an actual ancestral link.

Discussion

Previously, phylodynamic is used to describe immunodynamics, epidemiology, and evolutionary biology’ to understand how infectious diseases are transmitted and evolve [15]. Variety of evolutionary models assume a tree to facilitate the testing and discussion of hypotheses. However, the increase in population size more complex evolutionary scenarios are poorly described by such models [16]. Such limitations have led to the development of a number of different types of phylogenetic networks. To estimate evolutionary frequency of the available human CoV2 genomes and map them on to the geographical locations we present our analyses through median-joining network. 

Analyzing the global scale evolution and spread of human CoV2, we have noticed the presence of Chinese samples only in cluster A and B highlighting the efficacy of tight quarantine practices of Chinese citizens that proved to be efficient in retaining the homogeneous viral genomic pool. On the other hand, samples from the USA were split into distinct clusters indicating multiple port entries of the virus and implying a delay in quarantine measures. Although USA had restrictions in place on all the traffic coming from China but such measures were not applied to the traffic coming from rest of the world, hence the virus was not contained as efficiently as it was contained in China. A similar phenomenon was observed in Spanish samples located in three different clusters (A, B and C) and shares ancestors from Taiwan, China, The USA and Israel separately. Contrary, genomes reported from the USA population indicate that the passengers from the Cruise Diamond Princess were efficiently quarantined and treated and are not the major source for the spread of infection in the USA. The clustering of the cruise samples near the ancestral node are justified by two main reasons. Firstly, passengers were carrying the virus from the epicenter, China and secondly, they remained isolated inside the cruise which restricted viral evolution. Specifically, sequences of WA and VA has shown diversity and are scattered almost in every cluster. Overall, our data emphasize that the CoV2 spread is higher in the USA due to heterogeneity in viral pool when compared to rest of the affected countries. Besides the US government need to take some strict measures to keep the viral spread limited to the source by restricting free movements of the citizens.

Methods

All the sequences used in study are retrieved from the NIH NCBI Virus database (http://www.ncbi.nlm.nih.gov/labs/virus). Entries with incomplete genome are removed and we left with a final dataset of 473 sequences, containing 355-USA, 71-China, 18-Spain, 04-Korea, 03-Taiwan, 02 sequence each from India, Pakistan, Vietnam,  Nepal, Israel and Iran and 01 sequence each from Australia, Finland, France, Peru, Brail, Japan, Sweden and Colombia. Prior to perform an alignment with MAFT online server [17], the non-genomic alphabets were removed. Including the Bat corona sequence, alignment was performed with a strategy mode keeping 1PAM/k=2 substitution matrix and 1.53 gap-penalty score. Alignment file was manually adjusted by removing the 5’-prime 30-40 nucleotides and 3’-prime poly-A sequences. The aligned dataset was transported to MEGA for generation to time-tree where neighbor-joining approach was utilized [18]. The DNASp6 packages was utilized for the data format conversion purposes [19].  The PopART version 1.7 was used to convert all the time trees into median joining network using the epsilon value of “0” and the final networks were drawn with iteration value of 5000 [20, 21]. For graphical manipulations, the Microsoft package paint.net was considered. To reproduce this data, the alignment file of all 473 genomes can be accessed from supplementary section.

Declarations

Acknowledgments

The authors acknowledge all the researchers who are working hard in collecting, sequencing and data deposition of CoV2-19 samples and thank all the medical professionals who working on the front-line fighting CoVID19.

 

Funding

No internal or external funding sources are supporting this data.

Conflict of interests

The authors declare no conflict of interest.

Author Contribution

All authors have read and approved the manuscript. AA and SM: Designed, Analyses and Writing. FN: Writing and proof reading

Availability of data and materials

All the data is submitted.

References

  1. Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R et al: A Novel Coronavirus from Patients with Pneumonia in China, 2019. N Engl J Med 2020, 382(8):727-733.
  2. Chan JF, Yuan S, Kok KH, To KK, Chu H, Yang J, Xing F, Liu J, Yip CC, Poon RW et al: A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet 2020, 395(10223):514-523.
  3. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X et al: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395(10223):497-506.
  4. Drake JW: Rates of spontaneous mutation among RNA viruses. Proc Natl Acad Sci U S A 1993, 90(9):4171-4175.
  5. Forster P, Forster L, Renfrew C, Forster M: Phylogenetic network analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A 2020.
  6. Xiao C, Li X, Liu S, Sang Y, Gao SJ, Gao F: HIV-1 did not contribute to the 2019-nCoV genome. Emerg Microbes Infect 2020, 9(1):378-381.
  7. Chan KS, Lai ST, Chu CM, Tsui E, Tam CY, Wong MM, Tse MW, Que TL, Peiris JS, Sung J et al: Treatment of severe acute respiratory syndrome with lopinavir/ritonavir: a multicentre retrospective matched cohort study. Hong Kong Med J 2003, 9(6):399-406.
  8. Holshue ML, DeBolt C, Lindquist S, Lofy KH, Wiesman J, Bruce H, Spitters C, Ericson K, Wilkerson S, Tural A et al: First Case of 2019 Novel Coronavirus in the United States. N Engl J Med 2020, 382(10):929-936.
  9. Wang Z, Chen X, Lu Y, Chen F, Zhang W: Clinical characteristics and therapeutic procedure for four cases with 2019 novel coronavirus pneumonia receiving combined Chinese and Western medicine treatment. Biosci Trends 2020, 14(1):64-68.
  10. Savarino A, Di Trani L, Donatelli I, Cauda R, Cassone A: New insights into the antiviral effects of chloroquine. Lancet Infect Dis 2006, 6(2):67-69.
  11. multicenter collaboration group of Department of S, Technology of Guangdong P, Health Commission of Guangdong Province for chloroquine in the treatment of novel coronavirus p: [Expert consensus on chloroquine phosphate for the treatment of novel coronavirus pneumonia]. Zhonghua Jie He He Hu Xi Za Zhi 2020, 43(3):185-188.
  12. Gao J, Tian Z, Yang X: Breakthrough: Chloroquine phosphate has shown apparent efficacy in treatment of COVID-19 associated pneumonia in clinical studies. Biosci Trends 2020, 14(1):72-73.
  13. Wilder-Smith A, Freedman DO: Isolation, quarantine, social distancing and community containment: pivotal role for old-style public health measures in the novel coronavirus (2019-nCoV) outbreak. J Travel Med 2020, 27(2).
  14. Mizumoto K, Kagaya K, Zarebski A, Chowell G: Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020. Euro Surveill 2020, 25(10).
  15. Frost SD, Pybus OG, Gog JR, Viboud C, Bonhoeffer S, Bedford T: Eight challenges in phylodynamic inference. Epidemics 2015, 10:88-92.
  16. Huson DH, Bryant D: Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 2006, 23(2):254-267.
  17. Katoh K, Rozewicki J, Yamada KD: MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform 2019, 20(4):1160-1166.
  18. Kumar S, Stecher G, Li M, Knyaz C, Tamura K: MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol Biol Evol 2018, 35(6):1547-1549.
  19. Rozas J, Ferrer-Mata A, Sanchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, Sanchez-Gracia A: DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets. Mol Biol Evol 2017, 34(12):3299-3302.
  20. Bandelt HJ, Forster P, Rohl A: Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol 1999, 16(1):37-48.
  21. Leigh JW, Bryant D: popart: full-feature software for haplotype network construction. Methods in Ecology and Evolution 2015, 6(9):1110-1116.