Comparative genomics and diversity of SARS-CoV-2 suggest potential regional virulence

It is widely known fact about the global pandemic caused by Severe Acute Respiratory Syndrome Coronavirus -2 (SARS-CoV-2) to humans, which imposed immediate lockdown of effected territories in the prevailing provinces. However, few provinces were able to control infection severity with lower death rates. Interestingly three types of genomic features were noticed through comparative genomics in the available genome sequences SARS-CoV-2, due to the insertion/deletions of orf3a, orf6, orf7a and orf7b. Whole genome phylogeny (n=75 genomes) revealed a large diversity within the SARS-CoV-2, and distributed in 6 clusters namely China, Diamond princess, Asian, European, USA and Beijing. This study asserts diversity in the genome with high mutation rate and migration of carriers over the world. Here, we describe the polymorphic loci of Spike glycoprotein and its putative mechanism for pathogenicity, which unveiled the presence of GPI anchor amidation, PPI hotspot, O-linked glycosylation, catalytic site, Iron binding site, signal cleavage, disulphide linkage, sulfation, transmembrane region, and C-terminal signal sites. Mutational changes at spike glycoprotein of South Korea, India, Greece, Spain, Australia, Sweden and Yunnan samples possibly suggest the prevalence of mutated strains with either low or high virulence. The regions at the spike glycoprotein also have high binding capacity to angiotensin converting enzyme 2 (ACE2) suggesting a key link for explaining damage to multiple organs including lungs, kidney and heart. Factors inuencing the mutations at the spike glycoprotein region will need to be investigated to understand and neutralize the upsurge of the alarming Pandemic and to control the global spread of the disease.


Introduction
In the history of global infections, COVID-19 (Corona Virus disease) has left its dangerous, uncontrollable outbreak footprint. Towards the end of the year 2019, people of Wuhan, the capital city of Hubei province in China, developed a strange pneumonia-like infection due to an unknown aetiology. It was later recognized to be a part of coronavirus family 1 . This pandemic was spread over 210 countries with over 3,248,685 con rmed cases and 229,399 deaths world-wide till date (01 May, 2020) (https://www.worldmeters.info/coronavirus). As of the date of manuscript preparation, the US now has almost 1/3 of all COVID-19 cases worldwide. With a high mortality rate of about 3-6% across the world, the havoc created by COVID-19 has been massive. The transmission was vast and understanding the genome variation has been in priority. Till date, no speci c drugs or vaccines are available to control the infection, and symptomatic treatments to block the viral replication is in early trials. Considering COVID-19 as a major public health emergency, globally several countries have suspended their trade and called off social events to prevent community transmission. Furthermore, to battle the virus, countries worldwide have resorted to self-quarantine and social-isolation as containment strategy for the bene t of the mankind. Medical supplies, protective agents and hand hygiene are the only resort to prevent the transmission dynamics of this deadly disease. Ian M. Jones had suggested that the SARS-CoV-2 mutates rapidly in the respiratory tract 2 . The data sharing among collaborators or investigators had made the analysis more accurate and easy. Moreover, certain strains prevailing at few provinces had shown low mortality rates. Whereas, countries like Italy, Spain, USA and a few more had high mortality rates indicating the presence of evolved virulent strain when compared to the original strain from Wuhan.
Understanding the genetically distinct variants in a phenotype is very important in analysing the pathogenic mechanism of infected hosts.
As RNA viruses tend to evolve rapidly among large populations with short generation times 3 , monitoring evolutionary patterns in "real time" is important. This leads to the emergence of many new novel strains of COVID-19, it has become an important aspect to differentiate between the virulent strains of SARS-CoVs in the current scenario. The virus was found more related to the betacoronavirus of bats (RaTG13) and pangolins (Manis javanica) with 96.2% and 91.02% homology respectively. The spike gene of SARS-CoV-2 has shown slight variation with polybasic cleavage site (PCS) 4,5 . The PCS of Spike protein gets cleaved by furin leaving its infection to different organs of the host [4][5][6] . The whole-genome sequence (WGS) data would probably show its evolution and reasons for its mutation rates 1,7 . Laboratories from many countries have deposited over 2400 genome sequences of SARS-CoV-2 at the NCBI, GISAID and Nextstrain databases (https://nextstrain.org/ncov), which allowed us to analyze this novel virus.
Acknowledging the importance of the spread and the evolution of the virulent pathogens, the NEXTstrain database provided the necessary information related to the phylodynamics, genomes and the surveillance data 8 .
It has been hypothesized that the pathogenesis of disease is possibly due to alveolar damage followed by spleen atrophy, enlarged liver, injury to kidney and neuronal dysfunction in patients 9,10 . The ability of SARS-CoV-2 to interact with the kidneys of host and shed of viral particles through faecal and urine of patients suggests the multiple organ damage with increased severity 11 . Though, the target organ reported was lungs due to the speci c binding of SARS-CoV-2 to ACE2 receptors, the presence of other sites responsible for effective binding of the spike protein in other organs remains an enigma. Hence, the present study focused on the variations in the genomes of COVID-19, which were distributed worldwide.
Further investigation on the mutated strains pathogenesis in biopsy samples of different human organs will be investigated.

Results And Discussion
Genome diversity and comparative genomics among SARS-CoV-2 In the present investigation, SARS-COV-2 genome sequences were retrieved from NCBI and GISAID database (till April 25 th 2020). Among the 520 complete genome data of SARS-CoV-2, the genomes which showed variations in their size and their geographic region were targeted. A total of 75 complete sequences of SARS-CoV-2 that were prevalent in countries like China, USA, France, Australia, Spain, Italy, India, Nepal, Taiwan, South Korea, South Africa, Greece, Sweden, Pakistan, Peru, Brazil, Iraq, Turkey and Israel were collected (Table S1). The knowledge of patient's ethnicity and racial background were not readily available for all the samples. The genome analysis revealed that the SARS-CoV-2 is a 30 Kbp genome with over 10 to 12 genes ( Table 1). The largest genome size was noticed in a Shanghai patient (SH01) of China (Accession number MT121215) with 29945 bp (reported on 2 nd Feb 2020). While the smaller genome size of 29852 bp was detected in a USA patient (CA6) (Accession number MT044258) (isolated on 27 th Jan 2020). These two samples were compared to Wuhan-Hu-1 (MN908947 or NC045512), which is of 29,903 bp genome size and serves as a reference sample (Fig. 1a). Comparative genomics of these three isolates revealed that the SH01 sample had a deletion of ORF3a, ORF6, ORF7a, and 7b, while CA6 isolate had a deletion of ORF7b. However, the genome size of SH01 was noticed to be larger when compared to the other two samples. Though the function of these genes was not known, their absence had revealed diversity at strain level (Table 1, Fig. 1b).
Most of the coronaviruses (CoVs) of the Coronaviridae family possess two overlapping ORF1a and ORF1b polypeptides and other structural proteins like Spike (S), Envelope (E), membrane (M) and nucleocapsid (N) 12 . Among the samples that were analysed, ORF7b was present only in 8 samples of ncov-FIN, Yunnan-01, WH09/CHN, WIV02, WIV04, WIV05, WIV06 and WIV07. Subsequently, ORF3a, ORF7a, ORF7b and ORF8 were found to be deleted in HU/DP/Kng/19-20, SH01/CHN, WHU01 and WHU02 samples. However, the severity of the infection in these variants associated with the corresponding patient is not yet known, as the case history details are not available. These mutations could make an impact on the immunogenic changes that would either suppress or become more virulent than the wild type strain. The prevalence of more virulent strains may increase the severity of outbreak.
However, extensive research has to be conducted to correlate the nature of mutations with the outbreak severity.  isolates from Wuhan, France, Australia, Singapore, USA, Hongkong, Taiwan, Japan and other countries. It will be always interesting to study the transmission of these mutations and its pathogenesis in any prevalent area.
Notable changes at Spike glycoprotein Regardless of critical advances in cutting edge sequencing innovations, which have encouraged the disclosure of thousands of novel zoonotic viruses, methods for downstream evaluation of these novel sequences are de cient. Hence, an approach to determine the functional viromics in a more applicable way to understand the host-protein interactions is obligatory. The Spike (S) protein plays a role in the entry of virus into host cells, by binding to angiotensin converting enzyme 2 (ACE-2). The motif nder programme of S protein in Wuhan-Hu-1 showed 9 Pfam motifs (Fig. S1a) (https://www.genome.jp/tools/motif/). The S protein of CoVs isolated from bats and infected humans had >98% homology with few mutations. Basically, the S protein had an identical ribosome binding domain (RBD) and an O-linked glycan residue domain with polybasic cleavage site (RRARS) which was analysed through multiple alignment by Geneious Prime programme 15 . In the current investigation, we nd that the RBD of all 75 samples is highly conserved with 9 amino acid variations when compared to Bat-RaTG13 (Fig. S1b). Similarly, the O-linked glycan residue domain had an insertion of four nucleotides PRRA (Fig. S1b). Later on, the samples which showed mutations at Spike glycoprotein were retrieved from Nextstrain database. Among the 358 samples analysed (data not shown), over 33 samples showed variations and suggested strain variation ( Table 2). In samples from Peru (1), Israel (1), Greece (3), Spain (2), France (1), India (10) showed a common mutational site at D614G. However, these samples had variations in other ORFs of their genome, suggesting strain diversity. It was found that most of the strains possessed a unique pattern showing its strain-speci city. However, the immunological aspects of various strains and analysis is still lacking and need to be investigated. In the entire study, the structural genes of SARS-CoV-2 were mutated more rapidly than the non-structural genes.
It has been reported that Human angiotensin converting enzyme II (ACE2) receptor is the binding site for most SARS-CoV 11 . This was supported by another study which asserted that the novel SARS-CoV-2 utilizes the ACE2 to bind and nd its entry in to the host cell 4 . ACE2 expression in organs like kidney and heart has been reported, providing a mechanism for the multi-organ dysfunction that can be seen with SARS-CoV-2 infection 16,17 . Interspecies diversity within different bat species harbouring the coronavirus was found 18 . In the same study, a surveillance of bat-CoV's revealed the presence of SARS-like coronavirus, unclassi ed betacoronavirus and new betacoronavirus species. The co-infection of these CoVs in mineshaft bat species showed potential infection in the host. Further, the RBD of pangolin CoVs are indistinguishable from that of SARS-CoV-2 at 6 of 6 key amino acids examined previously 18,19 . This observation shows that entry of CoV in a host with human-like ACE2 could choose for a RBD with higha nity 15 . Whether the ACE2 expression in these organs affects the SARS-CoV-2 infectivity remains unclear. Majority of the scienti c reports state that acute kidney injury (AKI), abdominal discomfort and cardiac damage are the most commonly reported symptoms of COVID-19 20,21 suggesting that SARS-CoV-2 may have a tropism for these organs. Such recombination's and transmission could likewise choose for the insertion of the polybasic cleavage site (PCS), which is absent in pangolin and bats coronaviruses 12 . These PCS are highly conserved in a particular strain and shows their high pathogenicity, leading to a possible pandemic outbreak with high mortality or morbidity rate 22 , as observed in H5N1 virus. A putative recognition motif i.e., PRRARSV is present in all the sample analysed Page 7/23 and is the active site for furin recognition 14,23 . The natural selection of virus with cleavage site would probably have taken when such virus similar to the existing SARS-CoV had undergone several passages under in vitro cell-line models. It is improbable that the O-linked glycan site would have triggered without immune pressure, which was not present in the cell-lines. The insertion of PRRA amino acids, make the SARS-CoV-2 novel and more pathogenic than SARS and MERS.   Table 3). Deletion of an amino acid tyrosine (Y), lysine (K) and Guanine (G) at 144, 528 and 107 positions was noticed in subjects of Indian (MT012098), Spain (MT233521) and France (MT320538), respectively, who had a travel history from Wuhan, China (https://www.covid19india.org/, www.nextstrain.org). Though the spike protein had no variation at the ribosome binding site, the mutations noticed in these 42 samples would either increase or decrease the severity of the outbreak. However, further analysis is required to prove the severity of these samples. The prevalence of these strains in different geographical regions has to be assessed, as these might serve as biomarkers in understanding the antigenic and immunogenic changes. While analysing the cases in these provinces, it was noticed that the death rate was low in South Korea, Greece, Brazil, Israel, Peru, Turkey, South Africa and Australia, thus COVID-19 cases curve has declined. However, these states also followed many measures in controlling the outbreak such as early lockdowns, self-isolation, social distancing, hygienic practicesas instructed by their governments. However, in Sweden, and India the COVID-19 cases are being analysed and the graph is up surging due to a hike in the con rmed cases (https://www.worldometers.info/coronavirus/). It can be seen that the death toll is comparatively low in these areas when compared to the other areas such as Wuhan, Italy, Spain, France, United States of America and Germany. This might indicate that the prevalence of mutated strains which might have emerged during coinfection within the provinces, would have either reduced or increased its severity. Furthermore, the pathogenicity probabilistically was assessed in the putative neutral variants. The MutPred Indel software could assess the sites responsible for its virulence (  -no mechanism of pathogenicity was detected Petit et al. 24 suggested that palmitoylation aids in providing anchoring ability during cell fusion and receptor binding in SARS-CoV, this mechanism noted in COVID-19 sample suggest conformational changes during palmitoylation process leading to signal transduction mechanism at both intra-andextracellular domains. Sulfation is a process for protein-protein interaction and found to play a role in extracellular extension for high a nity towards binding, leading to the activation of receptors and stability of proteins by correct protein folding mechanism 25 . Hence, mutational changes in the spike glycoprotein may instigate its conformational changes, which is most likely to prompt the evolving antigenicity 26 . Studies pertaining to the localization of amino acids associated with this protein among different variants of SARS-CoV-2, are readily not available. A recent study on protein-protein interactions (PPI) by Gordon et al. 27 had suggested that the spike protein has the ability to interact with GOLGA7-ZDHHC5 acyl transferase complex and can be a therapeutic target. GPI anchor sites are also found to target host innate defense system, which allows functions in tra cking, cell adhesion and metabolism. It was reported that, Bone marrow stroma antigen 2 (BST2), also called as CD317 or tetherin has a capacity to inhibit the enveloped virus release into the host, hence such sites can be targeted for therapeutics 28 . It will be important to explore these mutational changes. Along these lines, reinforcing SARS-CoV-2 surveillance among different geographical regions can provide scienti c evidence for its more pathogenicity and allows in taking preventive and controlling measures in the transmission of disease.
ACE2 expression in human organs targeted in kidney SARS-CoV-2 infection starts by binding of the viral surface spike protein to the human angiotensinconverting enzyme 2 (ACE2) receptor following modi cation of the spike protein by transmembrane protease serine 2 (TMPRSS2) 29 . Initially, ACE2 is expressed in the lung (principally Type II alveolar cells7) and seems to be the predominant portal of entry. Considering SARS entry into target human cells, it can be observed that the expression of ACE2 protein is signi cantly found in the epithelial cells of the lung alveoli and small intestine and endothelial cells of organs including spleen, kidney, liver, lymph nodes, brain 30,31 . Burgeoning data con rm association of COVID-19 infection with increased morbidity and mortality from kidney disease. It is important to investigate whether SARS-CoV-2 replication occurs in these organs contributing to the virus disseminating throughout the body.
High expression of ACE2 was noticed in proximal tubular cells and to a lesser extent in podocytes, however, kidney glomerular endothelial and mesangial cells were not affected 17 . It was perceived that only 6% of patients infected with SARS-CoV experienced Acute Kidney Injury (AKI) during SARS outbreak during 2003 32 . Furthermore, AKI was identi ed as a serious complication of SARS, with mortality of 92% in patients 32 . During post-mortem from SARS patients, SARS-CoV viral particles were noticed in renal specimens, suggesting AKI was caused by active replication of SARS-CoV in tubular cells 32 . They suggested that renal impairment was likely associated with multi-organ failure as SARS-CoV was not demonstrable in any of the examined patients. Further, AKI (including cytokine release syndrome and SARS patients) might be a speci c pathogenic condition, and might not be due to the active replication of virus at kidneys 32,33 . An increased viral infection in alveolar cells leads to the production of large amount of cytokines, causing multiple-organ failure. Previously a study had reported that release of interferongamma-related cytokine increased the severity of organ damage in SARS patients 34 . Recently, a study described that the human kidney is a speci c target for SARS-CoV-2 infection 35 . The difference between the higher renal tropism of SARS-CoV-2 versus SARS-CoV can be assessed by the increased a nity of SARSCoV-2 for ACE2, contributing towards pronounced infection of the kidney, leading to viral reservoir 36 .
In addition, a small survey on COVID-19 patients has revealed that, proteinuria and haematuria are common features that were noticed in 40% of patients post hospitalization 37 . A reduced density of in ammation and edema was observed in CT scan reports of kidneys samples infected with SARS-CoV-2 38 . Furthermore, SARS-CoV-2 seems to be affected more by AKI frequently than subjects infected with SARS-CoV 37 . A very recent study by Yao et al. 39 con rms that SARS-CoV-2 infection damages vessels, kidney and other organs, in addition to the lungs,. Hyaline thrombi are found in small vessels in different organs. It is of utmost importance to investigate pathological changes in autopsy material. Before organ donation is considered in future, it will be important to investigate whether the SARS-CoV-2 has infected the kidneys; the risk of such organ grafts has not been reported as yet. In any case, it has been indicated that SARSCoV-2 has a high tropism for the kidney, where it has been shown to reproduce in practically 30% of COVID-19 patients 40 . Consequently, screening for COVID-19 in kidney donors is probably more important during screening time and need to be quarantined for 14-28 days who possess either symptoms or had a travel history to high-risk regions 31 . A research study demonstrated that more than 66% of patients had died with COVID-19 infection who had diabetes or cardiovascular disease 41 . As a rst-line treatment, angiotensin-receptor blockers (ARBs) were given to COVID-19 patients. Certain reports revealed that ARBs were found to express ACE2 by nearly 2 to 5 fold in kidney and heart samples [42][43][44][45] . Since SARSCoV-2 has a high tropism for the kidney 35

Materials And Methods
Information related to daily cases of COVID-19 and SARS-CoV-2 genome data The genome data of SARS-CoV-2 was retrieved from the public repositories like NCBI data and the global information on COVID-19 cases was obtained through worldometers (https://www.worldometers.info/coronavirus/) and NEXTstrain (https://nextstrain.org/ncov) websites.
Totally 75 genomes were considered based on the variation in their genome size, country and divergence.
The samples used in the current study are enlisted in Table S1. However, the genome ethnicity and racial inheritance of all the samples are not available.

Comparative genome analysis
Three genomes MT121215 and MT044528 were considered in the study which possessed highest and lowest genome size, and were compared to the reference sample (MN908947). The comparative genome analysis was performed by using Geneious Prime Software Version 2019.2.1 13 .

Phylogenetic evolution
To further analyze the evolution of isolates, a phylogenetic tree (n = 75) was constructed using the complete genome data of SARS-CoV-2 by using MEGA-X (Molecular Evolutionary Genetic Analysis) software 46 . The evolutionary history was deduced by using the Neighbor-Joining method with 500 bootstrap replicates 47 . Further, the evolutionary distances were computed using the Maximum Composite Likelihood method. The analysis involved 75 nucleotide sequences. There was a total of 39547 positions in the nal dataset.
Bioinformatics tools used in the analysis of spike protein Multiple sequence analysis of the spike protein among the 75 isolates was performed by using Clustal W programme 48 . The VIPR database is used to analyse the single nucleotide polymorphism (SNP) at spike glycoprotein as described by Pickett et al. 49 . Further Genome Detective Virus Tools was also used to look at the mutational analysis (https://www.genomedetective.com/app/typingtool/virus/). The pfam motifs were analysed by using genome motifs database (https://www.genome.jp/tools/motif/). Further, the ribosome binding region and the polybasic cleavage site was determined as described by Andersen et al. 15 .

Pathogenicity prediction in the phenotypes
To further assess the pathogenicity of the variants (putatively neutral) a machine learning-based method software package was employed for the spike protein phenotypes. The MutPred-Indel software assess the probabilistically the pathogenicity of the neutral variants and suggests the features affecting the phenotypes 50 .

Declarations Authors Contribution
SMD performed the bioinformatics analysis on the collected data, planned the work and wrote the manuscript, AP had collected the data and worked on kidney biopsy samples of COVID-19, BK edited the manuscript, and KS monitored the analysis and edited the manuscript.

Figure 2
Page 23/23 Phylogenetic emergence of COVID-19 among 75 prevalent samples globally In this study, each sample was given an ID, however, the ethnicity and geographical location of most of the patients details were not available.