Phylogeographic analysis of SARS-CoV-2 variations from Indian isolates

The aggressive multiplication of coronavirus in the Indian population is probably due to the faster mutation rate. At the time of commencement of this work, India was not present in the list of Top 10 worst-affected countries. However, upon completion of this manuscript, India is ranked No. 3 and during publication of this manuscript it may even elevate to the top two positions due to the pandemic. In this study, SARS-CoV-2 isolates of Indian origin were compared with the Wuhan reference sequence. Phylogenomic, proteomic, and phylogeographic analyses were performed. The genome comparisons indicated that majority of the sequence variations are associated with protein-coding regions, especially Orf1a and spike glycoproteins, while Orf7a had consistent variations, whereas Orfs 6a, 8 and 10 had negligible variations. The terminals of the genomes compared had high sequence entropy. However, the polyadenylation signal was invariant in the analysed dataset. The codon usage frequency indicated that UGU (code for cysteine) is the most frequent codon, while the least frequent was GCG (code for alanine). The amino acid frequency showed that the most abundant was leucine (12.5%), and the least was histidine (2.45%). The phylogeographical patterns were mapped for all the representative states of India, and were supplemented with few representative countries. The unique differences in the sequence of the Kerala isolate (EPI_ISL_413522) were resolved to be sequence errors rather than mutations. Based on the phylogeographic analysis, the high probability of mutations likely to be of Indian origin is attributed to the Gujarat cluster.


Introduction
The ongoing pandemic: coronavirus disease (COVID- 19), is a respiratory illness caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), also known as the novel coronavirus (2019-nCoV).This virus surfaced in late 2019, in the Hubei province of Wuhan city, China and is the cause for the ensuing worldwide biological disaster (F.Wu et al., 2020;Zhou et al., 2020).The SARS-CoV-2 belongs to the member of the same coronavirus family that caused the SARS outbreak in 2003 and the Middle Eastern respiratory syndrome (MERS) in 2012.This virus family belongs to an order of non-segmented, Nidovirales.Coronavirus belongs to the Coronavirinae sub-family of Coronaviridae, which primarily affects birds and mammals (MacLachlan & Dubovi, 2017).
The virus derives its name due to the glycoprotein projection from its envelope, which appears like a corona (Virology: Coronaviruses, 1968; Sturman & Holmes, 1983).It was found that these spike proteins assist in the entry of the virus through the angiotensin-converting enzyme II (ACE2) receptor.The virus enters only those cells that express ACE2, as was found for SARS-CoV-1 and in the initial studies of the SARS-CoV-2 (Zhou et al., 2020).The host-dependent translation of the proteins for this virus, like other coronaviruses, follows canonical methods like ribosome frameshifting (-1 frameshift) as summarised by Maranon et al. (2020) and Rodnina et al. (2020).
The genome of Coronaviruses usually varies in its GC%, ranging from 32 to 43%.There are conserved genes -start is the ORFa and ORFb, (ORFab) which codes for replicase gene.This is followed by the spike (S), envelope (E), membrane (M) and the subgroup.(Woo, Huang, Lau, & Yuen, 2010).Extensive phylogenetic analysis has revealed variations in the genome such as L (70%) and S (30%) clusters as well as A, B, and C clusters (Tang et al., 2020;Forster et al., 2020).Although there are several studies on comparison of whole-genome as well as spike proteins of SARS-CoV-2 at global level, there are relatively very few studies published on SARS-CoV-2 sequences from India.
Since the population in India is highly diverse, it was proposed to analyse the genomic variations of SARS-CoV-2 isolates sequenced from different states of India.These variations were compared with the Wuhan reference sequence (NC_045512) of March 2020.While documenting our results, there were few published papers from India emphasising on mutations of available sequences (Anindita et al., 2020) and variations in sequences from the state of Gujarat (Hassan et al., 2020).However, the mapping of conservations and variations at the strain level, inter-state differences, codon frequency, phylogeographic, phylogenomic and proteomic analyses were not performed exclusively for representative Indian states and union territories.Hence, a complete genomic and proteomic analyses of SARS-CoV-2 isolates of India along with few global representatives are performed and presented here.

Sequence retrieval
The SARS-CoV-2 genome sequences were retrieved from the GenBank (Sayers et al., 2020) and EpiCoV database of GISAID initiative (https://www.epicov.org/).The reference genome (NC_045512) and its corresponding protein sequences (YP_009725318.1) were retrieved from the GenBank database.The genome sequences were collected from different states of India.Partial sequences were ignored, and at least one full sequence from each state of India was collected from the database as on 21 st May 2020 (Fig. 1).Along with this dataset a representative sequence from the neighboring countries were (available as on 3 rd April 2020) also included for comparison.

Sequence alignment
The 33 representative genomes were stored in a FASTA formatted file, to which the SARS-CoV-1 genomic sequence, as well as the Wuhan reference sequence (NC_045512), were appended.This brings the total count to 35.After a draft alignment, the sequences were trimmed to a uniform size to ensure better alignment and also to avoid internal gaps that are introduced due to length variations.Multiple sequence alignment was performed using standalone ClustalX 2.1, with default parameters (Larkin et al., 2007).

Shannon Entropy
Shannon entropy was analysed using DAMBE7 (Xia, 2018) itesequence analysis menu was used.Coding regions of the genome were isolated and analysed.Amino acid grouping method was used.A standard text file was generated from which the tables were made.Heatmaps were used to represent data in the form of a matrix with colour gradients to facilitate easier identification of smallest and largest values, which in turn forms the basis of clustering.(Metsalu & Vilo, 2015).These 2D matrices with colour gradients make it easier to spot errors (Yachdav et al., 2014).

Codon usage, frequency analysis and Heat maps
DAMBE7 (Xia, 2018) was used to generate data on codon usage (standard) and relative synonymous codon usage, nucleotide and amino acid frequencies.Tab-separated value data sets were created that served as input for ClustVis (https://biit.cs.ut.ee/clustvis/), an online tool with which heatmaps were created (Metsalu & Vilo, 2015).The default parameters were set which used the unit variance method of row scaling and the singular value decomposition method of imputation.

Proteome analysis
Using YP_009725318.1 as a reference, the corresponding proteins for each sequence were obtained by translating the file containing 34 sequences (using standard codon) in MEGA-X.The same has been described in the results.The sequences of 12 different proteins were aligned using ClustalX 2.1

Phylogenetic tree construction
The sequence alignment file (*.ALN) obtained from 35 sequences was utilised to construct phylogenetic trees using DAMBE7 (Xia, 2018).
A phylogenetic tree was made using the distance-based method.The Fast-ME tree was constructed using SARS-CoV-1 genome sequence (NC_004718.3)as the outgroup.Bootstrapping was enabled and set to 100.

Phylogeographical variations
The first recorded COVID-19 case in India is Kerala (EPI_ISL_413522), which originated in Wuhan, China.Hence, the genomic sequence of the first case was compared with the Wuhan reference sequence using Molecular Evolutionary Genetics Analysis (MEGA X) for Windows (Kumar et al., 2018).Moreover, the mutations in sequences of other Indian states were also compared with the reference mentioned above to determine whether any sequence-specific variations arose in the Indian Subcontinent

Shannon Entropy
The sequence variation of the entire genome is depicted in Fig 2 .A large number of variations from the base value were observed for Orf1a and Spike glycoprotein, while Orf7a had consistent variations, whereas Orfs 6a, 8 and 10 had negligible variations.This is in accordance with the results obtained by (Hassan et al., 2020).(Peng et al., 2016), of which only four sequences Wuhan Reference, Hyderabad (Telangana), South 24 Parganas, and Howrah (West Bengal) (viz., NC_045512, EPI_ISL_431102, 437537, 437537) had a poly-A tail after 57 bases.The remaining sequences were terminated between 53-58 bases downstream, without a poly-A tail.There were no variations in the polyadenylation signal except ambiguous base calls for the two sequences of South 24 Parganas and Howrah (EPI_ISL_437536 and 437537).
The terminals were highly varying due to length differences and probably due to errors in sequencing.Moreover, there were ambiguous bases which are challenging to deal with, especially while inferring mutations throughout the sequences.The first available SARS-CoV-2 isolate from Kerala (EPI_ISL_413522) is devoid of polymight signify a different origin of the Indian isolate when compared to the Chinese isolate, or it may denote an indel (Chatterjee, 2020) as it is well known that the poly-A tail plays a significant role during infection.Truncation of the poly-A tail in Kerala isolate is probably due to sequencing errors or may be due to tail length regulation in coronavirus as the length is not static (Peng et al., 2016).However, it cannot signify a Chatterjee (2020).
There are also length variations, especially in the poly-A tail of other isolates, which is vital for the viral translation.It has been shown in vivo that the transcriptome with the longer poly-A tail is significantly enhanced over the shorter ones (Wu et al., 2013).The reason behind the poly-A tail length variation is unknown, and it is missing in many isolates except (Wuhan Reference, Hyderabad, South 24 Parganas, and Howrah) This may be due to lack of sequence data.This length variation may lead to misinterpretations in understanding the regulatory mechanism of coronavirus during infection (Wu & Brian, 2010).However, it was observed that the polyadenylation signal hexamer AAUUAA is invariant in the analysed dataset (Wu et al., 2013).This is crucial not only for the stability and translation efficiency of mRNA but also for the export of mature mRNAs as demonstrated by (Preiss et al., 1998 1. The codon usage pattern for the entire dataset is shown in Fig. 3.The GC% was calculated to be 38.1208% while the GT% turned out to be 51.7542%.This is due to a higher frequency of Thymine bases.It is also evident from the heat map (Fig. 4a) that T/U, followed by A is most frequent.The codon usage frequency indicated that the most frequent codon was UGU (343.6)code for cysteine, while the least frequent was GCG (23.6) code for alanine.The representation of this in Fig. 4b makes it clear that the UUU, UGU and UAA come under the high-frequency bracket, being marked with dark red.There are quite a few codons that come under low-frequency bracket, indicated by dark green.A relatively leucine-rich pattern is shown in Fig. 4b, followed by threonine and serine.A high-frequency tag of red colour confirms the presence of a large number of the ambiguous bases in the South 24 Parganas sequence (EPI_ISL_437536).Relative synonymous codon usage is shown in Fig. 4c.However, this is not related to amino acid usage (Xu et al. 2008).The amino acid frequency is showed in Fig. 4d.The most abundant amino acid was leucine (12.5%), and the least was histidine (2.45%).
In all the heat maps (Fig. 4), it is observed that there are three distinct clusters.The first cluster comprises West Bengal isolates of Kolkata, Darjeeling and Tehatta along with Kerala which consistently formed a cluster.The second cluster comprises of all the other sequences analysed except Karnataka.The Karnataka isolate is a single sequence that can be considered as a separate cluster if more Karnataka isolates are included.The reason for this is elaborated under the next section (3.3).

Proteome analysis
From the analysed dataset, it was found that few sequences differ in their reading frames for protein translation (open reading frames) in comparison with others.These sequences fall into 3 main groups, as discussed in section 3.2.me +1 for the group containing the West Bengal sequences and Kerala.The same region appears in reading frame +3 for the lone isolate of Karnataka, and it appears in reading frame +2 for all the remaining sequences (including Wuhan reference).

This concept of -1 frameshift has been discussed previously by Maranon et al. (2020) and Rodnina et al. (2020).
It was observed that the majority of the variations occur in nsp6 region and S protein region, and a little lesser but a significant number in N protein.A leucine zipper motif exists in the Orf1ab, Orf3a and Orf7b proteins.A cysteine-rich (5329-5354) and glutamine-rich (929-977) regions were also found in the Orf1ab.The spike glycoprotein of Kerala isolate also contained such cysteine-rich region from residues 1234-1253.In contrast, it was from 1235 to 1254 for the rest of the isolates due to codon deletion present in the Kerala sequence.A serine-rich region was observed in the nucleocapsid protein from the residues 176-206.The Orf8 and 7a contain a protein of unknown function (Entry pfam_fs: DUF308 and DUF1443, respectively).Besides, N-myristoylation site, protein kinase C phosphorylation site, N-glycosylation site and casein kinase II phosphorylation site were found abundantly scattered throughout the proteome.There are a few variable regions in the proteome depicted in Fig. 5a-f, and the Orfs 6a and 10 are highly conserved (Fig. 5g&h).

Phylogenomic analysis
The phylogenetic tree based on the genome sequences of Indian isolates is depicted in Fig 6a .Two significant clusters are branching out at the intermediary node (labelled with the bootstrap value of 99).The cluster A contains geographically diverse strains of the North, South (Andhra Pradesh, and Tamil Nadu), East (e.g.Assam) and West (e.g.Rajasthan) along with Kerala and Wuhan reference among few others.The cluster B mainly contains strains from Gujarat (West) and West Bengal (East) with two sequences of the South (Telangana and Karnataka).This pattern repeats itself (i.e.Telangana and Karnataka clusters with Gujarat and West Bengal sequences, Tamil Nadu and Andhra Pradesh appeared together based on mutation analysis (section 3.5).This tree data is in agreement with Saha et al. (2020) for some of the common sequences viz., ISL-430464, 65, 67, and 435056 were positioned in the same cluster as in cluster B (Fig. 6), and devoid of the cluster containing the reference sequence (cluster A).This also emphasises that ISL-430465 (Darjeeling WB) and 430464 (Kolkata WB) are closely related to each other and more distantly related to ISL-430467(E.Medinipur WB).These clusters are within the country; such instances were known since the first case of Wuhan in December 2019.
The phylogenetic tree of Indian isolates was also appended with global SARS-CoV-2 isolates from geographically distant locations and the clustering trend is depicted in Fig 6b .There are three distinct clusters A1, B1, and C1.The cluster A1 mainly contains the Indian isolates of North and North East India.However, it is interesting to know that the pairs of Karnataka-Telangana and Andhra-Tamil Nadu is clustered together.The cluster B1 consists sequences primarily from East (West Bengal) and the West (Gujarat) and can be referred to as the East-West cluster.This cluster also includes Spain and Peru isolates, and it matches with the results of variation analysis.The Gujarat cluster seems to be a subcluster of this East-West cluster.The cluster C1, is the international cluster (except the Kerala isolate).Since only the Kerala isolated is clustered with the international isolates, this could be be a reference point of divergence, from which Indian strains diverge and progressively become less similar to international strains.Based on the phylogenomic clustering (Table S1), it is evident that the Indian isolates are quite distinct from the international strains.This is in agreement with the global variations described with three central strains A, B and C, where A being the primary ancestral strain.The B strain is confined to Eastern Asia, whereas A and C are found in Europe and the West.Hence, people outside East Asia are immune to the primary B strain.This selection pressure forced the formation of derived B strains (Forster et al., 2020).

Phylogeographical variations
The phylogeographical patterns of variations are mapped for all the representative states analysed (Fig. 1).These variations were specific only to the coding regions rather than the non-coding regions.However, these variations were observed neither in the Wuhan reference genome nor in Kerala.The Indian isolates are organised into regional groups (Table 2) and depicted in Fig. 1.These patterns are similar to the clusters A and B obtained from phylogenomic analysis (Fig. 6).Excluding the Wuhan and the Kerala sequence, the North-type I cluster is almost identical to the genomically derived cluster A (Fig. 6).Also, the pattern of majority Gujarat (West) and West Bengal (East) along with two specific sequences of the South, which forms -(Fig.6).The percentage variations that are geographically specific to each region is showed in Fig. 7. Based on high sequence identity and variations, the following union territories such as Delhi and Ladakh were grouped into Uttar Pradesh (EPI_ISL_435099) and Jammu-Kashmir, respectively.These groupings are not to be considered as an act of disputing the current borders or divisions of states and union territories.They are merely clustered for easy comparison.
Based on the type of mutations (Table 3), three predominant clusters are as follows.
Gujarat cluster: Point mutation (Substitution) of either C or G with T in Orf3a, Orf6 and Orf10 restricted only to the genomes of Gujarat.North-type I: Point mutations in nsp3, nsp6, S & N proteins, predominantly in the Northern states.East-West cluster: Mutations in nsp3, S protein and RNA dependent RNA polymerase genes, which were specific to the Western and Eastern states.
It was observed that the single point mutations were found to be geographically unique.There were five substitutions and three deletions found exclusively in the Kerala sequence (EPI_ISL_413522), which are not observed in the Wuhan reference or any other Indian states.However, these variations (especially the deletions) are not considered to be giving rise to a unique strain, as these variations are probably due to sequencing errors (Balaji et al., 2017).Since those deletions are not present in any of the analysed sequences, including the reference, they are dismissed as sequencing errors, in contrast to Chatterjee (2020), who documented these as mutations.However, to confirm that these deletion patterns in Kerala isolate probably sequencing errors and are not due to a unique strain of distinct geographical origin, the Kerala isolate was compared thoroughly against Wuhan reference sequence.There were two direct identical repeats of 13 bases each (GGTGTTTATTACC) at base positions 21665-21677 and 21986-21998, respectively.308 bases separated these direct repeats.This sequence pattern is exclusive to the Wuhan reference sequence, whereas, in the Kerala isolate, only one repeat is found at base positions 21652-21664.The other repeat has three ---.
-terminus of Kerala isolate.It is evident that the missing bases are probably due to errors in sequencing or contig assembly.The missing bases were misinterpreted as either mutation or error in alignment techniques as proposed by Chatterjee (2020).This may further mislead researchers if the same misinterpretation is published as it is.Moreover, there is no mention of the direct repeats.The sequence comparison of direct rep repeat clarify that the Kerala isolate probably have sequence errors that might have originated during sequencing/assembly by the interference of tandem repeats (Tørresen et al., 2019) rather than mutations.Some of the sequence variations may be laboratory-specific, which might have come from sample preparation, contamination, base call errors, recurrent sequencing errors, the sequencing technology used and consensus calling approaches (Balaji et al., 2017; Maio et al.,2020).

Origin of new strains
There are research gaps put forth by Saha et al. (2020), to determine whether is there any mutations originated in India?It is of interest to identify those strains.To address this gap, we have classified the variations with the possibility of Indian origin in the following manner.
An international travel ban to and from China, Iran and EU, EU FTA, Turkey, UK was imposed on 17 th March 2020 by the Government of India (Ministry of home affairs).This was followed by a complete international travel ban from 22 nd March 2020 onwards (https://boi.gov.in/).Hence, it is to be noted that no passenger would have entered India from other countries after 22 nd March 2020.Moreover, the median time for the appearance of symptoms is 5-6 days and a maximum of 14 days, according to WHO (https://www.who.int).Thus presuming the last person who was exposed to the virus abroad, enters on 22 nd March 2020, the person may develop symptoms within 5 days or between 5 and 14 days.Accordingly, the sequences have been divided based on their collection dates (Table S2), i.e., (i) before 22 nd March 2020, (ii) after 5 days (27 th March 2020 onwards), and (iii) after 14 days (5 th April 2020 onwards).
Based on these classes, mutations in sequences belonging to the first category would be of foreign origin as International travel was allowed and India was in its nascent stage of the pandemic, with very few states affected.In the second category, there is a probability of mutation being in Indian origin, after ruling out that the same mutation does not exist in the first category sequences.The third category of sequences has a very high probability of containing mutations of Indian origin as 14 days after 22 nd March 2020 is the last day a virus from another country could show symptoms in India (Table 4).The original strain might have undergone multiple transmissions throughout India, which is the primary reason for such mutation, as suggested by Shu (2020).One interesting point to note is that the Gujarat cluster, which is placed in the high probability of Indian origin contains variations that are not present in any of the international sequences (collected as of 3 rd April).

Conclusions
The question of identifying mutations in SARS-CoV-2 isolates of Indian origin was addressed based on phylogenomic, proteomic, and phylogeographic analysis (reflecting inter-state differences).The Indian isolates indicated that the majority of sequence variations are associated with protein-coding regions than the non-coding regions.The genomes analysed had a high entropy due to length differences as well as due to the absence of poly-A tail.Although the polyadenylation signal hexamer AAUUAA is invariant in the analysed dataset, the absence of poly-A tail in the dataset hinders the understanding of viral regulation during infection.The comparison of direct repeats in Wuhan reference sequence with Kerala isolate clarified that the interference of tandem repeats.The ORF predictions indicated that the theoretical protein translation for is +3, and for the rest, it is +2.Further analysis of the ribosomal frameshifting and transcriptional slippage events are crucial in understanding the strain-specific variations.The phylogeographical patterns of variations based on mutations of Indian origin was partially solved.However, the fact that international isolates deposited till the early April did not contain the same variations as of the Gujarat clusters, this supports the theory of Indian origin.However, the study is limited only based on the identified variations with only one representative for each state and does not include all states and union territories of India.Hence, further analysis of the nature and patterns of mutations can be done with a larger dataset having fewer sequence errors.The reasons for certain mutations being restricted to particular regions should requires further probing, and more light needs to be shed on the geographic driving forces, primarily in those sequences with confirmed mutations of Indian origin.Intra-state variations can also be studied, keeping the Gujarat cluster in mind and looking for explanations for variations within the same geographical region.It is to be noted that specific mutations in SARS-CoV-2 strains may be responsible for higher risks that may cause fatalities and mortalities.Hence, misinterpretation of mutations along with sequence errors is quite challenging, it is difficult to ascertain such errors that arise due to handling of large sample volumes, high throughput processing and sequencing in the ongoing pandemic.However, a quality check for sequences is crucial not only for the understanding mutations on virulence, pathogenicity, antigenicity and resistance but also for the development of vaccines.

Accession numbers
Indian map depicting the locations from which the samples were collected.The map illustration was created using mapchart.net .Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries.This map has been provided by the authors.

Fig 1 .
Fig 1. Indian map depicting the locations from which the samples were collected.The map illustration was created using mapchart.net

Fig 7 .
Fig 7. Percentage variations that define the patterns of the different clusters of (a) North type I cluster, (b) North-type II cluster, and (c) East-West cluster

Figure 2 Entropy plot Figure 3 Codon usage pattern for the analyzed dataset Figure 4 5 SARSFigure 6 (
Figure 2

Table 3 :
Sequence data organized into regional groups based on the type of mutations

Table 4 :
Probable origin of variations in India based on collection dates (TableS1)