Genetic diversity and structural characterization of spike glycoprotein of newly emerged SARS-CoV-2

A new beta Coronavirus (SARS-CoV-2) infection was rst identied in the Wuhan City, China in December 2019 and after that it had spread rapidly throughout the globe and subsequently WHO have announced it as a pandemic. So, SARS-CoV-2 has now become a global threat to human civilization. Recent studies showed that the proteomic data of SARS-CoV-2 is closely related with other beta Coronavirus. The phylogenetic tree revealed the closeness of recently reported SAR-CoV2 with SARS-CoV by using MEGA 7 along with the suitable protocol of Neighbor joining algorithm. The spike glycoprotein plays the most important role during the onset of infection. Several mutations have been reported across the globe in the S Proteins. In this research, molecular docking between the SARS-CoV-2 spike glycoprotein and ACE2 protein was carried out in PatchDock web servers. WEBnm@ calculated the molecular simulation using Normal Mode Analysis (NMA) along with lowest deformation energy value which signies the domain motions. Also during multiple sequence analysis, variations were observed within the Spike protein reported globally. 3- Dimensional structure of protein molecules were designed using homology modeling and the structure were validated through Q mean score and Ramachandran plot. All of the designed sequences were having around 91% of the amino acid in the favored region of Ramachandran plot. In order to check the binding anity difference between the mutated and non-mutated strains, the generated models were docked with human ACE2 molecules. The non mutated strains have given the similar ACE value. However, there were variations in ACE value of the mutated strains. This observation provides evidence of Phylogenetic diversity and evolution. of uctuation

As of December 2019, the world is being ravaged by a new strain of Corona virus named Severe acute respiratory syndrome-Coronavirus 2 (SARS-CoV-2). The origin of this virus was reported from Wuhan in the Hubei province of China (Tian et al. 2020), and this raised intense attention not only within China but across the globe. Corona viruses belong to those virus categories which have been reported for causing Severe Acute respiratory. In Wuhan, on 7 th January, 2020, Chinese researchers have isolated a novel Corona virus (CoV) from patients. Among other Corornaviruses, SARS-CoV-2 was found to be the seventh viral starin that can infect humans. On January 30 th , 2020, the World Health Organization declared the SARS-CoV-2 outbreak as a public health emergency of international concern. SARS-CoV-2 is a positive strand RNA virus which shares about 80% identity with that of SARS-CoV and is about 96% identical to the bat coronavirus BatCoV RaTG13 isolate (Yan et al. 2020).
The surface glycoprotein also known as spike (S) protein of Corona viruses facilitates viral entry into human cells. For viral entry and attachment to target cell, the S1 subunit of S protein binds to cellular receptor. The process of entry requires S protein priming by cellular proteases, wherein the S protein is cleaved at the S1/S2 sites and S2 site allows fusion of viral and cellular membranes. Spike protein uses ACE2 receptor molecule as an entry receptor and for its priming it used uses cellular serine protease TMPRSS2 (Hoffmann et al. 2020). After studying the SARS-S/ACE2 interface at atomic level, researchers have found that ACE2 is a key molecule for the transmission of the virus. SARS-S und SARS-2-S share around 76% amino acid identity.
A speci c region on the S glycoprotein; termed as Receptor Binding Domain (RDB) on the virion surface mediates receptor recognition and membrane fusion (Yan et al. 2020). The receptor binding domain (RBD) is the most variable part of the coronavirus genome (Zhang et al. 2020). There are around Six RBD amino acids which are found to be responsible for binding to ACE2 receptor and also for identifying the host. ACE2 is a type I membrane protein expressed in lungs, heart, kidneys, and intestine its primary physiological role is in the maturation of angiotensin (Ang), a peptide hormone that controls vasoconstriction and blood pressure. Structural studies have shown that SARS-CoV-2 is having a RBD which can bind to ACE2 molecule of various organisms viz. humans, ferrets, cats and other species. SARS-CoV-2 may bind human ACE2 with high a nity, however Anderson and co-workers through computational analyses have predicted that due to difference of RBD sequence of SARS-CoV-2 and SARS-CoV, the interaction between RBD and ACE2 is not that much ideal (Zhang et al. 2020). Another noticeable feature of SARS-CoV-2 is the presence of a polybasic site at the (RRAR) at the S1-S2 junction (Zhang et al. 2020). This help in cleavage by furin and other protease agents. Also it plays an important role in characterizing host range and viral infection.
Phylogenetic tree analysis have always contributed for studying the occurrence, transmission and progression of different RNA viruses like have largely contributed to a better understanding of the emergence, spread and evolution of many RNA viruses avian in uenza virus (Lam et al. 2008, Cattoli et al. 2009) and Ebola virus (Gire et al. 2014). After the outbreak of SARS-CoV2, there have been different reports for the evolution of the virus from SARS-CoV reported from Bat and there have studies going on to check the variation of spike protein of SARS-CoV2. So the aim of the present study is to corroborate the genetic diversity with its functional e cacy of SARS-CoV2 to bind with target cell with special emphasis to spike protein by using proteomics tools of molecular docking. Also, the stability of docked complex was veri ed through NMA dynamic simulation focusing on Cα atoms mobility

GIS mapping
In the present study, Qgis V 2.18.26 software was used for the mapping and visualization of the reported cases of SARS-COV-2 infection reported globally till 16.4.2020 retrieved from NCBI database. We did not use the exact co-ordinate for framing as we do not have the exact GPS coordinates of the cases, who have been participated for the whole genome sequencing. All the cases are tagged with a different legend to show the vast spreading of the Covid-19 outbreak and the NCBI accession numbers are assigned to each legend (Fig. 1).

Phylogenetic analysis of spike protein of SARS-CoV-2
To study the evolutionary pathway of the SARS-CoV-2, three phylogenetic trees were constructed. In the rst Phylogenetic analysis, the spike protein (Surface glycoprotein) amino acid sequences of various SARS virus were retrieved from NCBI-Genbank database ( Table 1). The sequences were aligned using Clustal W alignment program and the tree was constructed using Neighbor joining alignment. Bootstrapping was performed at 1000 replications of MEGA 7 program (Kumar et al. 2016). Tilapia lake virus hypothetical protein having genebank accession number MN094791 was kept as an out group The second Phylogenetic tree was constructed to determine the evolutionary relationship of mostly reported human Coronavirus viz. MERS, SARS-CoV and SARS-CoV-2. The S protein amino acid sequences were retrieved from NCBI database and were aligned. The evolutionary tree was generated using Neighbor joining program of MEGA 7 (Kumar et al. 2016) and bootstrapping was performed at 1000 replications. Tilapia lake virus hypothetical protein was kept as an out group.
Since SARS-CoV2 infection have been reported to be pandemic by WHO (https://www.who.int/emergencies/diseases/novelcoronavirus-2019) so the third Phylogenetic tree was constructed to check whether there is any variation between the S protein that have been reported globally. For the construction of the evolutionary tree, almost every representative sequence of Spike protein reported till 16.4.2020 from various geographical locations including Asia, Africa, North America, South America and Europe were retrieved from the new NCBI virus database (Table. 2) providing inclusive information regarding SARS-CoV2 (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus? VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202,%20taxid:2697049&SeqType_s=Nucleotide).
The sequences were aligned using Neighbor joining program and bootstrapping was performed at 1000 replications of MEGA 7 program (Kumar et al. 2016). Tilapia lake virus hypothetical protein was kept as an out group.
Molecular Docking between SARS-Cov-2 spike glycoprotein and ACE2 receptor The protein structure of SARS-Cov-2 spike glycoprotein having PDB ID-6VXX was retrieved from PDB database. Molecular docking between the SARS-CoV-2 spike glycoprotein (PDB ID: 6VXX) protein and ACE2 (PBD ID: 108A) protein are furthermost vital process to characterize the molecular interaction and accurate bonding pattern of the proteins. This is an essential process for supporting the binding a nity towards the outer spike glycoprotein and human ACE2 receptor. Molecular docking determines the cellular functions of the proteins (Lavi et al. 2013). In the present study, PatchDock docking server was used for carrying out molecular docking analysis (Duhovny et al. 2002). The server is having a geometric complimentary based algorithm to achieve the exact molecular docking (Yadav et al. 2017). From this server we also evaluated interface area, geometric score and Atomic Contact Energy (ACE) of the docking complexes along with the generated PDB le for analysis the proper molecular interactions. The RasMol software v.2.7.4.2 was helped to visualize the generated PDB le of protein-peptide docking complex (Chen et al. 2009).

Molecular Dynamic Simulation
Normal mode analysis (NMA) is a powerful method for predicting the possible large-scale movements of a speci ed biomacromolecule (Prabhakar et al. 2016).This speci c analysis has used in broad eld of structural biology, like the study of conformational changes of protein binding to ligand, changing of conformation opening and closure structural stability of membrane channel protein, potential actions of the ribosome, and also calculate viral capsid maturation. The iMODS server was employed for representation of the dynamic motion of the docking complex accordingly with the help of NMA (López-Blanco et al., 2014). This server measures the different parameters regarding molecular structural dynamics like; deformability plot, B-factor, eigen value, Covariance matrix. Result of deformability plot reveals the speci c region of protein (coiled regions) while the Bfactor shows atomic deformation. Eigen value calculates rigidity of molecular motion. Co-variance matrix signi es the correlated atomic pairs depending on speci c color code.

Homology modeling and molecular docking of Spike protein
In order to generate the 3D structure of spike protein, homology modeling was carried out at Swiss Model server (Waterhouse et al. 2018). Amino acid sequence having accession number viz. MT049951; MT12098, MT007544, NC045512, MT019532 were retrieved from NCBI Genbank server. Spike protein structure having PDB ID-6VXX was used as template for the generation of the model. Validation of the generated model was carried through Q-mean score (Benkert et al. 2011) and Ramachandran plot analysis.
To characterize the molecular interaction and accurate bonding pattern of the designed proteins, the generated 3D protein structure were docked with human ACE2 (PBD ID: 108A) receptor protein. PatchDock docking server was used for carrying out molecular docking analysis (Duhovny et al. 2002). The server can also evaluate interface area, geometric score and Atomic Contact Energy (ACE) of the docking complexes.

Results
Phylogenetic analysis of spike protein of SARS-COV 2 The rst Phylogenetic tree represents the relationship of SARS-CoV2 with other members of Corona virus family. From the evolutionary relationship tree, it was found that the spike protein of SARS-CoV2 is evolutionary close to SARS-CoV sharing the same node with highest bootstrap value of 100 (Fig. 2). However, MERS was found to be evolutionary closer to these two strains of corona virus.
The second phylogenetic tree exhibit the evolutionary correlation between the of mostly reported human Corona virus viz. MERS, SARS-CoV and SARS-CoV2. From the tree it can be observed that the spike protein of MERS reported from various organism were form a single cluster and sharing some similarity with SARS-CoV reported from Bat. However, recently reported SAR-CoV2 from various infected humans was found to be sharing the same node with the SARS-CoV reported from Bat with highest bootstrap value (Fig. 3).
The third Phylogenetic tree was constructed to check whether there is any variation in Spike protein of SARS-CoV2 isolated from different patient globally. From the tree it can be easily observed that there is not much variation observed in the spike protein sequence and they are sharing the same node (Fig. 4). However, in multiple sequence alignment le (Supplementry, S1), a substitution of a single amino acid Aspergine (N) in place of Tyrosine (Y) at position number 28

Molecular Docking
Molecular interaction between SARS-CoV-2 spike glycoprotein protein and ACE2 protein was analyzed by Patch Dock web servers. Top 20 docked complexes were gained in the PDB format with absolute clustering root-mean-square deviation (RMSD) factor of value 4.0. In this work, the output models were visualized with the help of pre discussed software and the most acceptable model was selected by their (Atomic Contact Energy) ACE value, geometric shape complementarity score and complex interface area.
The ACE value of the selected complex was -422. 22, the docking complex with the lowest ACE value (negative) was selected for the spontaneous reactivity, whereas the geometric shape complementarities score was 15390 and complex interface area for the best model was 2804.00. The selected docking complex was shown in gure 5.

Molecular Dynamic Simulation
Normal mode analysis (NMA) is now a promising method to investigate the slowest motions in macromolecule. NMA is found to be useful for studying large molecular docking stability of SARS-CoV-2 virus spike protein and Human ACE2 protein. Deformation energies and eigen values indicates the energy related with normal mode and inversely related to the amplitude of the dynamic motion. WEBnm@ calculated 14 normal mode indexes along with their deformation energies and here, we have selected the lowest deformation energy value indicating a mode with large rigid regions, which has a good chance of demonstrating domain motions (Fig. 6A) The selected deformation energy value of the present complex is found to be 1236.54. In this plot, anticorrelated, uncorrelated and correlated movement Cα in the protein represent via blue, white and orange colour gradient respectively showing in gure-6B. The square of the uctuation of each Cα is calculated a lowest non-trivial modes as eigen value plot (Fig. 6C). The uctuations of the atomic displacements in selected mode were inverse of their corresponding eigen values. The normalized squared atomic displacement plot is shown in gure-6D.

Structure prediction by homology modeling
The generated 3D structure of spike protein of SARS-COV-2 through homology modeling was predicted by using SARS-CoV-2 spike protein having PDB ID-6VXX as template. The identi ed template was an electron microscopic structure of spike protein having a resolution of 2.8 Å. The generated models were shown in Fig. 7 A-E.

Structural quality assessment
The generated protein structure through homology modeling was validated through different protein structure validation tools.
The quality of the generated model was evaluated based on their Q mean score and Ramachandran plot (Fig. 8A-E). The result of both the analysis tool was shown in Table. 3.

Molecular Docking
The molecular interaction between the designed 3D structure of SARS-CoV-2 spike glycoprotein and ACE2 (PBD ID:108A) protein was analyzed by using Patch Dock web servers. Top 20 docked complexes were gained in the PDB format with absolute clustering root-mean-square deviation (RMSD) factor of value 4.0. The output models were visualized with the help of pre discussed software and the most acceptable model was selected by their (Atomic Contact Energy) ACE value, geometric shape complementarity score and complex interface area. The results of molecular docking have been shown in Table. 4  Recent studies have also shown 79% similarity between spike protein of SARS-CoV-2 and spike protein of SARS-CoV reported from Bat (Rhinolophus sinicus) (Phan. 2020;Zhou et al. 2020). Also Ceraola and coworkers have reported 96% homology between R. a nis SARS-CoV spike protein and SARS-CoV-2 spike protein (Ceraolo and Giorgi 2020). The third Phylogenetic tree revealed that globally reported spike proteins are not having much genetic variation and they are sharing the same branch.
However, from multiple sequence analysis, deletion of single nucleotide has been observed in SAR-CoV-2 spike protein reported from Kerala, India. Also a substitution of amino acids have been observed from the sequence submitted Australia, Peru, Yunan (China), Greece, Spain and South Africa. Phan have also reported three deletions in nucleotide sequence reported from Australia, USA and Japan. He had also reported around 93 mutations in SARS-CoV2 genome (Phan. 2020).
According to Robson. 2020, a single variation in single amino acid sequence may lead to change in characteristic of croronavirus strain which will lead to generation of a new strain. Presently, ve protein structures were generated using homology modeling.
Two structures were generated using sequence from Wuhan and USA having no variation whereas three protein structures were generated by using sequence reported from Yunan, Australia and India. These three sequences were having variation from rest of the sequences. After docking with human ACE2 receptor it was very clear that the variation in single amino acid sequence will lead to change in binding a nity of the protein. The docking score of the model generated with the sequence from Wuhan and USA were having the same Atomic Contact Energy, score and area whereas the ACE value, area and score were found to be different for the docked protein reported from India, Yunan and Australia. Phan have also predicted that mutation in spike glycoprotein can induce major conformational changes and referred this protein as major protein of interest. However, due to unavailability of the amino acid sequences he was not able to nd out the changes (Phan. 2020). The infection rate is increasing daily and mutations are becoming a barrier towards development of therapeutics. So much more sequencing data is needed to nd out the different type of mutations and strains. Also the mapping carried out in our study will be helpful for estimating the number of cases using geo-location and geospatial techniques. However, due to unavailability of the exact GPS coordinates, we were not able to show the route of transmission from one place to another. So in future, the GPS coordinates will be crucial for studying the transmission route of this disease globally. Our study provides a clear insight regarding the variation in SAR-CoV-2 spike protein and also the nding will helpful for the future researcher for combating against this pathogen.

Declarations Con ict of Interest
The authors have declared no con ict of interest    Phylogenetic relationship of Spike protein of globally reported SARS-COV2 responsible for human infection were used in this study using Neighbor-joining method by the MEGA 7 software with bootstrap value of 1000. The percentage indicates the percentage identity. However, TiLV hypothetical protein protein have been there in the outgroup.  A-E. Molecular docking structure spike protein reported from various countries. The generated 3D structure was docked with Human ACE2 receptor molecule.