Structural modeling and conserved epitopes prediction against SARS-COV-2 structural proteins for vaccine development

Background : Coronavirus disease 2019 (COVID-19) caused by Severe Acute Respiratory Syndrome Corona (SARS-COV-2) was first Little is known about this new virus and it has the potential to cause severe illness and pneumonia in some people, therefore the development of an effective vaccine is highly desired. Methods : Immunoinformatics and statistical approaches were used in this study to forecast B- and T-cell epitopes for the SARS-COV-2 structural proteins (Surface glycoprotein, Envelope protein, and Membrane glycoprotein) that may play a key role in eliciting immune response against COVID-19. Different types of B cell epitopes (linear as well as discontinuous) and T cell (MHC class I and MHC class II) were determined. Moreover, their antigenicity and allergenicity were also estimated. Results : The antigenic B-cell epitopes exposed to the outer surface were screened out and 23 linear B cell epitopes were selected. “ SPTKLNDLCFTNVY ” had the highest antigenicity score among B cell epitopes. The T-cell epitopes bound to multiple alleles, antigenic, non-allergen, non-toxic, and conserved in the protein sequence were shortlisted. In total, 16 epitopes (9 from MHC class I and 7 from MHC class II) were selected. Among the T-cell epitopes, MHC class I ( IPFAMQMAYRFN ) and MHC class II ( VTLACFVLAAVYRIN ) were classified as strongly antigenic. Digestion analysis verified the safety and stability of the peptides predicted during this study. Furthermore, docking analyses of predicted peptides showed significant interactions with the HLA-B7 allele. Conclusion : The putative antigen epitopes identified in this study may serve as vaccine candidates and can help to eliminate/control growing health threat of COVID-19.


Background
Viruses have the potential to become dangerous life threat and cause irreparable loss to human beings. Hardly the world learns to cope with one strain of virus when another emerges and poses a threat to the future of humanity. A similar situation has emerged when a new strain of novel coronavirus (CoV) that has not been previously identified in humans is reported last year late December [1]. Coronaviruses are the largest among RNA viruses belonging to Coronaviridae, Roniviridae and Arteriviridae families. Coronaviridae are unsegmented, 3′ polyadenylated and 5′ capped positive sense single-stranded RNA viruses cause various respiratory diseases in humans [2].
CoV is classified into 4 classes; alpha, beta, delta, and gamma. Amongst them, beta and alpha CoVs have been reported for infecting humans [3]. CoV continuously changes its strains making it a herculean task to develop a vaccine. Recent CoV strain has received tremendous attention from researchers, as it causes severe respiratory complications similar to its other family members. [4].
According to the WHO reports, China was intimated about the cases of pneumonia having unknown etiology in Wuhan on December 31, 2019 [1,5]. Later, SARS-COV-2 was identified as the causative agent of this outbreak by the CCDCP (Chinese Center for Disease Control and Prevention ) [6].
Currently, there is a paucity of research regarding its transmission among people. Initial diagnostic procedures indicate that the SARS-COV-2 is primarily spread through respiratory droplets from sneezing/coughing, body contact and to some extent through fecal contact [7]. The SARS-COV-2 can exhibit symptoms in less than 2 days or up to 14 days after exposure. Symptoms of patients infected with COVID-19 include fever, runny nose, cough, and dyspnea [8]. Although the entire genome sequence of the virus has been published, the origin and proliferation mechanism of the new coronavirus is still ambiguous as stated by the WHO [9]. The study of genome sequences has cast a shadow that SARS-CoV-2 is closely related to the SARS-CoV which is the causative agent of the SARS outbreak in humans in 2002/03 [10].
The SARS-COV-2 outbreak initially was related to the Huanan seafood market in the South China region. However, major gaps in knowledge of the origin, human transmission period, clinical spectrum, and epidemiology of COVID-19 need fulfillment by future studies. Some patients diagnosed with disease revealed that they had searched or wandered the market [11]. From shellfish to snakes, birds and other small mammals including marmots, the market is said to be the selling point of a variety of animals that are purported to be the cause of dissemination of the virus. WHO reported that environmental samples of the market were positive for the SARS-CoV-2. However, no specific animal relationships were sorted out yet [12]. Initial reports claimed that snakes could be a possible source due to codon usage [13], but the claim is under debate and needs substantial research to prove it [6].
Researchers are currently working to sort out the SARS-COV-2 source, including possible intermediate animal vectors.
The samples taken from a respiratory system-throat swab or lung fluid-are helpful in diagnosing its infection in patients. Unfortunately, there is no vaccine or therapy available to cure this disease [14].
Therefore, an effective remedy to cope with this disease is direly needed to contain this rapidly spreading menace. For this purpose, immunoinformatics approaches can be applied to examine viral antigens, prediction of its epitopes and assessment of its immunogenicity [15]. Moreover, this approach could be both time and cost-effective [2,16,17]. Excessive respiratory infection can also resolve with T cell reactions and antibodies [18]. Furthermore, rapid identification, isolation, disease prevention, and control measures are required to hinder its spread of SARS-COV-2 at homes, communities and healthcare units [19,20]. In various studies, therapeutic approaches against the Ebola virus, Zika virus and MERS-CoV were developed using in silico peptide prediction [2,17,21].
The purpose of this study was to pinpoint the potential T-cell and B-cell epitopes from SARS-COV-2 structural proteins which would help in the future to develop its vaccine.

Methods
Flow chart of methodology used in present study is graphically represented in Additional file 1: Figure   S1. Complete details about software/resources and their availability URLs enlisted in Additional file 2. Table S1.

Data retrieval
Three proteins, Surface glycoprotein (S), Envelope protein (E) and Membrane glycoprotein (M) of SARS-COV-2 were taken as targets. Their amino acid sequences were collected in fasta format from Genbank [22].

Structural analysis
The Expasy ProtParam tool was used to evaluate the physical and chemical properties of target proteins [23]. Protein's secondary structure was analyzed through PSIPRED. The transmembrane topology of the proteins was checked through the TMHMM tool. DIANNA v1.1, an online tool, checked the presence of disulphide-bonds in proteins [24]. It makes predictions based on the neural trained system. Allergenicity and antigenicity of proteins were checked through AllerTOP v2.0 and VaxiJen v2.0 respectively [25,26].

3D structure prediction and validation
3D structures of SARS-COV-2 proteins (S, E, and M) were predicted using homology modeling approach. PDB database was used to find appropriate templates. PDB homologous proteins were scanned by PSI-blast [27]. The initial alignment between template and target was created using the ALIGN2D tool. The 3D structures were built in MODELLER v.9.12 using a restrained-based approach [28]. These models were then visualized by Chimera [29]. Galaxy refines server and ModRefiner was used to refine the retrieved models [30,31]. Besides, the refined structure needs to be validated based on experimentally validated 3D structure of proteins. Refined structures were therefore applied in the PROSA web providing a quality score for a given structure [32]. The quality score beyond the usual range of native proteins indicates a possible error in protein structure. Ramachandran plot was created by rampage server where the principle of PROCHECK is applied to validate the protein structure [33].

Prediction of B cell epitopes
The epitopes of B cells help to detect viral infections in the immune system. ABCpred was used to forecast 14-mer B cell epitopes for target proteins at 0.51 threshold [34]. Epitopes evident on the outer surface were picked, and other intracellular epitopes were removed. The vaxijen server tested the antigenicity of the selected epitopes at a threshold of 0.5. B cell epitope identification was based upon antigenicity, flexibility, linear epitope predictions, hydrophilicity, and surface accessibility [35].
Parker hydrophilicity prediction algorithms, Emini surface accessibility prediction method, Kolaskar and Tongaonkar antigenicity scale, and Karplus and Schulz flexibility prediction tool were used to perform hydrophilicity, accessibility of surface, antigenicity and flexibility analysis respectively [36].
As discontinuous epitopes become more evident and have higher dominant properties than linear epitopes, DiscoTop 2.0 server was used to forecast discontinuous epitopes from 3D structures of surface glycoprotein, membrane protein and envelope protein [37]. The position of epitopes on 3D structures of proteins was visualized by Pymol [38].

Prediction of T cell epitopes
In vaccine designation, T cell epitopes play a crucial role. More specifically, it reduces the cost and time compared with laboratory experiments [39]. IEDB consensus method was used to predict 12 mer MHC class I and 15 mer MHC class II epitopes. The results of this method are very important due to a large number of HLA alleles used in the calculation. The sequence was given in a FASTA format and all the alleles were selected for prediction. Epitopes with less than 2 consensus score believed to be good binders and chosen for further research.

Epitopes conservation analysis
The degree of conservation of predicted T cell and B cell epitopes within the protein sequence was analyzed by IEDB conservancy analysis tool. Epitopes having 100% conservancy were shortlisted [40].

Population coverage analysis
The expression and distribution of HLA alleles vary depending on the world's ethnicities and regions, thereby impacting the effective production of an epitope-based vaccine [41]. The population coverage was calculated using the IEDB population coverage tool and for this purpose selected MHC class I and MHC class II epitopes and corresponding HLA-binding alleles were considered. This tool estimates population coverage for each epitope for various regions of the world based on the distribution of human alleles that bind to MHC [42]. However, this study will highlight areas of particular importance concerning our pathogen.

Properties evaluation of peptides
Antigenicity and allergenicity of the peptides were checked by Vaxijen v 2.0 and Allergen Fp 1.0 respectively [43]. The threshold of VaxiJen was set at 0.5. Physiochemical properties of the peptides like Theoretical pI, the molecular weight was evaluated by Expassy Protparam tool. Protein digesting enzyme server was used to predict peptide digester enzymes. ToxinPred is used for Non-toxic/toxic peptide prediction. Non-toxic peptides were selected for further analysis [44].

Modeling of peptides and docking with HLA allele
The 3D structures of nine MHC class I epitopes were predicted using PEPFOLD. Five models for each peptide were created and the best model for each peptide was picked [45]. The crystallographic X-ray structure of the commonly found HLA allele in the human population i-e HLA-B7 allele was obtained from PDB (PDB ID: 3VCL). Molecular docking was performed using similar protocol of our already published studies [1,2,16,46,47].  Table S2). Physiochemical properties evaluated by protparam showed that all proteins are stable. S protein was found to be acidic and hydrophilic, while E and M protein was found to be basic and hydrophobic (Table 1). TMHMM carried out a prediction of the transmembrane topology of the target proteins displaying one transmembrane helix in S and E protein while three transmembrane helices in M protein. Residues of the S protein from 1-1213 were on the surface while residues from 1214-1236 were within the transmembrane region and residues from 1237-1273 were buried within the core region of S protein. Similarly, in E protein residues from 1-11 were exposed on the surface and residues from 12-34 were inside the transmembrane region and residues from 35-75 were buried inside the core region of E protein. In M protein residues from 1-19 and 74-77 were exposed on the surface, while residues from 20-39, 51-73, and 78-100 were inside the transmembrane region and residues from 40-50 and 101-222 were buried inside the core region.

Sequence retrieval and structural analysis of the target proteins
Secondary structure details of the target proteins predicted by PSIPRED are mentioned in Additional File 4: Table S3.

Homology modeling and validation
To predict structure-based epitopes of SARS-COV-2 proteins (S, E, and M), homology modeling of the proteins was performed. Chain A, Spike glycoprotein of SARS-COV (PDB ID: 6ACC) and Chain A, Envelope small membrane protein of SARS-COV (PDB ID: 5X29) were found to be the best template based on E-value, percent identity, and query coverage for S and E protein respectively. S protein was found to have a 75% identity with a spike protein of SARS-COV (6ACC). E protein had 88.71% identity with Chain A, Envelope small membrane protein of SARS-COV. No suitable template was found for M protein, so its structure was predicted by Raptor X [48]. The amino acid sequence was provided as an input that yielded to have two domains and the best template used was model 5yckA.
The P-value for the modeled structure was calculated at 9.14× 10−5. All 222 (100%) amino acids were modeled and 27 (12%) positions were predicted as disordered in the model. Visualization of the models was done by Chimera (Additional file 1: Figure S2). Structures were refined by Galaxy refine server and ModRefiner. The quality factor (z-score) and Ramachandran plot values of refined models are mentioned in Additional File 5: Table S4 (Additional file 1: Figure S3-S5).

B cell epitopes prediction
ABCpred was used to predict the linear B cell epitopes of the target proteins and the TMHMM server was used to check the surface availability. Antigenicity was checked by vaxijen. Epitopes that were exposed on the surface, antigenic, and 100% conserved in the protein were selected from all the predicted epitopes. Total 23 epitopes (S-19, E-1, and M-3) were selected based on these criteria.
Among the chosen epitopes, 'SPTKLNDLCFTNVY' of S protein showed the highest antigenicity (1.69) and predicted score ( Table 2). The position of epitopes on their respective protein structure was visualized by Pymol ( Figure 1).
Besides, potential B cell surface accessibility needs to be examined. By analyzing the physiochemical properties of amino acids and their concentration in previously identified B cell epitopes, the Kolaskar and Tongaonkar antigenicity tools evaluated target proteins for predicting B cell epitopes. The estimation threshold was set at 1.045, while the window size was maintained 7. It predicted the protein antigenic propensity value of S protein as 1.041 (Average), 0.866 (Minimum) and 1.261 (Maximum) (Additional file 1: Figure S6 Figure S8-A). Chou and Fasman beta-turn analyzing algorithm was used to predict beta-turn in target proteins since beta-turn in nature is exposed to the surface and hydrophilic and plays a vital role in beginning the defensive response. The threshold value was adjusted at 1.009, it calculated values of 0.097 (average), 0.541 (minimum) and 1.484 (maximum) in S protein (Additional file 1: Figure S6-B), 0.883 (average), 0.554 (minimum) and 1.264 (maximum) in E protein (Additional file 1: Figure S7-B), and 0.915 (average), 0.600 (minimum) and 1.384 (maximum) in M protein (Additional file 1: Figure S8 To further improve the specificity and variety of B-cell epitopes, Discotop 2.0 server was used to calculate surface abundance concerning residual contact number and use the novel amino acid score to forecast discontinuous epitopes. 3D structures of the target proteins were used to predict discontinuous epitopes; 90% specificity, − 3.700 thresholds and 22.000 Angstroms propensity score radius. 138 discontinuous epitopes of S protein, 1 epitope of the E protein and 22 epitopes of M protein were calculated ( Table 3). The position of epitopes on their respective protein structure was visualized by Pymol ( Figure 2).

T cell epitopes prediction and properties evaluation
T cell (MHC class I & MHC class II) epitopes of target proteins were predicted by the IEDB consensus method. Peptides that can bind to multiple alleles are considered the most appropriate peptides due to their strong defensive ability. Their antigenicity and allergenicity were checked by vaxijen and Allergen FP 1.0. Their conservation in protein sequences was analyzed by the IEDB conservancy analysis tool. Epitopes that are bound to multiple alleles, highly antigenic, non-allergenic and 100%  (Table 4).
Protein digesting enzyme server was used to estimate peptides digesting enzymes. Enzymes that do not digest peptide into fragments is considered non-digestive enzyme. Peptides digestible with many enzymes are not stable. Less enzyme digested peptides, on the other hand, are very stable and favored vaccine candidates (Table 5).

World population coverage
Population coverage was calculated with finally selected nine MHC class I and seven MHC class II epitopes and related HLA alleles. Selected MHC class I and MHC class II epitopes represented 88.23% and 66.66% of the world's population. Highest coverage of MHC class I epitopes found in the population of two countries Finland (96.16%) and Italy (96.38%). Similarly, the highest coverage of MHC class II epitopes was found within France (79.44%). In China where SARS-COV-2 was first identified, MHC class I epitopes showed 63.26% population coverage and MHC class II showed 41.17% coverage. Also, the population coverage was higher in regions where SARS-COV-2 cases have been reported ( Figure 3).

Molecular docking analysis
The 3D structures of selected nine MHC class I epitopes were predicted using PEPFOLD (Additional file 1: Figure S9). Molecular docking is important to understand the patterns of protein-peptides interaction. To analyze the interaction of HLA allele with selected MHC class I epitopes, molecular docking was performed. Docking results revealed all the selected peptides binds inside the receptor binding site of HLA allele with strong interactions (Figure 4).

Discussion
CoVs have long been considered as insignificant pathogens causing "colds" in humans. Here, we explored the development of epitope-based vaccines targeting the structural proteins of the SARS-COV-2. T-and B-cell epitopes of the target proteins were predicted to support the host's immune response. Research was performed at primary, secondary and tertiary structural levels of proteins. IEDB analysis resource and ABCPred predicted B-cell conserved epitopes. Certain IEDB techniques were utilized to evaluate antigenicity, solvent accessibility, disulphide bonds, and flexibility. The 'SPTKLNDLCFTNVY' produced a higher immunogenicity score (1.6982) and could represent a potential B cell epitope and candidate for the vaccine. In addition, antigenic T cell determinants with potential to bind MHC class I and MHC class II have been predicted by the IEDB consensus approach. MHC-I (IPFAMQMAYRFN) and MHC-II (VTLACFVLAAVYRIN) epitopes associate with many HLA alleles and are strongly antigenic. The position of epitopes on 3D structures of proteins was visualized by Pymol. DiscoTop server was used to predict discontinuous epitopes. To further improve specificity and selectivity, allergenicity, toxicity and physiochemical properties of predicted epitopes were checked. Digestion analysis verified that the peptides predicted during the analysis were stable and safe to use. For future studies the predicted epitopes should be tested for therapeutic potential.
The rapid development of structural and genomic databases combined with computational tools, helps in the design and discovery of new vaccine candidates. COVID-19 infection is a severe problem of morbidity and mortality worldwide. Unfortunately, the unavailability of the vaccinations against COVID-19 has impacted several precious lives, in different regions of the world. To successfully eradicate the disease, researchers have been trying to collect data associated with COVs to understand its transmission, pathophysiology and biology. Our analysis will help to develop potential peptide vaccines to eradicate the COVID-19 and help to combat against SARS-CoV-2.

Conclusions
A reverse vaccinology technique for classifying surface-exposed antigens was used in this research, rather than focusing on the whole pathogen, a less successful approach. T and B cell epitopes were identified through sequence, structure and conservational analysis. Due to their prediction methods, yields, speeds, and low costs, B and T cell epitopes have become the focal point of immunoinformatics studies. A preliminary sequence of epitopes for future SARS-CoV-2 vaccine development could be seen in this study, which can help manage this increased health hazard. The

Competing interests
All authors have no competing interests.