Epitope ‐ based peptide vaccine design against spike protein (S) of novel coronavirus (2019-nCoV): an immunoinformatics approach

Recently the global pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated a signicant need on identifying drugs or vaccines to prevent or reduce clinical infection of Coronavirus disease – 2019 (COVID-19). In this study, immuno-informatics tools were utilized to design a potential multi-epitopes vaccine against SARS-CoV-2 spike S protein. Structural analysis for SARS-CoV-2 spike S protein was also conducted. SARS-CoV-2 spike S protein sequences were retrieved from the GeneBank of National Central Biotechnology Information (NCBI). Immune Epitope Database (IEDB) tools were used to predict B and T cell epitopes, to evaluate their allergenicity, toxicity and cross- reactivity and to calculate population coverage. Protparm sever was applied to determine protein characterization of spike protein and predicted epitopes. Molecular docking for the proposed MHCI epitopes were also achieved against Tall like Receptor (TLR8) receptors and HLA-B7 allele.


Conclusion
The multi-epitopes vaccine was predicted based on Bioinformatics tools that may provide reliable results in a shorter time and at a lower cost. However, further in vivo and in vitro studies are required to validate their effectiveness.

Background
Recently, the World Health Organization announced the emergence of a new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus as a major threat to human health because it causes a global pandemic of lower respiratory diseases and was known as New Coronary Pneumonia (NCP) by the Chinese government initially [1]. In a situational report 96 published on 23 of April 2020 reported that more than 2 million con rmed cases with SARS-CoV-2 (2.544.792) worldwide and in Eastern Mediterranean Region including Sudan were 144.450 [2]. First cases were reported by the Health Commission of Hubei province, China on December 2019 of unexplained pneumonia, latter on 9th of January 2020, was o cially identi ed as the cause of the COVID-19 a SARS-CoV-2 outbreak in Wuhan,China [3,4].
Coronaviruses (CoVs) are members of the family Coronaviridae, the enveloped viruses that possess extraordinarily large single-stranded RNA genomes ranging from 26 to 32 kilobases in length. SARS-CoV belongs to Beta coronaviruses which infect the mammals [4,5]. SARS-CoV-2 causes u-like symptoms, such as persistent coughing, fever, shortness of breath, and di culty breathing, which are similar to the Severe Acute Respiratory Syndrome (SARS), and the Middle East Respiratory Syndrome (MERS) [6].
Structurally Coronaviruses have two types of proteins none structurally proteins proteases (nsp3 and nsp5) and RdRp (nsp12) and structurally proteins Nucleocapsid (N), Membrane glycoprotein (M), Envelope (E), and Spike (S). Spike protein is a part of virus that's bind to cell receptor and facilitate entering of this virus and is the main target for neutralization antibodies. Moreover, it is a trimeric protein present in outer surface of the virus. The molecular weight of spike protein is 180 kDa and contains two subunits S1 and S2, which they required cellular protease for the process of priming in to S1 and S2.
These two subunits facilitate the virus attachment and membrane fusion [7,8]. Spike S protein binds to speci c cell receptor angiotensin-converting enzyme 2 (ACE2) and use the cellular serine protease TMPRSS2 for S protein priming [9,10].
In last decade many vaccines have been proposed for SARS-CoV including DNA vaccine, synthetic peptides and even in silico perdition peptides, however the DNA and synthetics peptides elicits positive result against humeral and poor immunogenicity against T cell which need an adjuvant [11][12][13].
No speci c anti-virus drugs or vaccines are available against SARS-CoV-2 lethal disease. It is reported that greater than 85% of SARS-CoV-2 patients in China have been receiving Traditional Chinese Medicine (TCM) treatment, and presented the clinical evidence showing the bene cial effect of TCM in the treatment of the patients [4]. However, no approved vaccine is designed for SARS-CoV-2, under circumstances that protection against virus is curricle, especially in African countries which have a poor economic, weak health systems, poor health-seeking behaviors and different cultural practices that's will delay detection of cases and transmission of virus [14,15].
In this study spike S protein of SARS-CoV-2 was used to predict peptides that can stimulate humeral and cellular immunity using various immunoinformatics tools beside structural analysis of spike protein.

Protein Sequence Retrieval
Spike S protein sequences of SARS-CoV-2 virulent strains were retrieved in FASTA format from the GeneBank of National Central Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/protein/) database in April 2020.

Multiple Sequence Alignment and Epitope Conservancy Assessment
The conserve regions cross the Spike S protein were identi ed using ClustalW in BioEdit software version 7.2.5 [16]. Epitope conservancy analysis in Immune Epitope Database (IEDB) was used to detect potential epitope conservancy (http://tools.iedb.org/conservancy/) [17].

Phylogeny Analysis:
The retrieved sequences were subjected to MEGA7.0.26 (7170509) software using maximum likelihood parameter to determine the evolutionary relationship between retrieved sequences [18].

Protein Structural Analysis
Reference sequence of SARS-CoV-2 spike S protein was submitted to Protparam server to predict the physiochemical properties. Many characteristics were predicted include molecular weight, theoretical isoelectric point (pI), amino acid composition, total number of positive and negative residues, extinction coe cient, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [19].

Prediction of B and T cells
B and T cell epitopes were predicted using Immune Epitope Database (IEDB) (http://tools.iedb.org/mhci/) from reference sequences of Spike S protein [29]. Prediction of B-cell antigenic epitopes is important in designing vaccine components and immuno-diagnostic reagents. Generally, B-cell antigenic epitopes are classi ed as either continuous or discontinuous. The majority of available epitope prediction methods focus on continuous epitopes. Discontinuous epitopes dominate most antigenic epitope families [30].
To predict the continuous epitopes, BepiPred linear B-cell epitopes predicting method was used [31]. Then the predicted peptides were subjected to Emini surface accessibility prediction tool and kolaskar and Tongaonkar antigenicity methods to determine the epitopes that located on the surface and the score of epitopes antigenicity respectively [32, 33].
The prediction of discontinuous epitopes was carried out using DiscoTope server [34]. Parameter was set at ≥ 0.5 which indicated 90% speci city and 23% sensitivity. This method based on surface accessibility and amino acid statistics in a collected form dataset of discontinuous epitopes found out by X-ray crystallography of antigen/antibody protein buildings. The position of predicted epitopes clusters on 3D structure of S protein was identi ed by Chimera [35].
The T cell epitopes were predicted for different alleles of major histocompatibility complex class I (MHCI) and class II (MHCII). Arti cial neural networks and NN-align methods were used to predict the binding of proposed peptides with different MHC I and MHC II with binding a nity (IC50) less or equal to 300 and 1000 for MHC I and II respectively [36,37].

Prediction of Antigenicity, Allergenicity and Toxicity for Proposed Epitopes:
The proposed epitopes were also subjected in VaxiJen v2.0 server to determine the antigenicity [38]. AllerTop server was used to identify allergenicity while Toxinpred server was used to estimate the safety of selected epitopes [39,40].

Analysis for the Sequence Similarity with the Human Self-Epitopes:
To assess the possibility of autoimmune diseases for epitopes derived from Spike S protein. The selected epitopes were blasted against the non-redundant protein sequences of human [taxid: 9606] using NCBI Blastp suite program with default parameters (http://www.ncbi.nlm.nih.gov/BLAST/).

Population Coverage:
Immune Epitope Database (IEDB) was also used to calculate the population coverage for proposed epitopes for MHCI and II against whole population worldwide [41].

Retrieved Sequence Information:
Eight spike S protein sequences were retrieved from NCBI with their accession numbers, area and date of collection as shown in Table 1. All sequences are from China.

Multiple Sequence alignment and Epitopes Conservancy Assessment:
Multiple sequence alignment of the retrieved sequences was performed using ClustalW through BioEdit software showed high conservancy between the aligned sequences. The conserved regions were identi ed by identity and similarity of amino acid sequences (Fig.1).

Phylogeny Analysis
Evolutionary analyses were conducted in MEGA7.0.26 (7170509) software using maximum likelihood parameter. The analysis involved 8 amino acid sequences. All positions containing gaps and missing data were eliminated. There were a total of 1273 positions in the nal dataset [18,54]

Structural Analysis:
The physiochemical properties of Spike S protein calculated by protparam server revealed that it contained 1273 amino acids (aa) with molecular weight of 141178.47 kDa, which re ects a good antigenic nature. Theoretical isoelectric point (PI) was 6.24 which indicate its negative in nature. An isoelectric point below 7 states a negatively charged protein, however the total number of negatively charged residues (Asp + Glu) was 110 aa and positively charged residues (Arg + Lys) was 103 aa. Protparam computed instability-index (II), which was a 33.01, this categories spike S protein as stable protein. Aliphatic-index was 84.67, which devotes a thought of proportional volume hold by aliphatic side chain and GRAVY value for protein sequence is 0.012. (Grand average of hydropathicity (GRAVY: -0.079).
The half-life of protein described as the total time taken for its disappearing after it has been synthesized in cell, which was computed as 30 hour (h) for mammalian-reticulocytes, > 20 h for yeast, > 10 h for Escherichia coli. The component of secondary structure predicted by GOR IV server (https://npsa-prabi.ibcp.fr/cgibin/secpred_sopma.pl) revealed alpha helix (28.59%), Beta turn (3.38%), and random coil (44.78%) as in (Fig.3). The ambiguous states of the Spike S protein were predicted via UbPred server (Fig.4) it showed that there were six amino acids sites at the position 182, 776, 811, 947, 1255 and 1266 respectively with low con dence ambiguity site (grey color). Moreover, the average of hydrophobicity predicted by SOSUI server was -0.079183. This server predicted two Trans-membrane regions as shown in Table 2.
DiANNA1.1 tool calculated 20 disulphides bond (S-S) positions and assign them a score and it makes prediction based on trained neural system (see Additional le 1: Table S1).
Pfam server predicted 19 conserved domains (E-value cut-off to 1.0) in spike S protein ( The closest homologue obtained from BLASTP results was the severe acute respiratory syndrome-related coronavirus (75.96%) with E value 0.00 followed by Bat coronavirus BM48-31/BGR/2008 (71.96%) see Table 3 and Fig. 5 and 6. 3.5 Proposed B cell epitopes: In B cell prediction methods, thirty two conserved epitopes were predicted using Bepipred Linear Epitope Prediction method. Among them only ve epitopes were pass Emini surface accessibility prediction tool and kolaskar and Tongaonkar antigenicity methods. These epitopes were ( 110 LDSK 113 , 634 RVYST 638 , 1054 QSAPH 1058 , 1086 KAHFP 1090, and 1137 VYDPLQPELDSF 1148) . Among these epitopes only one epitope 1054 QSAPH 1058 was found non-toxin and non-allergen when investigated by Allertop and ToxinPred servers (Table 4).
Unfortunately, the promising B cell epitope when subjected to Protparam server to determine its physiochemical properties, it was found unstable. The molecular weight is 538.56 kDa and the GRAVY value for protein sequence is -1.460.
However, Discotope 2.0 server was used to calculate surface availability in term of residue contact number and novel tendency amino acid score was utilized to predict the discontinuous epitopes. 3D structure of S protein (PDB ID: 6VSB) [57] was used for discontinuous epitopes prediction, 90% speci city, − 3.700 threshold and 22.000 Angstroms propensity score radius. Total 45 discontinuous epitopes were identi ed at different exposed surface areas ( Table 5). Position of each predicted epitope on surface of 3D structure of S protein shown in Fig.7 were visualized using Chimera tool [35]. In MHCII prediction methods, many core sequences were predicted to interact with huge numbers of alleles as well as high antigenicity score. The core 898 FAMQMAYRF 906, that predicted in MHCI methods was interacted with 101 MHCII alleles. 888 FGAGAALQI 896 and 342 FNATRFASV 350 epitopes were also interacted with 83 and 65 alleles in MHCII respectively (Additional le 2: Table S2).

Antigenicity, Allergenicity, Toxicity of MHCI and MHCII Epitopes:
The expected MHCI and MHCII epitopes were subjected to VaxiJen v2.0 server, AllerTop v2.0 and ToxiPred to predict the antigenicity, allergenicity and toxicity of predicted epitopes respectively. The predicted MHCI and II epitopes were antigenic, but 1060 VVFLHVTYV 1068 and 898 FAMQMAYRF 906, epitopes displayed the higher scores ((1.5122 and 1.0278 respectively). The epitopes were also free of causing allergenicity and toxicity (see Table. 6 and Additional le 2: Table S2).

Cross Reactivity with Human Epitopes:
The only one epitope " 1209 YIKWPWYIW 1217 " shared between MHC I and MHC II has been detected to have putative conserved domain identical to human peptide among all selected epitopes. Therefore, it was removed from the epitopes pool to avert triggering an autoimmune response.

Predicted Physicochemical Properties
The proposed epitopes for both MHCI and II were further subjected to Protparam server to determine their physiochemical properties. All predicted epitopes were stable except 718 FTISVTTEI 726 (see Table 7 and 8 and Fig.8 and 9).

Population Coverage:
The proposed epitopes for MHCI revealed 95.74 coverage against whole population while the proposed epitopes for MHCII showed only78.09 population coverage against whole population (Table 9).

Discussion
Recently, the World Health Organization announced the emergence of a new SARS-CoV-2 virus as a major risk to human health because it causes a global pandemic of lower respiratory diseases and was known as New Coronary Pneumonia (NCP) by the Chinese government initially [1]. The recent global pandemic has placed a high priority on identifying drugs or vaccines to prevent or lessen clinical infection of Coronavirus disease -2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), [1,58]. This study therefore focused on the in silico design and development of a potential multi-epitope vaccine against SARS-CoV-2 spike protein.
In the present study, the calculation of the physiochemical properties of spike S protein of severe acute respiratory syndrome (COVID-19) using protparam revealed that the protein has good antigenic property, negative in nature and stable [19].
The identi cation of epitopes from B cells is important in immuno-detection and immunotherapy applications since the epitope is a minimal immune unit strong enough to stimulate a strong humoral immune response with no harmful side effects to the human body [34].
In B cell prediction methods, ve conserved epitopes were identi ed as linear, surface and antigenic based on Bepipred linear prediction methods, Emini and Kolaskar and Tongaonkar antigenicity measurement tools sequentially. Only one epitope ( 1054 QSAPH 1058 ) was identi ed as non-allergic using AllerTop v. 2. Software and nontoxic using ToxinPred software. It was also free from provoking an autoimmune response; however it was found unstable as a protein when analyzed by protparam server.
The discontinuous epitopes are increasingly explicit and have higher dominant attributes over linear epitopes [59]. 3D structure of S protein was used for discontinuous epitopes prediction using DiscoTope 2.0 server. The server uses a combination of amino acid statistics, spatial information, and surface exposure [60]. In this study, a total of forty ve conserved discontinuous epitopes were identi ed at different exposed surface areas. These epitopes may have principal role in humoral immunity. However, it has been estimated that > 90% of B-cell epitopes are discontinuous, i.e., consist of segments that are distantly separated in the pathogen protein sequence and brought into proximity by the folding of the protein [60].
MHCII prediction methods, predicted many epitopes such as ( 898 FAMQMAYRF 906, 888 FGAGAALQI 896 and 342 FNATRFASV 350 ) as they interacted with great numbers of HLA alleles as well as high antigenicity and safety. However, 898 FAMQMAYRF 906 epitope that predicted in MHCI was also expected to interact with huge number of MHCII alleles. In a similar in silico study, ve CTL epitopes, three sequential B cell epitopes and ve discontinuous B cell epitopes were predicted from the viral surface glycoprotein of SARS-CoV-2 virus [61].
Physicochemical properties of MHCI and II epitopes using protparam server indicated that all epitopes were predicted to be stable except 718 FTISVTTEI 726 . According to the server threshold, an instability index below 40 is indicative of protein stability, and a lower value demonstrates a more stable protein [62].
The molecular weights in all epitopes were slightly different ranging from 846.98 to 1164. Gravy values were also different. Gravy is a measure of hydrophobicity or hydrophilicity of the structures. Gravy value for all structures was positive, representing their slightly hydrophobic nature, except, 202 KIYSKHTPI 210 showed negative GRAVY (hydrophilic) [62]. The theoretical pI values of epitopes were also varies in range of 9.70 to 4.00. In a vaccine designed for injection, pI is preferred closer to the normal blood pH, body uids, or neutral pH [63].
The secondary structure predicted by GOR IV server indicated that the spike protein consisted of alpha helix (28.59%), Beta turn (3.38%), and random coil (44.78%). The ambiguous states predicted via UbPred server exhibited six amino acid site at position 182, 776, 811, 947, 1255 and 1266 respectively with low con dence ambiguity sites. Moreover, SOSUI server predicted two Trans-membrane regions while DiANNA1.1 tool calculated 20 disulphides bond (S-S) positions in SARS-CoV-2 spike protein.
Corona-S2 and Spike_rec_bind were identi ed as main motif in spike S protein. They were also sequenced by Conserved Domain (CDD) BLAST search [55]. To evaluate potential immune interaction between TLR8 and the 3D structure of predicted MHCI peptide, a protein-ligand docking analysis was performed. 1060 VVFLHVTYV 1068 epitope interacted strongly with TLR8 that indicated by the lower global energy − 84.58 followed by 2 FVFLVLLPL 10 (global energy − 64.23) (Table 10 and Fig. 10). In addition, docking with HLA-B7 exhibited strong association with HLA-B7 for all epitopes see Table 11 and Fig. 11. However, 2 FVFLVLLPL 10 produced lower global energy 78.81 which indicates the strong binding a nity in comparison with other epitopes followed by 1060 VVFLHVTYV 1068 (global energy − 63.20).
Furthermore, the proposed epitopes for MHCI revealed high coverage (95.74%) against whole population whereas the MHCII epitopes showed only78.09% population coverage against whole population.

Conclusion
This study used various immuno-informatics tools to design a potential multi-epitopes vaccine coding for B-cell and T-cell (HTL and CTL) epitopes.
Immuno-informatics analyses of spike S protein generate a candidate vaccine that contain a number of high-a nity MHCI, and II, linear and conformational B-cell epitopes that lack the allergenicity, toxicity and autoimmune properties which support their potential as vaccine candidates. The effectiveness of the designed vaccine should be further con rmed in wet-lab experiments.