Data Retrieval
The genome sequence of SARS-CoV-2 isolate Wuhan-Hu-1 is retrieved from the NCBI database with accession number MN90894734. The protein sequences are retrieved according to their translation. Especially, the spike protein (protein ID: QHD43416.1) has a length of 1273 amino acids (aa), and the receptor binding domain (RBD) is from 347 to 520aa19. The following experiments are mainly focused on the spike protein area.
DeepVacPred Vaccine Subunits Prediction
All the overlapping protein fragments with a length of 30aa are generated out of the 1273aa SARs-CoV-2 spike protein sequence. Our DeepVacPred first tests these 1244 30aa protein sequences and predicts 130 potential vaccine subunits (See Table 4). Our DeepVacPred further predicts the T-cell epitopes at these locations and discards the subunits which have less than 8 T-cell epitopes35. After this prediction, our DeepVacPred provides us with 26 potential vaccine subunits for further evaluation and construction (See Table 5). These subunits are very likely to contain B-cell epitopes and multiple T-cell epitopes. They are also very likely to have high antigenicity and low allergenicity. We start the following in-silico vaccine design process directly from the predicted 26 vaccine subunits, which is very efficient.
Linear B-cell Epitopes Prediction
B-cell epitopes are portions of antigens binding to immunoglobulin or antibody to trigger the B-cell to provide immune response36. Linear B-cell epitopes are predicted on the 26 vaccine subunits. Linear B-cell epitopes with different lengths are predicted by different online servers including BepiPred24, SVMtrip37, ABCPred38 and BCPreds39. B-cell epitopes must be located in the solvent-exposed region of the antigens to be possible to combine with the B-cell36, thus it is essential to predict the surface availability of the structural protein sequence. The surface availability is predicted by Emini tool40 on the whole CoV spike protein sequence. After the predictions, we select out 14 vaccine subunits (see Table 6). Each of them contains at least 1 exposed B-cell epitope which is predicted by BepiPred and one other server (SVMtrip, ABCPred and LBtope) simultaneously.
Cytotoxic T Lymphocytes (CTL) Epitopes Prediction
Cytotoxic T Lymphocytes (CTL) recognize the defected cells by using the MHC class I molecules to bind with certain CTL epitopes25. We use NetMHCpan 4.1 server41 to predict potential CTL epitopes. All the overlapping 9aa peptide sequences in the 14 vaccine subunits are tested with the 12 most common human-leukocyte-antigen (HLA) Class I alleles including HLA-A1, HLA-A2, HLA-A3, HLA-A24, HLA-A26, HLA-B7, HLA-B8, HLA-B27, HLA-B39, HLA-B44, HLA-B58 and HLA-B62 to
evaluate their binding affinities and predict potential CTL epitopes. The total HLA score is calculated for each vaccine subunits. The results are shown in Table 7.
Helper T Lymphocytes (HTL) Epitopes Prediction
Helper T Lymphocytes (HTL) help the activity of other immune cells and they recognize the infection by using MHC class II molecules to bind with certain HTL epitopes42. We use NetMHCIIpan 4.0 server43 to predict potential HTL epitopes. All the overlapping 15aa peptide sequences in the 14 vaccine subunits are tested with the 13 most common HLA Class II alleles from HLA-DRB1-0101 to HLA-DRB1-1601 to evaluate their binding affinities and predict the potential HTL epitopes. The total HLA score is calculated for each vaccine subunits. The results are shown in Table 8.
Worldwide Human Population Coverage Analysis
The vaccine we design should have wide human population coverage. We use the IEDB population coverage analysis tool44 to evaluate the worldwide human population coverage of the 14 vaccine subunits. The 25 HLA alleles we used to predict the T-cell epitopes can cover 98.39% human population. The human population coverage of each vaccine subunit is shown in Table 9. The results suggest that our 14 vaccine subunits can cover a very wide range of human population.
Multi-epitope Vaccine Construction
We discard Subunit 9, 15 and 26 for their poor performance in the CTL and HTL epitopes predictions. We use the rest 11 vaccine subunits to construct a final multi-epitope vaccine (See Figure 3). The final vaccine contains an adjuvant, 50S ribosomal protein L245,46 (accession no. AXI95322.1), to improve the immune response47, linked with the amino (N) terminum of the multi-subunit sequence through an EAAAK linker48. The multi-subunit sequence has a CTL multi-epitope peptides region followed by an HTL multi-epitope peptides region. The CTL region is constructed by 6 subunits which have better performance in the CTL epitopes prediction. AAY linkers48 are used in this region to fuse the subunits. The HTL region is constructed by 6 subunits which have better performance in the HTL epitopes prediction. GPGPG linkers48 are used in this region to fuse the subunits. The two regions are linked through a GPGPG linker. In addition, a 6xHis tag is added at the C-terminal to help purify and identify the protein49. The final vaccine consists of 694 amino acid residues. It contains 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes.
Antigenicity and Allergenicity Evaluation
The antigenicity of the final multi-epitope vaccine sequence is evaluated by the Vaxijen 2.0 online server33,50. We also evaluate the antigenicity of each vaccine subunit, including the adjuvant (See Table 10). The Vaxijen score for the whole final vaccine is 0.5705 with a virus model at a threshold of 0.4, suggesting a high antigenicity of our final vaccine. The AllergenFP 1.0 online server predicts the final vaccine to be non-allergenic. We also use the AllergenFP 1.0 online server to check every subunit in the final vaccine and each of them is predicted to be non-allergenic.
Toxicity and Physicochemical Properties Analysis
The vaccine must not have toxicity potential and the physicochemical properties are also important to evaluate how the vaccine interacts with the environments51. We use the ToxinPred server52 to predict the toxicity. Other physicochemical properties, including hydropathicity, charge, half-life, instability index, pI (Theoretical isoelectric point value) and molecule wheight, are predicted by ExPASy ProtParam Tool53. The predictions are done on the whole final vaccine sequence. The final vaccine is predicted to be non-toxic. The hydropathicity value is predicted to be -0.521, this negative value suggests that our final vaccine is hydrophilic in nature and can interact with the water molecules easily54. The charge is 37.00, this value will decrease in alkaline environment so usually it is better if the charge values are positive. The half-life of the final vaccine is predicted to be 30 hours in in Vitro and >20 hours in Vivo. An Instability Index of 34.01 is predicted, this less than 40 value suggests that our final vaccine is stable. The pI of the final vaccine is calculated to be 9.75, which is an alkaline value, indicating its highly basic existence in nature. The molecule weight of the final vaccine is calculated to be 76 kDa. We also check the toxocity and physicochemical properties of every subunit and the results are shown in Table 11.
Secondary Structure Prediction
We use PSIPRED55 to generate the secondary structure of our final vaccine. Graphical representation of the secondary structure features are shown in Figure 4. The predicted secondary stucture indicates that the final vaccine constitutes 10.8% alpha helix, 24.6% beta strand, and 64.6% coil. The solvent accessibility (ACC), and disorder regions (DISO) are predicted by RaptorX
Peoperty server56,57 (See Figure 5). Among the 694 amino acid residues in our final vaccine, 44% are predicted to be exposed, 27% medium exposed, and 27% are predicted to be buried. A total of 60 residues (8%) are predicted to be located in disordered regions.
Vaccine 3D Structure Modeling
We use the Swiss Model58 to build the 3D structure models of our final vaccine. 5myj.1.6, 3j3v.1.G, 3j9w.1.8 and 5nd9.1.Y are predicted by Swiss Model to be the best four templates with high Global Model Quality Estimation (GMQE) score58. Based on these four templates, we use the Swiss Model to construct the 3D structure models of our final vaccine (See Figure 6). The QMEAN Z-score, Clash score and Ramachandran favoured score of these 4 models are shown in Table 2. The QMEAN Z-score59 provides an estimation of the structure and QMEAN Z-scores around zero indicate good agreement between the model structure and experimental structures. We select Model 3, which has the best QMEAN Z-score value, for further refinement.
Vaccine 3D Structure Refinement
We use GalaxyRefine server60 to refine the 3D structure model of our final vaccine. Among the 5 refined models predicted by GalaxyRefine, we choose the model shown in Figure 7 as our final vaccine model based on its model quality scores. This model has a GDT-HA of 0.9774, an RMSD of 0.344, a molProbity of 1.431, a clash score of 7.9, a poor rotamers of 0 and a Ramachandran plot score of 98.2%, showing great overall quality.
Vaccine 3D Structure Validation
We use ProSA-web61 to validate the overall model quality of the refined final vaccine model. ProSA predicts a Z-score of -5.73 (See Figure 8), which is lying inside the score range of the comparable sized native proteins, indicating good overall model quality. ProSA also checks the local model quality and the residue scores are plotted in Figure 8. Negative values suggest no erroneous parts of the model structure. We also use RAMPAGE server to do the Ramachandran plot analysis and it reveals a Ramachandran plot score of 98.2%, which is consistent with the results of GalaxyRefine.
Codon Optimization and In-silico Cloning
We analyze the cloning and expression efficiency and optimize the codon usage of vaccine construct in E. coli (strain K12) by Java Codon Adaptation Tool62. The length of the optimized codon sequence is 2082 nucleotides. Its Codon Adaptation Index (CAI) is 0.997, and the average GC content is 50.73%, indicating a great potential of good expression of the final vaccine in the E. coli host. After the optimization, we use the SnapGene tool to insert the codon sequences in pET28a(+) vector for cloning (See Figure 9).
RNA Mutations
As the CoV spreads all over the world, its RNA sequence is going through mutations, translating out different virus proteins. Such mutations can have influences on the epitope based vaccines, since a single amino acid difference can change the epitope prediction results. Therefore it is important to prove our final multi-epitope vaccine can tackle the mutations. With our DeepVacPred, we are also able to quickly examine the mutated protein sequence to search for new potential vaccine subunits.
The RNA sequence we use to translate the spike protein and design the vaccines is from Wuhan, which is the original virus34. The RNA mutations result in three most frequent changes in the spike protein area of the CoV and each of the changes contains one amino acid change63. Table 3 shows the mutation details.
The mutation at the 614aa in spike protein from D to G is the most frequent mutations with 116 known isolates63. This mutation is very common in many cities in North America. In Europe and South America the D614G mutation occurs in less than 10 isolates. This change has no influence on the final multi-epitope vaccine since it does not contain the 614aa of the spike protein. With DeepVacPred, we are also able to quickly check and identify whether the mutation can create new potential vaccine subunits. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.
At 476aa in spike protein there is a frequent mutation from G to S, which occurs in 3 isolates from Washington DC63. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 476aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.
At 483aa in spike protein there is a frequent mutation from V to A, which occurs in 6 isolates from Washington DC63. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 483aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.
In conclusion, our designed multi-epitope vaccine can tackle the current RNA mutations of the coronavirus. The current RNA mutations of the coronavirus create no new potential vaccine subunits.