An In-silico Deep Learning Approach to Multi-epitope Vaccine Design: A SARS-CoV-2 Case Study

doi:10.21203/rs.3.rs-36528/v1

Download PDF

Research Article

An In-silico Deep Learning Approach to Multi-epitope Vaccine Design: A SARS-CoV-2 Case Study

https://doi.org/10.21203/rs.3.rs-36528/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The rampant spread of COVID-19, an infectious disease caused by SARS-CoV-2, all over the world has led to over 6.5 million cases and more than 380,000 deaths, and devastated the social, financial and political entities around the world. Without an existing effective medical therapy, vaccines are urgently needed to avoid the spread of this disease. In this study, we propose an in-silico deep learning approach for prediction and design of a multi-epitope vaccine (Deep-Vac-Pred). By combining the in-silico immunotherapeutic and deep neural network strategies, the DeepVacPred computational framework directly predicts 26 potential vaccine subunits from the available SARS-CoV- 2 spike protein sequence. We further use in-silico methods to investigate the linear B-cell epitopes, Cytotoxic T Lymphocytes (CTL) epitopes, Helper T Lymphocytes (HTL) epitopes in the 26 subunit candidates and identify the best 11 of them to construct a multi-epitope vaccine for SARS-CoV-2 virus. The human population coverage, antigenicity, allergenicity, toxicity, physicochemical properties and secondary structure of the designed vaccine are evaluated via state-of-the-art bioinformatic approaches, showing good quality of the designed vaccine. The 3D structure of the designed vaccine is predicted, refined and validated by in-silico tools. Finally, we optimize and insert the codon sequence into a plasmid to ensure the cloning and expression efficiency. In conclusion, this proposed artificial intelligence vaccine discovery framework accelerates the vaccine design process and constructs a 694aa multi- epitope vaccine containing 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes, which is promising to fight the SARS-CoV-2 viral infection and can be further evaluated in clinical studies. Moreover, we trace the RNA mutations of the CoV and make sure our designed vaccine can tackle the recent RNA mutations of the virus.

Computational Biology

Artificial Intelligence and Machine Learning

Bioinformatics

Vaccine Development

COVID-19

Multi-epitope Vaccine

Deep Learning

In-silico Vaccine Design Framework

RNA Mutations

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2

(SARS-CoV-2)^1,². First detected in December 2019 in Wuhan, the virus has spread globally, resulting in over 6.5 million infected cases, more than 380,000 deaths,³and unprecedented financial, social and political impacts all over the world⁴. Efficacious vaccines are therefore desperately needed⁵. Main clinical features of the COVID-19 are fever, cough and myalgia or fatigue⁶; the virus has caused clusters of severe respiratory illness similar to severe acute respiratory syndrome coronavirus and is associated with ICU admission and high mortality⁷.

Currently, without a single specific antiviral therapy for CoV, the control methods of the COVID-19 are early diagnosis, reporting, isolation, supportive treatments, and timely publishing epidemic information with only limited impact on the coronavirus^8,⁹. Researchers have proposed several approaches to develop vaccines for the CoV¹⁰. Traditional process of vaccine design is based on growing pathogens, which has a very time-consuming process of isolating, inactivating and injecting the virus that cause the disease^11,¹². Such process usually takes more than a year to result in efficacious vaccines and hence contributes very little to avoid the current spread of the disease^13,¹⁴. Recently, researchers are working on constructing multi-epitope vaccines by in-silico methods based on immunoinformatics without the need to grow pathogens to accelerate the vaccine design process^15,¹⁶. Multi-epitope vaccines are constructed by multiple virus protein fragments rich in overlapping epitopes. They contain the vital part of the virus to elicit either a cellular or a humoral immune response and they reduce unwanted components that can trigger adverse effects¹⁷. Multi-epitope vaccines can be powerful against viral infections, providing excellent vaccine candidates for clinical trials. The genome sequencing of the SARS-CoV-2 is completed⁸and researchers have studied the details in the SARS-CoV-2 proteins¹⁸. Coronavirus is studded on its exterior with spike proteins, which is the key component to infect and attack human cells¹⁹. The spike protein of the SARS-CoV-2 can latch onto cells and force the virus through the cell membrane, which enables the virus entry. Previous studies reveal that the spike protein of the CoV plays a decisive role during the infection and proteolytic activation of spike protein by host cell proteases is also a critical determinant²⁰. It is promising to combat the COVID-19 by inducing the B-cells and T-cells that can perform immune response against the CoV spike protein. Hence, in this study, we choose the spike protein sequence of the CoV as the main subject to design our multi-epitope vaccine.

Although the in-silico vaccine design approaches are looked at as fairly efficient, they may not be sufficiently fast to keep pace with the emergence of various pandemics. Figure 1A shows the schematic diagram of a traditional in-silico vaccine design process. Researchers usually use numerous in-silico tools to predicts the B-cell, CTL and HTL epitopes on the whole virus proteins^21,²². The antigenicity and other physicochemical properties of the overlapping fragments are also necessary to evaluate²³. To select the best virus protein regions to construct an efficacious vaccine, we need to carefully and comprehensively evaluate all the predicted results, which creates a large overhead and can be very time consuming. Currently, each in-silico vaccine design tool can only achieve one single prediction goal. For example, BepiPred²⁴is a very popular B-cell epitope prediction tool and many researches use this tool to predict the B-cell epitopes. However, BepiPred can only be used to address the one step of B-cell epitope prediction, and when it comes to T-cell epitope prediction, a different tool such as NetMHCpan²⁵is needed. No current tool is able to conduct multiple predictions and comprehensively analyze the results for us to directly give us the best vaccine subunits for further construction and evaluation.

To conquer the above shortages of the in-silico vaccine design, we propose DeepVacPred, a novel in-silico multi-epitope vaccine design framework. We successfully replace the multiple necessary predictions and the comprehensive evaluations with a deep neural network (DNN) architecture. When the DNN takes one peptide sequence as input, then it can judge whether this input sequence can be a potential vaccine subunit. In the DeepVacPred framework, the number of potential vaccine subunits can be firstly reduced to around 30, then further evaluation and vaccine construction is done on the predicted subunits by reliable and popular in-silico methods to construct the final vaccine. Our novel approach aims to achieve a much better efficiency of the in-silico vaccine design.

With DeepVacPred, this study designs a multi-epitope vaccine in a novel in-silico way. We first use the DNN architecture to lock down 26 fragments in the CoV spike protein as vaccine subunit candidates. Then we predict the linear B-cell epitopes, CTL epitopes and HTL epitopes to select and construct our final vaccine. We further analyze the human population coverage, antigenicity, allergenicity, toxicity and other physicochemical properties to validate the quality. We also predict the secondary structure and 3D structure model. This model is eventually refined and validated. Finally, the codon optimization and in-silico cloning are performed to check the vaccine genome and protein constructions and ensure its effective expression. In addition, DeepVacPred allows us to quickly check for newly emerging threats caused by the RNA mutations of the CoV. We prove that our vaccine can tackle the virus RNA mutations.

DeepVacPred

Background

An in-silico vaccine design process can be seen as selecting good fragments of the virus proteins, then constructing them together into a final vaccine²³. A fragment with multiple merits can be selected as a subunit of the final vaccine. For example, an ideal subunit should contain multiple B-cell epitopes and T-cell epitopes and it should has high antigenicity to trigger human protective reactions^21,²². These merits can be predicted by in-silico approaches and currently there are numerous in-silico vaccine design tools. However, these tools are designed to address only one prediction at a time. Consequently, researchers have to overcome the time consuming tasks of analyzing each individual prediction result from different tools while adopting a comprehensive view of the vaccine design. No current tool can take all the necessary merits into consideration and directly predict the vaccine subunit candidates from the virus proteins.

There are two drawbacks to the current situation: (a)We usually need only the best 10-20 subunits to construct the final vaccine while each prediction tool may provide us with hundreds or even thousands of potential locations to choose, which creates a large overhead to comprehensively select out the subunits we need and no current tool can achieve both the prediction and the selection for us. (b)Nearly 90% prediction results are eventually discarded because they have only part of the merits, resulting in too much unnecessary analysis and wasting many computing resources.

In order to improve the efficiency and reliability of the vaccine design process, we improve over state-of-the-art tools by providing a DNN approach, DeepVacPred, an efficient in-silico vaccine design process to address the afore-mentioned concerns. DeepVacPred directly predicts the best vaccine subunit candidates (the number is within 30) from the virus protein sequences within a second by replacing the prediction and selection with deep neural network architecture, hence promising much higher efficiencies for the vaccine design and test process.

Data Collection and Dataset Design

Reliable data is essential for the performance of supervised learning²⁶, thus it plays a crucial role in the outcome of the vaccine design process. We collect 5000 latest known B-cell epitopes (B) and 2000 known T-cell epitopes containing both MHC (major histocompatibility complex)-1 and MHC-2 binders²⁷(T) from the IEDB database, combining with the same number of proteins which are not T-cell or B-cell epitopes, forming a dataset of epitopes and non-epitopes. 100 known latest viral protective antigens are selected from the IEDB database, and the same number of proteins without protective functions are randomly selected, combining with the 400 antigens in previous work²⁸, forming a dataset with 600 antigens.

DeepVacPred is built based on supervised learning on a subtly designed dataset. To directly predict the vaccine subunit candidates, the protein sequences in the positive dataset must contain at least one T-cell epitope and one B-cell epitope and must be protective antigens. Cartesian Product²⁹is the set that contains all ordered pairs from two sets. Thus, the two Cartesian Products, T ×B and B×T, which are formed between the collected B-cell epitopes dataset and the T-cell epitopes dataset can cover all the possible combination of the known B-cell and T-cell epitopes. We use the 600 antigens to train a neural network that can identify protective antigens. We use this neural network on the Cartesian Product to sieved out 706,970 peptides sequences that are predicted to be protective antigens. Those 706,970 peptides contain both B-cell epitopes and T-cell epitopes and are protective antigens, referred in this paper as the positive vaccine dataset. The same number of peptides randomly bridged by negative T-cell and B-cell epitopes form our negative vaccine dataset. The dataset we design addresses the three most important predictions, B-cell epitopes, T-cell epitopes and antigenicity in the vaccine design process.

Network Training

A multi-layer convolutional neural network (CNN) and a four-layer linear neural network connect together, forming a deep neural network (DNN) with a two-class output. The positive and negative datasets are annotated by Z-descriptors³⁰, then converted to the same length of 45 vectors with auto cross covariance (ACC) transformation³¹. Trained by the transformed dataset above, the DNN achieves the classification function to predict whether the input is a protective antigen containing both the B-cell and T-cell epitopes, realizing the ability to directly judge whether a sequence can be a potential vaccine subunit. This DNN is the core part of the rapid vaccine design process of our DeepVacPred framework and we name it as DNN-V. In addition, we train another DNN with the same structure on the T-cell epitope dataset which can judge whether an input sequence can be a T-cell epitope and we name it as DNN-T.

Validation

ROC Curves

Receiver operating characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied³². DNN-V is a novel approach that needs to be validated. We use the ROC curves to evaluate the DNN-V in DeepVacPred. We test the trained DNN-V with two datasets, namely the train set and the test set, each of which contains 200 protein sequences. The train set contains 200 proteins randomly selected from the dataset we use to train the DNN-V, with 100 positive and 100 negative protein sequences. We also selected known B-cell epitopes and T-cell epitopes that are not in our collected data and use the above steps to form the test set, also with 100 positive and 100 negative protein sequences. The ROC curves are shown in Figure 2. The validation data appears in Table 1. The thresholds are ranged from 0 to 1. The accuracy reported in Table 1 is the greatest value among all thresholds. The sensitivity and specificity values in Table 1 are reported for the case with the highest accuracy. The AUC (Area Under the ROC Curve) value of 0.9703 for the test set indicates the high accuracy of classification of DNN-V to identify potential vaccine subunits.

Vaccine Design Test

The false positive rate (FPR) will fall down to 0 if we set the threshold to a very low value, e.g., 0.0003, since we only care about discarding all the non-candidates. We use the DNN-V in our DeepVacPred framework on the 1273aa spike protein sequence of the CoV. 130 vaccine candidates are predicted. We use BepiPred²⁴, NetMHCpan²⁵and Vaxijen³³to examine each candidate. All of the candidates contain both T-cell and B-cell epitopes and only 14 of them are predicted by Vaxijen to be non-protecitve antigens.

DeepVacPred Framework

Figure 1 (b) provides the schematic diagram of the vaccine design process using DeepVacPred Framework. DeepVacPred first uses DNN-V to predict a very small number of potential vaccine subunits directly from the virus protein sequences. DeepVacPred further uses DNN-T to examine all the overlapping sequences in these subunits and select the subunit candidates which have multiple T-cell epitopes. These two prediction rounds take less than a second and reduce the number of potential vaccine subunits to around 30.

The following steps in the DeepVacPred framework are selecting the best subunits from only about 30 candidates and constructing the final vaccine based on the evaluations by various reliable in-silico tools, including Linear B-cell epitopes prediction, CTL and HTL epitopes prediction, population coverage analysis, vaccine construction, evaluation of antigenicity, allergenicity, toxicity and other physicochemical properties, structure prediction, 3D modeling and in-silico cloning. Compared to the popular computational process, those evaluations are done on a much smaller amount of data, hence improve the efficiency.

Data Retrieval

The genome sequence of SARS-CoV-2 isolate Wuhan-Hu-1 is retrieved from the NCBI database with accession number MN908947³⁴. The protein sequences are retrieved according to their translation. Especially, the spike protein (protein ID: QHD43416.1) has a length of 1273 amino acids (aa), and the receptor binding domain (RBD) is from 347 to 520aa¹⁹. The following experiments are mainly focused on the spike protein area.

DeepVacPred Vaccine Subunits Prediction

All the overlapping protein fragments with a length of 30aa are generated out of the 1273aa SARs-CoV-2 spike protein sequence. Our DeepVacPred first tests these 1244 30aa protein sequences and predicts 130 potential vaccine subunits (See Table 4). Our DeepVacPred further predicts the T-cell epitopes at these locations and discards the subunits which have less than 8 T-cell epitopes³⁵. After this prediction, our DeepVacPred provides us with 26 potential vaccine subunits for further evaluation and construction (See Table 5). These subunits are very likely to contain B-cell epitopes and multiple T-cell epitopes. They are also very likely to have high antigenicity and low allergenicity. We start the following in-silico vaccine design process directly from the predicted 26 vaccine subunits, which is very efficient.

Linear B-cell Epitopes Prediction

B-cell epitopes are portions of antigens binding to immunoglobulin or antibody to trigger the B-cell to provide immune response³⁶. Linear B-cell epitopes are predicted on the 26 vaccine subunits. Linear B-cell epitopes with different lengths are predicted by different online servers including BepiPred²⁴, SVMtrip³⁷, ABCPred³⁸and BCPreds³⁹. B-cell epitopes must be located in the solvent-exposed region of the antigens to be possible to combine with the B-cell³⁶, thus it is essential to predict the surface availability of the structural protein sequence. The surface availability is predicted by Emini tool⁴⁰on the whole CoV spike protein sequence. After the predictions, we select out 14 vaccine subunits (see Table 6). Each of them contains at least 1 exposed B-cell epitope which is predicted by BepiPred and one other server (SVMtrip, ABCPred and LBtope) simultaneously.

Cytotoxic T Lymphocytes (CTL) Epitopes Prediction

Cytotoxic T Lymphocytes (CTL) recognize the defected cells by using the MHC class I molecules to bind with certain CTL epitopes²⁵. We use NetMHCpan 4.1 server⁴¹to predict potential CTL epitopes. All the overlapping 9aa peptide sequences in the 14 vaccine subunits are tested with the 12 most common human-leukocyte-antigen (HLA) Class I alleles including HLA-A1, HLA-A2, HLA-A3, HLA-A24, HLA-A26, HLA-B7, HLA-B8, HLA-B27, HLA-B39, HLA-B44, HLA-B58 and HLA-B62 to

evaluate their binding affinities and predict potential CTL epitopes. The total HLA score is calculated for each vaccine subunits. The results are shown in Table 7.

Helper T Lymphocytes (HTL) Epitopes Prediction

Helper T Lymphocytes (HTL) help the activity of other immune cells and they recognize the infection by using MHC class II molecules to bind with certain HTL epitopes⁴². We use NetMHCIIpan 4.0 server⁴³to predict potential HTL epitopes. All the overlapping 15aa peptide sequences in the 14 vaccine subunits are tested with the 13 most common HLA Class II alleles from HLA-DRB1-0101 to HLA-DRB1-1601 to evaluate their binding affinities and predict the potential HTL epitopes. The total HLA score is calculated for each vaccine subunits. The results are shown in Table 8.

Worldwide Human Population Coverage Analysis

The vaccine we design should have wide human population coverage. We use the IEDB population coverage analysis tool⁴⁴to evaluate the worldwide human population coverage of the 14 vaccine subunits. The 25 HLA alleles we used to predict the T-cell epitopes can cover 98.39% human population. The human population coverage of each vaccine subunit is shown in Table 9. The results suggest that our 14 vaccine subunits can cover a very wide range of human population.

Multi-epitope Vaccine Construction

We discard Subunit 9, 15 and 26 for their poor performance in the CTL and HTL epitopes predictions. We use the rest 11 vaccine subunits to construct a final multi-epitope vaccine (See Figure 3). The final vaccine contains an adjuvant, 50S ribosomal protein L2^45,⁴⁶(accession no. AXI95322.1), to improve the immune response⁴⁷, linked with the amino (N) terminum of the multi-subunit sequence through an EAAAK linker⁴⁸. The multi-subunit sequence has a CTL multi-epitope peptides region followed by an HTL multi-epitope peptides region. The CTL region is constructed by 6 subunits which have better performance in the CTL epitopes prediction. AAY linkers⁴⁸are used in this region to fuse the subunits. The HTL region is constructed by 6 subunits which have better performance in the HTL epitopes prediction. GPGPG linkers⁴⁸are used in this region to fuse the subunits. The two regions are linked through a GPGPG linker. In addition, a 6xHis tag is added at the C-terminal to help purify and identify the protein⁴⁹. The final vaccine consists of 694 amino acid residues. It contains 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes.

Antigenicity and Allergenicity Evaluation

The antigenicity of the final multi-epitope vaccine sequence is evaluated by the Vaxijen 2.0 online server^33,⁵⁰. We also evaluate the antigenicity of each vaccine subunit, including the adjuvant (See Table 10). The Vaxijen score for the whole final vaccine is 0.5705 with a virus model at a threshold of 0.4, suggesting a high antigenicity of our final vaccine. The AllergenFP 1.0 online server predicts the final vaccine to be non-allergenic. We also use the AllergenFP 1.0 online server to check every subunit in the final vaccine and each of them is predicted to be non-allergenic.

Toxicity and Physicochemical Properties Analysis

The vaccine must not have toxicity potential and the physicochemical properties are also important to evaluate how the vaccine interacts with the environments⁵¹. We use the ToxinPred server⁵²to predict the toxicity. Other physicochemical properties, including hydropathicity, charge, half-life, instability index, pI (Theoretical isoelectric point value) and molecule wheight, are predicted by ExPASy ProtParam Tool⁵³. The predictions are done on the whole final vaccine sequence. The final vaccine is predicted to be non-toxic. The hydropathicity value is predicted to be -0.521, this negative value suggests that our final vaccine is hydrophilic in nature and can interact with the water molecules easily⁵⁴. The charge is 37.00, this value will decrease in alkaline environment so usually it is better if the charge values are positive. The half-life of the final vaccine is predicted to be 30 hours in in Vitro and >20 hours in Vivo. An Instability Index of 34.01 is predicted, this less than 40 value suggests that our final vaccine is stable. The pI of the final vaccine is calculated to be 9.75, which is an alkaline value, indicating its highly basic existence in nature. The molecule weight of the final vaccine is calculated to be 76 kDa. We also check the toxocity and physicochemical properties of every subunit and the results are shown in Table 11.

Secondary Structure Prediction

We use PSIPRED⁵⁵to generate the secondary structure of our final vaccine. Graphical representation of the secondary structure features are shown in Figure 4. The predicted secondary stucture indicates that the final vaccine constitutes 10.8% alpha helix, 24.6% beta strand, and 64.6% coil. The solvent accessibility (ACC), and disorder regions (DISO) are predicted by RaptorX

Peoperty server^56,⁵⁷(See Figure 5). Among the 694 amino acid residues in our final vaccine, 44% are predicted to be exposed, 27% medium exposed, and 27% are predicted to be buried. A total of 60 residues (8%) are predicted to be located in disordered regions.

Vaccine 3D Structure Modeling

We use the Swiss Model⁵⁸to build the 3D structure models of our final vaccine. 5myj.1.6, 3j3v.1.G, 3j9w.1.8 and 5nd9.1.Y are predicted by Swiss Model to be the best four templates with high Global Model Quality Estimation (GMQE) score⁵⁸. Based on these four templates, we use the Swiss Model to construct the 3D structure models of our final vaccine (See Figure 6). The QMEAN Z-score, Clash score and Ramachandran favoured score of these 4 models are shown in Table 2. The QMEAN Z-score⁵⁹provides an estimation of the structure and QMEAN Z-scores around zero indicate good agreement between the model structure and experimental structures. We select Model 3, which has the best QMEAN Z-score value, for further refinement.

Vaccine 3D Structure Refinement

We use GalaxyRefine server⁶⁰to refine the 3D structure model of our final vaccine. Among the 5 refined models predicted by GalaxyRefine, we choose the model shown in Figure 7 as our final vaccine model based on its model quality scores. This model has a GDT-HA of 0.9774, an RMSD of 0.344, a molProbity of 1.431, a clash score of 7.9, a poor rotamers of 0 and a Ramachandran plot score of 98.2%, showing great overall quality.

Vaccine 3D Structure Validation

We use ProSA-web⁶¹to validate the overall model quality of the refined final vaccine model. ProSA predicts a Z-score of -5.73 (See Figure 8), which is lying inside the score range of the comparable sized native proteins, indicating good overall model quality. ProSA also checks the local model quality and the residue scores are plotted in Figure 8. Negative values suggest no erroneous parts of the model structure. We also use RAMPAGE server to do the Ramachandran plot analysis and it reveals a Ramachandran plot score of 98.2%, which is consistent with the results of GalaxyRefine.

Codon Optimization and In-silico Cloning

We analyze the cloning and expression efficiency and optimize the codon usage of vaccine construct in E. coli (strain K12) by Java Codon Adaptation Tool⁶². The length of the optimized codon sequence is 2082 nucleotides. Its Codon Adaptation Index (CAI) is 0.997, and the average GC content is 50.73%, indicating a great potential of good expression of the final vaccine in the E. coli host. After the optimization, we use the SnapGene tool to insert the codon sequences in pET28a(+) vector for cloning (See Figure 9).

RNA Mutations

As the CoV spreads all over the world, its RNA sequence is going through mutations, translating out different virus proteins. Such mutations can have influences on the epitope based vaccines, since a single amino acid difference can change the epitope prediction results. Therefore it is important to prove our final multi-epitope vaccine can tackle the mutations. With our DeepVacPred, we are also able to quickly examine the mutated protein sequence to search for new potential vaccine subunits.

The RNA sequence we use to translate the spike protein and design the vaccines is from Wuhan, which is the original virus³⁴. The RNA mutations result in three most frequent changes in the spike protein area of the CoV and each of the changes contains one amino acid change⁶³. Table 3 shows the mutation details.

The mutation at the 614aa in spike protein from D to G is the most frequent mutations with 116 known isolates⁶³. This mutation is very common in many cities in North America. In Europe and South America the D614G mutation occurs in less than 10 isolates. This change has no influence on the final multi-epitope vaccine since it does not contain the 614aa of the spike protein. With DeepVacPred, we are also able to quickly check and identify whether the mutation can create new potential vaccine subunits. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

At 476aa in spike protein there is a frequent mutation from G to S, which occurs in 3 isolates from Washington DC⁶³. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 476aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

At 483aa in spike protein there is a frequent mutation from V to A, which occurs in 6 isolates from Washington DC⁶³. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 483aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

In conclusion, our designed multi-epitope vaccine can tackle the current RNA mutations of the coronavirus. The current RNA mutations of the coronavirus create no new potential vaccine subunits.

In-silico vaccine design has high value of efficacy and it strongly emphasizes the multi-epitope in the vaccine peptides. In this study, we develop DeepVacPred, an efficient vaccine subunit sieving framework, that utilize DNN to rapidly select the potential 26 vaccine subunit candidates, introducing a new way to have much higher speed and efficiency in in-silico vaccine design. The goal is to directly predict the potential vaccine subunit sequence without the need to do a large number of different predictions and evaluate and select the predicted results manually. With this artificial intelligence framework, we are able to skip at least 95% of unnecessary predictions and let the computer analyze and select the best vaccine subunits for us. DeepVacPred predicts the 26 vaccine subunits within a second, which enables us to skip the most time consuming part of the in-silico vaccine design. With DeepVacPred, a researcher can construct a multi-epitope vaccine for a new virus and validate its quality within an hour.

This approach can be further developed by enhancing the complexity and coverage of the dataset. In this study we select a part of known epitopes and protective antigens to form the dataset we use to train the DNN. We use the simple bridging of one B-cell epitopes and one T-cell epitopes. With a more comprehensive dataset and more possibilities of epitopes combination, we will be able to develop a better quick vaccine design tool. But practically it can deal with most of the situations now.

The application of DNN in protein sequences classification shows great potential. Most of the online tools rely on the SVM learning model. In the very popular protective antigens prediction tool Vaxijen³³, the AUC of the ROC curve can only reach 0.743, which can not perform very accurate predictions. The dataset to train Vaxijen only contains 200 proteins, so it becomes more time consuming and challenging to rely on the SVM model with increasing number of discovered protective antigens. Consequently, DeepVacPred proves that DNN can perform a very accurate prediction with over 700000 different proteins in the dataset.

This study eventually results in a novel multi-epitope vaccine with a length of 649aa against the CoV. It contains an adjuvant, 11 subunits with 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes. It shows good antigenicity, population coverage and good physichochemical properties and structures, providing great potential for the next step COVID-19 vaccine design with actual experiments and clinical studies.

Furthermore, we trace the RNA mutations of the CoV. Basically the RNA mutations can result in one amino acid change in the spike protein. The proposed vaccine design framework can also tackle the three most frequently observed mutations. The investigation on the RNA mutations also proves the high efficiency of our DeepVacPred.

DNN Design and Training in DeepVacPred Framework

Each data input to the DNN is a sequence with a length of 45 vectors which is converted from its protein sequence by Z-descriptors³⁰and ACC transformation³¹. Convolutional Neural Network (CNN) exhibits good performance to identify and process such vectors while multi-layer linear neural network is broadly connected to the ouput layer of the CNN, forming a complex DNN to enhance the classification ability. Hence, our DNN is constructed by the following layers and the parameters of each layer is decided using a random search to obtain high accuracy while maintain good computing speed:

i. CNN, in channels = 1, out channels = 16, kernel size=3, stride=2, padding=1, Tanh function;

ii. CNN, in channels = 16, out channels = 16, kernel size=3, stride=2, padding=1 , Tanh function;<

iii. CNN, in channels = 16, out channels = 1, kernel size=3, stride=2, padding=1 , Tanh function, Average Pooling;

iv Linear, in features = 32, out features = 64 , Tanh function;

v. Linear, in features = 64, out features = 32, Tanh function;

vi. Linear, in features = 32, out features = 16, Tanh function; vii. Linear, in features = 16, out features = 2, Sigmoid function.

The hyper-parameters of the DNN training are listed below. The selected values in bold are obtained using a random search.

i. Learning rate: [0.0001, 0.0005, 0.001, 0.0015, 0.002];

ii. Optimizer: [SGD, RMSProp, Adam];

iii. Epochs: [2000, 4000, 6000, 8000, 10000];

iv. Batch size: [1024, 2048, 4096, 8192].

Linear B-cell Epitopes Prediction

We use four popular server to predict the linear B-cell epitopes on each vaccine subunit candidates. (1) BepiPred-2.0 web server (http://www.cbs.dtu.dk/services/BepiPred/). BepiPred is a reliable machine learning based tool trained by random forest algorithm and its training dataset covers a large number of known linear B-cell epitopes from the IEDB database²⁴. (2) ABCpred (http://www.imtech.res.in/raghava/abcpred/). ABCPred applies recurrent neural network to the classification of epitopes and non-epitopes to improve the accuracy³⁸. (3) SVMTrip (http://sysbio.unl.edu/SVMTriP/). SVMTrip uses support vector machine to predict antigenic epitopes and its AUC reaches a value of 0.702³⁷. (4) BCPreds (http://ailab.ist.psu.edu/bcpred/). BCPreds is also based on SVM model with an AUC value of 0.758 and its prediction relies on kernel methods³⁹.

Cytotoxic T Lymphocytes (CTL) Epitopes Prediction

We use NetMHCpan 4.1 server (http://www.cbs.dtu.dk/services/NetMHCpan/) to predict the CTL epitopes on each vaccine subunit candidates. We predict the CTL epitopes with a length of 9aa. All the parameters are set at default. NetMHCpan predicts peptide binding to any MHC Class I molecule of known sequence using artificial neural networks (ANNs) which is trained on a combination of more than 850,000 quantitative Binding Affinity (BA) and Mass-Spectrometry Eluted Ligands (EL) peptides, providing reliable prediction results⁴¹.

Helper T Lymphocytes (HTL) Epitopes Prediction

We use NetMHCIIpan 4.0 server (http://www.cbs.dtu.dk/services/NetMHCIIpan/) to predict the HTL epitopes on each vaccine subunit candidates. We predict the HTL epitopes with a length of 15aa. All the parameters are set at default. NetMHCIIpan predicts peptide binding to any MHC II molecule of known sequence using artificial neural networks (ANNs) which is trained on an extensive dataset of over 500,000 measurements of Binding Affinity (BA) and Eluted Ligand mass spectrometry (EL), covering the three human HLA-DR, HLA-DQ and HLA-DP alleles, providing reliable prediction results⁴³.

Worldwide Human Population Coverage Analysis

The worldwide human population coverage of each subunit is evaluated by IEDB population coverage analysis tool (http: //tools.iedb.org/population/). The evluation is done on the worldwide human population.

Antigenicity and Allergenicity Evaluation

The antigenicity of the final vaccine and its every subunit is predicted by VaxiJen 2.0 server (http://www.ddg-pharmfac.net/ vaxijen/VaxiJen/VaxiJen.html ). Vaxijen is based on auto cross covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties³³. The allergenicity of the final vaccine and its every subunit is checked by AllergenFP 1.0 server (http://ddg-pharmfac.net/AllergenFP/). AllergenFP is a binary classfier between allergens and non-allergens. The dataset is described by five E-descriptors and the strings are transformed into uniform vectors by auto-cross covariance (ACC) transformation⁶⁴.

Toxicity and Physicochemical Properties Analysis

The toxicity of the final vaccine and its every subunit is predicted by ToxinPred server (http://crdd.osdd.net/raghava/toxinpred/). TonxinPred is based on SVM model to classify toxicity and non-toxicity. The dataset used in its method consists of 1805 toxic peptides (<=35 residues)⁵². The physicochemical properties of the final vaccine and its every subunit is predicted by ExPASy ProtParam server (https://web.expasy.org/protparam/). The physicochemical properties include hydropathicity, charge, half-life, instability index, pI (Theoretical isoelectric point value) and molecule wheight⁵³.

Secondary Structure Prediction

PSIPRED is used for the secondary structure prediction of our final vaccine (http://bioinf.cs.ucl.ac.uk/psipred/). PSIPRED incorporates two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST). It achieves an average Q3 score of 81.6%, which can achieve accurate secondary structure prediction⁵⁵. We also use RaptorX Property web server (http://raptorx.uchicago.edu/StructurePropertyPred/predict/) to predict the solvent accessibility (ACC) and disorder regions (DISO). RaptorX employs an emerging machine learning model called DeepCNF (Deep Convolutional Neural Fields) to predict secondary structure (SS), solvent accessibility (ACC), and disorder regions (DISO) simultaneously⁵⁷.

Vaccine 3D Structure Modeling

The 3D model of the final vaccine is constructed by Swiss Model (https://swissmodel.expasy.org/interactive ). Swiss Model first predicts the best templates. We select the best one with high GMQE score to build the 3D model⁵⁸.

Vaccine 3D Structure Refinement

The 3D model built by Swiss Model is refined by GalaxyRefine (http://galaxy.seoklab.org/cgi-bin/submit.cgi?type=REFINE ). GalaxyRefine first rebuilds side chains and performs side-chain repacking and subsequent overall structure relaxation by molecular dynamics simulation. According to the CASP10 assessment, the GalaxyRefine server method performed the best in improving local structure quality⁶⁰.

Vaccine 3D Structure Validation

The final refined 3D model of our final vaccine is validated by ProSA-web server (https://prosa.services.came.sbg.ac.at/prosa. php ). ProSA calculates an overall quality score for a specific input structure. If this score is outside a range characteristic for native proteins the structure probably contains errors. A plot of local quality scores points to problematic parts of the model which are also highlighted in a 3D molecule viewer to facilitate their detection⁶¹.

Codon Optimization and In-silico Cloning

Java Codon Adaptation Tool (JCat) server is used for codon optimization (http://www.prodoric.de/JCat ). JCat adapts the codon usage to most sequenced prokaryotic organisms and selected eukaryotic organisms⁶². The optimized codon sequence is insert into pET28a(+) vector with SnapGene software.

Data Availability

We obtained the genome sequence and the spike protein sequence of SARS-CoV-2 from NCBI database (https://www.ncbi.nlm. nih.gov ) with accession number MN908947 and protein ID QHD43416.1. The protein data we collected and processed to train the DeepVacPred is available on github.com https://github.com/zikunyang/DCVST .

Code Availability

The code used for data generation and/or analysis in the study are available on github.com https://github.com/zikunyang/DCVST .

Acknowledgements

The authors gratefully acknowledge the support by the National Science Foundation under the Career Award CPS/CNS-1453860, the NSF award under Grant numbers CCF-1837131, MCB-1936775, and CNS-1932620, the U.S. Army Research Office (ARO) under Grant No. W911NF-17-1-0076 and the DARPA Young Faculty Award and DARPA Director Award, under grant number N66001-17-1-4044, and a Northrop Grumman grant. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied by the Defense Advanced Research Projects Agency, the Department of Defense or the National Science Foundation.

Author Contributions

Z.Y., P.B. and S.N conceived the problem formulation and discussed the computational approach as well as the experimental methodology. Z.Y. designed, implemented, improved analyzed the experimental results. All authors analyzed the results and improved the manuscript.

Competing Interests

The authors declare no competing interests.

Additional information

Correspondence and requests for materials should be addressed to P.B.

Wu, J., Leung, K. & Leung, G. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. The Lancet 395, 689–697, DOI: https://doi.org/10. 1016/S0140-6736(20)30260-9 (2020).
Zhou, P., Yang, X., Wang, X. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273, DOI: https://doi.org/10.1038/s41586-020-2012-7 (2020).
Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet 20, 533–534, DOI: https://doi.org/10.1016/S1473-3099(20)30120-1 (2020).
Coronavirus: the first three months as it happened. Nature DOI: https://doi.org/10.1038/d41586-020-00154-w (2020).
Shang, W., Yang, Y., Rao, Y. et al. The outbreak of SARS-CoV-2 pneumonia calls for viral vaccines. npj Vaccines 5, DOI: https://doi.org/10.1038/s41541-020-0170-0 (2020).
Tay, M. Z., Poh, C. M., Rénia, L. et al. The trinity of COVID-19: immunity, inflammation and intervention. Nat Rev Immunol DOI: https://doi.org/10.1038/s41577-020-0311-8 (2020).
Huang, C. et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The Lancet 395, 497–506, DOI: https://doi.org/10.1016/S0140-6736(20)30183-5 (2020).
Chen, Y., Liu, Q. & Guo, D. Emerging coronaviruses: Genome structure, replication, and pathogenesis. Med. Virol. 92, 418–423, DOI: https://doi.org/10.1002/jmv.25681 (2020).
Gewin, V. On the front lines of the coronavirus-vaccine battle. Nature DOI: https://doi.org/10.1038/d41586-020-01116-y (2020).
Callaway, E. The race for coronavirus vaccines: a graphical guide. Nature 580, 576–577, DOI: https://doi.org/10.1038/ d41586-020-01221-y (2020).
Graham, B. Advances in Antiviral vaccine development. Rev. 255, 230–242, DOI: https://doi.org/10.1111/imr. 12098 (2013).
Gandon, S., Mackinnon, M., Nee, S. & Read, F. Imperfect vaccines and the evolution of pathogen virulence. Nature 414, 751–756, DOI: https://doi.org/10.1038/414751a (2001).
Gao, Q. et al. Rapid development of an inactivated vaccine for SARS-CoV-2. bioRxiv DOI: https://doi.org/10.1101/2020.04.17.046375 (2020).

Kim, Y. C., Dema, B. & Reyes-Sandoval, A. COVID-19 vaccines: breaking record times to first-in-human trials. npj Vaccines 5, DOI: https://doi.org/10.1038/s41541-020-0188-3 (2020).
Oany, A., Emran, A. & Jyoti, T. Design of an epitope-based peptide vaccine against spike protein of human coronavirus: an in silico approach. Drug Des Devel Ther 1139–1149, DOI: https://doi.org/10.2147/DDDT.S67861 (2014).
Feng, Y. et al. Multi-epitope vaccine design using an immunoinformatics approach for 2019 novel coronavirus in China (SARS-CoV-2). bioRxiv DOI: https://doi.org/10.1101/2020.03.03.962332 (2020).
Zhang, L. Multi-epitope vaccines: a promising strategy against tumors and viral infections. Cell Mol Immunol 15, 182–184, DOI: https://doi.org/10.1038/cmi.2017.92 (2017).
Lan, J., Ge, J., Yu, J. et al. Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature DOI: https://doi.org/10.1038/s41586-020-2180-5 (2020).
Lu, R. et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet 395, 565–574, DOI: https://doi.org/10.1016/S0140-6736(20)30251-8 (2020).
Shokeen, K., Pandey, S., Shah, M. & Kumar, S. Insight towards the effect of the multibasic cleavage site of SARS-CoV-2 spike protein on cellular proteases. BioRxiv DOI: https://doi.org/10.1101/2020.04.25.061507 (2020).
Purcell, A., McCluskey, J. & Rossjohn, J. More than one reason to rethink the use of peptides in vaccine design. Rev. Drug Discov. 6, 404–414, DOI: https://doi.org/10.1038/nrd2224 (2007).
Callaway, E. Scores of coronavirus vaccines are in competition — how will scientists choose the best? Nature DOI: https://doi.org/doi:10.1038/d41586-020-01247-2 (2020).
Mascola, J. R. & Fauci, A. S. Novel vaccine technologies for the 21st century. Nat Rev Immunol 20, 87–88, DOI: https://doi.org/10.1038/s41577-019-0243-3 (2020).
Jespersen, M., Peters, B., Nielsen, M. & Marcatili, P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 45, W24–W29, DOI: https://doi.org/10.1093/nar/gkx346 (2017).
Nielsen, M. et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS One 2, article e796, DOI: https://doi.org/10.1371/journal.pone.0000796 (2007).
Zhu, X. & Goldberg, A. Introduction to Semi-Supervised Learning. Morgan Claypool Publ. DOI: https://doi.org/10.2200/ S00196ED1V01Y200906AIM006 (2009).
Ahmad, T., Eweida, A. & El-Sayed, L. T-cell epitope mapping for the design of powerful vaccines. Anal Chim Acta 6, 13–22, DOI: https://doi.org/10.1016/j.vacrep.2016.07.002 (2016).
Heinson, A. et al. Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology. J. Mol. Sci. 18, 312, DOI: https://doi.org/10.3390/ijms18020312 (2017).
Agesen, O. The Cartesian Product Algorithm. 9th Eur. Conf. DOI: https://doi.org/10.1007/3-540-49538-X_2 (1995).
Hellberg, S., Sjoestroem, M., Skagerberg, B. & Wold, S. Peptide quantitative structure-activity relationships, a multivariate approach. Chem. Soc. 30, 1126–1135, DOI: https://doi.org/10.1021/jm00390a003 (1987).
Wold, S., Jonsson, J., Sjöström, M., Sandberg, M. & Rännar, S. DNA and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least squares projections to latent structures. Anal Chim Acta 277, 239–253, DOI: https://doi.org/10.1016/0003-2670(93)80437-P (1993).
Calders, T. & Jaroszewicz, S. Efficient AUC Optimization for Classification. Discov. Databases 4702, DOI: https://doi.org/10.1007/978-3-540-74976-9_8 (2007).

Doytchinova, I. A. & Flower, D. R. VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinforma. 8, DOI: https://doi.org/10.1186/1471-2105-8-4 (2007).
Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269, DOI: https://doi.org/10.1038/s41586-020-2008-3 (2020).
Patronov, A. & Doytchinova, I. T-cell epitope vaccine design by immunoinformatics. Open Biol. 3, DOI: https://doi.org/10. 1098/rsob.120139 (2013).
Sanchez-Trincado, J., Gomez-Perosanz, M. & Reche, P. Fundamentals and Methods for T- and B-Cell Epitope Prediction. Immunol. Res. DOI: https://doi.org/10.1155/2017/2680160 (2017).
Yao, B., Zhang, L., Liang, S. & Zhang, C. SVMTriP: A Method to Predict Antigenic Epitopes Using Support Vector Machine to Integrate Tri-Peptide Similarity and Propensity. PLoS One 7, e45152, DOI: https://doi.org/10.1371/journal.pone.0045152 (2012).

Saha, S. & Raghava, G. P. S. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 65, 40–48, DOI: https://doi.org/10.1002/prot.21078 (2006).
El-Manzalawy, Y., Dobbs, D. & Honavar, V. Predicting linear B-cell epitopes using string kernels. J Mol Recognit 21, 243–255, DOI: https://doi.org/10.1002/jmr.893 (2008).
Almofti, Y., Abd-elrahman, K., Gassmallah, S. & Salih, M. Multi Epitopes Vaccine Prediction against Severe Acute Respiratory Syndrome (SARS) Coronavirus Using Immunoinformatics Approaches. J. Microbiol. Res. 6, 94–114, DOI: https://doi.org/10.12691/ajmr-6-3-5 (2018).
Jurtz, V. et al. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J Immunol 199, 3360–3368, DOI: https://doi:10.4049/jimmunol.1700893 (2017).
Nielsen, M. et al. Quantitative predictions of peptide binding to any HLA-DR molecule of known sequence: NetMHCIIpan. PLoS Comput. Biol. 4, article e1000107, DOI: https://doi.org/10.1371/journal.pcbi.1000107 (2008).
Reynisson, B. et al. Improved Prediction of MHC II Antigen Presentation through Integration and Motif Deconvolution of Mass Spectrometry MHC Eluted Ligand Data. Proteome Res. Article ASAP, DOI: https://doi.org/10.1021/acs. jproteome.9b00874 (2020).
Bui, H. H. et al. Predicting population coverage of T-cell epitope-based diagnostics and vaccines. BMC bioinformatics 7, DOI: https://doi.org/10.1186/1471-2105-7-153 (2006).
Man, L., Jiang, Y., Gong, T., Zhang, Z. & Sun, X. Intranasal Vaccination against HIV-1 with Adenoviral VectorBased Nanocomplex Using Synthetic TLR-4 Agonist Peptide as Adjuvant. Pharm. 13, 885–894, DOI: https: //doi.org/10.1021/acs.molpharmaceut.5b00802 (2016).
Diedrich, G. et al. Ribosomal protein L2 is involved in the association of the ribosomal subunits, tRNA binding to A and P sites and peptidyl transfer. EMBO J. 19, 5241–5250, DOI: https://doi.org/10.1093/emboj/19.19.5241 (2000).
Singh, M. & O’Hagan, D. Advances in vaccine adjuvants. Nat Biotechnol 17, 1075–1081, DOI: https://doi.org/10.1038/ 15058 (1999).
Arai, R., Ueda, H., Kitayama, A., Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. Des. Sel. 14, 529–532, DOI: https://doi.org/10.1093/protein/14.8.529 (2001).
Crowe, J., Masone, B. S. & Ribbe, J. One-step purification of recombinant proteins with the 6xHis tag and Ni-NTA resin. Mol Biotechnol 4, 247–258, DOI: https://doi.org/10.1007/BF02779018 (1995).

Ong, E. et al. Vaxign-ML: supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens. Bioinformatics DOI: https://doi.org/10.1093/bioinformatics/btaa119 (2020).
Iwasaki, A. & Yang, Y. The potential danger of suboptimal antibody responses in COVID-19. Nat Rev Immunol DOI: https://doi.org/10.1038/s41577-020-0321-6 (2020).
Gupta, S. et al. In Silico Approach for Predicting Toxicity of Peptides and Proteins. PLoS ONE 8, e73597, DOI: https://doi.org/10.1371/journal.pone.0073957 (2013).
Gasteiger, E. et al. John M. Walker: Protein Identification and Analysis Tools on the ExPASy Server. The Proteomics Protoc. Handb. 571–607, DOI: https://doi.org/10.1385/1592598900 (2005).
Pandey, A. M. et al. Exploring dengue genome to construct a multi-epitope based subunit vaccine by utilizing immunoinformatics approach to battle against dengue infection. Sci Rep 7, DOI: https://doi.org/10.1038/s41598-017-09199-w (2017).
McGuffin, L. J., Bryson, K. & Jones, D. T. The PSIPRED protein structure prediction server. Bioinformatics 16, 1511–1522, DOI: https://doi.org/10.1093/bioinformatics/16.4.404 (2000).
Källberg, M. et al. Template-based protein structure modeling using the RaptorX web server. Protoc. 7, 1511–1522, DOI: https://doi.org/10.1038/nprot.2012.085 (2012).
Wang, S., Li, W., Liu, S. & Xu, J. RaptorX-Property: a web server for protein structure property prediction. Nucleic Acids Res. 44, W430–W435, DOI: https://doi.org/10.1093/nar/gkw306 (2016).
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5, 725–738, DOI: https://doi.org/10.1038/nprot.2010.5 (2010).
Benkert, P., Biasini, M. & Schwede, T. Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics 27, 343–350, DOI: https://doi.org/10.1093/bioinformatics/btq662 (2011).
Heo, L., Park, H. & Seok, C. GalaxyRefine: protein structure refinement driven by side-chain repacking. Nucleic Acids Res 41, W384–W388, DOI: https://doi.org/10.1093/nar/gkt458 (2013).
Wiederstein, M. & Sippl, M. J. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res 35, W407–W410, DOI: https://doi.org/10.1093/nar/gkm290 (2007).
Grote, A. et al. JCat: A Novel Tool to Adapt Codon Usage of a Target Gene to Its Potential Expression Host. Nucleic Acids Res 33, W526–31, DOI: https://doi.org/10.1093/nar/gki376 (2005).
Banerjee, A. K., Begum, F. & Ray, U. Mutation Hot Spots in Spike Protein of COVID-19. Preprints 2020, 2020040281, DOI: https://doi.org/10.20944/preprints202004.0281.v1 (2020).
Ivan, D. et al. AllergenFP: allergenicity prediction by descriptor fingerprints. Bioinformatics 6, 846–851, DOI: https: //doi.org/10.1093/bioinformatics/btt619 (2014).

Table 1. DeepVacPred Validation

Validation	AUC	Threshold	Accuracy(%)	Sensitivity%	specificity%
Train set	0.9999	0.32	0.995	0.99	0.99
Test set	0.9703	0.5	0.95	0.95	0.95

Table 2. Vaccine 3D Structure Swiss Model Prediction Results

Models	Templates	QMEAN Z-score	Clash Score	Ramachandran Favoured %
Model 1	5myj.1.6	-5.64	5.4	78.94
Model 2	3j3v.1.G	-6.40	3.89	80.07
Model 3	3j9w.1.8	-0.73	2.08	97.08
Model 4	5nd9.1.Y	-2.77	1.16	89.42

Table 3. Spike Protein Mutations. Occurrence is the number of isolates that showed the mutation. Region is the origin of the isolates.

Mutations	Occurrence	Regions
G476S	3	Washington
V483A	6	Washington
D614G	116	Washington, Los Angeles, New York, South America, Europe

Table 4. DeepVacPred first round prediction results. Here we show the number of predicted vaccine subunits for each location.

Location	Proteins	Start	End	Number of Vaccine Subunits
Location 1	spike	6	36	2
Location 2	spike	53	104	3
Location 3	spike	105	167	8
Location 4	spike	206	322	22
Location 5	spike	352	585	30
Location 6	spike	601	741	19
Location 7	spike	751	862	17
Location 8	spike	878	981	16
Location 9	spike	1034	1063	1
Location 10	spike	1057	1186	12
Location 11	spike	1188	1218	2

Table 5. DeepVacPred second round prediction results. Here we get 26 vaccine subunits for further evaluation and construction.

Vaccine Subunits	Protein	Start	End	Peptide Sequence
Subunit 1	Spike	19	48	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL
Subunit 2	Spike	34	63	RGVYYPDKVFRSSVLHSTQDLFLPFFSNVT
Subunit 3	Spike	71	100	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI
Subunit 4	Spike	141	170	LGVYYHKNNKSWMESEFRVYSSANNCTFEY
Subunit 5	Spike	191	220	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS
Subunit 6	Spike	209	238	PINLVRDLPQGFSALEPLVDLPIGINITRF
Subunit 7	Spike	306	335	FTVEKGIYQTSNFRVQPTESIVRFPNITNL
Subunit 8	Spike	359	388	SNCVADYSVLYNSASFSTFKCYGVSPTKLN
Subunit 9	Spike	402	431	IRGDEVRQIAPGQTGKIADYNYKLPDDFTG
Subunit 10	Spike	439	468	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI
Subunit 11	Spike	480	509	CNGVEGFNCYFPLQSYGFQPTNGVGYQPYR
Subunit 12	Spike	510	539	VVVLSFELLHAPATVCGPKKSTNLVKNKCV
Subunit 13	Spike	584	613	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ
Subunit 14	Spike	626	655	ADQLTPTWRVYSTGSNVFQTRAGCLIGAEH
Subunit 15	Spike	655	684	HVNNSYECDIPIGAGICASYQTQTNSPRRA
Subunit 16	Spike	697	726	MSLGAENSVAYSNNSIAIPTNFTISVTTEI
Subunit 17	Spike	709	738	NNSIAIPTNFTISVTTEILPVSMTKTSVDC
Subunit 18	Spike	773	802	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF
Subunit 19	Spike	805	834	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK
Subunit 20	Spike	866	895	TDEMIAQYTSALLAGTITSGWTFGAGAALQ
Subunit 21	Spike	946	975	GKLQDVVNQNAQALNTLVKQLSSNFGAISS
Subunit 22	Spike	1017	1046	EIRASANLAATKMSECVLGQSKRVDFCGKG
Subunit 23	Spike	1034	1063	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL
Subunit 24	Spike	1094	1123	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS
Subunit 25	Spike	1156	1185	FKNHTSPDVDLGDISGINASVVNIQKEIDR
Subunit 26	Spike	1179	1208	IQKEIDRLNEVAKNLNESLIDLQELGKYEQ

Table 6. Linear B-cell Epitopes Prediction Results. Here we show the selected 14 vaccine subunits, the B-cell epitopes they contained and their Emini scores.

Vaccine Subunits	Protein	Start	End	Peptide Sequence	B-cell Epitopes	Emini Score
Subunit 1	Spike	19	48	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL	TTRTQLPPAYTNSF	1.937
Subunit 3	Spike	71	100	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI	NGTKRFD	2.678
					KSNI	1.395
Subunit 4	Spike	141	170	LGVYYHKNNKSWMESEFRVYSSANNCTFEY	YYHKNNKS	3.544
Subunit 5	Spike	191	220	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS	HTPIN	1.207
Subunit 9	Spike	402	431	IRGDEVRQIAPGQTGKIADYNYKLPDDFTG	EVRQIAPGQTGKIADYNYK	1.775
Subunit 10	Spike	439	468	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI	NNLDSKV	1.508
					LFRKSN	2.403
Subunit 13	Spike	584	613	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ	GTNTSN	1.888
Subunit 15	Spike	655	684	HVNNSYECDIPIGAGICASYQTQTNSPRRA	HVNNSY	1.460
					YQTQTNSPRRAR	3.849
Subunit 18	Spike	773	802	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF	QDKNTQ	4.752
					KQIYKTPPI	2.243
Subunit 19	Spike	805	834	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK	LPDPSKPSKR	3.136
Subunit 23	Spike	1034	1063	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL	GQSKRVDFC	1.098
					FPQSAPH	1.001
Subunit 24	Spike	1094	1123	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS	FYEPQIITTD	1.627
Subunit 25	Spike	1156	1185	FKNHTSPDVDLGDISGINASVVNIQKEIDR	DKYFKNHTSPDVDLGDIS	1.833
					IQKEIDR	1.666
Subunit 26	Spike	1179	1208	IQKEIDRLNEVAKNLNESLIDLQELGKYEQ	IQKEIDR	1.666
					ELGKY	2.802

Table 7. CTL Epitopes Prediction Results.

Subunits	Peptide Sequence	CTL Epitopes	HLA Class I alleles and supertypes	HLA Score
Subunit 1	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL	9	A1, A2, A24, A26, B7, B8, B27, B39, B58, B62	4.652
Subunit 3	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI	6	A1, A3, A24, B7, B27, B39, B62	2.492
Subunit 4	LGVYYHKNNKSWMESEFRVYSSANNCTFEY	9	A1, A3, A24, A26, B39, B40, B58, B62	6.124
Subunit 5	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS	9	A1, A2, A24, A26, B7, B8, B27, B39, B58, B62	7.131
Subunit 9	IRGDEVRQIAPGQTGKIADYNYKLPDDFTG	6	A2, A3, B7, B27, B62	3.092
Subunit 10	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI	9	A1, A3, A24, B8, B27, B39, B62	4.326
Subunit 13	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ	5	A1, A3, A24, B8, B27, B39, B62	5.837
Subunit 15	HVNNSYECDIPIGAGICASYQTQTNSPRRA	3	A1, B7, B40, B62	0.211
Subunit 18	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF	7	A1, A2, A3, A24, A26, B8, B39, B40, B62	4.282
Subunit 19	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK	8	A1, A2, A3, A24, B7, B8, B27, B39, B40, B58, B62	5.763
Subunit 23	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL	8	A1, A2, A3, A24, A26, B7, B8, B39, B58, B62	6.167
Subunit 24	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS	8	A2, A3, A24, A26, B27, B39, B58, B62	5.66
Subunit 25	FKNHTSPDVDLGDISGINASVVNIQKEIDR	4	A2, A26, B39	1.341
Subunit 26	IQKEIDRLNEVAKNLNESLIDLQELGKYEQ	5	A1, A2, B7, B8, B40, B62	3.26

Table 8. HTL Epitopes Prediction Results.

Subunits	Peptide Sequence	HTL Epitopes	HLA Class II (HLA-DRB1*:01) alleles	HLA Score
Subunit 1	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL	9	01, 03, 04, 07, 08, 09, 10, 11, 13, 15, 16	18.031
Subunit 3	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI	10	01, 04, 07, 08, 09, 10, 12, 13, 14, 15	9.07
Subunit 4	LGVYYHKNNKSWMESEFRVYSSANNCTFEY	9	04, 08, 10, 11, 13, 15, 16	7.38
Subunit 5	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS	14	01, 03, 04, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16	26.785
Subunit 9	IRGDEVRQIAPGQTGKIADYNYKLPDDFTG	7	01, 07, 09, 10, 14	4.932
Subunit 10	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI	8	07, 08, 11, 13, 14, 16	12.14
Subunit 13	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ	2	10	0.618
Subunit 15	HVNNSYECDIPIGAGICASYQTQTNSPRRA	4	01, 03, 04, 09, 10, 16	3.986
Subunit 18	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF	9	03, 04 ,07, 08, 09, 10, 11, 12, 13, 14, 15, 16	21.858
Subunit 19	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK	8	03, 04, 08, 09, 10, 11, 14	5.479
Subunit 23	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL	4	01, 04, 08, 10, 11	2.996
Subunit 24	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS	8	03, 04, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16	11.56
Subunit 25	FKNHTSPDVDLGDISGINASVVNIQKEIDR	8	01, 04, 07, 08, 09, 10, 11, 12, 13, 14, 15	11.925
Subunit 26	IQKEIDRLNEVAKNLNESLIDLQELGKYEQ	6	08, 11, 12, 14	3.489

Table 9. Worldwide Human Population Coverage Analysis Results.

Vaccine Subunits	Protein	Start	End	Peptide Sequence	Population Coverage (Worldwide) %
Subunit 1	Spike	19	48	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL	96.95
Subunit 3	Spike	71	100	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI	83.02
Subunit 4	Spike	141	170	LGVYYHKNNKSWMESEFRVYSSANNCTFEY	81.74
Subunit 5	Spike	191	220	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS	97.04
Subunit 9	Spike	402	431	IRGDEVRQIAPGQTGKIADYNYKLPDDFTG	77.19
Subunit 10	Spike	439	468	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI	78.51
Subunit 13	Spike	584	613	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ	61.44
Subunit 15	Spike	655	684	HVNNSYECDIPIGAGICASYQTQTNSPRRA	68.94
Subunit 18	Spike	773	802	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF	90.19
Subunit 19	Spike	805	834	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK	76.12
Subunit 23	Spike	1034	1063	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL	68.38
Subunit 24	Spike	1094	1123	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS	94.90
Subunit 25	Spike	1156	1185	FKNHTSPDVDLGDISGINASVVNIQKEIDR	87.47
Subunit 26	Spike	1179	1208	IQKEIDRLNEVAKNLNESLIDLQELGKYEQ	76.72

Table 10. Antigenicity Evaluation Results. The Vaxijen Score for the whole final vaccine is 0.5705 with a virus model at a threshold of 0.4.

Vaccine Subunits	Protein	Start	End	Peptide Sequence	Vaxijen Score (Virus Model)
Adjuvant					0.7447
Subunit 1	Spike	19	48	TTRTQLPPAYTNSFTRGVYYPDKVFRSSVL	0.2486
Subunit 3	Spike	71	100	SGTNGTKRFDNPVLPFNDGVYFASTEKSNI	0.4791
Subunit 4	Spike	141	170	LGVYYHKNNKSWMESEFRVYSSANNCTFEY	0.3891
Subunit 5	Spike	191	220	FVFKNIDGYFKIYSKHTPINLVRDLPQGFS	0.4757
Subunit 10	Spike	439	468	NNLDSKVGGNYNYLYRLFRKSNLKPFERDI	0.3615
Subunit 13	Spike	584	613	ILDITPCSFGGVSVITPGTNTSNQVAVLYQ	0.8318
Subunit 18	Spike	773	802	EQDKNTQEVFAQVKQIYKTPPIKDFGGFNF	0.2449
Subunit 19	Spike	805	834	LPDPSKPSKRSFIEDLLFNKVTLADAGFIK	0.3605
Subunit 23	Spike	1034	1063	LGQSKRVDFCGKGYHLMSFPQSAPHGVVFL	0.6713
Subunit 24	Spike	1094	1123	VFVSNGTHWFVTQRNFYEPQIITTDNTFVS	0.4012
Subunit 25	Spike	1156	1185	FKNHTSPDVDLGDISGINASVVNIQKEIDR	0.6035
6xHis Tag				HHHHHH	0.6280

Table 11. Toxicity and Physicochemical Properties Prediction Results

	Toxicity	hydropathicity	Charge	Half-life (vitro)	Half-life (vivo)	Instability Index	Stability	pI	Mol. Weight
Final Vaccine	NT	-0.521	37.00	30h	>20h	34.01	Yes	9.76	76428.68
Adjuvant	NT	-0.679	28.00	30h	>20h	38.94	Yes	10.30	30396.93
Subunit 1	NT	-0.510	3.00	7.2h	>20h	34.35	Yes	9.99	3465.91
Subunit 3	NT	-0.670	0.00	1.9h	>20h	45.82	Yes	5.84	3277.00
Subunit 4	NT	-0.880	0.50	5.5h	3min	69.83	No	6.75	3668.46
Subunit 5	NT	-0.170	2.50	1.1h	3min	18.96	Yes	9.40	3545.56
Subunit 10	NT	-1.053	3.00	1.4h	3min	7.15	Yes	9.71	3635.55
Subunit 13	NT	-0.010	-1.0	20h	30min	1.99	Yes	3.80	3095.51
Subunit 18	NT	-0.897	0.00	1h	30min	25.35	Yes	6.31	3518.40
Subunit 19	NT	-0.183	1.00	5.5h	3min	67.50	No	8.43	3348.34
Subunit 23	NT	-0.050	3.00	5.5h	3min	38.38	Yes	9.20	3307.31
Subunit 24	NT	-0.150	-0.50	100h	>20h	17.10	Yes	5.33	3548.92
Subunit 25	NT	-0.450	-1.50	1.1h	3min	24.99	Yes	7.75	3283.07
6xHis Tag	NT	-3.20	0.00	3.5h	10min	8.33	Yes	7.21	840.86

Download PDF

Version 1

posted

You are reading this latest preprint version

An In-silico Deep Learning Approach to Multi-epitope Vaccine Design: A SARS-CoV-2 Case Study

Status:

Version 1

Abstract

Figures

Introduction

DeepVacPred

Background

Data Collection and Dataset Design

Network Training

Validation

ROC Curves

Vaccine Design Test

DeepVacPred Framework

Results

Data Retrieval

DeepVacPred Vaccine Subunits Prediction

Linear B-cell Epitopes Prediction

Cytotoxic T Lymphocytes (CTL) Epitopes Prediction

Helper T Lymphocytes (HTL) Epitopes Prediction

Worldwide Human Population Coverage Analysis

Multi-epitope Vaccine Construction

Antigenicity and Allergenicity Evaluation

Toxicity and Physicochemical Properties Analysis

Secondary Structure Prediction

Vaccine 3D Structure Modeling

Vaccine 3D Structure Refinement

Vaccine 3D Structure Validation

Codon Optimization and In-silico Cloning

RNA Mutations

Discussion

Methods

DNN Design and Training in DeepVacPred Framework

Linear B-cell Epitopes Prediction

Cytotoxic T Lymphocytes (CTL) Epitopes Prediction

Helper T Lymphocytes (HTL) Epitopes Prediction

Worldwide Human Population Coverage Analysis

Antigenicity and Allergenicity Evaluation

Toxicity and Physicochemical Properties Analysis

Secondary Structure Prediction

Vaccine 3D Structure Modeling

Vaccine 3D Structure Refinement

Vaccine 3D Structure Validation

Codon Optimization and In-silico Cloning

Data Availability

Declarations

Acknowledgements

Author Contributions

Competing Interests

Additional information

References

Tables

Supplementary Files

Status:

Version 1