Protein coding features of SARS-CoV-2 genome
A map of the predicted ORFs is depicted in Supplementary Figure 1 based on the genome sequence of the virus Wuhan-Hu-1 isolate (NCBI Reference Sequence number: NC_045512.2). The genomic structure of SARS-CoV-2 shares characteristics that are also found in other coronaviruses including SARS-CoV, MERS-CoV, and HCoV-NL63. All these coronaviruses contain recognizable ORFs including the replicase (orf1ab polyprotein), surface glycoprotein (spike protein), envelope protein, membrane glycoprotein, nucleocapsid protein, and several non-structure proteins (NSP). The conserved domains of proteins encoded by the SARS-CoV-2 genome are summarized in Supplementary Table 1. Spike protein mediates the specific binding of the virion to the receptor on the host cell membrane. The overall structure of spike protein is outside the virus particle [14]. Thus, it is an ideal target for B-cell epitope screening. Compared to spike protein, nucleocapsid protein is more conserved in selected coronaviruses (Figure 2). Though unable to induce humoral immunity, nucleocapsid protein in SARS-CoV and MERS-CoV has been experimentally tested as a robust immunogen to induce cytotoxic T-lymphocyte (CTL)-mediated response[23, 24], which suggests nucleocapsid protein in SARS-CoV-2 could be a good candidate for T-cell epitope prediction.
Sequence analysis of spike protein and nucleocapsid protein in selected coronaviruses
To better understand the characteristic of SARS-CoV-2, we compared its protein sequences with other selected coronaviruses. All protein sequences were downloaded from NCBI database with accession IDs shown in Figure 2A. The total sequence identity and phylogenetic tree results were presented in Figure 2B&2C. Consistent with a recently published study[25], we found that both spike and nucleocapside proteins in SARS-CoV-2 are more closely related to that of SARS-CoV. The protein domains of spike and nucleocapsid proteins (Figure 2D) were depicted based on previous studies on SARS-CoV[26, 27] and the protein alignment result in the current study (supplementary files: N align.clustal_num; S align.clustal_num). The amino acid sequence identity result confirmed a high similarity between SARS-CoV-2 and SARS-CoV. As anticipated, the nucleocapsid protein is more conserved among selected coronaviruses compared to spike protein.
B-cell epitopes recognition
The full-length sequence of spike protein was scanned for putative sequential B-cell epitopes by two types of bioinformatics programs. A total of 28 non-overlapping peptides were identified by ABCpred server with the threshold set at 0.85 (Supplementary Table 2). For sequential B-cell epitopes prediction on BepiPred-2.0 server, a threshold value of 0.5 was applied and 35 peptides were predicted (Supplementary Table 3). Antigenicity was calculated by Vaxijen 2.0 server and peptides with the highest antigenicity scores were selected (Tables 1&2). The structure of SARS-CoV-2 spike protein was resolved recently with Cryo-electron microscopy (cryo-EM) [14], which could greatly facilitate the process of vaccine development. Predicted epitopes in Tables 1&2 were highlighted as sphere in monomer structure of spike protein viewed with pymol (Supplementary Figure 2). While most epitopes predicted were exposed on the surface of spike monomer, only epitopes Spike315-324(TSNFRVQPTE), Spike333-338(TNLCPF), Spike648-663(GCLIGAEHVNNSYECD), Spike1064-1079(HVTYVPAQEKNFTTAP) displayed good surface accessibility in spike trimer (Figure 3 and Supplementary File: B-cell-epitope-animation.ppt), the pattern more likely exists in nature. Conformation-based B-cell epitopes were computed on DiscoTope 2.0 server [16]. A threshold value of -1.0 was chosen for the computation, which corresponds to a specificity of 85% and a sensitivity of 30%. The contact number, propensity score, and disctope score for each amino acid that passed the threshold were presented in Table 3. The position of these residues was viewed with pymol and highlighted as sphere (Figure 4). Processing with a combination of B-cell epitope scanning and peptide analysis forecasted 4 potent linear epitopes and 10 residues involved in discontinuous epitopes formation.
T-cell epitopes recognition
In our study, the IEDB server was utilized following prediction methods recommended (a combination of ANN, SMM, CombLib, and NetMHCpan EL methods for HLA-1 binding prediction, and a combination of NN-align, SMM-align, CombLib, Sturniolo, and NetMHCIIpan methods for HLA-2 binding prediction).
For HLA-1 binding peptide prediction, the top 50% scoring peptides were retained for further analysis. A total of 81 nonrepetitive peptides with ANN_IC50 value not higher than 500, indicative of stronger than medium binding affinity, were identified (Supplementary Table 4). 6 peptides with the highest antigenicity scores by vaxijen 2.0 were chosen for next step processing. In this step, we screened all HLA-1 molecules being able to bind these peptides (Table 4). A similar strategy was applied for HLA-2 binding peptides prediction on the IEDB server and 64 peptides were identified as HLA-2 binding sequences (Supplementary Table 5). 6 peptides with the highest antigenicity scores were selected for HLA-2 molecule screening and the result was presented in Table 5. In the selected peptides pool for HLA binding, Nucleocapsid104-112(LSPRWYFYY) was predicted as both HLA class-I and class-II binding peptides. Additionally, this peptide may excel in the capability of binding to a large number of HLA molecules as shown in Tables 4&5. A partially overlapping region was found in the CTL epitope Nucleocapsid66-74(FPRGQGVPI) and the helper T-lymphocyte (Th) epitope Nucleocapsid67-75(PRGQGVPIN), which suggests the sequence containing these two epitopes may initiate both CD4+ and CD8+ dependent immune response.
Selected T-cell epitopes feature profiling and evaluation
Peptide stability, mutation analysis, toxicity, allergenicity, hydro and physiochemical features were calculated and the results were presented in Supplementary Table 6. While no peptide listed is toxic, a majority of them are potentially allergenic. To forecast the probability of an immune response induced by the predicted HLA-1 binding peptides, the Class-I Immunogenicity test was performed and the scores were presented in Table 6. A higher score indicates a higher potential of immune response induction.
Peptides selected and multi-epitope based vaccine design
To induce humoral and cellular immune response simultaneously, five peptides that contain four linear B-cell epitopes and three T-cell epitopes (Figure 5A) were selected for vaccine development. To facilitate the processing of the T-cell epitopes, selected peptides from nucleocapsid protein were extended 5 amino acid residues at both N- and C-terminus as compared to the predicted epitopes. These peptides were scanned on the IEDB database. We found that five sequences presented in the vaccine construct were identical to the experimentally verified epitopes on SARS-CoV. These determined peptides displayed a strong or medium binding affinity to a series of MHC molecules (Supplementary Table 7). These epitopes likely possess cross-protective effects against SARS-CoV-2 as well. As shown in Figure 5B, peptides selected in this study were joined together by using GPGPG and (GGGGS)2 linkers. The Pan DR epitope (PADRE), a universal Th epitope that activates CD4+ cells[28], was introduced at the N terminus of the vaccine construct to enhance helper T cell activity. The vaccine construct can be generated as previously reported[29, 30].