Prediction of SARS-CoV2 spike protein epitopes reveals HLA-associated susceptibility

SARS-CoV2 infection is causing a pandemic disease that is reflected in important public health problems worldwide. HLA-based epitope prediction and association with susceptibility provides an important base for treatment design. Hence the aim of this study is to predict the best antigen-presenting HLA-class I and II alleles in top affected populations and determine probable susceptibility associations. A bioinformatic prediction of T cell epitopes and their restricted HLA class I and II alleles was performed to predict immunogenic epitopes and HLA alleles from the spike protein of the SARS-CoV-2 virus, together with molecular modeling analysis and a correlation with the cumulative incidence of COVID-19 infection in 14 countries. Here, we describe a set of ten highly immunogenic epitopes, together with different HLA alleles that can efficiently present these epitopes to T cells. Most of these epitopes are located within the S1 subunit of the spike protein, suggesting that this area is highly immunogenic. A statistical correlation was found between the frequency of HLA-A*02:03 and HLA-A*31:01 and a low cumulative incidence in the selected countries.


Introduction
The coronavirus disease (COVID-19) has been declared a pandemic by the World Health Organization (WHO) and the number of cases and deaths associated with this disease has increased significantly.1 It is estimated that there are over 932,000 confirmed cases and 47,000 deaths, representing a severe global emergency on March 31st, 2020. To date, the US, Italy, and Spain are the most affected countries with the highest total number of confirmed cases and number of deaths.2 Even though the incidence among the total population would be a better estimate to evaluate the pandemic burden in each country.
COVID-19 is a disease generated by the novel severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2) that during its initial outbreak in China in 2020 presented with a wide range of clinical manifestations, like fever (88·7%), cough (67·8%), fatigue (38·1%), and acute respiratory distress syndrome (ARDS) in severe cases. 3 Interestingly, the molecular and clinical manifestations of the disease vary between asymptomatic, mildsymptomatic, and severe patients, requiring in some cases hospitalization to prevent fatal outcomes.4 Currently, the SARS-CoV-2 genome has been characterized as a new Betacoronavius, which shares around 87% of genomic identity to bat-SL-CoVZC45 and bat-SL-CoVZXC21 viruses.5 A more recent analysis by Zhou et al. reported that there is a 96·2% identity with CoV BatCoVRaTG13 and a 79·5% identity with SARS-CoV.3, 6 The genomic characterization of the virus not only provides information about its taxonomy and probable origin but also offers opportunities to perform deeper analysis using bioinformatic tools.
The angiotensin-converting enzyme-2 (ACE-2) receptor and the transmembrane serine protease 2 (TMPRSS2) are essential components of the human host for the virus entry into the upper respiratory epithelial cells. The virus recognizes ACE-2 through the viral spike glycoprotein (S), event that leads to the virus-cell membrane fusion.7 The S glycoprotein is found as a homotrimer of three identical monomers, each one of which is divided into two subunits: S1 and S2. The first subunit folds in four domains: A, B, C, and D. The B domain possesses a receptor-binding domain (RBD) that recognizes ACE2, hence its importance for the viral entry. 8 The S2 subunit sequence has two tandem domains, named HR1 and HR2, that play an essential role in the viral fusion to the membrane.9 Furthermore, analysis of the spike protein showed that it is conserved among SARS-CoV and SARS-CoV-2 with 76·3% of identity and 87·3% of similarity.10 Several viral diseases studies have shown that disease severity is closely associated with some individual factors, such as genetic background and immune response. The human leukocyte antigen HLA is responsible for the antigen presentation to T cells and, therefore, a key component for adaptive immune response initiation. The HLA genes are the most polymorphic genes in the human genome, as these polymorphisms influence the ability to present different sets of epitopes to T cells. Some HLA molecules are more efficient than others presenting certain antigens, what may lead to a better induction of immune responses. This fact has already been proven for some viral diseases like A H1N1 influenza 11 and HIV.12 It has been reported an association between infection and HLA-B*07:03,13 HLA-Cw*08:01,14 HLA-B*46:01, and HLA-B*54:01 for SARS-CoV. Specifically, it has been reported that the individuals who are HLA-B*46:01 positive have a higher risk of severe infection,15 whereas the frequency of HLA-DRB2*03:01 is lower among SARS-CoV patients. 13 The relationship between viral infection, HLA and disease susceptibility has led to vaccine development and molecular epidemiology research that can contribute to novel therapies.
So far, the control of the COVID-19 pandemic remains a challenge, resulting in thousands of new cases and deaths reported daily. It is necessary to find prophylaxis and specific treatment to control the unbridled infection and to diminish the global morbidity and mortality. The generation of a vaccine that targets this virus remains the primary solution.16 However, the lack of knowledge regarding the immune response, such as the HLA-virus interactions, makes it a challenging task.17 Additionally, the genetic variations among different populations and their possible link with SARS-CoV-2 viral responses remain unknown. In this report, we analyzed which epitopes of the spike protein of SARS-CoV2 are highly immunogenic and able to be presented by HLA class I and II in different populations using bioinformatic tools. We also demonstrate an epidemiological correlation between HLA allele frequency and cumulative incidence of COVID-19 among countries.

Study design
A bioinformatic epitope prediction of the spike glycoprotein was performed. This gave information about the most immunogenic peptide-HLA matches and the HLA alleles that more likely present these epitopes in an efficient way. An ecological study was performed for the evaluation of HLA allele types and the cumulative incidences among countries from the day on which the first case was reported to March 31st, 2020.

Variables definition
Cumulative incidence was calculated as the total of confirmed cases from day one per one million inhabitants using the total population of each country in 2020 reported by the United Nations Population Division.

Bioinformatic epitope prediction
Bioinformatic analyses were performed to predict HLA class I and II epitopes using the sequence of the SARS-CoV-2 Spike protein. The sequence for SARS-CoV-2 Spike Glycoprotein was obtained from the GenBank with the accession number QHR63290·2 in FASTA format. This sequence was then submitted to the TepiTool server from the IEDB Analysis Resource database (http://tools.iedb.org/tepitool/).18 The epitope prediction was performed for the 27 most frequent HLA-A and B alleles that cover for most populations (table 1).19 Once the total epitope list was obtained, it was submitted to the T cell class I pMHC immunogenicity predictor server (http://tools.iedb.org/immunogenicity/) to get the immunogenicity score, which is predicted according to the aminoacid residues of the peptide.20 The peptide-HLA pairs with a positive immunogenicity score and a predicted IC50 level lower than the established cut-off from the complete list were selected (table 2).21 Considering that the lower the IC50 value is, the higher the binding affinity. The ten more immunogenic pMHC combinations from this list were selected.
The epitopes for HLA class II molecules were also predicted. To do so, the same sequence as before was used and submitted to the IEDB MHC class II epitope prediction tool (http://tools.iedb.org/mhcii/) using the IEDB recommended 2·2 algorithm and the most common HLA-DP, DQ and DR alleles (table 1).22 The predicted epitopes that had an SMM-predicted IC50 value higher than 50 were excluded and the sequences were ordered by the percentile rank.23 The MHC-II prediction tools use a core of nine amino acids to predict the best peptide binding affinity, even when the class II molecules bind peptides with 15 amino acid length, so the ten SMM cores with the minor percentile rank (what means the highest affinity binding) were selected.

Structural modeling
The Cryo-EM 3D model of the SARS-CoV2 spike glycoprotein crystal was downloaded from the RSSB Protein Data Bank (PDB) (https://www.rcsb.org/pdb/home/sitemap.do) (ID: 6VYB).24 The 3D structure was obtained and analyzed using PyMOL® software (Schrödinger LLC. Molecular Graphics System (PyMOL). Version 1·80 LLC, New York, NY. 2015). The basic local alignment search tool online (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to assess the position of the predicted peptides in the glycoprotein and the protein sequence was adjusted manually using the PyMOL tools.

Modeling structural analysis of SARS-CoV-2 epitopes and HLA
Modeling of protein-peptide interactions between an HLA class I and two SARS-CoV-2 spike protein epitopes was performed using the CABS-dock server (http://biocomp.chem.uw.edu.pl/CABSdock) to predict whether the most possible binding site would correspond to the HLA groove. To perform this analysis, the 3D structure of HLA-A*02:01 (accession number 3HLA) was used, since it was the only class I HLA molecule with a full structure available that did not have any pre-loaded peptides in the RSCB PDB database (RSSB Protein Data Bank) (https://www.rcsb.org/pdb/home/sitemap.do). The YLQPRTFLL and FIAGLIAIV peptides were selected from the epitope prediction results, as well as A2-ALW (ALWGPDDPAA), which is a well-known weak binding epitope for the HLA-A*02:01 that has been related to autoreactive CD8+ cells in Diabetes Type 1 and constitutes a good control element for the analysis.25 The last epitopes were re-analyzed using the NetMHC 4·0 server (http://www.cbs.dtu.dk/services/NetMHC/index.php) to obtain a prediction of the properties for each peptide-MHC-class I complex. The HLA-A*02:01 flexibility was determined by the PDBFlex server (http://pdbflex.org/) to select the best level of flexibility to introduce into the CABS-dock tool. Two regions were also removed for docking analysis, such as the ß2-microglobulin and a fragment of the A chain that was outside of the pocket. This was done by selecting the specific rank of amino acids that constitute them and selecting 1:B-99: B to remove ß2-microglobulin and 182:A-270: A for A chain. Finally, the analysis was run using ten simulation cycles per assay and using the full peptide flexibility.

HLA alleles frequency and cumulative incidence analysis.
The top 14 countries with the highest number of confirmed cases by March 31st, 2020 were selected to correlate the HLA allele frequency and the cumulative incidence rates. The database consulted provided information since January 22th, 2020 2 and did not have data from China. Hence, China was excluded. Japan was included despite having performed a relatively low number of COVID tests compared with other countries (the proportion of positive cases among tests performed was lower (7%) than in the US and Italy (21%)26,27 because it has a large population.
HLA allele frequencies were searched from our epitope prediction results in the open database Allele Frequencies in Worldwide populations and the samples were restricted to studies of blood donors and bone marrow registries.28 The population that better represented the country (minorities or subpopulations were not included) and had information on HLA alleles was the one considered. For each country, day one was determined as the day when the first confirmed case was reported. To March 31st, 2020, the day of analysis, Netherlands was on day 34, Austria on day 36, Belgium on day 57, Spain on day 60, and the rest of the countries on day 61 and higher. The cumulative incidence on days 34, 36, 57, and 60 was calculated.

Statistical analysis
The correlation between each HLA allele frequency and the cumulative incidence among countries was evaluated with the Spearman's rank correlation. Statistical significance was considered when the p-value was <0·10. The analysis was performed in the software GraphPad Prism ® (GraphPad Software version 6·0, La Jolla California USA, www.graphpad.com).

Results
In order to assess the best Spike protein epitope-HLA matches, the sequence was analyzed looking for epitope predictions in the most frequent HLA-A and HLA-B alleles. The ten most immunogenic peptides with a higher affinity binding to its restricted HLA are shown in table 3.
Although the most immunogenic peptide from this list is GTHWFVTQR, the match with the highest affinity was between the peptide FIAGLIAIV and HLA-A*02:03. Of note, here we analyzed the most frequent class I A and B alleles, so this analysis not only reveals epitopes that can be used for vaccine development but also the HLA alleles that best present epitopes of this particular protein.
The best epitopes and HLA class II alleles were also predicted, as shown in To track down the specific location of the peptides in the SARS-CoV2 spike glycoprotein, the corresponding 3D model was obtained. In this model, different predicted epitopes were searched for (tables 2 and 3) in the protein structure considering its subunits and domains ( figure 1). Notably, HLA class I peptides WTAGAAAYY, SANNCTFEY, and YLQPRTFLL (7, 8, and 9) are located in the A domain, which is highly conserved among other coronavirus species 24, suggesting that these could be also epitopes for other viruses. On the other hand, it was found that the class II epitopes FELLHAPAT, VVVLSFELL, FLVLLPLVS, VLSFELLHA, and FTISVTTEI (a, b, c, d, and h) and the HLA class I EVFNATRFA (4) are preferentially founded in the B domain.
The most probable junctions among the HLA-A*02:01 (table 3) against two of the most immunogenic SARS-CoV-2 epitopes (peptides FIAGLIAIV and YLQPRTFLL) were predicted. They were also compared with one well-known weak binding epitope for the HLA-A*02:01, the A2-ALW (ALWGPDDPAA), as a control.25 In order to perform a docking analysis, the last epitopes were re-analyzed using the NetMHC 4·0 server to obtain a prediction of the properties of each peptide-MHC-class I complex.34 The YLQPRTFLL epitope showed the strongest affinity and the highest binding percentage level for HLA-A*02:01, with values of 5·36 nM and 0·04% respectively. Conversely, the FIAGLIAIV epitope had less affinity and percentage of binding level (10·29 nM and 0·12% respectively), as well as the control peptide A2-ALW (117·15 nM and 1·20%).
Despite the differences in affinities between the two SARS-CoV-2 epitopes, both showed to fit the groove of HLA-A*02:01 maintaining their unfolded conformations. A contact map of the interface between the amino acids of HLA-A*02:01 and the epitopes was made, finding that each amino acid of FIAGLIAIV peptide was interacting at least with one amino acid of the HLA-A*02:01 groove (figure 2). Our data complemented the previous analysis of the epitopes and demonstrated that all of them have a great chance of being successfully presented by their respective HLA-restricted allele into a pMHC-TCR complex and trigger an immunological response against SARS-CoV-2. As the other HLA class I and II alleles structures are not found free of a pre-loaded peptide, they were not submitted to docking analysis.

HLA allele analysis and correlation with the country cumulative incidence
The population behavior of the HLA alleles that resulted in our prediction analysis was looked for considering the cumulative incidence of COVID infection in the most affected countries by March 31st, 2020. Table 5 shows the cumulative incidence per one million inhabitants on days 36, 57, 60, and 65 after the report of the first case in each country. Austria had the highest cumulative incidence on day 34 and 36 (976 and 1130 per one million inhabitants, respectively).
There was no association between any HLA allele frequencies and the cumulative incidence on days 34 and 36 (data not shown). On day 57, HLA-A*02:03 frequency was significantly associated with a lower cumulative incidence per one million inhabitants (R = -0·75, p value=0·08) and remained the same on day 60.

Discussion
Determining HLA interactions with epitopes for optimal presentation is crucial for the understanding of immunological response to the SARS-CoV-2. Here we present a group of epitopes of the S protein that can be efficiently presented to CD8 and CD4 T cells and are probably related to the immune-mediated elimination of the virus. These peptides can be either used for peptide-based design of vaccines or for further analysis of the immunogenicity and structure of this relevant protein.
Remarkably, our structural analysis of the protein shows a higher abundance of epitopes in the A and B domains of the S1 subunit of the virus, indicating that, in the case of this part of the protein being processed by the host cells, it could represent a highly immunogenic region. In this analysis, we did not look for B cell epitopes in the structure of the protein, and we cannot confirm that the specific target of the presented epitopes could interfere with its viral function, as would be the case of neutralizing antibodies.
HLA peptide groove sequence determines which epitopes from an antigen are presented to the immune system to elicit an effective response. The high rate of polymorphisms in the HLA locus can indicate a different ability to respond to certain antigens by different individuals, and some HLA alleles can be more efficient presenting certain antigens, thus protecting from certain infections 12. Our analysis from the most representative HLA alleles revealed those that present more effectively spike protein antigens of SARS-CoV-2, hence, one can hypothesize that their presence in an individual might confer an enhanced ability to defend against the virus.
To assess this, we analyzed the frequency of these alleles and their relation to the disease dynamics in some of the countries that are currently most affected by the COVID-19 epidemic. We found a negative correlation between HLA-A*02:03 and A*31:01 frequencies and the cumulative incidence per 1 million inhabitants, confirming the hypothesis that these alleles that were predicted to be good antigen presenters are associated to a better immunity against the infection. On the other hand, a positive correlation was found with HLA-A*03:02, suggesting that this allele can be considered as a risk factor.
Our results differ from other studies exploring the association of HLA alleles and different outcomes of infection with SARS-CoV. Often, the studies compare HLA alleles within specific populations: HLA-B*07:03 in China13, or HLA-B*46:0115 and HLA-Cw*08 in Taiwan,29 therefore, the variability of HLA allele frequencies might not be as high as between populations.30 We did not find any association of the studied alleles on day 34 and day 36, suggesting that the possible effect of the different allele frequencies in the cumulative incidence might be evident in the later exponential stages of the epidemic so far. It would be interesting to explore this association once the pandemic is over and we have a better understanding of the epidemic curve.
We acknowledge the limitations of the ecological association and its risk of bias. First, the association is at a population level and we cannot apply it at an individual level. The frequency of the HLA alleles is determined for the resident population of each country, and it could be subject to migration bias due to foreign people with COVID 19 infection that might have different HLA alleles frequencies compared to the native populations. Second, several factors influence the disease transmission rate, such as the actions implemented to contain the epidemic and its right execution. Although much of the information will not be available until the pandemic is over. Third, the new cases reported depend on the number of tests performed, so we might be underestimating the cumulative incidence in some countries.   Bold refers to the highest cumulative incidence for the given day. Figure 1 Localization analysis of immunogenic peptides of SARS-CoV-2 Spike Glycoprotein by 3D modeling. (A) Structure of the three full SARS-CoV-2 spike glycoprotein monomers with subunits S1-S2. The S1 domains consist of A, B, C, and D. The S2 subunit consists of the fusion peptides and domains HR1 and HR2. (B) The predicted epitopes for HLA class I are shown in green (C) and the suggested peptides for HLA-class II in blue. (D) Merge of class I and II peptides within the protein structure. The peptides are marked individually listed from 1-10 for class I and a-j for class II, corresponding to the immunogenicity tables 3 and 4.