Predicting HLA-I peptide immunogenicity with deep learning and molecular dynamics

Understanding how T cells discriminate self from non-self is a fundamental question with important implications for immunology, immunotherapy, and vaccine development. Presentation of peptides by human leukocyte antigen I (HLA-I) molecules is necessary but not sucient for T cell recognition, and peptide features that dictate immunogenicity are obscure. Here, we develop a convolutional neural network that learns features governing peptide immunogenicity, integrating molecular dynamics and sequence representations of humans, pathogen, and tumor peptides presented by HLA-I. Our model identied structural and dynamical properties correlated with immunogenicity and yielded a highly accurate classication of peptides from pathogens versus humans. Furthermore, we applied our model to classify more challenging cancer neoantigens, and it successfully predicted immunogenic neoepitopes from patients with melanomas. These data demonstrate the utility of deep learning models built on molecular dynamics and reveal underlying properties that govern HLA-I peptide immunogenicity. an immunogenicity prediction model that combines peptide sequence and HLA-peptide molecular dynamics simulation data within an integrated deep convolutional neural network. show that our neural network yields highly accurate classication of cancer, and human self and successfully identies immunogenic cancer neoantigens in two individual

melanoma. Analysis of the extensive molecular dynamics ensemble reveals the biophysical underpinnings of peptide immunogenicity and illuminates corresponding peptide features that complicate the classi cation of non-immunogenic cancer neoantigens. In particular, we nd that, when presented by HLA-A*02:01, the ensemble of non-immunogenic cancer neoantigens considered in this work look far more like immunogenic pathogen peptides than non-immunogenic self peptides.

Results
To explore determinants of HLA-I peptide immunogenicity, we rst assembled a dataset of 1,038 HLA-A02-restricted immunogenic 9mer epitopes from pathogens (29), as well as 1,845 non-immunogenic self 9mer peptides identi ed via mass spectrometry (30) (Fig. 1a). We used these data to develop a convolutional neural network that discriminates between immunogenic and non-immunogenic peptides ( Fig. 1). We rst applied a one-hot sequence representation (Fig. 1b) with a residue-similarity ordering, hypothesizing that both amino acid identity and arrangement within the peptide may affect immunogenicity. This input representation allows the neural network to learn aspects of amino acid similarity from a spatial encoding of proximity. By applying a 2 x 2 sliding convolutional lter, we integrated both amino acid sequence and similarity into the network (Fig. 1c). Applying a ve-fold cross validation approach to our data, our current sequence convolutional neural network achieved a mean test AUC of 0.85 on the 2,883 peptide dataset. The AUCs achieved in a recent work focused on classifying immunogenic pathogen peptides range between 0.60 and 0.70 (28), suggesting enhancements in both data curation and architecture can improve prediction results (Supplementary Table 1

and Supplementary
Text).
The general molecular topology of HLA-I bound to peptide antigens is widely conserved, but HLA-I/peptide complexes each interact in a speci c manner with the surfaces of their conjugate TCRs. Therefore, we hypothesized that differences in peptide solvent exposure at the residue level could account for differences in immunogenicity, since the features of the presented peptide side chains represent the nexus of the immune recognition interface. In support of this hypothesis and consistent with previous analyses (16,17), Riley et al. (28) reported that hydrophobic solvent-accessible surface area of presented peptides is correlated with heightened immunogenicity, particularly at positions 4, 5, and 8.
To interrogate the structural dynamics of the peptide-HLA complex and its effects on immunogenicity, we conducted extensive atomistic molecular dynamics (MD) simulations of a representative subset of our peptide collection (Supplementary Figure 1). Notably, it has been widely established that protein dynamics simulations in explicit solvent properly capture the dynamic hydrophobic exposure and/or burial of side chains in peptide conformational ensembles, which are essential for proper function (31). MD simulations provide the added bene t of generating time-ordered conformations, facilitating the creation of temporal features that are useful in training classi ers.
In our study, peptides were selected to ensure sequence diversity within the MD subset, particularly in reference to positions 1 and 8. In total, 110 immunogenic pathogen and 123 non-immunogenic selfnonamers were loaded onto the HLA-A*02:01 peptide-binding groove (Fig. 2a) based on a common PDB structure (PDB: 5NMH) through structural alignment. Loaded peptides were then placed in an explicit solvent box, subjected to an annealing procedure to minimize clashes, equilibrated for 100 ns at 310 K and 1 atm, and then simulated at production for 500 ns at the same temperature and pressure (32)(33)(34)(35)(36)(37).
This procedure was also repeated for 17 immunogenic tumor neoantigens (29). The resulting simulation ensemble (comprised of 250 peptides, and 125 μs of aggregate MD simulation time) showed that hydrophobic exposure (de ned by exposure of residues G, A, V, L, I, F, W, Y, and P) is enhanced at positions 4, 5, and 8 in immunogenic epitopes (Fig. 2b), with particular emphasis on positions 5 and 8. Another important observation derived from our MD simulations concerns non-hydrophobic exposure in nonimmunogenic peptides: non-hydrophobic residues are considerably more exposed at peptide positions 4, 5, and 8, suggesting that immunogenicity is strongly anti-correlated with polar or charged residues at those positions. This nding is signi cant, despite the fact that enhanced hydrophobic exposure occurs in immunogenic epitopes at the same positions. Such anti-correlations with non-hydrophobic exposure are likely equally useful for immunogenicity classi cation.
Based on the above results, we hypothesized that an integrated representation of HLA-I binders, incorporating both sequence and dynamic structural features, would enable us to learn additional properties of HLA-I peptide immunogenicity and improve prediction of immunogenic epitopes. We designed a deep convolutional neural network that combines sequential, structural, and dynamical features for immunogenicity prediction. Our featurization approach involves computing residue-speci c solvent-accessible surface areas (SASAs) as a function of time for each peptide sequence, with MD frames sampled every nanosecond (Fig. 2c). This procedure results in de facto coarse-grained images of the HLA-A*02:01 peptide presentation interface, a feature type that is well suited to convolutional architectures. The images were then split into 10 subtrajectories per peptide (representing 50 ns of MD each), with the dual purpose of providing more varied samples (2500, in this case) on which to train/test and to facilitate testing on shorter simulations, if desired. Importantly, samples from any single peptide were never divided between training and test data sets to avoid trivial classi cation outcomes.
Our integrated deep learning architecture is comprised of three modules: the sequence-based module adapted from above, and two additional sets of layers designed to encode molecular dynamics data.
With respect to the MD data, each sample image corresponding to an MD trajectory was split into two channels to re ect our residue exposure observations (Fig. 3a): one for hydrophobic (G, A, V, L, I, F, W, Y, and P) residues, and a second for the complementary residues present in each peptide. These two images were then fed through respective series of three convolutional and pooling layers to generate internal tensor representations for both sets of residues. In these cases, pooling operations were only conducted in the time dimension of each image, as the full resolution in the residue dimension needs to be retained to capture each peptide position's unique exposure statistics and relation to other residues.
The rst two pooling layers in the MD convolution modules were designed to identify the most prominent features in each peptide column (a strategy often employed in convolutional neural networks called max pooling), while the third pooling layer was designed to compute an ensemble average over dynamic features representative of the underlying statistical physics of protein conformations. In standard MD data analysis, one typically employs some kind of clustering and/or ensemble averaging to understand the sampled conformational landscape. In the present case, the deep learning encoder plays the role of these traditional analytical tools in grouping and averaging conformations, using experimental immunogenicity labels as a guide.
The output of the hydrophobic and complementary convolutional modules was then concatenated and fed through a dense layer to yield a uni ed representation for the protein structure and dynamics. This MD-derived representation was then concatenated with the output from the sequence-based module and fed through a last dense layer, yielding an integrated sequential, structural, and dynamical representation that is used to predict a peptide's probability of immunogenicity via a nal softmax layer.
Conducting a ve-fold cross validation on the peptide set for which MD data was collected, we found that the structural and dynamical features improved upon the performance of the sequence-only model. ROC curves for representative test sets from this procedure are presented in Fig. 3b. The MD plus sequence model yields an average test AUC of 0.89 (Fig. 3c), as opposed to the average AUC of 0.80 derived from the sequence-only model on an identical ve-fold data split of this reduced peptide set. These results suggest that our integrated model can distinguish viral immunogenic peptides from non-immunogenic self peptides at a high level of discrimination.
Considering the performance gap between the sequence-only and MD plus sequence models, one can conclude the improvement derived from MD data does not result from trivial bias in how sequences were split between training and test sets. We conducted a pair of feature analyses aimed at explaining how the deep learning model learns from molecular dynamics data: the rst based on residue identity, and the second based on residue position. In the rst test, we arti cially restricted the hydrophobic residue list to G, A, V, L, I, F (removing W, Y, and P, which have presumed importance because of their size and abundance, but can be considered non-hydrophobic in certain environments). This procedure reduces the AUC to 0.85 (Fig. 3c), suggesting that distinguishing between hydrophobic and complementary SASAs, rather than simply using generic residual exposure as a function of position, is important for predicting immunogenicity. Notably, the addition of MD-based features with the reduced hydrophobic residue list still signi cantly outperforms the sequence-only model on this same 250 peptide dataset (Fig. 3c).
In our second feature analysis, we aimed to ascertain the relative impact of residues at each peptide position (P1-P9) on immunogenicity prediction. To do so, we set all input apart from MD-based features at one speci c peptide position to zero in each of nine ve-fold training runs. While underlying correlations in residue dynamics present in MD trajectories cannot be removed, our ablation procedure ensures that features are only explicitly considered from one peptide position at a time.
This feature ablation procedure illustrates differences in immunogenicity prediction depending on the position considered (Fig. 3d). Perhaps unsurprisingly, the model is able to learn the least from dynamics of P2 and P9, which correspond to peptide anchor residue positions for HLA-A*02:01. This result further reinforces the point that peptide immunogenicity is a concept distinct from peptide-HLA binding, since properties of residues at P2 and P9 are known to be essential to the binding process. Outside of the anchor residues, model performance is relatively uniform across the positions considered, suggesting the model is able to distinguish a variety of peptide presentation modes in determining immunogenicity. The worst model performance at a non-anchor position occurs at P4; past work has suggested residue exposure at P4 is particularly important to peptide immunogenicity 26 , but we cannot draw the same conclusion based on the present data. Interestingly, our model is able to perform relatively well with residue solvent exposure information from P1, a peptide position not typically associated with TCR interactions. Regardless, the result from training on the full set of MD features simultaneously (mean test AUC ~0.89 with sequence information; ~0.84 without sequence information) is still signi cantly higher than that of any model trained on a single residue position, implying that explicit correlations between positions do contribute positively to classi cation.
The above data thus indicate that a deep learning model that combines molecular dynamics data with sequence features can be used to improve peptide immunogenicity prediction with respect to distinguishing pathogen peptides from self. Importantly, the addition of high-level MD features serves to add to the performance of a sequence-only baseline computed on identical datasets, suggesting that MDderived improvements do not result from trivial similarities across train-test divides.
We next interrogated how well our convolutional model can be extended to classifying cancer neoantigens, a task that is useful for many therapeutic applications. For example, neoantigen classi cation would be useful if one would like to discriminate between immunogenic and non-immunogenic neopeptides for antigen selection in vaccine design, an approach that goes beyond just making generating HLA-I binding predictions. Though our initial model was trained on features from 17 HLA-A*02:01 immunogenic neoantigens (15), it was not exposed to any non-immunogenic neopeptides, which themselves may be distinct from standard self-peptides. Accordingly, we conducted a full MD simulation and featurization on 15 non-immunogenic neopeptides (and one additional immunogenic neoepitope, yielding 18 immunogenic peptides and 33 peptides in total) to create a neoantigen test set (hereafter called "Small Neo"), and evaluated this test set using on models trained on a range of training sets and features. For reference, the non-self/self (NS/S) MD training set listed in Supplementary Table 3 includes the 95 immunogenic pathogen and 123 self-non-immunogenic peptides fully featurized with MD, and is used as the single training set for the remainder MD-based models discussed in this work. Sequence-only models were constructed using all relevant training sets, and the 291 neoantigen dataset presented in Riley et al. (28) (hereafter known as "Full Neo") was used as an additional test set for the purpose of comparison. Single MD plus sequence models were trained on the NS/S MD training set; for completeness, MD-only models were trained only on MD-derived features.
Not too surprisingly, the sequence-only models trained on either large peptide training (n=2883/n=3955) set could not classify either of the reduced ("Small Neo"; n=33) and full ("Full Neo"; n=291) neoantigen test data sets (Supplementary Table 2). This lack of neoantigen classi cation is consistent with previous results (28), where an AUC of 0.50 was reported for a sequence-only model and 0.60 for a structural model on the full neoantigen test set. Our smaller, diversity-conditioned peptide training set yields better classi cation on both neoantigen test sets (AUC = 0.67 -Small Neo; AUC = 0.57 -Full Neo) (Supplementary Table 3). Our sequence-only result on the full neoantigen set approximates that of previously described more expensive structural model(28) (See Supplementary Information).
Adding MD-derived features to our reduced NS/S training set and small neoantigen test set improved the AUC from 0.67 to 0.71 (Supplementary Table 3). This result falls far short of the 0.89 reported from crossvalidation, suggesting that neoantigens are much more challenging to differentiate in terms of immunogenicity than peptides from pathogens. This observation could have origins in the fact that neoantigens are derived from the human genome, and thus might be more similar to one another than or have structural and dynamic features distinct from pathogen and human self peptides. However, an AUC of 0.71 still represents a reasonable level of classi cation, and the MD plus sequence, MD-only, and sequence-only results suggest that those feature sets can cooperate in determining immunogenicity. Fig.   4a shows box-and-whisker plots of predicted probabilities for the immunogenic and non-immunogenic peptides in the small neoantigen test set derived from the best-performing MD plus sequence model. The model does differentiate the two classes of peptides with reasonable separation: the average predicted probability for immunogenic samples is 0.69, while the average value for non-immunogenic samples is 0.42. By contrast, the corresponding values are approximately 0.90 and 0.05 in models trained and tested on pathogen/self peptides These differences may suggest that non-immunogenic peptides share features correlated with immunogenicity and that immunogenic neoepitopes diverge less from nonimmunogenic neoepitopes than pathogen peptides diverge from self peptides.
To further assess whether our model could identify immunogenic cancer neoepitopes among HLA-A*02:01 binders from individual tumor genomes, we generated neoantigen test sets from peptides derived from two individual patients with melanoma (n=8 peptides for "Patient 1" (38), and n=15 peptides for "Patient 2"(23)). The patient-speci c test results are shown in Fig 4b and Supplementary Table 3. Intriguingly, the results for each patient diverge with respect to dominant feature type. However, the results for both patients were favorable overall (Fig. 3b); with the MD plus sequence model, both of Patient 1's top ranked peptides and three of the top four-ranked Patient 2 peptides were identi ed as immunogenic. It is important to note that these patients' data were obtained under speci c conditions and limited by screening assay sensitivity and the presence of potential immune evasion. It is nonetheless encouraging that our model's predictions feature strong enrichment of immunogenic peptides, providing an example of how neopeptides might be prioritized in personalized vaccine design. Fig. 3c provides two anecdotal examples in which MD-derived features appear to improve classi cation outcomes. In the case of AVGSYVYSV, the MD simulation reveals conformational dynamics that expose Y5, a residue for which such exposure is likely correlated with immunogenicity (and indeed seems to advance the immunogenicity probability prediction from 6% in the sequence-only model to 99% in the MD plus sequence model). By contrast, the non-immunogenic LMASISSFL features large, hydrophobic residues at positions 5 and 8, which triggers a strong immunogenicity prediction in the sequence-only model (97%); however, the MD simulation has the effect of burying F8, reducing the predicted immunogenicity probability to 31% in the MD-only model. The presence of I5/F8 seems to be so strongly correlated with immunogenicity that even the MD plus sequence model produces a high immunogenicity probability (94%). While neural networks learn from a complicated array of internal representations that are often di cult to interpret, these two peptides provide examples of how MD-derived features can improve immunogenicity prediction in a deep learning framework.
To expand our exploration of non-immunogenic neoantigens, we featurized the full (n=274) nonimmunogenic neoantigen set by collecting a single 50 ns sample for each peptide. All peptides were equilibrated for 25 ns prior to production. Combining these non-immunogenic neopeptide data with those previously generated for 17 immunogenic neoantigens and feeding all samples into the MD plus sequence model, we see a modest improvement in AUC relative to the sequence only model (0.61 vs. 0.57). These results are consistent with those reported by Riley et al. (28) for structural featurization (0.60). It is possible that a full (10 sample) featurization could improve our model's performance on these data, though the improvement derived from MD is consistent with that observed for the neoantigens that were fully featurized (0.71 vs. 0.67).
These results on the full non-immunogenic neoantigen set raises the question of why neopeptides are apparently more di cult to discriminate than immunogenic pathogen and self non-immunogenic peptides. A part of the answer may lie in the characteristics of the non-immunogenic neopeptides available for study. Our MD featurization of the 274 neoantigens allows for a direct comparison of A02presented peptide interfaces with those from other peptide classes. Non-immunogenic neopeptide SASA differences relative to the same self-peptide set featured in Fig. 2b are shown in Fig 4d. Surprisingly, the presented interfaces of these non-immunogenic neopeptides are quite similar to those of immunogenic pathogen peptides, with respect to both our chosen hydrophobic and complementary residue exposure metrics.

Discussion
Our simulation data on non-immunogenic neoantigens thus suggest that cancer neopeptide immunogenicity exists on a spectrum, and that neoantigens labeled as "negative" by common assays appear more immunogenic (in terms of HLA-I presentation) than standard human/self controls. Accordingly, our deep neural network must pick up on much more subtle differences between presented neoantigens to identify nonimmunogenic peptides than those that are typically reliable for distinguishing pathogen peptides from self. In short, the non-immunogenic neoantigens considered here are presented in strikingly similar ways to immunogenic peptides, complicating the classi cation problem. Interestingly, this di culty perhaps mirrors the physiological characteristics of neoantigen recognition: while robust and durable T cell responses are normally observed in the presence of pathogen peptides, the same often cannot be said for cancer neoantigens.
Despite these complications, our current model has proven capable of ranking true-positive immunogenic peptides at or near the top of patient-speci c neoantigen sets, an important ability for vaccine design. It remains possible that features orthogonal to current representations of sequence and solvent exposure may play a signi cant role in determining neoantigen immunogenicity. More featureagnostic deep learning architectures like graph convolutions could perhaps also help identify such differentiating factors.
We believe our results add to knowledge of speci c positional features of the HLA-I bound peptides that are associated with immunogenicity. We observe the most signi cant hydrophobic exposure occurs at positions 5 and 8, whereas Riley et al. (28) report their most certain enhancement at position 4. Since both the previous Riley et al. (28) study and the present work identify the importance of nuanced hydrophobic exposure with respect to observed immunogenicity, a more rigorous featurization approach, like the current full MD simulation in explicit solvent, is helpful to improve immunogenicity prediction. It's also important to note that molecular dynamics simulations produce more physically rigorous conformational ensembles that better approximate dynamic functional context.
Single-sample (50 ns) MD trajectories of a peptide-HLA-I complex take approximately one node-day to collect on our present hardware, which makes featurizing datasets of the size presented here expensive but tractable. To further accelerate featurization going forward, advances in generative deep learning (e.g., dynamical autoencoders) are worth exploring in the context of generating MD-derived features with a neural network rather than molecular dynamics code (39,40). Other recent work has shown that HLA allele-dependent factors can have a considerable (and potentially negative) impact on neoantigen presentation (41), suggesting that the impact of various properties of the cancer peptidome, and HLA diversity (42), need to be assessed in future immunogenicity prediction. Pressing needs related to vaccine development for emerging viruses suggest additional use cases for our models in antigen selection, should such advanced strategies be required. The present work suggests that combining molecular dynamics with arti cial intelligence can play a foundational role in immunogenicity prediction in di cult biomedical applications, which are certainly not limited to personalized cancer vaccine design.

Construction of Datasets
The peptides used to train and test models in this study were all of length nine. For analyses involving pathogen peptides, all peptides restricted to the HLA-I supertype A02 were derived from the Immune Epitope Database(29) (IEDB) under T cell assays, last accessed July 2017. Peptides were considered immunogenic if they had an annotated T cell response (column "Qualitative Measure") as positive (N = 1038). HLA-A02 self non-immunogenic binders (N = 1845) were derived from Pearson et al. (30), in which mass-spectrometry was used to elute human self peptides bound to HLA-A*02:01 from the cell surface. Neoantigen data were obtained from four main sources. Immunogenic and non-immunogenic neoantigens restricted to HLA A*02:01 were obtained from Bjerregaard et al. (15) (n=18 immunogenic; 15 non-immunogenic). and Riley et a.l (n=291) (28). For the analyses ranking Patient 1 and Patient 2 immunogenic neoepitopes, data were obtained from Robbins et al. (38) (n=8) and Stronen et al. (23) (n=15). respectively. The reduced dataset on which full MD featurization was selected to maximize sequence diversity at peptide positions 1 and 8, so as to ensure a range of amino acid compositions at both peptide termini. As such, peptides sets were selected from the larger self and non-self databases with roughly equal proportions of available amino acids at the aforementioned positions.
Deep convolutional neural network based on peptide sequence only Once peptide sequences have been encoded as described in the main text, the one-hot sequence representations are run through a 32-lter convolutional layer using the lter 2 x 2 con guration described above (Fig. 1c). The resulting mixed features are then max-pooled across the same lter dimensions to extract the dominant amplitudes from the convolutional procedure, as one often does in image-or graphbased convolutional neural networks. The pooled features are next passed through an L2-regularized dense layer, which is then subjected to a 50% dropout procedure during training. Finally, the surviving output from the dense layer is passed through a softmax layer to yield peptide immunogenicity probabilities. Sequence-based deep learning models were implemented in TensorFlow 1.2.1.
Deep convolutional neural network combining sequential, structural, and dynamical features MD trajectories were collected as described in the main text using NAMD 2.9.1 using the CHARMM27 force eld, Langevin thermostat (damping at 1/ps), a Langevin Piston Barostat, and 2 fs time step. All systems were solvated in explicit TIP3P water and ionized to 150 mM Na+ and Cl-. Prior to full equilibration, all systems were annealed with there iterations of 50000 steepest descent minimization steps and 5 ns molecular dynamics at 310 K and 1 atm. After collection of production simulations, residue-speci c solvent-accessible surface areas (SASAs) were computed every 1 ns using the measure tool in VMD 1.9.2, split into hydrophobic/nohydrophobic channels, normalized by the maximum residuespeci c SASA observed per channel across all peptides used in training/testing, and split into 10 (50 ns) samples per peptide (with the exception of the full non-immunogenic neoantigen set, which contained only one 50 ns sample per peptide). Molecular dynamics data samples were then each fed into a series of three convolutional/pooling units, leveraging 2 x 2 convolutional lters and pooling along the time axis across 33, 33, and 35 frames to yield a 2 x 9 x 128 internal representation for each channel image. In the third convolutional unit, max pooling was replaced with mean pooling to mimic the ensemble averaging that is known to characterize the underlying statistical physics. The hydrophobic and complementary representations are then attened, concatenated, and fed through a dense neural network layer. The output of this dense layer is concatenated with the sequence-only dense layer output; this vector is nally fed through a softmax layer to generate immunogenicity probability predictions. Integrated deep learning models were implemented in TensorFlow 1.2.1.

Deep convolutional neural network training procedure
For applications not involving neoantigens, a ve-fold cross validation procedure was applied in an attempt reproduce the method used by Riley et al. for immunogenicity classi cation. As described by Riley et al., datasets were randomly split into ve equal parts, and models were trained on all possible permutations of three-part training sets, one-part validation sets, and one-part test sets. In practice, hyperparameters were not changed as a result of training runs conducted on each validation set: all L2 regularization weights were set at standard value of 0.01 for, and all sequence and MD plus sequence models were trained for 200000 and 500000 optimization steps, respectively, with a batch size of 50 samples. Stochastic gradient descent was governed by the default settings of the ADAM optimizer in TensorFlow. Standard ReLU activation functions were used in convolutional and dense layers. Cross validation AUCs were reported as averages over withheld test sets. For applications testing neoantigen classi cation, models were trained with all available pathogen/self peptide samples using the same procedures described above. Out of an abundance of caution, pathogen peptides with the ambiguous label "occurrence of autoimmune disease" were excluded from training data for models used to make neoantigen predictions. In models involving molecular dynamics data, no samples drawn from the same peptide were divided between any train/test split. In other words, in cases in which 10 samples per peptide were available, all ten samples were either used as training data or test data, but never a mixture of both.

Statistical analyses
Statistical analyses were performed with available functions in TensorFlow 1.2.1 and standard tools packaged with Python 2.7.12 and MATLAB_R2015b.

Data availability
All peptides used in the present study are provided as a supplementary le.

Declarations
Author Contributions R.Z. and T.A.C. conceived and planned the study. J.K.W. designed and constructed the deep learning architecture, and performed molecular dynamics simulations. D.C. and C.K. collected and prepared all peptide sequences and patient data. D.C. and C.K. helped develop the deep learning architecture for sequence only model. J.K.W., D.C. and C.K. analyzed data. J.K.W., D.C., C.K., T.A.C. and R.Z. interpreted results and co-wrote the manuscript. All authors contributed to the general discussion of the project and manuscript.
Competing interests T.A.C. is a co-founder of Gritstone Oncology and holds equity. T.A.C. holds equity in An2H. T.A.C. acknowledges grant funding from Bristol-Myers Squibb, AstraZeneca, Illumina, P zer, An2H and Eisai. T.A.C. has served as an advisor for Bristol-Myers, MedImmune, Squibb, Illumina, Eisai, AstraZeneca and An2H. N.A.R. is a consultant/advisory board member for AstraZeneca, BMS, Roche, Merck, Novartis, Lilly and P zer. T.A.C. and D.C. hold ownership of intellectual property on using TMB to predict immunotherapy response, with pending patent, which has been licensed to PGDx.