TransMut: a program to predict HLA-I peptide binding and optimize mutated peptides for vaccine design by the Transformer-derived self-attention model

Computational prediction of the interaction between human leukocyte antigen (HLA) and peptide (pHLA) can speed up epitope screening and vaccine design. Here, we develop the TransMut framework composed of TransPHLA for pHLA binding prediction and AOMP for mutated peptide optimization, which can be generalized to any binding and mutation task of biomolecules. Firstly, TransPHLA is developed by using a Transformer-derived self-attention model to predict pHLA binding, which is significantly superior to 11 previous methods on pHLA binding prediction, neoantigen and human papilloma virus vaccine identification. For vaccine design, the AOMP program is then developed to automatically optimize mutated peptides to search for mutant peptides with higher affinity to the target HLA and with high homology to the source peptide. Among 3660 non-binding pHLAs, 3630 were successfully mutated. Of these, 94% were verified by the IEDB recommended method, and 88% have homology higher than 80% to the source peptide.


Introduction
The binding of peptides with the major histocompatibility complex (MHC), also called human leukocyte antigen (HLA) in humans, is essential for antigen presentation, which is a necessary prerequisite for effective T cell recognition 1 .
Only when the peptide is presented to the HLA molecules on the outer cell surface to form a peptide-HLA (pHLA) complex, and then recognized by the T cell, can it trigger a robust immune response 2 . HLA is generally divided into two categories: HLA class I (HLA-I) and HLA class II (HLA-II). HLA-I is encoded by three I loci, including HLA-A, HLA-B, and HLA-C, and it is expressed on the surface of all nucleated cells. In contrast, the encoded HLA-II can only be expressed in professional antigen-presenting cells 3 . Therefore, we focus on HLA-I molecules (hereinafter referred to as HLA) in this study. HLA mainly binds short peptides with a length of 8-12 amino acids, of which 9-mer peptides are the most common 4 . These short peptides are usually obtained by proteasomemediated degradation of intracellular proteins. Then, presenting these pHLAs on the cell surface for CD8 + T cells to recognize 5,6 .
Since the HLA molecule is highly specific and has a polymorphism in the human population 7 , coupled with the source antigen it is randomly processed by proteasome intracellularly, a small proportion of peptides can be presented to the HLA molecules 1 . Determining which peptides are selected for display in an individual's HLA type is crucial to epitope selection. The most reliable method is to verify the affinity of peptides and HLA through experiments, which are time-and labor-consuming. Given that the affinity of the peptide and HLA is closely related to whether it can be presented, many in silico methods have been developed to predict the affinity between the peptide and HLA (the related works are summarized in Section 1 of the Supplementary Information). The existing methods are mainly based on machine learning models, especially neural networks, to predict the binding or affinity between peptides and some HLA alleles 8 . However, over the years of dataset accumulation and modeling improvement, although the accuracy is as high as 90% for peptides of length 9 9 , other performance evaluation metrics and prediction capabilities for other peptide lengths are still not satisfactory 8 . The reason for the above situation is that, compared with other lengths, 9-mer peptides are easier to bind to HLA and have the most pHLA binding data. In contrast, the peptides with lengths 13 and 14 have few pHLA binding data. Moreover, many methods construct multiple allele-specific models 10 , which make it impossible to be applied in HLA or peptide lengths that do not exist in the training data, and there is a tendency for poor performance for HLA or peptide lengths with few training data. Therefore, a great deal of effort is urgently needed to develop pan-specific methods 10 , which are trained on multi-allele data and solve the problem of allele-specific methods, to accurately predict pHLA binding or not, especially for rare HLA and peptide lengths.
Moreover, it is attractive to synthesize short peptides to design highly targeted immune responses. Therefore, understanding the interaction of pHLA can promote the design of short peptide vaccines 11 and play an important role in the development of candidate vaccines for various diseases 12,13 . Several studies 14,15 demonstrate that neoantigens produced by non-synonymous mutations in cancer cells play a key role in the anti-tumor immune response.
Moreover, vaccines for neoantigens have been proven to be beneficial to clinical outcomes 16,17 . In addition, short peptide-based vaccines have many advantages over traditional vaccines 12,18 , such as the lower risk of genetic recombination and integration, higher stability, and the easier to synthesize and store, etc. The principle of short peptide vaccines is that antigen peptides bind to specific HLA to form HLA-peptide-TCR complexes to elicit T cell immune responses. Theoretically, the antigen peptide should have high affinity with specific HLA. At present, the process of identifying neoantigens is as follows 8 .
First, use a next-generation sequencing platform to characterize the nonsynonymous mutations of the primary tumor, and then use docking or other computational methods to predict the binding probability of the mutant peptide and the patient's HLA 19 . Finally, the number of candidate mutant peptides has been reduced, thus speeding up the process of experimental validation 20,21 .
However, the above-mentioned process is relatively complicated, requires high professional knowledge, and is difficult to be applied in large-scale development.
Therefore, the development of an automatic program to optimize mutated peptides (AOMP) would represent a huge breakthrough in the neoantigen design field.
In this study, we designed a Transformer 22 -derived model for pHLA binding prediction (TransPHLA), as shown in Figure 1. It is a pan-specific method 10 , which can achieve more stable and stronger performances and can be applied to rare and unseen HLA alleles. The core idea of the TransPHLA model is to apply self-attention 22 to peptides, HLAs, and pHLA pairs to obtain the binding score. In addition, we also introduced some techniques to optimize the model: (i) Embedding block: Besides the encoding of amino acids in the sequence, we added positional embedding to describe the position information of the sequence; (ii) Encoder block: Multiple self-attentions are applied to focus on different components of sequences, and padding positions of the sequence will be masked to prevent misleading the model; (iii) Feature optimization block: The fully-connected layers with the gyro channel that rise first and then fall are used to process the features obtained by the previous self-attention block to obtain better feature representation; (iv) Projection block: Use multiple fullyconnected layers to predict the final pHLA binding score. The above four submodules together constitute the whole model. We compared TransPHLA with 11 previous PHLA binding prediction methods, including the state-of-the-art method 23 , IEDB recommended method 9 , and 9 benchmark methods 9,24-31 .
TransPHLA not only achieved better performance, but also solved the limitations of many methods in HLA allele and peptide length. Moreover, the speed of TransPHLA is very high, the prediction of 170000 samples only needs 2 minutes on a CPU, and our web server does not limit the number of user inputs. Further, we conduct two types of case studies to demonstrate the usability and validity of the proposed TransPHLA. The TransPHLA shows an extraordinary performance compared with 11 other methods of neoantigen identification 32,33 , and it achieves a positive screening rate of 96%. For human papilloma virus (HPV) vaccine identification 34 , although the positive screening rate is not very high, it is also significantly superior to the other 11 methods.
In another crucial aspect, we developed an AOMP program based on the attention mechanism obtained by trained TransPHLA, which can be used in many aspects especially vaccine design. The details of AOMP are shown in cross-immunization. We tested 366 combinations of different HLAs and peptide binder lengths, and randomly selected 10 non-binding peptides for each combination, for a total of 3660 negative pHLAs. Among them, 3630 successfully found the binding mutant peptide-HLA, and 94% of them were verified by the method recommended by IEDB 9 , which confirmed the usability of our program. Further, 88% of the 3660 source peptides can find successful mutant peptides with a homology of more than 80% (mutate 1-2 sites), which is exciting for vaccine design.
The TransPHLA and AOMP program jointly form the TransMut framework, which is the first successful attempt to apply Transformer to the field of biomolecular mutations. This framework can be applied to any biomolecular mutation task, such as epitope optimization 35 , drug design 36 . Especially in the vaccine development, like TNFα-targeted vaccine, due to the biological activity of TNFα, it will cause inflammation in the body, and long-term medication will cause the risk of autoimmune diseases 37 . The core problem of TNFα vaccine development is how to reduce the biological activity of TNFα while maintaining sufficient immunogenicity 38 . Therefore, the AOMP program is just suitable to be applied to the project. First, use the Transformer-derived model to train the mutation direction data of the biomolecule, and then the attention score toward the mutation direction is obtained. Based on the attention score, the AOMP program will find a better mutant.
We provide a web server for TransPHLA and AOMP programs, the input and output results of which are shown in Figure 3. The data and code are freely available.

Comparison of TransPHLA with existing methods
To verify the effectiveness of the TransPHLA, we compare it with 9 benchmark methods from the Immune Epitope Database (IEDB), the recommended method from IEDB (i.e. NetMHCpan_EL 9 ), and the state-of-theart method published in 2021 (i.e. Anthem 26 ). The benchmark methods are ANN 24 , Consensus 28 , NetMHCcons 29 , NetMHCpan_BA 10 , NetMHCstabpan 31 , PickPocket 30 , CombLib 27 , SMM 25 , and SMMPMBEC 26 , which can be obtained from http://tools.iedb.org/main/tools-api/. Different methods provide different scoring methods to determine whether pHLA can bind, such as predicted IC50, predicted score, and percentile rank. We prefer to use predicted IC50 and predicted score as the criteria for the regression task and binary classification task, and Consensus only provides percentile rank as the criterion.
Supplementary Table 1   pHLA test set In Matchable subfigures, the data used for the performance comparison of the methods are all consistent, so the performance results can be compared uniformly and directly. In Unmatchable subfigures, HLAs and peptide lengths that can be predicted by different methods are different. Therefore, for each method in these subfigures, the data used for performance comparison is a subset of the provided data. To make the performance comparison fairer and more reasonable, the proposed method performs a pairwise comparison with each method on the corresponding subset data. It can be seen that the proposed method is superior to other methods regardless of whether it is independent data or external data. Although NetMHCpan_EL has achieved good performance on external data, its performance on independent data is greatly reduced. Independent data contains 112 types of HLA alleles, while external data contains only 5 HLA alleles. We mentioned before that the two test data are complementary in the performance comparison of the methods, thus, only a method that works well on both two types of data can demonstrate its superiority. The Anthem method shows slightly inferior performance than TransPHLA, but it cannot be extended to some unknown HLAs or peptide lengths due to its limited published data.
Also, the performance of each method for each peptide length on the independent test set and external test set is discussed. We show the violin plot for the distribution of AUC, Accuracy, MCC, and F1 of the 12 methods on the independent test set and external test set ( Supplementary Figures 1-8 Overall, the proposed method shows its superior predictive ability compared with other methods.

Neoantigen identification
Epitope screening and identification mainly depend on the pHLA binding, especially in neoepitope-based immunotherapy recognized as the most promising cancer treatment, the primary determinant of neoantigen screening is the binding of peptide and autologous specific HLA molecule 40 .
We collected neoantigen data from nonsmall-cell lung cancer(NSCLC), melanoma, ovarian cancer, and pancreatic cancer in recent work 32 Table 2. Excitingly, our method was able to screen out 96.4% of neoantigens. Although CombLib has achieved 100% accuracy, it only supports peptides with a length of 9, which limits its application. The remaining 10 methods are not as accurate as our method, and they may be limited by predictable HLAs or peptide lengths.

HPV vaccine identification
In the world, about 80% of people have been exposed to HPV before the age of 50, which is the most common sexually transmitted disease 41 . Currently, HPV cannot be cured, but there are some preventive HPV vaccines. However, the therapeutic effect of these vaccines is limited, and the use rate is very low 42 .
Therefore, it is very urgent to develop therapeutic vaccines to treat HPV infections and diseases.
An important study 34 has displayed some experimentally verified HPV vaccine data, including 278 experimentally verified pHLA binders from HPV16 proteins E6 and E7, composed of 8-11 mer peptides. The comparison results of different methods on this data are shown in Figure 4(f) and Supplementary   Table 3. Although our method only shows a screening rate of 68%, it is still significantly superior to other methods and can assist experimental research.

TransPHLA uncovers the underlying patterns of pHLA binding
The attention mechanism included in TransPHLA provides biological interpretability for the model. Further, we explored the binding rules of pHLA according to our method. Evidence shows that the C-terminal, N-terminal, and anchor site 43 of the peptide are critical for binding to HLA, which are always located by the first, last, and second position of the peptide sequence. The attention score of the position is confirmed, as shown in Figure 5  In addition, since the attention score represents the pattern of pHLA binding, it implies that the key amino acid sites on the peptide sequence which are important for binding or non-binding to the target HLA. To better persuade, we visualized the binding pattern of 5 HLA alleles according to ACME 44 ( Figure   5(c)). As expected, TransPHLA found a similar pattern for amino acid types at different peptide positions to the previous studies 44,45 . For HLA-A*11:01, TransPHLA recognizes the anchor residue that the peptides with K (Lys) at position 9 (9-th K). For HLA-B*40:01, the important residues, which are 2-nd E (Glu) and 9-th L (Leu), were successfully identified by TransPHLA. For HLA-B*57:03, hydrophobic residues usually form the binding pocket, and we identified this preference through 9-th L, 9-th F (Phe), and 9-th W (Trp), which is consistent with 2BVP 46 . For HLA-A*68:01, 4HWZ 47 demonstrates the 9-th K and 9-th R (Arg) residues of the peptide significantly contribute to the binding.
For HLA-B*44:02, the significance of 2-nd E has been proved by 1M6O 48 . All these results have been supported by previous studies, and demonstrate the effectiveness of our methods.

AOMP program
AOMP program is proposed to search for mutant peptides with higher affinity given the specific important peptide with weak affinity for a specific HLA allele.
For example, Figure 2 visualizes the process of AOMP and automatic mutation of strategy 2 for source peptide DLLPETPW and target HLA-B*51:01.
To demonstrate the effectiveness of our program, we tested all 366 HLApeptide length combinations. For each HLA-peptide length combination, we randomly selected 10 negative pHLA that were correctly predicted by TransPHLA. Finally, we ran the automatic mutation program for 3660 negative pHLA, of which 3633 were successful. Among them, 1692, 1536, 350, and 55 were successfully mutated at least 1, 2, 3, and 4 amino acid sites, respectively.
Further, 88% of the source peptides can find successful mutant peptides with a homology of more than 80%, which is exciting for vaccine design.
To further verify the authenticity and usability of the mutation results, we use the NetMHCpan_BA recommended by IEDB 9 to validate mutation results for 3660 pHLA. According to the predicted IC50 threshold of 500, 2905 (80%) are successful. And 3418 (94%) are successful according to percentile rank threshold of 2. Overall, 3419 (94%) are successful in either IC50 or percentile rank.

Discussion
pHLA binding and interaction are critical to epitope presentation and prerequisite for subsequent T-cell recognition which initiates an effective immune response. Epitope screening and identification mainly depend on the affinity of pHLA, especially in neoepitope-based immunotherapy recognized as the most promising cancer treatment, the primary determinant of neoantigen screening is the affinity of peptide and autologous specific HLA molecule. Therefore, accurate pHLA binding prediction is essential for the identification of immunotherapy targets, epitope screening, and vaccine design. On the other hand, the short peptide-based vaccine design is another important field for the treatment of diseases. However, the current vaccine design method is still in its infancy and cannot yet be automated.
Firstly, we proposed a TransPHLA method for pHLA binding prediction, which is a generalized pan-specific model that is not restricted by HLA allele and peptide length. We conducted two types of independent tests and two types of case studies (that are neoantigen and HPV vaccine identification). Compared with the state-of-the-art method (i.e. Anthem), IEDB recommended method (i.e. NetMHCpan_EL), and 9 benchmark methods, the proposed TransPHLA achieves significantly superior performance on all four experiments.
Then, based on TransPHLA, we develop an AOMP program to search for mutant peptides with higher affinity to the target HLA and high homology with the source peptide. Among 3660 negative pHLA for different HLAs and peptide lengths, 3630 samples are successfully found for the binding mutant peptide-HLA, 94% were verified by the method recommended by IEDB, 88% with a homology of more than 80%, which is exciting for vaccine design. This is the first attempt to propose a TransMut framework in the field of automatic mutation of biomolecules. Excitingly, it has achieved success in the field of pHLA binding prediction and peptide-based vaccine design. It is worth noting that this framework can be applied to any binding prediction and mutation task of biomolecules.

Dataset
In this study, the pHLA binding data (that is, positive data) are obtained from  Table 4).
To compare with previous methods conveniently, we followed the training and evaluation strategy of Anthem 23 , which is the state-of-the-art pHLA binding

Experiment settings
To follow the previous studies 8, 23 for pHLA binding prediction, we conducted the 5-fold cross-validation (CV) and independent test. Since the source of the independent test set and the training set are the same, the data distributions between the training set and independent test set are very similar (see Supplementary Figures 9 and 10). When the model is tested on the data with a similar distribution to the training data, it will be easier to obtain better test performance than a model that is not trained with a similar distribution of test data. That is, the proposed method and Anthem 23 may obtain greater advantages than other methods on the independent test set. Thus, we set up an external test to perform a fairer performance comparison of different methods.

5-fold CV
The 5-fold CV used in this study has two purposes, which are trained model

Independent test
Independent test is also a popular strategy for performance evaluation of methods. Independent test data require that it cannot have any overlap with training data. It is used as unified data to evaluate the performance of different methods. Therefore, compared with the CV, using the same independent test data will more objectively evaluate the performance and generalization ability of different methods.

External test
For a fair comparison, we used experimental data as external test data to eliminate possible deviations due to the similar data distribution. According to Supplementary Figures 9 and 10, the data distribution of the external test is a little bit different from that of the training and independent test data. Like the independent test, it can also more objectively evaluate the performance and generalization ability of the method.

Performance evaluation metrics
For each predictive model or method, the following metrics are calculated: where TP is true positive, FP is false positive, FN is false negative, TN is true negative, and MCC is Matthews Correlation Coefficient. In addition, we adopt AUC, i.e. the area under the receiver operating characteristic curve, as another performance evaluation metric. The above four metrics can comprehensively evaluate the performance of models or methods.
Except for MCC, which ranges from -1 to 1, other metrics range from 0 to 1. The higher value of the metric, the better the model or method. It is worthwhile to note that MCC cannot be calculated when two of TN, TP, FN, FP are 0, because the denominator is 0. And this phenomenon is not caused by both FN and FP being 0. Thus, if the MCC cannot be calculated for a specific peptide length of a specific HLA, it implies that the method is invalid for this HLA with this peptide length.

TransPHLA
The core idea of TransPHLA is the application of the self-attention mechanism 22 , and it is composed of four blocks to achieve high prediction performance. The embedding block added positional embedding to the amino acid embedding, to generate the sequence embedding, and then a dropout technology is used to enhance the robustness. Through the embedding block, TransPHLA generates the input embedding for peptide and HLA respectively.
Next, input the peptide and HLA embedding into the encoder block, respectively, which contains the masked multi-head self-attention mechanism and the feature optimization block. The feature optimization block is a combination of fully connected layers in which the channel of the gyro first rises and then falls.
This module makes the feature representation obtained by the attention mechanism better, mainly because more layers are added. Then, the output feature representations of peptide and HLA are concatenated as the embedding of the pHLA pair. After pHLA pair embedding passes through the encoder block, the projection block is used to predict the pHLA binding score. The details can be seen in Figure 1. On the other hand, the order of amino acids is critical to the structure of the peptide and HLA sequence, but the above embedding method does not consider it. Thus, we apply positional embedding to encode the position of the amino acid in the sequence. Given the position p in the sequence, the positional embedding is also encoded as a dX-dimensional vector, the value of the i-th element of this vector is PE(p)i, the formulas are where 2i represents the even dimensions, and 2i+1 represents the odd dimensions. This position embedding method can not only reflect the absolute position information of the amino acid, but also the relative position information.
We visualize positional embedding in Supplementary Figure 11

Masked multi-head self-attention mechanism
The attention mechanism is to focus on the important information and Compared with RNNs, Transformer realizes parallelization and solves the long-term dependencies problem, hence, it can process the data faster than RNNs. Compared with CNNs, which extract local information commendably, Transformer extracts more global information, which is suitable for the information exploration of the whole sequence of peptide and HLA.
The self-attention mechanism belongs to a variant of the attention mechanism, which captures the internal correlation of a sequence, and reduces the dependence on external information. In this mechanism, Q, K, and V matrices are all generated based on the input sequence S. The calculation process of the self-attention mechanism can be summarized into three processes ( Figure 6): (1) Calculate Q, K, and V according to input embedding X of sequence S; (2) The similarity or correlation between Q and K is calculated to generate the weight of V (i.e. attention score); (3) The attention (i.e. output) is calculated by the weighted sum of V.
Multi-head attention uses multiple Q, K, and V to calculate multiple attentions in parallel, where each attention mechanism focuses on a different information pattern of the input sequence. Suppose h attentions are to be calculated, that is, the number of heads is h. The Q, K, and V obtained by linear mapping from X are divided into h parts, and then the attention is calculated for each part. Concatenate these attentions, and then perform linear mapping again to obtain the final output, whose dimension is the same as X. The detailed process can be seen in Figure 6. The whole process is expressed by the formula as follows: where i represents the i-th head attention, dK is the dimension of Ki, and � is the scaled factor to prevent the large dot products.
It is worth noting that this study introduced the mask operation when calculating the attention. For peptide or HLA sequences whose length is less than the corresponding maximum length, non-amino acid characters should not be considered for the model training. Therefore, we use 10 -9 that is very close to 0 as their attention scores, so that non-amino acid characters do not play a role in calculating the attention. � is the scaled factor to prevent the large dot products.

AOMP program
In this study, we developed an AOMP program for the first time. This program aims to search for higher affinity mutant peptides based on the specific important peptide with weak affinity for a specific HLA allele. For example, the specific important peptides can be E6 and E7 peptide from HPV, neoantigen, and TNF epitope.
The program designed four directed mutation strategies based on the attention score obtained by TransPHLA ( Figure 2). According to the analysis in Section 2.2, the attention score not only represents the pattern of pHLA binding, but also reveals that the key amino acid sites on the peptide sequence which are important for binding or non-binding to the target HLA. On the other hand, for effective vaccine design, we also considered the homology of the mutant peptide and the source peptide. The homology between the mutant peptide and the source peptide is calculated by sequence similarity, and experiments show that the similarity is very close to the blast result. The homology of 1, 2, 3, and 4 amino acid positions were mutated on average 90%, 80%, 70%, and 61%, respectively. Therefore, we limit the number of mutations in the amino acid site of the source peptide to no more than 4. Subsequently, four optimization strategies are designed, with details as follows. We calculated two contribution rate matrices based on the above two contribution matrices. The larger the element value in the contribution matrix, means that the corresponding amino acid site is more critical for binding or nonbinding. Intuitively, since the amino acid site contributes more to non-binding prediction, if we replace them with other amino acids which contribute more to binding prediction, the mutated peptide is more likely to have a better affinity with the target HLA. Based on the above four matrices, we designed four strategies to generate mutant peptides. Their main idea is to compare the amino acid sites on the source peptide that have a large impact on weak affinity and the amino acid sites on the target HLA-peptide length that contribute significantly to the high affinity. Then, the corresponding amino acid substitutions are made according to the comparison results. The process is as follows: (1) Predict the binding score for source peptide and target HLA; (2) Find some most important amino acid sites based on the self-attention mechanism; (3) Replace these important sites of weak affinity pHLA with some amino acids which may contribute more on binding prediction; (4) Select some best mutation candidates for evaluation.
For the source peptide and the target HLA (i.e. the specific pHLA), the mutant peptides generated by the four strategies will be merged, and the duplicates will be removed. Then, TransPHLA will screen and retain mutant peptides that can bind to the target HLA. Excitingly, the original intention of this program is for non-binding pHLA, and we found it can also find mutant peptides with a stronger affinity for binding pHLA.