Predicting TCR sequences for unseen antigen epitopes using structural and sequence features

doi:10.21203/rs.3.rs-3891946/v1

Download PDF

Article

Predicting TCR sequences for unseen antigen epitopes using structural and sequence features

https://doi.org/10.21203/rs.3.rs-3891946/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

T-cell receptor (TCR) recognition of antigens is fundamental to the adaptive immune response. With the expansion of experimental techniques, a substantial database of matched TCR-antigen pairs has emerged, presenting opportunities for computational prediction models. However, the accurate forecasting of binding affinities for unseen antigen-TCR pairs remains a major challenge. Here, we present Convolutional-Self-Attention TCR (CATCR), a novel framework tailored to enhance the prediction of epitope and TCR interactions. Our approach integrates an encoder that concurrently processes structural and sequential data, utilizing convolutional neural networks (CNNs) to extract peptide features from residue contact matrices, as generated by OpenFold, and a Transformer to encode segment-based coded sequence. We further introduce CATCR-D, a discriminator equipped to assess binding by analyzing structural and sequence features of epitopes and CDR3-β regions. Additionally, the framework comprises CATCR-G, a generative module designed for CDR3-β sequences, which applies the pretrained encoder to deduce epitope characteristics and a Transformer decoder for predicting matching CDR3-β sequences. CATCR-D has shown exemplary feature extraction and generalization, achieving an AUROC of 0.89 on previously unseen epitope-TCR pairs and outperforming four benchmark models by a margin of 17.4%. CATCR-G has demonstrated high precision, recall, and F1 scores, surpassing 95% in BERT-score assessments. Our results indicate that CATCR is an effective tool for the prediction of unseen epitope-TCR interactions, and that incorporating structural insights significantly enhances our understanding of the general rules governing TCR-epitope recognition. The prediction of TCRs for novel epitopes using structural and sequence information is promising, and broadening the repository of experimental TCR-epitope data stands to further improve the precision of epitope-TCR binding predictions.

Biological sciences/Computational biology and bioinformatics/Computational models

Biological sciences/Immunology/Adaptive immunity/VDJ recombination

T cell typically recognize antigen peptide presented by MHC molecules through the T-cell receptors (TCRs). This process is critical for immune responses against exogenous pathogens and cancers ¹. During recognition, the complementarity-determining regions (CDRs) of the TCR interact and bind to specific epitopes of antigen. Among the different regions of the CDR, CDR3 plays a pivotal role in TCR diversity. CDR3 regions generate a remarkably high diversity through the V(D)J recombination mechanism, allowing them to adapt to a wide range of existing and potential antigens ². It is estimated that the potential number of TCR variants that can be generated reaches up to 10¹⁸ or even more ³. The importance of TCR diversity in disease monitoring, autoimmune diseases, and anti-cancer immunity has driven research on the rules governing CDR3-epitope binding. Experimental methods, such as TCR-antigen multimer assays ⁴ and tetramer-associated TCR sequencing ⁵, have been used to detect the binding between CDR3s and epitopes. However, due to technical barriers and cost limitations, the currently available experimental evidence for specific CDR3-epitope pairs represents only a small fraction of the overall repertoire ¹.

Therefore, some studies have turned to computational methods, particularly machine learning approaches ^6–8. However, computational methods still present significant challenges. The ideal model should possess strong generalization capabilities, meaning that it should effectively predict not only for epitopes that have been seen during training but also learn general patterns to apply to unseen CDR3-epitope pairs. In certain diseases, such as SARS-CoV-2, the prediction of CDR3-epitope specificity binding has driven the development of drugs and vaccines ^9,10. However, the performance of existing models diminishes when faced with unfamiliar epitopes ¹¹. Therefore, there is a need to develop models that can comprehend the general principles of CDR3-epitope binding, thereby enhancing their predictive ability for unseen epitopes.

The CDR3 region interacts with the amino acid residues on the peptide through various non-covalent interactions, such as hydrogen bonding, ionic interactions, and van der Waals forces, to establish stable binding. The three-dimensional (3D) structure, determined by these amino acid sequences, underlies the specific binding of CDR3 to different epitopes ^12,13. Thus, individual sequence data alone are insufficient to capture all the information that governs CDR-epitope binding. Incorporating 3D structural information may be crucial for improving prediction accuracy. Innovative protein structure prediction methods, such as AlphaFold2 ¹⁴, provide non-experimental approaches for predicting protein structures and have demonstrated promising performance in short peptides ¹⁵. While some studies have utilized methods like AlphaFold2 for multimeric predictions to predict interactions by constructing 3D structural models, these approaches have not consistently shown clear advantages over sequence-based predictions ¹⁶. Consequently, further investigation is required to determine how structural information can be effectively utilized in predicting TCR and epitope binding.

Therefore, in this study, we utilized OpenFold ¹⁷ (the PyTorch version of AlphaFold2) to predict the 3D structures of peptides, and we used a convolutional neural network (CNN) to extract structural features from these predictions via residue contact matrices (RCM). For the sequence information, we established a segment-based encoding scheme and employed a Transformer encoder for feature extraction. We trained a discriminator (Convolutional-Self-Attention TCR discriminator, CATCR-D) to predict the binding of epitopes to the CDR3-β region. Building on this, we developed a generative model (Convolutional-Self-Attention TCR generator, CATCR-G) that fuses features from the epitope encoder with those of predicted CDR3-β structures and inputs them into a Transformer decoder to generate the corresponding sequence. In this process, the CDR3-β structural features were derived using a residue contact matrix-based transformer (RCMT), informed by epitope features. Additionally, we leveraged the pre-trained discriminator to refine the training of the generative model. Our goal was to establish a robust method for predicting the binding of TCRs to unseen antigen epitopes.

Predicting the 3D structure of CDR3β and epitope sequences using OpenFold.

The model architecture utilized in this research is illustrated in Fig. 1. The sequences and pairing information for CDR3-β and epitopes were sourced from VDJdb ¹⁸, Immune Epitope Database (IEDB) ¹⁹, and McPAS-TCR ²⁰ databases. The three databases contain 128,259 unique CDR3-β-epitope pairs, including 127,507 CDR3-β sequences and 1,176 epitope sequences. Following data cleaning (Supplementary Fig. 1), a total of 65,069 records were incorporated into the training and testing set (Supplementary Table 1).

To incorporate structural information into the training data, we employed OpenFold to predict the 3D structure of the peptide chains used in training. The 3D structure of the peptide chains was characterized using residue contact matrices (RCM). The predicted structures from OpenFold were compared with experimentally determined protein structures from the RCSB Protein Data Bank (RCSB PDB) ²¹ (Fig. 2A). The findings suggest that OpenFold demonstrates strong predictive performance for epitopes, with a root mean square deviation (RMSD) of 0.4 ± 0.24Å between the measured and predicted structures (Fig. 2B). However, OpenFold shows relatively higher RMSD in predicting the CDR3-β structure due to its inability to predict the folding of the middle segment of CDR3-β (Fig. 2C). This folding involves interactions of other amino acids in the TCR α and β segments. Therefore, relying solely on the sequence information from the CDR3-β region is insufficient to predict this folding. Nevertheless, OpenFold can accurately predict the structures of individual subsegments of CDR3-β, including the N-terminal, C-terminal, and the complex structure in the middle segment (Fig. 2C). Furthermore, OpenFold’s predictions can reflect structural distinctions for very similar sequences that differ by only one amino acid (Fig. 2D).

2 Training discriminative model for predicting the binding of CDR3 and epitope.

We trained a discriminative model for predicting the binding of CDR3-β to epitopes. The discriminative model consists of three parts: CDR3-β encoder and epitope encoder for feature extraction, and a multi-layer linear discriminator for outputting the judgment on their binding (module 1 in Fig. 1). In the encoder, we extracted sequence and structural features of the peptide chain and then concatenated them. For the sequence features, considering that certain contiguous short amino acid segments (2–5 residues) occur frequently within the CDR3-β sequence and epitope, forming specific secondary structures and providing crucial recognition information for binding. Therefore, highlighting the differences between these similar sequences is meaningful. Instead of utilizing single amino acid-based coding methods such as BLOSUM62, we developed a segment-based coding approach, where frequently occurring amino acid segments (2–5 residues) are treated as one character for coding (Fig. 3A), while infrequent amino acids are encoded separately. Segments appearing more than 1,000 times in the dataset are considered as frequent segments (Supplementary Table 2).

We then used a Transformer encoder to extract features from the encoded sequence information. For the extraction of structural features, we employed RCMs, which represent the distances between amino acids. Subsequently, two CNNs with different kernel sizes were used to extract features from the RCM, which were then concatenated with the sequence features extracted by the Transformer encoder before entering the discriminator. The discriminator consists of four progressively shrinking fully connected layers that output the discrimination result after being activated by the Sigmoid function, and then calculate the cross-entropy loss. We constructed the negative sample set by employing a random selection approach ²², where we chose CDR3-β sequences from the dataset that were confirmed not to bind with the given epitope. The ratio of negative to positive samples is 1:1. During the training process, the cross-entropy loss of both the training and validation sets showed a significant initial decrease. The training set loss stabilized after 160 epochs, while the validation set stabilized after 320 epochs (Fig. 3B).

The main current challenge is to predict whether an unseen epitope can bind to a given TCR. Therefore, we have set up two test sets: the first test set contains epitopes that appeared in the training set but with CDR3-β sequences that have not appeared (internal test set), and the second test set contains epitopes and CDR3-β sequences that have never appeared in the training set (external test set) (Supplementary Fig. 3). The internal test set achieved a precision of 92.8%, a recall of 98.9%, and an F1 score of 0.958. The external test set achieved an accuracy of 84.8%, a recall of 82.8%, and an F1 score of 0.837 (Fig. 3C).

The Area Under the Receiver Operating Characteristic curve (AUROC) of the internal and external test sets are 0.965 ± 0.003 and 0.891 ± 0.006, respectively. We selected four classic or newly reported models as controls, including TITAN ²³, epiTCR ²⁴, TEINet ²⁵, and EPIC-TRACE ²⁶. TITAN utilizes convolution and contextual attention to embed epitopes from SMILES and complete TCR from BLOSUM62 ²⁷ embedding. epiTCR employs random forest to predict CDR3 from BLOSUM62 embedding. EPIC-TRACE utilizes ProtBERT embedding to represent the amino acid sequences of both chains and epitopes, while employing a combination of convolution and multi-head attention structure. TEINet, on the other hand, utilizes transfer learning to address the prediction problem. Compared to the above models, CTTCR-D exhibited an improvement in AUROC in both internal and external test sets (Supplementary Table 3). Particularly in the external test set, CTTCR demonstrated a notable enhancement in generalization performance for unseen antigen epitopes (Fig. 3D and supplementary table 3).

We further investigated the key factors contributing to the improved performance of CTTCR-D and found that when using only a Transformer model, the predicted AUROC for unseen epitope- CDR3-β pairing was only 0.548 ± 0.013, similar to earlier related studies^23,24. However, when using only CNN to extract features from RCM, the predicted AUROC improved to 0.756 ± 0.008 (Fig. 3E). This result suggests that both sequence and 3D structures provide crucial information for predicting epitope-CDR3-β binding. In terms of the impact of sequence coding methods on prediction performance, we found that segment-based coding significantly improved predictive performance compared to BLOSUM62 (0.779 ± 0.008) (Fig. 3F). In the training and testing data we used, the length of the epitope peptides ranged from 7 to 24, and the frequency of epitope occurrence (corresponding to the number of TCRs in the database) ranged from a minimum of 1 to 500. We analyzed the model's predictive performance for unseen epitope-CDR3-β pairing across epitopes of different lengths and frequencies of occurrence. The AUROC ranged between 0.673 and 0.949. As the length of the epitope peptide chain increased, the AUROC demonstrated an increasing trend. Furthermore, epitopes with higher occurrence frequencies exhibited higher predictive accuracy for their pairings (Fig. 3G).

3 Training residue contact matrices transformer (RCMT)

CATCR-D suggests that our encoder can capture generalized features of epitopes or CDR3-β. We developed a generative model using a decoder that leverages features from the CTTCR-D encoder to predict CDR3 sequences binding to unseen epitopes. Previous results suggest that the structural data represented by the RCM contains crucial information about epitope-TCR binding. While we aimed to incorporate structural data into the decoder, for a generative model, the structural data of the target sequence is unknown. Therefore, we pre-trained the RCMT to estimate the RCM of CDR3-β based on the epitope sequence and its RCM. This model uses the features output by CTTCR-D encoder as input and employs a linear decoder to predict the RCM of CDR3-β.

During the training process, the loss of the training and validation sets both decreased with an increase in epochs. The training set loss stabilized after 200 epochs, while the validation set loss stabilized after 300 epochs (Fig. 4A). When using OpenFold's predictions as labels, the average discrepancy between the residue contact matrices (RCM) predicted by the RCMT and those predicted by OpenFold is 1.695 ± 2.040 Å. Figure 4B shows the average differences between the two prediction methods at each position. The predicted differences in distance at different positions range from 0.010 to 6.195 Å. The distance deviations at positions 12 and 13 are relatively large, while deviations at other positions are relatively small. For a given epitope, there may be multiple CDR3-β label sequences in the dataset, while the RCMT can only output a single predicted matrix. Subsequently, we analyzed the relationship between the distance distribution of the label matrix at each position and the predicted values of the RCMT. Figure 4C shows the results for three epitopes with a large number of paired CDR3-β label data (external test). At most positions, the distances predicted by the RCMT are close to the median value of the label distance set, indicating that the RCMT can reflect the landscapes of the corresponding CDR3-β structure through the epitope sequence and structure.

4 Generator for predicting CDR3-β sequences that bind to a given epitope.

We previously trained an encoder and a RCMT using the epitope's sequence and structural information. Following this, we developed a generative model, CATCR-G, that incorporates the pre-trained weights to predict binding CDR3-β sequences for a given epitope. This model uses a Transformer decoder to output CDR3-β predictions based on inputs from both the epitope encoder and the RCM produced by RCMT. The predicted CDR3-β sequence, along with the epitope's sequence and structural data, are then fed into the pre-trained discriminator to refine the generator's loss using the discriminator's loss. During training, we froze the weights of the encoder and RCMT to preserve their pre-trained states.

Initially, the training and validation losses declined quickly. However, while the training loss continued to drop up to 300 epochs, the validation loss plateaued after 100 epochs and became more variable after 200 epochs (Fig. 5A). Consequently, we concluded the training at 300 epochs. In testing, we applied beam search to yield the top 7 CDR3-β sequence predictions and evaluated them against reference sequences using the BERTscore metric (Fig. 5B), which leverages contextual embeddings from the BERT model. The external test set yielded a BERTscore precision of 0.959 ± 0.013, recall of 0.955 ± 0.018, and F1 score of 0.957 ± 0.014.

Our analysis revealed that the pre-trained encoder significantly accelerates early training performance, whereas models lacking this component showed reduced BERTscore R, P, and F1 scores. Furthermore, while models employing only the pre-trained encoder matched the combined approach in initial training, those also incorporating RCMT eventually outperformed in BERTscore evaluations, suggesting a synergistic benefit from using both (Fig. 5C).

We also examined the alignment between predicted and reference CDR3-β sequences at each position with BERTscore, padding shorter sequences to 25 amino acids with placeholders as necessary, despite most CDR3-β sequences being 8 to 12 amino acids in length. This confirmed high similarity across corresponding positions, with consistency in placeholder regions indicating that CATCR-G can appropriately determine CDR3-β length (Fig. 5D). Additional evaluations with alternative metrics, ROUGE-L and Skip Thought, yielded similarity scores of 0.580 ± 0.145 and 0.959 ± 0.040, respectively.

TCR recognition and binding to epitopes, particularly within the CDR3 region, are pivotal for initiating immune responses and are crucial in T-cell therapy development. The CDR3 region's high diversity is a major determinant of TCR binding specificity ²⁸, and a comprehensive analysis of TCR repertoires can offer insights into the clonal and diverse nature of immune responses ²⁹. Despite advancements in our understanding, the accurate prediction of TCR specificity remains challenging due to the extensive variability resulting from V(D)J recombination, limited availability of negative samples, and other factors that constrain model performance.

To address these challenges, a variety of machine learning methods have been applied, ranging from clustering-based approaches like TCRdist ³⁰, GLIPH ³¹, and TCRMatch ³², to random forest algorithms such as epiTCR ²⁴. The influx of data has ushered in deep learning techniques, including GRU and Transformer models, which have been adapted from their success in natural language processing to improve TCR-epitope binding predictions with models like ERGO ³³, ImRex ³⁴, TITAN²³, DeepTCR ³⁵, TEINet²⁵, PanPep ³⁶, and TEIM-Seq ³⁷. However, a major challenge faced by these models is accurately predicting unseen TCR-epitope pairings.

The fundamental challenge in predicting TCR-epitope interactions lies in the structural specificity of the CDR3 region, which is critical for TCR recognition of specific epitopes ¹. The ability to predict peptide structure from sequence alone, without additional structural information, is limited. This limitation is a significant barrier to enhancing the generalization performance of predictive models. Moreover, experimental protein structure data, particularly for TCRs, is scarce in existing databases. Bradley ³⁸ utilized an advanced complex structure prediction model, AlphaFold-Multimer, to embed TCR, MHC, and epitope sequences for prediction, achieving an AUROC of 82% for eight seen epitope sequences. As reliance on AlphaFold-Multimer alone imposes stringent demands on structural prediction accuracy, we adopted OpenFold for its ease of deployment and reduced computational cost compared to AlphaFold2. Despite structural prediction errors, our approach, which represents structural data through residue contact matrices (RCM), can capture the subtleties in peptide structures, thereby mitigating the requirement for highly precise structural predictions. Our findings underscore the value of integrating structural information in improving the accuracy of TCR-epitope binding predictions. In particular, the use of OpenFold infuses the model with insights into general protein structural patterns that are not discernible from sequence data alone.

We implemented a segment-based sequence coding approach to capitalize on the recurrent initial and terminal segments found in many sequences, such as 'CASS' and 'CASR' at the N-terminal, and 'QYF' and 'YTF' at the C-terminal. These conserved segments exhibit similarity across different sequences, and segment-based coding allows us to effectively discriminate between them. Our findings indicate that segment-based coding improves model performance over BLOSUM62 ²⁷.

It is well-established that while a TCR can recognize a specific antigen, an antigen can be recognized by multiple TCRs ³⁷. In our dataset, certain epitopes, like KLGGALQAK, are associated with a large number of TCR pairs. However, most epitopes are linked to relatively few TCR pairs. To mitigate this imbalance, we limited our dataset to 500 records for any epitope associated with more than 500 CDR3-β sequences to ensure a more balanced representation. For new antigens that lack corresponding sequences in the existing TCR repertoire, we turned to generative models. Unlike discriminative models, generative models are tasked with a more complex challenge: they must not only have a powerful decoder but also capture accurate and comprehensive information during the encoding phase for effective sequence generation.

Initially, our exploration involved using a Transformer decoder with a multi-head self-attention mechanism, which did not include a pre-trained encoder. We later assigned weights from the discriminative model's epitope encoder to our generative model, CATCR-G, and noted a substantial improvement in early-stage model performance due to the pre-trained encoder.

Recognizing the importance of structural information in CATCR-D, as conveyed by residue contact matrices, we aimed to integrate similar information into the generative model. With RCMT, we generated predictions for the RCM of CDR3-β. Although the RCMT outputs only one matrix per epitope, and an epitope might interact with various CDR3-β, our findings suggest that the RCMT's predicted matrix closely mirrors the median structure of the CDR3-β set. Thus, the RCMT's output should be interpreted as an overview of potential CDR3-β structures for an epitope, rather than exact structural predictions.

Evaluating generative models is challenging due to the absence of definitive reference standards and the need for task-relevant metrics. We leveraged BERTscore, a tool commonly used in natural language prediction, to gauge the quality of our sequence predictions. BERTscore measures token similarity between the generated and reference sequences using contextual embeddings, and applies standard performance metrics such as precision, recall, and F1 scores. We also included additional metrics in our analysis, like Rouge-L, which evaluates longer sequence similarity, and sentence embeddings as per the Skip-thought approach. These metrics suggest that CATCR-G can produce CDR3-β sequences with high resemblance to the reference TCR dataset. Nevertheless, we acknowledge that further experimental validation is essential to verify the accuracy of the CDR3-β sequences predicted by CATCR-G.

This study has certain limitations. First, we did not include CDR3-α and MHC sequences, which, as previous research ^7,9,25 has shown, can improve the predictive accuracy of models. The exclusion was due to the scarcity of data on MHC and paired CDR3-α sequences in the available databases; hence, we focused on CDR3-β and epitope data to maintain a sufficient sample size and standardize the samples. Moreover, CATCR-D's predictive performance was less robust for shorter CDR3-β sequences compared to longer ones, highlighting an area for future improvement in feature extraction and prediction for shorter sequences.

In conclusion, this study successfully developed a discriminative model, CATCR-D, capable of determining the binding of CDR3-β to epitopes based on sequence and structural data. This model is particularly useful for data involving previously unseen epitopes and CDR3-β sequences. Furthermore, we constructed a generative model, CATCR-G, for predicting CDR3-β sequences that could bind to new epitope sequences by utilizing their sequence and structural information. The potential for improved performance through the establishment of a larger TCR-epitope-MHC pairing database is promising. The outcomes of this study make us optimistic about the possibility of creating a universal model that can accurately predict TCR sequences matching any given epitope.

1 Dataset

The CDR3 α and β regions work together to recognize antigens, but the data recorded in currently available databases are primarily focused on CDR3-β ^18–20, currently available databases primarily document paired samples of CDR3-β chains. The amount of paired data for both CDR3-α and -β is limited. To ensure an adequate amount of data for training and testing, we only included CDR3-β sequences and their paired epitope sequences in this study. To construct a dataset with sufficient quantity and diversity, we included data from the VDJdb ¹⁸, the IEDB ¹⁹, and the McPAS-TCR ²⁰ databases in our study.

The McPAS database encompasses 39,986 TCR data records, including 36,620 human TCR data entries. Among these, there are 12,256 data records that include both CDR3-β and antigen epitope sequences, resulting in 10,942 non-redundant entries after duplicate removal. Besides peptide antigens, the IEDB database also records non-amino acid compounds, totaling 219,333 entries. Among them, 113,080 data entries include both CDR3-β and antigen peptide epitopes, resulting in 113,038 non-redundant entries. In VDJdb, we selected CDR3-β-epitope pairs with VDJ scores ≥ 1, resulting in a total of 5,706 entries. After merging the three databases, we obtained 128,259 non-redundant data records, including 127,507 unique CDR3-β sequences and 1,176 distinct epitope sequences. We removed all sequences containing ambiguous amino acids (B, J, O, U, X). Among the CDR3-β sequences, 752 sequences corresponding to more than one epitope were eliminated. The number of CDR3-β sequences corresponding to each epitope ranged from 1 to 11,899 (Supplementary Fig. 2), with a median of 3 and an average of 109. For epitopes with over 500 corresponding CDR3 sequences, we randomly retained 500 records, resulting in a dataset containing 65,069 CDR3β-epitope pairs.

Training and Testing Samples

One of the main tasks of this study is to predict unseen epitope-CDR3-β pairs. Therefore, we employed a sample partitioning method based on epitopes using ten-fold cross-validation. The epitopes were divided into ten relatively balanced subsets based on the number of paired CDR3-β sequences. Each time, one subset was selected as the external testing set, ensuring that it did not contain any epitope or CDR3-β sequence information from the training set. Additionally, 5% of the combined nine subsets were used as an internal testing set, which included epitope sequences from the training set but not CDR3-β sequences. Another 5% of the data was used as a validation set. To ensure information confidentiality, during each fold of cross-validation, CATCR-D, RCMT, and CATCR-G were trained and tested using the same set of training and testing data.

The epitope- CDR3-β dataset only contains positive samples. To train a robust supervised model, it is necessary to generate negative samples, where for a positive sample \({d}_{i}=\left({e}_{i}, {t}_{i}\right)\in D={\left\{{d}_{i}\right\}}_{i=1}^{N}\), where \({e}_{i}\) and \({t}_{i}\) are the epitope and TCR interacting in sample \(i\), respectively. We generated negative samples using a random CDR3-β approach ^22,25. For this sampling method, the negative CDR3-β for \({e}_{i}\) were uniformly sampled from the CDR3-β set \(T=\left\{{t}_{i}\right\}\) of positive binding pairs, while excluding their known true TCR binding partners. Subsequently, the negative samples for \({e}_{i}\) were represented as \({n}_{i}={\left\{({e}_{i}, {t}_{k})\right\}}_{k=1}^{M}\), where \({t}_{k}\in T\) and \(({e}_{i},{t}_{k})\notin D\). are different subsets. Negative samples are generated within their respective subsets after the training and testing sample partition to avoid potential data leakage ²⁵.

Predicting the 3D structure of CDR3-β and epitopes Using OpenFold

OpenFold ¹⁷ is an open-source PyTorch reimplementation of AlphaFold2 ¹⁴ trained from scratch. Similar to AlphaFold2, it employs sequence alignment and deep learning algorithms. In comparison to AlphaFold2, OpenFold offers lower deployment complexity and hardware requirements. In this study, we utilized the pre-trained model "fineturning_ptm_1" provided by the OpenFold developers to predict the 3D structures of CDR3β and antigenic epitopes. To evaluate OpenFold's predictive performance on peptide chain structures, 112 matched sequences obtained from the RCSB Protein Data Bank (RCSB PDB) ²¹ were selected by aligning the amino acid sequences with those in the dataset. During the evaluation, the actual structural coordinates corresponding to the predicted sequences were extracted. VMD software was then employed to align the predicted sequences with the actual sequences and compute the RMSD values.

Model architecture

The model architecture is shown in Fig. 1. We employed a model consisting of three modules. The first module involved constructing a discriminator based on structural and sequence data (CATCR-D) to determine whether a epitope can recognize a specific CDR3-β sequence. For the structural data, we represented structural features using residue contact matrices (RCM), taking the positions of carbon atoms of amino acids from OpenFold’s predicted results as the positions of the amino acids and obtaining their coordinates in Euclidean space. The residue contact matrix \(D=\left({d}_{ij}\right)\in {\mathbb{R}}^{m\times m}\), where \({d}_{ij}=‖{P}_{i}-{P}_{j}‖\), \(m\) is the length of amino acid sequence, \(‖{P}_{i}-{P}_{j}‖\) represents the Euclidean distance between the carbon atoms as position \(i\) and position \(j\) (Supplementary Fig. 3). Two CNNs with kernel sizes of 3 and 5 were used respectively to extract features from the residue contact matrices. Each CNN consisted of two convolutional layers, followed by a max-pooling layer after the second convolutional layer. The features extracted by the two CNNs were then merged and passed through two linear fully connected layers to output a feature vector. For the amino acid sequences, we developed a segment-based coding method. This is because we observed a significant number of highly similar segments among CDR3 (epitope) sequences, such as "CASS" and "CASR" at the N-terminal, and "QYF" and "YTF" at the C-terminal. In the sequence data processing, we treated these frequently occurring segments as single characters for subsequent processing to enhance the discrimination of these similar segments. Segments that appeared in the dataset more than 1000 times were identified as high-frequency segments and assigned separate coding. During the coding process, longer segments coding are given higher priority over shorter segments. For example, "CASS" would be assigned a separate coding, while the shorter high-frequency block contained within it, "AS," would not be encoded separately. The encoding for all high-frequency segments is presented in Supplementary Table 2. Subsequently, an embedding method was used to further encode the sequences. We employed a Transformer encoder with 6 layers and 8 heads to extract features from the encoded sequence information. Each inner layer contains 1024 nodes. Finally, a feature vector is outputted through a linear fully connected layer. After extracting features from the CDR3-β and epitope separately using the aforementioned method and merging them, a classifier containing 4 linear fully connected layers is used to output the prediction results.

The second module is the Residue Contact Matrix Transformer (RCMT), which aims to integrate the predicted structural information of CDR3-β into the generation model. In the generation model, the structural details of CDR3-β are not directly observable. We utilize the epitope encoder in Module 1 to extract epitope features and employ a linear generator with expanding dimensions to generate the predicted RCM of CDR3-β. Throughout this process, the weights of the epitope encoder remain locked.

The third module is a generator (CATCR-G) that combines the epitope's sequence and the predicted structural information from RCMT to generate predictions for CDR3-β. The fundamental principle involves merging the features produced by the epitope encoder with the predicted residue contact matrix (RCM) of CDR3-β and employing a Transformer decoder to generate the predicted sequence.

Model training

CATCR is implemented using the PyTorch deep learning framework and is written in Python 3.9. The model was trained with a batch size of 128. CATCR-D employs stochastic gradient descent (SGD) to optimize binary cross-entropy loss with a learning rate of 0.05. In RCMT, we trained the model using a Smooth L1 loss function augmented with L2 regularization and a learning rate of 0.05, while using the predicted CDR3-β structure from OpenFold as the label. CATCR-G utilizes the Adam optimizer to optimize cross-entropy loss The predicted CDR3-β sequences and the RCM obtained from the RCMT, along with the epitope sequences and structural information, are re-introduced into CATCR-D to obtain the discriminative loss. The final training loss is given by \(L=L\left(G\right)+(1-L\left(D\right))\times \omega\), where \(L\) is the total loss, \(L\left(G\right)\) is the loss of generative model, and \(L\left(D\right)\) is the discriminative loss. In the training process of CATCR-G, the weights of the discriminator are fixed. The value of \(\omega\) was 0.5. CATCR-G employs a warm-up strategy to dynamically adjust the learning rate, starting at 0.000012 and ramping up to 0.00324 after 5 epochs, followed by a gradual decrease. The generation model utilizes BERT-score ³⁹, ROUGE-L ⁴⁰ and Skip-thought methods ⁴¹ for performance evaluation.

Acknowledgements

We extend our gratitude to all contributors to VDJdb, IEDB, and McPAS-TCR, and other TCR-specific datasets who have made their data publicly available.

Funding

This study was supported by the Clinical Key Research Project of Xijing Hospital (XJZT24LZ15).

Data and code availability

CTTCR was written in Python using the deep learning library PyTorch. All the data and code are available at https://github.com/FreudDolce/CTTCR/.

Hudson, D., Fernandes, R. A., Basham, M., Ogg, G. & Koohy, H. Can we predict T cell specificity with digital biology and machine learning? Nat Rev Immunol 23, 511–521 (2023).
Chi, X., Li, Y. & Qiu, X. V(D)J recombination, somatic hypermutation and class switch recombination of immunoglobulins: mechanism and regulation. Immunology vol. 160 233–247 Preprint at https://doi.org/10.1111/imm.13176 (2020).
Shen, Y., Voigt, A., Leng, X., Rodriguez, A. A. & Nguyen, C. Q. A current and future perspective on T cell receptor repertoire profiling. Frontiers in Genetics vol. 14 Preprint at https://doi.org/10.3389/fgene.2023.1159109 (2023).
Altman, J. D. et al. Phenotypic analysis of antigen-specific T lymphocytes. Science 274, 94–6 (1996).
Zhang, S.-Q. et al. High-throughput determination of the antigen specificities of T cell receptors in single cells. Nat Biotechnol 36, 1156–1159 (2018).
Grazioli, F. et al. On TCR binding predictors failing to generalize to unseen peptides. Front Immunol 13, (2022).
Ehrlich, R. et al. SwarmTCR: a computational approach to predict the specificity of T cell receptors. BMC Bioinformatics 22, (2021).
Cai, M., Bang, S., Zhang, P. & Lee, H. ATM-TCR: TCR-Epitope Binding Affinity Prediction Using a Multi-Head Self-Attention Model. Front Immunol 13, (2022).
Mayer-Blackwell, K. et al. TCR meta-clonotypes for biomarker discovery with tcrdist3 enabled identification of public, HLA-restricted clusters of SARS-CoV-2 TCRs. Elife 10, (2021).
Huang, H., Wang, C., Rubelt, F., Scriba, T. J. & Davis, M. M. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat Biotechnol 38, 1194–1202 (2020).
Korpela, D. et al. EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings. Bioinformatics 39, (2023).
Koyama, K., Hashimoto, K., Nagao, C. & Mizuguchi, K. Attention network for predicting T-cell receptor–peptide binding can associate attention with interpretable protein structural properties. Frontiers in Bioinformatics 3, (2023).
Henry, K. A. & MacKenzie, C. R. Antigen recognition by single-domain antibodies: structural latitudes and constraints. mAbs vol. 10 815–826 Preprint at https://doi.org/10.1080/19420862.2018.1489633 (2018).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Yang, Z., Zeng, X., Zhao, Y. & Chen, R. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy vol. 8 Preprint at https://doi.org/10.1038/s41392-023-01381-z (2023).
Bradley, P. Structure-based prediction of T cell receptor:peptide-MHC interactions. Elife 12, (2023).
Gustaf Ahdritz, N. B. S. K. Q. X. W. G. M. AlQuraishi. aqlaboratory/openfold: Openfold v1.0.0 (v1.0.0). Preprint at (2021).
Goncharov, M. et al. VDJdb in the pandemic era: a compendium of T cell receptors specific for SARS-CoV-2. Nature Methods Preprint at https://doi.org/10.1038/s41592-022-01578-0 (2022).
Vita, R. et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res 47, D339–D343 (2019).
Tickotsky, N., Sagiv, T., Prilusky, J., Shifrut, E. & Friedman, N. McPAS-TCR: A manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33, 2924–2929 (2017).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Research vol. 28 http://www.rcsb.org/pdb/status.html (2000).
Lu, T. et al. Deep learning-based prediction of the T cell receptor–antigen binding specificity. Nat Mach Intell 3, 864–875 (2021).
Weber, A., Born, J. & Rodriguez Martínez, M. TITAN: T-cell receptor specificity prediction with bimodal attention networks. Bioinformatics 37, I237–I244 (2021).
Pham, M. D. N. et al. epiTCR: a highly sensitive predictor for TCR–peptide binding. Bioinformatics 39, (2023).
Jiang, Y., Huo, M. & Li, S. C. TEINet: a deep learning framework for prediction of TCR–epitope binding specificity. Brief Bioinform 24, (2023).
Korpela, D. et al. EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings. Bioinformatics 39, (2023).
Styczynski, M. P., Jensen, K. L., Rigoutsos, I. & Stephanopoulos, G. BLOSUM62 miscalculations improve search performance. Nat Biotechnol 26, 274–275 (2008).
Szeto, C., Lobos, C. A., Nguyen, A. T. & Gras, S. TCR recognition of peptide–MHC-I: Rule makers and breakers. International Journal of Molecular Sciences vol. 22 1–26 Preprint at https://doi.org/10.3390/ijms22010068 (2021).
Chi, X., Li, Y. & Qiu, X. V(D)J recombination, somatic hypermutation and class switch recombination of immunoglobulins: mechanism and regulation. Immunology vol. 160 233–247 Preprint at https://doi.org/10.1111/imm.13176 (2020).
Olson, B. J., Schattgen, S. A., Thomas, P. G., Bradley, P. & Matsen, F. A. Comparing T cell receptor repertoires using optimal transport. PLoS Comput Biol 18, (2022).
Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017).
Chronister, W. D. et al. TCRMatch: Predicting T-Cell Receptor Specificity Based on Sequence Similarity to Previously Characterized Receptors. Front Immunol 12, (2021).
Springer, I., Besser, H., Tickotsky-Moskovitz, N., Dvorkin, S. & Louzoun, Y. Prediction of Specific TCR-Peptide Binding From Large Dictionaries of TCR-Peptide Pairs. Front Immunol 11, (2020).
Moris, P. et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Briefings in Bioinformatics vol. 22 Preprint at https://doi.org/10.1093/bib/bbaa318 (2021).
Sidhom, J. W., Larman, H. B., Pardoll, D. M. & Baras, A. S. DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat Commun 12, (2021).
Gao, Y. et al. Pan-Peptide Meta Learning for T-cell receptor–antigen binding recognition. Nat Mach Intell 5, 236–249 (2023).
Peng, X. et al. Characterizing the interaction conformation between T-cell receptors and epitopes with deep learning. Nat Mach Intell 5, 395–407 (2023).
Bradley, P. Structure-based prediction of T cell receptor:peptide-MHC interactions. Elife 12, (2023).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. (2019).
Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. in 74–81 (Association for Computational Linguistics, Barcelona, Spain, 2004).
Kiros, R. et al. Skip-Thought Vectors. (2015).

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Predicting TCR sequences for unseen antigen epitopes using structural and sequence features

Status:

Version 1

Abstract

Figures

Introduction

Result

Discussion

Method and materials

Declarations

Acknowledgements

Funding

Data and code availability

References

Additional Declarations

Supplementary Files

Status:

Version 1