Predicting the 3D structure of CDR3β and epitope sequences using OpenFold.
The model architecture utilized in this research is illustrated in Fig. 1. The sequences and pairing information for CDR3-β and epitopes were sourced from VDJdb 18, Immune Epitope Database (IEDB) 19, and McPAS-TCR 20 databases. The three databases contain 128,259 unique CDR3-β-epitope pairs, including 127,507 CDR3-β sequences and 1,176 epitope sequences. Following data cleaning (Supplementary Fig. 1), a total of 65,069 records were incorporated into the training and testing set (Supplementary Table 1).
To incorporate structural information into the training data, we employed OpenFold to predict the 3D structure of the peptide chains used in training. The 3D structure of the peptide chains was characterized using residue contact matrices (RCM). The predicted structures from OpenFold were compared with experimentally determined protein structures from the RCSB Protein Data Bank (RCSB PDB) 21 (Fig. 2A). The findings suggest that OpenFold demonstrates strong predictive performance for epitopes, with a root mean square deviation (RMSD) of 0.4 ± 0.24Å between the measured and predicted structures (Fig. 2B). However, OpenFold shows relatively higher RMSD in predicting the CDR3-β structure due to its inability to predict the folding of the middle segment of CDR3-β (Fig. 2C). This folding involves interactions of other amino acids in the TCR α and β segments. Therefore, relying solely on the sequence information from the CDR3-β region is insufficient to predict this folding. Nevertheless, OpenFold can accurately predict the structures of individual subsegments of CDR3-β, including the N-terminal, C-terminal, and the complex structure in the middle segment (Fig. 2C). Furthermore, OpenFold’s predictions can reflect structural distinctions for very similar sequences that differ by only one amino acid (Fig. 2D).
2 Training discriminative model for predicting the binding of CDR3 and epitope.
We trained a discriminative model for predicting the binding of CDR3-β to epitopes. The discriminative model consists of three parts: CDR3-β encoder and epitope encoder for feature extraction, and a multi-layer linear discriminator for outputting the judgment on their binding (module 1 in Fig. 1). In the encoder, we extracted sequence and structural features of the peptide chain and then concatenated them. For the sequence features, considering that certain contiguous short amino acid segments (2–5 residues) occur frequently within the CDR3-β sequence and epitope, forming specific secondary structures and providing crucial recognition information for binding. Therefore, highlighting the differences between these similar sequences is meaningful. Instead of utilizing single amino acid-based coding methods such as BLOSUM62, we developed a segment-based coding approach, where frequently occurring amino acid segments (2–5 residues) are treated as one character for coding (Fig. 3A), while infrequent amino acids are encoded separately. Segments appearing more than 1,000 times in the dataset are considered as frequent segments (Supplementary Table 2).
We then used a Transformer encoder to extract features from the encoded sequence information. For the extraction of structural features, we employed RCMs, which represent the distances between amino acids. Subsequently, two CNNs with different kernel sizes were used to extract features from the RCM, which were then concatenated with the sequence features extracted by the Transformer encoder before entering the discriminator. The discriminator consists of four progressively shrinking fully connected layers that output the discrimination result after being activated by the Sigmoid function, and then calculate the cross-entropy loss. We constructed the negative sample set by employing a random selection approach 22, where we chose CDR3-β sequences from the dataset that were confirmed not to bind with the given epitope. The ratio of negative to positive samples is 1:1. During the training process, the cross-entropy loss of both the training and validation sets showed a significant initial decrease. The training set loss stabilized after 160 epochs, while the validation set stabilized after 320 epochs (Fig. 3B).
The main current challenge is to predict whether an unseen epitope can bind to a given TCR. Therefore, we have set up two test sets: the first test set contains epitopes that appeared in the training set but with CDR3-β sequences that have not appeared (internal test set), and the second test set contains epitopes and CDR3-β sequences that have never appeared in the training set (external test set) (Supplementary Fig. 3). The internal test set achieved a precision of 92.8%, a recall of 98.9%, and an F1 score of 0.958. The external test set achieved an accuracy of 84.8%, a recall of 82.8%, and an F1 score of 0.837 (Fig. 3C).
The Area Under the Receiver Operating Characteristic curve (AUROC) of the internal and external test sets are 0.965 ± 0.003 and 0.891 ± 0.006, respectively. We selected four classic or newly reported models as controls, including TITAN 23, epiTCR 24, TEINet 25, and EPIC-TRACE 26. TITAN utilizes convolution and contextual attention to embed epitopes from SMILES and complete TCR from BLOSUM62 27 embedding. epiTCR employs random forest to predict CDR3 from BLOSUM62 embedding. EPIC-TRACE utilizes ProtBERT embedding to represent the amino acid sequences of both chains and epitopes, while employing a combination of convolution and multi-head attention structure. TEINet, on the other hand, utilizes transfer learning to address the prediction problem. Compared to the above models, CTTCR-D exhibited an improvement in AUROC in both internal and external test sets (Supplementary Table 3). Particularly in the external test set, CTTCR demonstrated a notable enhancement in generalization performance for unseen antigen epitopes (Fig. 3D and supplementary table 3).
We further investigated the key factors contributing to the improved performance of CTTCR-D and found that when using only a Transformer model, the predicted AUROC for unseen epitope- CDR3-β pairing was only 0.548 ± 0.013, similar to earlier related studies23,24. However, when using only CNN to extract features from RCM, the predicted AUROC improved to 0.756 ± 0.008 (Fig. 3E). This result suggests that both sequence and 3D structures provide crucial information for predicting epitope-CDR3-β binding. In terms of the impact of sequence coding methods on prediction performance, we found that segment-based coding significantly improved predictive performance compared to BLOSUM62 (0.779 ± 0.008) (Fig. 3F). In the training and testing data we used, the length of the epitope peptides ranged from 7 to 24, and the frequency of epitope occurrence (corresponding to the number of TCRs in the database) ranged from a minimum of 1 to 500. We analyzed the model's predictive performance for unseen epitope-CDR3-β pairing across epitopes of different lengths and frequencies of occurrence. The AUROC ranged between 0.673 and 0.949. As the length of the epitope peptide chain increased, the AUROC demonstrated an increasing trend. Furthermore, epitopes with higher occurrence frequencies exhibited higher predictive accuracy for their pairings (Fig. 3G).
3 Training residue contact matrices transformer (RCMT)
CATCR-D suggests that our encoder can capture generalized features of epitopes or CDR3-β. We developed a generative model using a decoder that leverages features from the CTTCR-D encoder to predict CDR3 sequences binding to unseen epitopes. Previous results suggest that the structural data represented by the RCM contains crucial information about epitope-TCR binding. While we aimed to incorporate structural data into the decoder, for a generative model, the structural data of the target sequence is unknown. Therefore, we pre-trained the RCMT to estimate the RCM of CDR3-β based on the epitope sequence and its RCM. This model uses the features output by CTTCR-D encoder as input and employs a linear decoder to predict the RCM of CDR3-β.
During the training process, the loss of the training and validation sets both decreased with an increase in epochs. The training set loss stabilized after 200 epochs, while the validation set loss stabilized after 300 epochs (Fig. 4A). When using OpenFold's predictions as labels, the average discrepancy between the residue contact matrices (RCM) predicted by the RCMT and those predicted by OpenFold is 1.695 ± 2.040 Å. Figure 4B shows the average differences between the two prediction methods at each position. The predicted differences in distance at different positions range from 0.010 to 6.195 Å. The distance deviations at positions 12 and 13 are relatively large, while deviations at other positions are relatively small. For a given epitope, there may be multiple CDR3-β label sequences in the dataset, while the RCMT can only output a single predicted matrix. Subsequently, we analyzed the relationship between the distance distribution of the label matrix at each position and the predicted values of the RCMT. Figure 4C shows the results for three epitopes with a large number of paired CDR3-β label data (external test). At most positions, the distances predicted by the RCMT are close to the median value of the label distance set, indicating that the RCMT can reflect the landscapes of the corresponding CDR3-β structure through the epitope sequence and structure.
4 Generator for predicting CDR3-β sequences that bind to a given epitope.
We previously trained an encoder and a RCMT using the epitope's sequence and structural information. Following this, we developed a generative model, CATCR-G, that incorporates the pre-trained weights to predict binding CDR3-β sequences for a given epitope. This model uses a Transformer decoder to output CDR3-β predictions based on inputs from both the epitope encoder and the RCM produced by RCMT. The predicted CDR3-β sequence, along with the epitope's sequence and structural data, are then fed into the pre-trained discriminator to refine the generator's loss using the discriminator's loss. During training, we froze the weights of the encoder and RCMT to preserve their pre-trained states.
Initially, the training and validation losses declined quickly. However, while the training loss continued to drop up to 300 epochs, the validation loss plateaued after 100 epochs and became more variable after 200 epochs (Fig. 5A). Consequently, we concluded the training at 300 epochs. In testing, we applied beam search to yield the top 7 CDR3-β sequence predictions and evaluated them against reference sequences using the BERTscore metric (Fig. 5B), which leverages contextual embeddings from the BERT model. The external test set yielded a BERTscore precision of 0.959 ± 0.013, recall of 0.955 ± 0.018, and F1 score of 0.957 ± 0.014.
Our analysis revealed that the pre-trained encoder significantly accelerates early training performance, whereas models lacking this component showed reduced BERTscore R, P, and F1 scores. Furthermore, while models employing only the pre-trained encoder matched the combined approach in initial training, those also incorporating RCMT eventually outperformed in BERTscore evaluations, suggesting a synergistic benefit from using both (Fig. 5C).
We also examined the alignment between predicted and reference CDR3-β sequences at each position with BERTscore, padding shorter sequences to 25 amino acids with placeholders as necessary, despite most CDR3-β sequences being 8 to 12 amino acids in length. This confirmed high similarity across corresponding positions, with consistency in placeholder regions indicating that CATCR-G can appropriately determine CDR3-β length (Fig. 5D). Additional evaluations with alternative metrics, ROUGE-L and Skip Thought, yielded similarity scores of 0.580 ± 0.145 and 0.959 ± 0.040, respectively.