2.1 Evaluation of inter-chain contact prediction for homodimers
We compare CDPred with DeepHomo and GLINTER on the HomoTest1 homodimer test dataset with the results shown in Table 1. The input tertiary structures for all three methods are predicted structures corresponding to the unbound monomer structures. The results of DeepHomo are obtained from its publication. Three versions of CDPred are tested. The first version (CDPred_BFD) uses the MSAs generated from the BFD database as input. The second version (CDPred_Uniclust) uses the MSAs generated from the Uniclust30 database as input. The third version (CDPred) uses the average of the distance maps predicted by CDPred_BFD and CDPred_Uniclust as the prediction. Because DeepHomo and GLINTER predict binary inter-chain contacts at an 8 Å threshold instead of distances, we convert the inter-chain distance predictions of CDPred, CDPred_BFD, and CDPred_Uniclust into binary contact predictions for comparison.
Table 1. The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions on the HomoTest1 test dataset for DeepHomo, GLINTER, and three versions of CDPred. L: sequence length of a monomer in a homodimer. Bold numbers denote the highest precision.
Predictor
|
top 5
|
top 10
|
top L/10
|
top L/5
|
top L/2
|
top L
|
DeepHomo
|
-
|
52.50
|
43.20
|
37.40
|
28.20
|
-
|
GLINTER
|
54.81
|
54.07
|
50.54
|
48.09
|
41.90
|
34.91
|
CDPred_BFD
|
63.57
|
62.50
|
61.24
|
58.26
|
52.78
|
47.08
|
CDPred_Uniclust
|
65.71
|
61.79
|
60.53
|
58.18
|
54.14
|
47.91
|
CDPred
|
68.57
|
66.43
|
64.55
|
61.56
|
55.87
|
50.06
|
CDPred achieves the highest contact prediction precision across the board among all the methods. For instance, CDPred has a top L/5 contact prediction precision of 61.56%, which is 24.16% percentage points higher than DeepHomo and 13.47% percentage points higher than GLINTER. CDPred performs better than both CDPred_BFD and CDPred_Uniclust, indicating that averaging the distance predictions made from the two kinds of MSAs can improve the prediction accuracy.
We also compared the methods above on the HomoTest2 homodimer test dataset (Table2). Similarly, CDPred performs best in terms of all the evaluation metrics. Combining the predictions of CDPred from two kinds of MSAs improves the prediction accuracy.
Table 2. The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions on the HomoTest2 test dataset for DeepHomo, GLINTER, and CDPred predictors. Bold numbers denote the highest precision.
Predictor
|
top 5
|
top10
|
top L/10
|
top L/5
|
top L/2
|
top L
|
DeepHomo
|
-
|
30.43
|
27.32
|
23.08
|
-
|
-
|
GLINTER
|
-
|
43.04
|
40.18
|
36.74
|
-
|
-
|
CDPred_BFD
|
43.48
|
41.74
|
42.24
|
40.01
|
35.92
|
33.80
|
CDPred_Uniclust
|
48.70
|
45.22
|
43.32
|
39.64
|
37.46
|
32.32
|
CDPred
|
53.04
|
50.00
|
49.23
|
43.26
|
40.08
|
35.98
|
2.2 Evaluation of inter-chain contact prediction for heterodimers
We compare CDPred and a state-of-the-art heterodimer contact predictor GLINTER on both HeteroTest1 and HeteroTest2 heterodimer test datasets (see results in Table 3 and Table 4, respectively). The input tertiary structures of monomers used by both methods are predicted by AlphaFold2. We use two different orders of monomer A and monomer B (AB and BA) in each heterodimer to generate input features for CDPred to make predictions. The average of the outputs of the two orders is used as the final prediction.
On the HeteroTest1 dataset (Table3), CDPred achieves much better performance than GLINTER in terms of all the metrics. For instance, the top L/5 contact prediction precision of CDPred 47.59%, is more than twice 23.24% that of GLINTER. On the HeteroTest2 dataset (Table 4), CDPred also substantially outperforms GLINTER.
Table 3. The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions on the HeteroTest1 test dataset for the GLINTER and CDPred. L: the sequence length of the shorter monomer in a heterodimer. Bold numbers denote the highest precision.
Predictor
|
top 5
|
top 10
|
top L/10
|
top L/5
|
top L/2
|
top L
|
GLINTER
|
-
|
24.44
|
29.70
|
23.24
|
-
|
-
|
CDPred
|
55.56
|
54.44
|
51.47
|
47.59
|
38.64
|
32.73
|
Table 4. The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions on the HeteroTest2 test dataset for the GLINTER and CDPred. L: the sequence length of the shorter monomer in a heterodimer. Bold numbers denote the highest precision.
Predictor
|
top 5
|
top 10
|
top L/10
|
top L/5
|
top L/2
|
top L
|
GLINTER
|
14.55
|
13.27
|
13.73
|
13.49
|
12.27
|
10.40
|
CDPred
|
23.27
|
23.82
|
23.93
|
22.87
|
20.17
|
17.51
|
Furthermore, we divide the top L/10 contact prediction precisions for the heterodimers in the more challenging HeteroTest2 dataset into four equal intervals and plot the number of heterodimers in each interval (Figure 1). The precision of the predictions in the four internals is bifurcated, mainly centered on a low precision interval [0%-25%] and a high precision interval [75%-100%]. 40 heterodimers have low contact prediction precision in the range of 0%-25%, indicating there is still a large room for improvement. One reason for the low precision is that most of the 40 heterodimers have shallow MSAs. The Pearson correlation coefficient between the logarithm of the number of effective sequences (Neff) of MSA and the top L/10 complex contact precision is 0.46, indicating a modest correlation between the two.
It is also observed that the inter-chain contact prediction accuracy for heterodimers is lower than for homodimers on average. One reason is that the MSA generation for a homodimer only needs to generate an MSA for a monomer in the homodimer, which is usually much deeper than the MSA generated for a heterodimer that requires the pairing of the related sequences in the MSAs of two different monomers in the heterodimer. Another reason is that homodimers tend to have a larger interaction interface than heterodimers on average, making the prediction easier.
2.3 Comparison of the co-evolutionary features generated by the statistical optimization method and deep learning method
To compare the performance of the co-evolutionary feature generated by the statistical optimization tool – CCMPred and the deep learning tool – MSA transformer, we trained two different models on the two different kinds of co-evolutionary features of the same training dataset using the same neural network architecture. One network (CDPred_PLM) is trained on the PLM co-evolutionary features generated by CCMPred. Another one (CDPred_ESM) is trained on the row attention map features generated by the MSA transformer. The precision of the top L/10 contact predictions of the two models on the four different test datasets are plotted in Figure 2. CDPred_ESM has better performance than CDPred_PLM on all the four test datasets, indicating that the co-evolutionary features extracted automatically by the deep learning method is more informative than by the statistical optimization method of maximizing direct co-evolutionary signals. However, combining the two kinds of co-evolutionary features yields even better results (see the results in Tables 1, 2, 3, and 4). Figure 3 plots the top L/10 precision of CDPred_ESM against the top L/10 precision of CDPred_PLM for the homodimers in the two homodimer test datasets and the heterodimers in the two heterodimers test datasets, respectively. For 42 out of 51 homodimers and 55 out of 64 heterodimers, CDPred_ESM has higher precision than CDPred_PLM. Both CDPred_ESM and CDPred_PLM can perform better on some targets, indicate the co-evolutionary features used by the two methods have some complementarity.
2.4 Impact of intra-chain distance information on inter-chain distance prediction
Intra-chain residue-residue distance information of monomers is often used as an input for inter-chain distance/contact prediction. But it is unclear what kind of intra-chain distances may be useful for inter-chain distance/contact prediction. To study this issue, we prepare three kinds of intra-chain distance maps as the input features for CDPred to predict the inter-chain distance map. The first is the full ground-truth intra-chain distance map extracted from the true tertiary structure of monomers (denoted as FullMap). The second one is the ground truth distance map that excludes those distances less than 8Å (denoted as NoContact), which can be considered as only keeping non-contact intra-chain distance information (large distances). And the last one is the ground true distance map that excludes those distances greater than 8Å (denoted as OnlyContact), which can be considered as only keeping contact intra-chain distance information (small distances). It is worth noting that contact information (small intra-chain distances between residues) is much more important than non-contact information (large intra-chain distances between residues) for determining the tertiary structures of monomers.
The precision of the top L/2 contact predictions using the three kinds of intra-chain distance maps above with CDPred on the two homodimer test datasets and two heterodimer datasets are shown in Figure 4. It is interesting that the NoContact intra-chain distance map is much more useful for inter-chain distance map prediction than the OnlyContact intra-chain distance map, indicating that large distances between residues in intra-chain distance maps are much more important for inter-chain contact prediction than small intra-chain distances between residues. The NoContact intra-chain distance maps work even slightly better for inter-chain distance prediction for homodimers than the FullMap intra-chain distance maps, but they perform worse for heterodimers. The same phenomenon is observed when the predicted monomer distance was fed into CDPred for inter-chain distance prediction (data not shown). In summary, the intra-chain contact information is much less informative for the inter-chain distance prediction, but the intra-chain non-contact information plays a critical role in the inter-chain distance prediction. This may be partly due to that most inter-chain contacts involve residues on the surface of monomers that do not have many intra-chain contacts. For homodimers, removing the intra-chain contact information from the input slightly improves the accuracy of the inter-chain distance prediction may be due to that reducing the largely irrelevant part of the input enhances the prediction capability of the deep learning method.
2.5 High correlation between the precision of inter-chain contact predictions and predicted probability scores
The previous work on the intra-chain distance prediction (Guo, et al., 2021) shows that the intra-chain distance prediction accuracy and predicted probability scores have a strong correlation, which can be used to select predicted intra-chain distance maps. Here, we investigate if the similar correlation exists in the inter-chain distance prediction. Figure 5 is a plot of the precision of top L/5 inter-chain contact predictions and the average of their probability scores for each target in the four test datasets. The correlation between the top L/5 inter-chain contact precision and the average predicted probability score is 0.7345. The high correlation suggests that the probability of inter-chain contacts predicted by CDPred can be used to estimate the confidence of the inter-chain prediction.
2.6 An interesting inter-chain distance prediction example
Typically, when the MSA is shallow, the precision of inter-chain distance prediction is low due to the lack of information. However, CDPred still can accurately predict inter-chain distance for some targets with shallow MSAs. Figure 6 shows such a CASP13 homodimer target T0991. Its MSA has only one sequence. The TM-score (Zhang and Skolnick, 2004) of the tertiary structure of the monomer of T10991 predicted by AlphaFold2 is 0.3104, indicating the predicted tertiary structure fold is not correct. However, the precision of top L/10, top L/5, and top L/2 inter-chain contacts derived from the distance map predicted by CDPred is 72.73%, 68.18%, and 56.36%, respectively, which is high. Figure 6a and 6b show the intra-chain distance maps of the AlphaFold predicted tertiary structure and the true tertiary structure of the monomer, Figure 6c shows the inter-chain contact map predicted by CDPred, and Figure 6d the true inter-chain contact map. The predicted inter-chain contact map accurately recalls a large portion of the true inter-chain contacts.