Binding residues agree for FunFams, less so for EC. After omitting all proteins without binding residue annotations (not in the PDB), those with conflicting sequence or annotation lengths, those with duplicate entries (each UniProt identifier once in each FunFam), and families with single members, 7,172 sequences from 1,856 FunFams were left. The average binding residue similarity score for these 1,856 FunFams was 36.9±0.6% (Table 1); on average each family had 3.9±0.1 proteins (Fig. S1). The average similarity score for randomly constructed sequence families was 5.5±0.2%. Thus, the binding residue similarity within the same FunFam was 6.7-fold higher than that between “random families”.
To put the FunFam results into perspective of other resources, we analyzed three popular resources in the same way, namely PROSITE [17, 18], Pfam [19], and EC classes [2] . 4,090 sequences in our FunFam dataset mapped to 588 different PROSITE patterns. The average binding residue similarity for these groups was 25.7±0.8% (compared to 29.5±0.8% similarity within FunFams computed on the same dataset). 3,530 sequences in our FunFam dataset mapped 656 Pfam families which had an average binding residue similarity of 26.2±0.3% (compared to 30.6±0.8% similarity within FunFams computed on the same dataset). Both approaches outperformed randomly grouped sequences more than five-fold but performed worse than FunFams (1.2-fold).
For comparison with a specialized functional classification, we also computed binding residue similarity for the EC numbers classification. Our FunFam dataset contained 5,789 proteins with 1,080 different EC numbers (all had complete annotations for all four levels of the EC number; the remaining 1,383 proteins were ignored for this investigation). The average binding residue similarity for proteins with the same four-level EC number was 29.9±0.8% (Table 1), a 5.4-fold increase over random. The binding residue similarity was higher for FunFams than for EC numbers across all similarity levels (Fig. 2). The average for FunFams was 1.2-fold higher (1.1-fold on same dataset) than for EC numbers. The same was true for particular points in the distribution, e.g. for families with 100% binding residue similarity (Fig. 2: rightmost values), and those with, e.g. 60% or 50% similarity (Fig. 2: light gray vertical lines on right and in middle). Conversely, the fraction of those with binding residue similarity levels close to random (Fig. 2: intersection of lines with gray shading on left) were higher for EC than for FunFams, except at zero, i.e. no binding residue similarity (FunFams 6.95% vs. EC numbers 6.67%)
>>> Fig. 2 <<<
To use the largest subsets possible, we calculated the similarity within FunFams and within EC classes on different subsets. To ensure that performance differences did not largely result from differences in the sub-sets, we re-computed all values for a smaller subset identical to both (4,143 proteins grouped into 1,103 FunFams and into 833 EC classes). On this subset, the average binding residue similarity for proteins within the same FunFam was 38.6±0.8% that within the same EC class was 34.5±0.9%, i.e. FunFam performed 1.1-times better than EC numbers supporting the statistically more relevant results for the larger subsets (Table 1).
We also extracted all proteins with identical EC number classified into different FunFams (if more than one found, one representative selected randomly). This resulted in 771 groups (each representing one EC number) with 2,893 proteins. These groups had an average binding residue similarity of 9.6±0.4% (Table 1, Fig. 2: gray dashed line). Conversely, we computed the average similarity for proteins in the same FunFam but with different EC numbers (if several sequences in a FunFam had the same EC, we picked one at random). This yielded 404 groups (each representing one FunFam) with 2,817 proteins; the average binding residue similarity in this group was 26.8±0.1% (Table 1, Fig. 2: dark dashed line). Along a similar line, we found that EC number annotations became more consistent when constrained by the superfamily. The average binding residue similarity for identical EC numbers rose to 38.0±0.01% (1.2-fold improvement) for the subset of proteins with the same EC number and the same superfamily (with 4,445 proteins from 1006 EC numbers: Table 1). Notably, 69% of all EC numbers that occurred in a superfamily grouped into its most frequent FunFam. Furthermore, we found that the binding residue similarity of protein pairs with the same EC number but grouped into two different superfamilies dropped to a random level of 5.22±0.01 (Table 1). The dataset contained 1,155 such proteins from 435 EC numbers.
Table 1: Average binding residue similarity for FunFams and EC-numbers.*
Group
|
Number of families
|
Number of proteins
|
Average binding residue similarity (Eqn. 1)
|
Same FunFams
|
1856
|
7172
|
36.9±0.6
|
Same EC numbers
|
1080
|
5789
|
29.9±0.8
|
Same FunFams, EC-FunFams subset
|
1103
|
4143
|
38.6±0.8
|
Same EC numbers, EC-FunFams subset
|
833
|
4143
|
34.5±0.9
|
Same EC, different FunFam
|
771
|
2893
|
9.6±0.4
|
Same FunFam, different EC
|
404
|
2817
|
27.0±1.0
|
Same EC, same superfamily
|
1006
|
4445
|
38.0±.0.01
|
Same EC, different superfamily
|
435
|
1155
|
5.22±0.01
|
* Same FunFams: proteins within same FunFam; Same EC-numbers: proteins with identical EC number; EC-FunFams subset: same subset used for both similarity calculation with FunFams and within EC classes; Same EC different FunFam: subset of proteins with identical EC number classified into different FunFams; Same FunFam different EC: subset of proteins from same FunFam with different EC numbers; Same EC, same superfamily: proteins with identical EC number grouped into a structural superfamily; Same EC, different superfamily: proteins with identical EC number grouped into different superfamilies; ±: refers to one standard error.
Binding annotation transfer within FunFams raises precision. Homology-based inference implies the following transfer: if proteins P1 and P2 are sufficiently sequence similar (e.g. PIDE(P1,P2)<T), experimental annotations obtained for P1 could be transferred to P2. We applied such a homology-based inference by transferring binding residue annotations from one member of a FunFam to all other members. This resulted in an F1 score of 37.97±0.01% (Precision=49.03±0.01%, Recall=47.52±0.01%) and an MCC of 0.36±0.0002. This was further evidence for the high degree of functional similarity within FunFams.
Binding residue prediction improved through FunFam filter. The methods BindPredict-CCS and BindPredict-CC predict binding residues through cumulative coupling scores and clustering coefficients derived from DI scores [15]. We applied these methods to 470 proteins from 138 FunFams. For that set, the prediction with cumulative coupling scores reached an F1-score of 10.5±1% and the prediction with clustering coefficients an F1=14.2±1%. Building consensus predictions at consensus thresholds of 0.01 from all predictions for members of a FunFam raised the F1-score for cumulative coupling scores to 16.2±0.8% corresponding to a 1.5-fold increase (Fig. S2). At the same threshold, the corresponding values for precision, recall, and accuracy were 18.3±0.1% (Eqn. 2), 29.8±0.2% (Eqn. 3) and 71.1±0.1% (Eqn. 4) respectively (Fig. 3A showing precision and recall). This corresponded to roughly 1.4-fold increase for precision, one-third decrease for recall and a one-tenth decrease for accuracy (data not shown). For predictions based on clustering coefficients, the F1-score increased 1.3-fold to 18.4±1% (Fig. S2). Precision decreased 0.7-fold to 17.5±1% (Eqn. 2) while recall reached 49.5±1% (Eqn. 3), a 2.0-fold improvement (Fig. 3B). The accuracy was 55±1% (1.3-fold decrease). The MCC was very low for all predictions. Nevertheless, the consensus prediction still increased the MCC about two-fold (2.1-fold for BindPredict-CCS at consensus threshold 0.01; 2.0-fold at 0.1; Fig. S3).
Varying the consensus threshold at which a binding prediction was included into the consensus, i.e. the number of proteins within a FunFam for which the same residue had to be predicted as binding, provided a convenient way for tuning precision and recall. At a consensus threshold of 1.0, precision reached 60.8±0.4% (2.5-fold increase over standard method) for the cumulative couplings method (Fig. 3A) and 44.0±0.4% (1.9-fold increase over standard method) for clustering coefficient-based predictions (Fig. 3B). At this conservation threshold, about three residues were, on average, predicted in each protein as binding and at least one residue was predicted for 55.2% of the proteins. For comparison: for the clustering coefficients, 10.4 residues were predicted as binding per protein and at least one residue was predicted for 34.4% of the proteins.
>>> Fig. 3 <<<
Consensus prediction vs. machine learning prediction from bindPredictML17. To compare the consensus predictions with the results of a more sophisticated binding residue prediction method not using information from FunFams, we applied bindPredictML17 [20] on 114 sequences from the FunFam dataset that were also part of the development set of bindPredictML17. For these proteins, bindPredictML17 reached F1=25.85±0.01% (precision=31.40±0.02%, recall= 32.59±0.02%). Applying the FunFam filter at a consensus threshold of 0.01 led to F1=14.8% for BindPredict-CC and F1=19.0% for BindPredict-CCS. The highest recall of 43.6% was reached for BindPredict-CC at a consensus threshold of 0.01, and the highest precision of 50.7% for BindPredict-CCS for a threshold of 1.0.