Composition and Position Distribution of the Residues in the Datasets
To give an intuition of the interface residues on protein-nucleic acids interfaces, we analyzed the composition and position distribution of the residues in our datasets. Figure 2 shows the percentages of the 19 types of residues in our training dataset, independent test set and both of the data sets. It is clear that the two positive charged residues, ARG and LYS, have the highest frequencies. This is normal because of the negative charges of phosphate groups of nucleic acids.
The position of the interface residues in the datasets were described by a CORE_RIM feature that was proposed in our previous paper [14]. The CORE_RIM value is defined as the (SAStau-SAStab)/SAStau, note that SAStau and SAStab are the feature 29 and 39 in Table S5 in additional file 1. Figure 3 shows that overall our datasets include both residues on core and rim parts of the interfaces, although the residues in core positions are a little bit more than the residues in rim positions (see the blue bars). In addition, the core residues ratio in the training dataset is higher than that of the independent test set.
Feature Selection
The correlations between the 97 features
In this study, we generated totally 97 features which came from 7 different kinds of structural or sequential properties. The features from each structural or sequential properties may be interdependent and the features from different properties could be also interdependent. We calculated the correlation coefficients between different features. Figure 4 shows the correlation coefficients between different features. It shows that the features from the same structural or sequential properties are easily interdependent, for example, the feature 29-48 are highly correlated because they are all solvent accessible surface area related features and the features 49-80 are also highly correlated because they are based the differences of solvent accessible surface areas between bound and unbound states. The features from different structural or sequential features are generally less interdependent, for example, the correlation between electrostatic potential features (features 81-85) are generally independent to other features.
According to the correlation analysis between different features, a feature selection is necessary to find an optimal feature subset to build our model.
Features selected by decision tree
As shown in List 1, the decision tree selected 20 features from 97 original features, which include 3 physicochemical features of amino acids, 7 features related to depth index and protrusion index, 6 features related to solvent accessible surface area, 2 features related to electrostatic potential, 1 feature related to secondary structure, and 1 feature related to conservation.
List 1. The features selected by Decision tree. The number in the parenthesis is corresponding to the feature number in Table S5 (see List 1 in the Supplementary Files)
The final feature subset selected by SFS
From the preliminary subset of features selected by decision tree, we further used a sequential forward feature selection (SFS) process to determine the final subset of features as input of the final model. In each round of the SFS process, different feature combinations were used to train models by SVM, and the cross-validation results (F1 score) were used to evaluate these feature combinations. Thus, the contribution of each remaining feature was identified, then the features contribute more were selected. This strategy was also used in Yang et al.’s work [47]. We selected the top three feature combinations in each round for the next round. Table S7 in the additional file 1 shows the features selected in each round and the corresponding cross validation F1 scores. The results show that the predictive performance is convergent at the 7th round. The best cross-validated F1 score is 0.684. The corresponding feature combination contains 7 features, which are Nphb, PItu, , SAStau,, , and Helix. Nphb is the number of potential hydrogen bonds of the residue, which means the number of possible hydrogen bonds that a residue can formed with other molecules. PItu is the total protrusion index of the residue in unbound state. is the difference of the side chain depth indexes between bound and unbound states. SAStau is the total absolute SASA of the residue in the unbound state. is the square roots of the differences of the absolute SASAs of residue side chain between unbound and bound states. is the electrostatic potential of the neighbor residues and the target residue. Helix is if the residue lies in a helix secondary structure.
Based on these 7 features, we built our final model, iPNHOT (identification of protein-nucleic acid interaction hot spots), using SVM. The parameters of G and C for radial basis function used in the final model are 0.1 and 40.0, respectively. The cross-validation results show that our model achieved 0.628, 0.750, 0.684, and 0.829 for recall, precision, F1 score, and accuracy, respectively.
In addition, we plot the ROC curve and PRC curve based on the cross-validation results as shown in Figure 5 and Figure 6. The AUC of the curves are 0.832 and 0.668, respectively.
Our previous work [14] in predicting hotspots on protein-protein interfaces showed that differences between the leave one-residue out cross validation and the leave one-protein out cross validation is small. Briefly, the leave one-residue out cross validation is the standard leave one out cross validation in our case, for a sample in our dataset is corresponding to a residue. When we do the leave one-protein our cross validation, the samples belong to a protein were used as the validation set and the samples belong to the other proteins were used to train a model. In this work, we also did a leave one-protein out cross validation based on the final feature subset. The results indicated that the leave one-protein out cross validation achieved the sensitivity, specificity and F1 score of 0.535, 0.894 and 0.597, respectively, which is worse than that of the leave one-residue out (i.e. the standard leave one out) cross validation.
Models based on all features or the features selected only by decision tree or SFS
To validate the effectiveness of our two-step feature selection process, we also built models based on all 97 features (AFmodel), the 20 features selected only by decision trees (DTmodel), and the features selected only by SFS (SFSmodel), respectively. As shown in Table 2, the AFmodel gives the lowest predictive accuracies compared with other models. The iPNHOT model is superior to DTmodel on all the six evaluation metrics. We inferred that using all features or the 20 features selected by decision tree may have over-fitted the models. In addition, we did the SFS feature selection based on the original 97 features, although it is five more times time-consuming than our two-step feature selection process for each round. Table S8 in the additional file 1 shows the SFS process, it was convergent at the 9th round. Table 2 shows that the SFSmodel is superior to iPNHOT model on all the six evaluation metrics except specificity, however, the SFSmodel are easily overfitted.
To further demonstrate the effectiveness of our two-step feature selection strategy, we also combined NSGA-II and SVM to select the relevant feature subset and optimize the G and C parameters of SVM. We tried different populations (50-250) and different generations (50-250) of NSGA, as shown in Figure 7, the best F1 score is 0.671 which was obtained when population and generation were set to 200 and 200, respectively. In addition, we also built model based the features selected by Boruta algorithm, which selected 16 features. Based on the selected features, we did the cross validation on the training dataset and obtained the best F1 score 0.523. Thus, we show that our two-step feature selection strategy is superior to GA and Boruta algorithm in this study.
Comparison of different classifiers on the selected 7 features
To evaluate the effectiveness of the SVM learning method in predicting the hot spots within protein-nucleic acid interfaces, we compared the performance of models built by different machine learning algorithms (KNN, naïve Bayesian (NB) and Logistic Regression (LR)) based on the selected 7 features. Table 3 shows that the model built based on SVM (iPNHOT) achieved the highest recall (0.628), the highest precision (0.750), the highest accuracy (0.829), the highest F1 score (0.684), and the highest MCC (0.572) compared with other models. These results indicated that the SVM model outperformed the models built by KNN, naïve Bayesian, logistic regression based on the selected 7 features.
In addition, to further evaluate the effectiveness of our feature selection process and the SVM learning method in predicting the hot spots on protein-nucleic acid interfaces, we also compared our iPNHOT model with the random forest model built using all the 97 features. We used all the 97 features because random forest classifier is an ensemble learning method and the diversity of trees is important for the algorithm. One of the important steps of the random forest algorithm is to select a feature subset randomly, then to determine an optimal feature from the feature subset to divide the examples. Thus, the diversity of the trees can be enhanced by using all features. We tried different tree numbers and selected the one which gives the best predictive accuracy. The optimal tree number is 68. Table 3 shows that iPNHOT achieved higher values than the random forest model for all the six evaluation metrics, demonstrating that our two-step feature selection strategy and the SVM learning method are effective in predicting hot spot on protein-nucleic acid interfaces.
Evaluation of our model on the independent test set
The generalization of the iPNHOT model was evaluated on the independent test set. Table 4 shows that the recall, the specificity, the accuracy on the independent test set is 0.571, 0.845, 0.815 that is close to the cross-validation recall, specificity, accuracy of 0.628, 0.913, 0.829, respectively, which shows the good generalization of the iPNHOT model.
Comparison with other methods
Our iPNHOT model is a single model which was built to predict the interface hot spot residues on both protein-RNA and protein-DNA interfaces. The SBHD server [11] is also for predicting hotspot residues on both protein-RNA and protein-DNA interfaces, however, it is not available now. mCSM-NA server [23] contains modules to predict mutagenic effect of residues on both protein-RNA or protein-DNA interfaces, and it is available to the community. HotSPRing [18] and PrabHot [24] are two models for predicting hot spots on protein-RNA interfaces. However, HotSPRing server does not work well because no results could be obtained for submitted jobs. PrabHot server only outputs the predicted scores for predicted hotspot residues. In addition, PrabHot defined the hotspot residues by using a cutoff value 1.0 kcal/mol, which is different from the cutoff value 2.0 kcal/mol used in this study. Thus, the AUC is the only metric that can be compared between PrabHot and iPNHOT. PrPDH [22] is a recently developed method for predicting hotspot on protein-DNA interfaces. In the method, the authors also defined the hot spot residues by using the cutoff value 1.0 kcal/mol. However, only 11 of the 32 residues in the independent test set that are on the protein-DNA interfaces are not used to train the PrPDH model, thus it is not suitable to compare our model with this method because of the small number of samples.
First, we compared our model to mCSM-NA on both the training data set and the independent test set. The prediction results for all examples in the training data set and the independent test set are shown in Table S2 and Table S4, respectively. As shown in Table 4, the cross-validation results of iPNHOT outperform the predictive results of mCSM-NA according to all the 6 evaluation metrics. However, only part of the training data set, collected from ProNIT, has been used to train the mCSM-NA model. To fairly compare with mCSM-NA, we extracted the 102 interface residues obtained from ProNIT, and compared the predictive results between iPNHOT and mCSM-NA on these 102 data points. Table 4 indicates that iPNHOT outperforms mCSM-NA on all the metrics except specificity.
In addition, we also compared iPNHOT with mCSM-NA on the independent test set. Table 4 shows that iPNHOT outperforms mCSM-NA on all the metrics except recall.
In addition to the 6 performance metrics, we also plotted the ROC curves and PRC curves to compare different methods. Figure 5 shows the ROC curves based on the predictive results of mCSM-NA vs. iPNHOT on the training data set. For the 106 data collected from ProNIT, the area under the curve (AUCROC) of mCSM-NA is 0.668 that is substantially lower than the AUCROC of iPNHOT (0.831). Figure 8 shows that the AUCROC of mCSM-NA is 0.742 which is lower than the AUCROC of iPNHOT (0.754) on the independent test set. Figure 6 shows the PRC curves on the training data set. For the 106 data collected from ProNIT, the area under the PRC curve (AUCPRC) of mCSM-NA is 0.590 that is substantially lower than the AUCPRC of iPNHOT (0.730). Figure 9 shows that the AUCPRC of mCSM-NA is 0.458 which is higher than the AUCROC of iPNHOT (0.258) on the independent test set.
According to the results mentioned above, overall iPNHOT model outperforms the mCSM-NA model. As for PRC curve, although some researchers reported that PRC is suitable to evaluate the imbalanced dataset, others reported that the PRC curve are easily affected by the example with the largest output value [48], and the empirical PRC curve are highly imprecise estimate of the true curve, especially in the case of a small sample size and the class imbalance in favor of negative examples [49].
In addition, we compared the AUCs between iPNHOT and the PrabHot. Because part of the data in the independent test set have been used to train the PrabHot model, the AUCs were calculated based on 23 samples which were not used to train the PrabHot model and whose predicted scores of PrabHot are available. As shown in Figure 10 and Figure 11, the AUCROC of PrabHot is 0.525 which is lower than the AUCROC of iPNHOT (0.658) and the AUCPRC of PrabHot is 0.338 which is also lower than the AUCPRC of iPNHOT(0.517).
Thus, we demonstrated that our model outperforms other state-of-art methods for predicting hotspots on protein-nucleic acid interfaces.
Furthermore, we also compared our method with two protein-DNA binding sites prediction methods and two protein-RNA binding sites prediction methods. As shown in figure 12, the AUCROC of iPNHOT is 0.685 on protein-DNA interface samples of the independent test set, which is higher than the two protein-DNA binding sites prediction methods (0.167 for DNA-Bind [50] and 0.667 for DP-Bind [51]). Similarly, figure 13 shows that the AUCROC of iPNHOT is 0.783 on protein-RNA interface samples of the independent test set, which is also higher than the two protein-RNA binding sites prediction methods (0.338 for Pprint [52] and 0.278 for DRNApred [53]).
Post analysis of the selected features of the final model
To demonstrate the importance of the features used in the final model, we did a post analysis by removing one of the selected features and checking the performance of the models built based on the remaining features. As showed in Table 5, when we removed the feature Nphb, PItu, , SAStau,, , and Helix respectively, the predictive accuracies decreased as expected. Especially, the predictive accuracies decreased substantially when was removed, which emphasizes the importance of this feature. The electrostatic complementarity on protein-DNA interfaces have been extensively reviewed in Harris et al.'s article [54]. Although it is still a controversy for the contribution of electrostatic potential to the binding affinity, our results indicate that the electrostatic potential can be a useful feature for predicting hotspots on protein-RNA/DNA interfaces.
Statistical analysis of the selected 7 features
To further evaluate the ability of the 7 selected features to distinguish hot spot from non-hot spots, we used the Wilcoxon rank sum analysis. Figure 14 shows that three of the 7 selected features can significantly differentiate hot spots from non-hot spots with p-values less than 0.05, which are , , and esp3. The first features, , reflect the shape complementarity between protein residues and nucleic acid upon binding. As we proposed in the “Feature extraction” section, may related to the desolvation energy upon binding. As for esp3, it is the electrostatic potential of protein surface patch around the target residue. For hot spots, the average value of the feature is 11.6 compared to 1.77 for non-hot spots. Because of the negative electrostatic potential of nucleic acid surface, this feature may partially reflect the electrostatic potential complementarity between the protein surface patch and the nucleic acid surface patch around the target residue. Thus, these 3 features combined the effects of shape complementarity, electrostatic potential complementarity, and the desolvation energy.
In addition to the features that were statistically important on an individual basis, the other 4 of the 7 selected features were also kept in the final model. This suggests the possibility of coordinated effects between different features. In particular, a feature that is not individually significant can gain significance when combined with other information gleaned from other features.
The analysis of the 20 selected features by decision tree can be found in the additional file 1.
Case Study
To visualize the hotspot residues on the protein-nucleic acid interfaces, we plotted two cases by using PyMol. The first one is the complex of U1 small nuclear ribonucleoprotein A (U1A) and an RNA, for which the PDB ID is 1AUD. As shown in Figure 15A, 4 hotspot residues and 2 non-hotspot residues at the interface had been recorded in the training dataset. Our model identified all the 4 hot spot residues as hotspots and the 2 non-hotspot residues as non-hotspot residues when doing both leave one out cross validation and leave one-protein out cross validation. On the contrary, mCSM-NA did not assign any of the 4 residue as hot spot residue. The second case is the complex of Pot1(protection of telomere) and a DNA, for which the PDB ID is 1QZG. As shown in Figure 15B, 2 hotspot residues and 3 non-hotspot residues at the interface had been recorded in the training data set. Our model identified all of the 2 hotspot residues as hotspots and all the 3 non-hotspot residues as non-hotspot residues when doing both leave one out cross validation and leave one-protein out cross validation. However, mCSM-NA did not detect any of the 2 hot spot residues. Note that both 1AUD and 1QZG were collected from ProNIT, which had been used to train the mCSM-NA model.