Benchmark datasets
Training dataset
The training dataset comes from dbAMEPNI database [22] which was built in our group. The database contains alanine mutagenic effects data from ProNIT database [17] and our curated data collected from literature between 2011 and 2017.
Firstly, we identified 335 interface residues from dbAMEPNI database by defining the interface residue as a residue whose buried solvent accessible surface area is larger than 0.0. Then, we detected redundancy among homologous proteins using the PISCES server [23] with a sequence identity cutoff set to 25%. If homologous pairs are included, the sites of recognition differ between the two proteins. Finally, we obtained a dataset containing 299 interface residues, which come from 106 protein-nucleic acid complexes that consist of 74 protein-DNA complexes, 31 protein-RNA complexes and 1 protein-RNA/DNA complexes. The complexes and the 299 interface residues are listed in Table S1 and S2 of the additional file 1, respectively. According to Table S2, 106 interface residues are common to the data in ProNIT. By using as cutoff, 86 of the 299 interface residues are defined as hot spot residues and the remaining 213 residues are defined as non-hot spot residues.
Independent test set
The independent test set comes from three different sources: (i) The independent test set in Pires et al.’s work [19] which comes from Barik et al.’s paper [18]. In their paper, they collected 80 alanine mutations from 14 protein-RNA complexes. Note that these alanine mutations were also included in the dbAMEPNI database later. After checking these protein-RNA complexes, we found one complex (PDBID: 2Y8W) appeared in our training dataset, so we removed this complex. The other complex 2XS2 was also removed because the only one corresponding residue is not on the protein-RNA interface. (ii) The independent test set reported in Pan et al.’s work [21]. In their work, they collected the 58 mutations as the independent test set. After carefully checking the 58 data, we found several problems of the dataset: (1) It includes 13 non-alanine mutations; (2) The multiple mutants were wrongly considered as single alanine mutants. For example, the mutants of 4JVH are all double mutants, however, they were used as single alanine mutants in their dataset; (3) There is redundancy between 5EN1 and 5HO4. According to these problems, we removed part of the data and kept 33 data points. We have summited the 33 data points on the PrabHot server, and 23 PrabHot scores (Table S4 in additional file 1) were available for plotting the ROC curve (Figure S1 in additional file 1). (iii) Literature corresponding to the 3D structures of protein-nucleic acid complexes deposited in PDB database [24] from 2017 to 2018. We identified 51 protein-nucleic acid complexes, and the corresponding references were carefully examined to find the alanine mutation information. From these articles, we obtained 16 alanine mutation data.
For the proteins collected from the three sources mentioned above, we used PISCES server to determine the sequence identity between them and the proteins in the training dataset using the sequence identity cutoff 25%. If homologous pairs are included, the sites of recognition differ between the two proteins. In all, we obtained 99 interface residues which come from 27 protein-nucleic acids complexes that consist of 22 protein-RNA complexes, 4 protein-DNA complexes and 1 protein-RNA/DNA complexes (Table S3 in additional file 1). By using as cutoff, 13 of the 99 interface residues are hotspot residues and 86 of them are non-hot spot residues (Table S4 in additional file 1).
Feature extraction
Due to the special characteristics of the protein-nucleic acid interfaces, we generated 7 different kinds of features to build our model, which were described as below.
Physicochemical characteristics of 20 amino acids.
As the basic unit that comprises a protein, it is the residues that interact with other molecules. Different properties of the 20 residues have been deposited in AAindex [25], 10 of which were used as features to predict the hotspots on protein-protein interfaces in previous studies [26-29]. The 10 physicochemical properties were considered as the first 10 features as showed in Table S5 (see additional file 1). The numerical values of the 10 features are showed in Table S6 (see additional file 1).
Depth index and protrusion index.
The surface shape complementarity between proteins and nucleic acids is an important factor in protein-nucleic acid binding. The surface geometry of residues in the interface is quantified using features in this study. We used the PSAIA [30] program to calculate the depth index (DI) and protrusion index (PI) for each interface residue. The program calculates several different kinds of depth index and protrusion index including the average values of the entire and side chain of the residue, the maximum and minimum values of the residue’s atoms. We used the former 4 values for each residue in both bound and unbound state as features. More specifically, these features contain the average DI of the entire residue, the average DI of the side chain of the residue, the average PI of the entire residue, and the average PI of the side chain of the residue.
In addition, we calculated the differences of these 4 values of each residue between bound and unbound states by using the following equations:
(1) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(2) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(3) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(4) Due to technical limitations, this equation is only available as a download in the supplemental files section.
where, the and mean the average DIs of the total residue in unbound and bound states, respectively. the and mean the average DIs of the side chain of the residue in unbound state and bound states, respectively. We did the same for PI. Furthermore, we calculated the relative DIs and PIs according to the following equations:
(5) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(6) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(7) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(8) Due to technical limitations, this equation is only available as a download in the supplemental files section.
In all, we obtained 16 features related to depth index and protrusion index.
Features related to solvent accessible surface area (SASA).
The residue’s solvent accessible surface area has been used as features in previous studies [11, 14, 31-33] in predicting hotspots on protein-protein interfaces. In this study, we used the different representations of SASA as features to build our model. The SASA was calculated by NACCESS program [34], which calculated the SASA of a residue in different scenarios, for example, the absolute SASA and the relative SASA, the SASA of all atoms, side chain atoms, backbone atoms, polar, and nonpolar atoms of the residue. We obtained these SASAs in both bound and unbound states.
In addition, we calculated the buried SASA that is the difference of the SASA between proteins in bound and unbound states. The buried SASA has been thought to correlate with different energy terms such as desolvation energy. In this work, we calculated different kinds of buried absolute SASA and relative SASA mentioned above. Furthermore, we considered different powers of the buried absolute SASA and relative SASA as features. The three powers we tested are 0.5, 1.5, and 2.0.
In all, we obtained 54 features related to SASA. These features can be found in Table S5 (see additional file 1).
Features related to electrostatic potential
Considering the electrostatic characteristics of nucleic acids, we supposed the electrostatic potential could be benefit for predicting hotspots on protein-nucleic acid interfaces. In this study, we used the APBS program [35] to calculate the electrostatic potential around the proteins, and the procedure to calculate the electrostatic potential of a residue has been described in our previous study [36]. The description of the 5 features related to electrostatic potential can be found in Table S5.
Hydrogen bond features.
By using the WHATIF server [37] we obtained the hydrogen bonds [38] on protein-nucleic acid interfaces. The hydrogen bond numbers formed by the entire residue and those of the side chain with nucleic acid were counted as two features.
Secondary structure features
A residue’s secondary structure is assigned by the DSSP program [39, 40], which outputs 8 different kinds of secondary structure that include H (α-helix), B (isolated β-bridge), E (extended strand), G (3-helix), I (5-helix), T (hydrogen bonded turn), S (bend), and blank (loops). We re-categorized them into 5 different types by combining B, T, S as turn, and G, I as helix1. Then the 5 different types of secondary structure were represented as binary vectors by using (1, 0, 0, 0, 0) as H, (0, 1, 0, 0, 0) as E, (0, 0, 1, 0, 0) as turn, (0, 0, 0, 1, 0) as helix1, and (0, 0, 0, 0, 1) as loops.
Sequence conservation features
Based on our previous works[36, 41], we obtained 5 features from the PSSM file generated by PsiBlast. The first one is the information entropy that represents the conservation of the corresponding sequence position. In addition, we defined two kinds of relative conservation based on the weighted observed percentage of each kind of residues for each sequence position as follows:
(9) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(10) Due to technical limitations, this equation is only available as a download in the supplemental files section.
where, , is the weighted observed percentage of residue type at the certain sequence position, with the formulas designed to avoid division by 0. is the weighted observed percentage of the residue type “alanine” at the certain sequence position. Label ‘rm’ means the residue type with the maximum percentage, and ‘ra’ means the actual residue type at that sequence position. And ‘wop’ is the abbreviation of ‘weighted observed percentage’. Similarly, we also defined two kinds of relative conservation based on the position specific scores in the position specific scoring matrix (PSSM) as follow:
(11) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(12) Due to technical limitations, this equation is only available as a download in the supplemental files section.
where, is the position specific score of residue type on the certain sequence position. Labels ‘ra’, ‘rm’, and ‘A’ have the same meaning as above, and ‘pps’ is the abbreviation of ‘position specific score’.
In all, we obtained 97 features in this study, and the z-scores were calculated to standardize all the features.
Feature selection
Feature selection has become an important step for building machine learning models, especially for high-dimensional applications. By feature selection, redundant and irrelevant features can be removed, and we can also avoid over-fitting, improve model performance and provide faster and more cost-effective models.
Previous study [41] shows that a hybrid two-step feature selection strategy is effective to detect relevant feature subset. In this work, we combined decision tree and sequential forward feature selection as a two-step strategy to determine the relevant feature subset. First, we used a MATLAB function FITCTREE to select a feature subset. Then, we used the sequential forward feature selection (SFS) method to determine the final feature subset.
Evaluation with SVM
Support vector machine (SVM) has been used to build models for predicting hotspots on protein-protein interfaces in several previous studies [11, 14, 15, 31], due to its low complexity and robust output. In this study, SVMlight [42] and the radial basis function were used to train our models. The two parameters, G and C, were optimized by a grid search with G values from 0 to 2 and C values from 0 to 40. To avoid over-fitting, we used the leave-one-out cross validation to evaluate the models. Then, the model was further validated on an independent test set.
Due to the imbalance of our data set, the overall accuracy is heavily biased by the accuracy of the negative examples. Therefore, we provide several different metrics, sensitivity (SEN), specificity (SPE), accuracy (ACC), precision (PRE), F1 score and Matthew correlation coefficient (MCC) to evaluate the performances of different models. These metrics are defined as follows:
(13) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(14) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(15) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(16) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(17) Due to technical limitations, this equation is only available as a download in the supplemental files section.
(18) Due to technical limitations, this equation is only available as a download in the supplemental files section.
where TP, FP, TN, and FN represent the numbers of true positive (predicted hot spot residues are actual hot spots), false positive (predicted hot spot residues are actual non-hot spots), true negative (predicted non-hot spot residues are actual non-hot spots) and false negative (predicted non-hot spot residues are actual hot spots), respectively. In addition to these 6 parameters, the Area under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve was also used as a metric to evaluate our model. The ROC curve shows the relationship between true positive rate and false positive rate, and the area under the curve (AUC) of ROC curve indicates how strongly the model separates the positive and negative examples.
Statistical analysis to detect the relationship between features and hotspots (Wilcoxon Rank Sum Test)
Statistical analysis is useful to reveal the role of each feature on differentiating hot spots from non-hot spots. Because the normal distribution of our data was not guaranteed, the t-test could not be used to analyze the selected features. Instead, the Wilcoxon Rank Sum test was used in the statistical analysis. The Wilcoxon Rank Sum test is a nonparametric test to assess whether two samples of observations come from the same distribution. The RANKSUM function in MATLAB was used in this statistical analysis.