2.1 Data Collection
Based on multiple databases mentioned in 4.1, we performed data collection of drug molecules, target protein sequences, and Kd and EC50 values characterizing drug molecule-target affinity. Taking drug molecule and target as a whole system, we obtained the EC50 dataset-quantifying drug molecule-target affinity by EC50 and the Kd dataset-quantifying drug molecule-target affinity by Kd, respectively. The EC50 dataset contains 8147 ligands and 544 targets, and 11076 ligand-target-Ec50 pairs. At the same time, The Kd dataset contains 1870 ligands and 778 targets, and 10923 ligand-target-Kd pairs. The two datasets without redundancy were used as benchmark datasets.
In process of data collection, we kept to the following two criteria: (1) maintain entries as many as possible; (2) exclude redundant data as many as possible. Therefore, some drug molecules and targets were removed due to Kd, EC50 has no definite value, or their activity values are inconsistent. These redundant data may strongly affect the accuracy of prediction models for DTIs affinity. It is worth noting that half-maximal effective concentration (EC50) refers to the concentration of a drug, antibody or toxicant that induces a response halfway between the baseline and maximum after a specified exposure time. It was commonly used as a measure of a drug's potency [31]. Dissociation constants (Kd) are often used to describe degree of binding of a ligand to a particular protein [32]. The smaller dissociation constant, the tighter ligand binding, otherwise the higher affinity between ligand and protein. Considering the practical significance of Kd and EC50, we finally chose both as quantitative indexes of DTIs affinity.
2.2 Results of descriptor calculation
2.2.1 Calculation of drug molecule descriptors
After calculation by PaDEL software, we obtained the molecular descriptors. The descriptors calculated in this article were shown in Table 1. There were 1874 descriptors for drug molecules and drug molecular descriptors can be divided into 16 categories, among which E-state descriptors, Autocorrelation descriptors and Topological type descriptors account for a relatively large number. Even though many descriptors in Table 1 are of the same type, each descriptor has its own specific meaning. However not all molecular descriptors are suitable for the construction of predictive models for DTIs affinity.
Table 1
Type and number of drug molecule descriptors
Serial number
|
Descriptor type
|
Number of descriptors
|
1
|
Constitutional descriptors
|
120
|
2
|
Autocorrelation descriptors
|
346
|
3
|
Basak descriptors
|
42
|
4
|
BCUT descriptors
|
6
|
5
|
Burden descriptors
|
96
|
6
|
Connectivity descriptors
|
56
|
7
|
E-state descriptors
|
489
|
8
|
Kappa descriptors
|
3
|
9
|
Molecular property descriptors
|
15
|
10
|
Quantum chemical descriptors
|
5
|
11
|
Topological descriptors
|
265
|
12
|
CPSA descriptors
|
29
|
13
|
RDF descriptors
|
210
|
14
|
Geometrical descriptors
|
21
|
15
|
WHIM descriptors
|
91
|
16
|
3D Autocorrelation descriptors
|
80
|
Therefore, how to measure importance of descriptors and filter out meaningful ones is the key to improve accuracy of prediction models. Some researchers used kernel functions, thresholds, and other methods to filter descriptors to improve accuracy of models [33, 34]. It is worth considering that these method does not take into account properties of drug molecules and that may not be applicable in quantitative prediction of drug molecule-target interactions.
After comprehensive consideration, in this paper, based on properties of drug molecules, we screened characteristic descriptors of drug molecules from the perspective of molecular vibrations. This is because molecular vibrations was caused by vibrations of chemical bonds within molecule and molecular vibrations is a macroscopic representation of properties of drug molecules [35, 36]. Moreover, molecular vibrations are affected by various factors such as conjugation effect, induction effect, spatial effect, hydrogen bonding, vibrational coupling effect, etc. Therefore, molecular vibrations can reflect drug molecular structure and physicochemical properties of drugs to a certain extent [37]. It should be remember that seven physicochemical properties are particularly relevant to the nature of chemical bonds, including electronegativity, π-atomic charge, total charge, and bond polarity [38]. Selecting all descriptors related with these seven properties based on meaning of each descriptor. For instance, Mpe-Constitution Descriptor-mean Atomic Pauling Electronegativity (scaled on carbon atom) was selected as feature descriptors to construct prediction models for DTIs affinity due to it was related to atomic electronegativity. Finally, 813 descriptors associated with molecular vibrations were selected from 1874 descriptors in Table 1 to represent the feature characteristics of drug molecule.
2.2.2 Calculation of target protein descriptors
As was known to all, 3D structures of many proteins are unknown, especially for membranous proteins [27, 28]. Thus, the analysis based on protein sequences rather than 3D structures of proteins can ensure a wide range of applicability of models and its accuracy [39]. The target protein descriptors were shown in Table 2.
Table 2
Type and number of target protein descriptors
Serial number
|
Descriptor type
|
Number of descriptors
|
1
|
Amino acid composition
|
20
|
2
|
Dipeptide composition
|
400
|
3
|
Normalized Moreau-Broto autocorrelation
|
240
|
4
|
Moran autocorrelation
|
240
|
5
|
Geary autocorrelation
|
240
|
6
|
Composition
|
21
|
7
|
Transition、Distribution
|
126
|
8
|
Sequence-order-coupling number
|
60
|
9
|
Quasi-sequence-order descriptors
|
100
|
As shown in Table 2, there are 1437 descriptors for each protein and descriptors can be divided into 9 categories, among which Dipeptide composition, Moran autocorrelation, Moran autocorrelation as well as Normalized Moreau-Broto autocorrelation account for a relatively large number.
813 drug molecule descriptors were integrated with 1437 protein sequence descriptors and Kd, EC50 datasets to obtain the integrated Kd, EC50 datasets.
2.3 Results of feature screening
The Boruta algorithm was used for feature filtering. If a feature attribute was marked as "Confirmed", it means that the attribute is "important". On the contrary, if feature attribute was marked as "Rejected", it means that the attribute is "Not Important". In addition, some data was marked as "Tentative", which means importance of the data is not clear. To ensure reliability of feature filtering, we excluded the data marked "Tentative" and "Rejected". As shown in Fig. 1, for integrated EC50 dataset, 1259 descriptors were marked as "Confirmed" and 683 descriptors were marked as "Rejected", with 308 descriptors being marked as "Tentative". That is, after feature selection, each DTIs in the integrated EC50 dataset was characterized by 1259 feature attributes. Similarly, as shown in Fig. 2, for the integrated Kd dataset, 827 descriptors were marked as "Confirmed" and 1191 descriptors were marked as "Rejected" with 232 descriptors being marked as "Tentative". Each DTIs in integrated EC50 dataset was characterized by 827 feature attributes.
In the process of feature selection, we chose Boruta algorithm because this algorithm can filter the set of features that are associated with the response variable, rather than selecting the set of features that minimizes penalty factor only for a specific model.
The feature subsets of EC50 and Kd were obtained by feature screening for construction of quantitative prediction models for DTIs affinity.
2.4 Results of quantitative prediction model for DTIs affinity
2.4.1 Parameter optimization
The setting of algorithm parameters is crucial to construction of prediction models for DTIs affinity. In RF model, there are two important parameters need to be considered: Ntree and Mtry. After comparison and optimization of several parameters, we finalized RF algorithm parameters: Ntree = 500, Mtry = default value; As for SVM model, we used "Tune" function to determine the optimal parameters of SVM algorithm, with the following algorithm parameters: cost = 1000, gamma = 0.0001 [40]. Kernel selected radial basis kernel function to produce minimum error rate. In same optimization way, ANN algorithm parameters were determined: size = 2, decay = 0.1, linout = T (non-linear function), maxit = 1000, the rest of parameters were default values.
3.4.2 Optimal prediction model for DTIs affinity
Before attempting to construct prediction models for DTIs affinity, EC50 feature subsets were preprocessed to facilitate calculation. Then combined with SVM, RF and ANN to construct quantitative prediction models respectively. The results of 10-fold cross validation for EC50 feature subset were shown in Table 3.
Table 3
Tenfold cross validation of three kinds of algorithms for EC50 feature subset
Model (EC50)
|
R2
|
|
MSE
|
|
SSE
|
Training
|
Test
|
Training
|
Test
|
Training
|
Test
|
SVM
|
0.9317
|
0.5759
|
0.1270
|
0.8356
|
1249
|
8216
|
RF
|
0.9611
|
0.9641
|
0.0891
|
0.0817
|
876
|
803.3
|
ANN
|
0.7350
|
0.5211
|
0.4867
|
0.9590
|
4785
|
9429
|
As shown in Table 3 and Fig. 3, In RF model, R2 of training and test sets are 0.9611, 0.9641 respectively indicated a good fit of RF model to data. MSE of training and test sets were both less than 0.09 and were in same order of magnitude, which indicated that there is no overfitting problem existing, and demonstrated that RF model showed satisfactory predictive performance (Fig. 3-a). As for SVM model, R2 of training and test sets are 0.9317, 0.5759 respectively. SVM model exhibited some differ for training and test sets, but order of magnitude is the same and no greatly obvious overfitting can be observed from SVM model (Fig. 3-b). However, predictive performance of SVM model worse than that of RF model. For training and test sets in ANN model, no obvious overfitting can be observed (Fig. 3-c), but the performance of ANN model in training and test set were lower than both RF model and SVM model. By comparing predictive performance of three models based on evaluation indicators, it can be observed that the performance of RF model is best selection for EC50 data.
The same analysis was appropriate for Kd data, on the basic of data in Table 4 and scatter plot in Fig. 4, we completed selection of optimal model: RF model showed satisfactory predictive performance with R2 of test set being 0.9485 (Fig. 4-a). The SVM model suffered from overfitting and its predictive performance was worse than that of RF model (Fig. 4-b). ANN models are the least effective model (Fig. 4-c). The results indicated that RF model was the optimal quantitative prediction model for KD data.
Table 4
Tenfold cross validation of three kinds of algorithms
Model
(KD)
|
R2
|
|
MSE
|
|
SSE
|
Internal
|
External
|
Internal
|
External
|
Internal
|
External
|
SVM
|
0.9099
|
0.5083
|
0.1254
|
0.7290
|
1230
|
808.4
|
RF
|
0.9425
|
0.9485
|
0.1208
|
0.1191
|
1204
|
132.1
|
ANN
|
0.5857
|
0.2961
|
0.5612
|
1.0190
|
5593
|
1130
|
In summary, whether based on EC50 data or Kd data, the performance of RF model was the best. Therefore, in this paper, random forest (RF) model is more suitable for quantitative prediction of biological activities for DTIs affinity.
2.5 Evaluation of application for optimal models
By comparing analysis in 2.4, we obtained RF optimal models. To demonstrate the reliability and applicability of RF model further, we used RF model for analysis of DTIs in Binding DB database, in which Kd and EC50 quantified affinity of DTIs.
Using same data collection methods and eliminating duplicate data, we collected 1045 ligand-receptor-Ec50 pairs and 89 ligand-receptor-KD pairs from Binding DB database for quantitative analysis of DTIs affinity. Quantitative analysis of new dataset was carried out using optimal models based on Kd and EC50. Calculating absolute value of the difference between true value and predicted value-|d| and dividing |d| into 5 parts in which each part was divided on a scale of 0.5. Therefore, we obtained the results of distribution of |d| (Fig. 5 and Fig. 6) in new EC50 and Kd dataset, reflecting prediction capability of RF model.
The predictive values of RF models were all greater than zero, suggesting that drug molecule-target interactions do exist, which is consistent with the data information gathered from datasets. This indicated that optimal model constructed in this paper could be accurately used for qualitative prediction of DTIs. However, as shown in Fig. 5 and Fig. 6, eighty percent of the |d| distribution was 1.5-2.0. The range of differences was within 2.0 for 98.95% (EC50) 96.63% (Kd) of |d| respectively. This indicated that there is error between predicted value and experimental value. Further comparison of predicted and experimental data revealed that all predicted values are greater than experimental true values and within a certain margin of error. The reason for that maybe a systematic error due to different standards used to store data in different databases. The case can be set with a correction factor - average of all difference values. The above demonstrated that quantitative RF prediction model developed in this paper can predict affinity of DTIs to a certain extent based on Kd and EC50.
2.6 Comprehensive comparisons of models
Besides evaluation of application of RF models, comprehensive comparisons were made with predictive models for DTIs previously reported. In recent years, there have been many reports for predicting DTIs, such as Xie L, et al., adopted transcriptome data and deep-learning algorithm to predict the potential DTIs [41]. Olyan R S, et al., developed a novel method based on RF model to improve DTIs prediction accuracy [42]. Chen N, et al., carried out a quantitative analysis of antioxidant activity of antioxidant tripeptides in free radical systems based on QSAR [43]. In above analysis methods, even current state of prediction analysis for DTIs, there are often only analysis based on structure of ligand or receptor rather than taking ligand-receptor as a whole system for DTIs analysis. This method of analysis, which separated ligands from receptors, can be limited by its own structure and produce non-reciprocal results, leading to poor accurate. Conversely, in this paper, the model was constructed to take full account of ligands and receptors. From perspective of taking molecule-target as a whole system, we integrated molecule-target descriptors to construct predictive models for DTIs affinity, which is able to avoid unequal results based on receptors or ligands only, thus increasing accuracy of prediction model. At the same time, based on whole system of ligand-receptor, we can collect a large amount of molecule-target data rather than building for specific targets or several targets, expanding scope of application.
There were related reports on quantitative prediction of DTIs affinity. Based on 9948 DTIs quantified by Ki, 1589 molecular descriptors and 1080 protein descriptors, Shar P A, et al., constructed quantitative prediction models for DTIs using RF and SVM model, respectively [44]. However, the Coefficient of Determination-R2 of RF and SVM models in training set are 0.88 and 0.86, at the same time, that of modes in test set are 0.63 and 0.61, which showed that there exists over-fitting. That is to say, predictive models have low accuracy. The main reasons for that would be improper characterization of drug molecules-targets and lack of feature screening. Considering this situation, in this paper, we screened characteristic descriptors of drug molecules from the perspective of molecular vibrations [35, 38]. Moreover, the analysis based on protein sequences rather than 3D structure of protein can ensure a wide range of applicability of models and its accuracy [39]. Therefore, the SVM and RF models in this paper had good results better than above research. In addition, the two datasets in this paper involved 544 and 778 targets respectively, which guaranteed that the model had some broad applicability. Likewise, Hakime Öztürk, et al., constructed DeepDTA to quantify the affinity of ligands-receptors, in which the results were not ideal. In process of building model, more attention was paid to amount of data and neglecting molecular feature representation. The R2 of Convolutional Neural Network (CNN) model was less than 0.70, which was lower than optimal RF model in this paper. The MSE of CNN model was high than 0.194, which was high than that of RF model in this paper (0.119) [45]. Abbasi W A, et al., proposed a sequence-based novel protein binding affinity predictor called ISLAND, in which the SVR model for LA kernel was the best model with R = 0.44, MSE = 6.55 [46]. Above comparative result showed that RF model developed based on Kd and EC50 in this paper can perform quantitative prediction of DTIs affinity more accurately with certain applicability and reliability.
Moreover, literature already reported has not characterized drug molecules from the perspective of molecular vibrations. Based on the methods and good results of this paper, it was also shown that parametric characterization based on molecular vibrations is crucial for construction of prediction model for DTIs affinity.