DTIRF: Predicting Drug-Target Interactions Based on Improved Rotation Forest from Drug Molecular Structure and Protein Sequence

Background: The identification and prediction of Drug-Target Interaction (DTI) is the basis for screening drug candidates, which plays a vital role in the development of innovative drugs. However, due to the time-consuming and high cost constraints of biological experimental methods, traditional drug target identification technologies are often difficult to develop on a large scale. Therefore, in silico methods are urgently needed to predict drug-target interactions in a genome-wide manner. Results: In this article, we design a new in silico approach, named DTIRF, to predict the DTI combine feature weighted Rotation Forest (FwRF) classifier with protein amino acids information. More specifically, we first use Position-Specific Score Matrix (PSSM) to numerically convert protein sequences and utilize Pseudo Position-Specific Score Matrix (PsePSSM) to extract their features. Then a unified digital descriptor is formed by combining molecular fingerprints representing drug information. Finally, the feature weighted rotation forest is applied to implement on Enzyme, Ion Channel, GPCR, and Nuclear Receptor data sets. The results of the five-fold cross-validation experiment show that the prediction accuracy of this approach reaches 91.68%, 88.11%, 84.72% and 78.33% on four benchmark data sets, respectively. To further validate the performance of the DTIRF, we compare it with other excellent methods and Support Vector Machine (SVM) model. In addition, 7 of the 10 highest predictive scores in predicting novel DTIs were validated by relevant databases. Conclusions: The experimental results of cross-validation indicated that DTIRF is feasible in predicting the relationship among drugs and target, and can provide help for the discovery of new candidate drugs.

Identifying the interactionship among drugs and targets is a crux area in genomic drug discovery, which not only helps to understand various biological processes, but also contributes to the development of new drugs [1,2]. The emergence of molecular medicine and the completion of the Human Genome Project provide better conditions for the identification of new drug target proteins. Although the researchers have made a lot of efforts, only a small number of candidate drugs can be approved by the Food and Drug Administration (FDA) to enter the market so far [3][4][5]. An important reason for this situation is due to the inherent defects of the experimental methods. As is known to all, biological laboratory methods to identify DTI are usually limited to small-scale studies, and are expensive and time-consuming. Computational methods can narrow the scope of candidate targets and provide supporting evidence for the drug target experiments, thus speeding up drug discovery. Therefore, computation-based methods are urgently required to improve efficiency and reduce time in identifying potential DTIs across the genome. [2, 4, 6-8].
In recent years, researchers have developed a variety of computation-based methods to analyze and predict DTI, which can be broadly divided into two categories: network-based methods and machine learning-based methods [9]. Network-based methods usually describe the relationship between drugs and target proteins as a heterogeneous network, and predict DTI by analyzing the correlation and similarity between nodes. For example, Wu et al. [10] proposed the SDTBNI model in 2016, which searches for unknown DTIs through new chemical entity-substructure linkages, drug-substructure linkages and known DTI networks. Zhang et al. [ 11] proposed a novel DTI prediction model based on LPLNI.
The model uses data points reconstructed from neighborhood to calculate the linear neighborhood similarity of drug-drug. Based on biomedical related data and Linked Tripartite Network (LTN), Zong et al. [ 12] used the target-target and drug-drug similarities calculated by DeepWalk to predict DTI.
Machine learning-based methods usually extract the features of drug and target data by related algorithms, and effectively predict potential DTIs by supervised or semi-supervised methods. For example, Peng et al. [ 13] combines the biological information of targets and drugs with PCA-based convex optimization algorithms to predict new DTIs using semisupervised inference method. Ezzat et al. [ 14] used ensemble learning algorithm to predict DTI by decrease features with subinterval features through three dimensionality reduction models. Generally speaking, drugs with chemical similarity also have similar biochemical activity, that is, they can bind to similar target proteins. Based on the above assumptions, the use of medicinal chemical molecular structure information and protein sequence information to predict the DTI model has achieved good results. For example, Wen et al. [15] extracted drug and target features from their chemical substructure and sequence information, and used deep belief network (DBN) to predict potential DTI. The model proposed in this paper belongs to the machine learning-based method based on this assumption.
In this article, according to the assumption that the interaction among drugs and target proteins largely depend on the information of target protein sequences and drug molecular sub-structure fingerprints, a novel machine learning-based model is proposed to infer potential DTI. Our feature combines the fingerprint of the drug molecule structure and the protein sequence encoded by a feature extraction method called Pseudo Position- where TP is the number of drug-target pairs that are related to each other to be correctly identified; FP is the number of drug-target pairs that are related to each other to be incorrectly identified; TN is the number of drug-target pairs that are not related to each other to be correctly identified; FN is the number of drug-target pairs that are not related to each other to be incorrectly identified. Moreover, the receiver operating characteristic (ROC) curve [16,17] and area under the ROC curve (AUC) are used to visually display the performance of the classifier.

Model Construction
To optimize the performance of the DTIRF, the grid search method is applied to explore the parameters of PsePSSM and FwRF. When extracting feature by PsePSSM, the parameters in the formula 5 can be adjusted to increase the amount of information. In the experiment we explored the effects of different PsePSSM parameters on the performance of classifiers on Enzyme data set. After optimization, we set the parameter λ of PsePSSM to 34, and the parameters the feature selection ratio r, the feature subset K and the decision tree number L of FwRF classifier to 0.8, 16 and 21, respectively. Figure

Evaluation of Model Prediction Ability
After finding the optimal parameters of the DTIRF, we put them in benchmark data sets, including Enzyme, Ion Channel, GPCR and Nuclear Receptor. In order to avoid over-fitting of the model, we use five-flod cross-validation method to evaluate the performance of the model. More specifically, we split the data set into five subsets, one of which is taken as the test set, and the remaining four are used as the training set. Then, the crossvalidation process will be repeated five rounds. The results from the 5 times are then averaged to produce the final result. Comparison between the proposed model and LPQ descriptor models To evaluate the impact of PsePSSM algorithm on the proposed model, we compare it with Local Phase Quantization (LPQ) on four benchmark data sets in this section. The LPQ feature extraction algorithm is based on the blur invariance property of the Fourier phase spectrum [18][19][20] and originally described in the article for texture description by Ojansivu and Heikkila [21]. Table 5 summarizes the cross-validation results generated by LPQ algorithm combined with FwRF classifier on four benchmark data sets. From the table we can see that DTIRF has achieved the best results in all the evaluation indicators including accuracy, sensitivity, precision, MCC and AUC. Detailed five-fold cross-validation results on four benchmark data sets are presented in Supplementary Materials Table S1-S4. In the comparison experiment, we set the same parameters for the FWRF classifier. We can see from the comparison results that PsePSSM algorithm combined with FwRF classifier does helps to improve the performance of the model.

Comparison between FwRF and SVM classifier models
As the most versatile Support Vector Machine (SVM) classifier has been widely used by various problems. In order to estimate DTIRF clearly, we compare the results of DTIRF and SVM classifier model on the same data set. The SVM parameters are determined by grid search, and finally set the value of c to 0.5 and the value of g to 0.6. The results of the SVM classifier optimization can be viewed in the supplementary materials

Comparison with existing methods
The prediction of the relationship between drugs and targets has drawn increasing interest of researchers. So far, a lot of excellent computational approaches have been designed. To better verify the proposed approach, we compare it with other existing methods using five-fold cross-validation on the same benchmark data sets. Table 7 lists the details of other excellent methods and DTIRF on four benchmark data sets in terms of the AUC. It is seen that the results obtained by DTIRF on Enzyme data set are significantly higher than those of other existing methods, and the results achieved on Icon Channel and GPCR data sets by DTIRF only lower than the highest result 0.73% and 0.13%. The performance of DTIRF on Nuclear Receptor data set is not very good, it may be because the sample number of the Nuclear Receptor data set is too small, and the training of the classifier is not sufficient Case study To further validate DTIRF's ability to predict potential DTI, we use all known interactions to train the model and then predict unknown interactions. We selected 10 drug-target pairs with the highest predictive score to validate in SuperTarget [26]. SuperTarget is a database that collects drug-target relations and currently stores 332,828 DTIs. As shown in Table 8, 7 of the top 10 predicted highest scores were confirmed. This result indicates that DTIRF can effectively predict the potential DTIs. It is worth noting that although we have not found evidence of the interaction of the remaining 3 drug-target pairs, we can not completely deny the possibility of their interactions.

Materials And Methodology
Benchmark data sets In this article, we applied four protein targeting data sets, including Enzyme, Ion Channel, GPCR and Nuclear Receptor. These data sets are applied as the benchmark data sets by Yamanishi et al. [27] and collected from the BRENDA [28], DrugBank [29], SuperTarget & Matador [26] and KEGG BRITE [30]. They can be downloaded at http://web.kuicr.kyotou.ac.jp/supp/yoshi/drugtarget/ [ 27]. The number of drugs was 445, 210, 233 and 54, and the number of target proteins was 664, 204, 95 and 26 in these benchmark data sets, respectively. Among these data, 5127 pairs of drug-target were confirmed to interact with each other, corresponding to 2926, 1476, 635 and 90 pairs in four data sets, respectively [25].
The DTI network can be expressed by a bipartite graph in which nodes represent drugs or

Molecules description
In recent years, different types of descriptors have been proposed to represent drug compounds, such as quantum chemical properties, topological, constitutional and geometrical. Since the molecular substructure fingerprint does not require the threedimensional structural information of the molecule and has the advantage of directly reflecting the relationship between molecular properties and structure, more and more researchers use it as a descriptor to predict the relationship between the drug and the target protein. Specifically, we first store all the molecular substructures in the form of a dictionary, and then split a given drug molecule. When it contains a certain substructure, the corresponding bit of the descriptor is assigned to 1; otherwise it is assigned to 0. where ei,j0 denotes the raw score generated by PSI-BLAST, which is typically a positive or negative integer. This is not the final score, because if it exceeds 20 amino acids, the score may contain 0; if the same conversion procedure continues, the score may remain unchanged. The positive number signifies that the frequency of corresponding mutations in the alignment is higher than that of accidental expectations. Conversely, the negative number signifies that the frequency of corresponding mutations in the alignment is lower than that of accidental expectations. However, base on the PSSM formula, proteins of different lengths will produce a matrix of different numbers of rows. Therefore, equation 3 is used to convert the PSSM matrix into a uniform pattern.
[Due to technical limitations, this equation is only available as a download in the supplemental files section.] (7)(8) where e̅ j indicates the average score of P protein when amino acid residues evolve into jtype amino acids. However, if only M̅ PSSM is used to indicate protein P, all information about sequence order will be lost during evolution. In order to prevent this from happening, we introduce the idea of pseudo amino acid to improve equation 3. Therefore, according to the formula 5, we can get the features of segmented PsePSSM:  9) where ej is a related factor for j-type amino acid, whose contiguous distance is along each Assuming that the number of decision trees is N, then the decision trees can be expressed as D1,D2,…,DN. The algorithm is executed in the following steps.
(1) Using the appropriate parameter K to randomly divide F into K independent and uncrossed subsets, the number of each subset feature is nk.
(2) A corresponding column of features in the subset Di,j is selected from the training set X to form a new matrix Xi,j. Then, 75% of the data is extracted from X in the form of bootstrap to form a new set X'i,j.
(3) Use matrix X'i,j as the feature transform to generate coefficients in matrix Mi,j.
(4) Using the coefficients obtained from the matrix Mi,j to form a sparse rotation matrix Ri, the expression of which is as follows: [Due to technical limitations, this equation is only available as a download in the supplemental files section.] (13) In classification, the test sample x is determined to belong to theyi class by the x generated by the classifier Di of di,j(XRie). Then calculate the confidence class by the following average combination formula: [Due to technical limitations, this equation is only available as a download in the supplemental files section.] (14) Finally, the class with the largest value λjx is discriminated as x.

Conclusions
Prediction of DTI is a crucial problem for human medical improvement and genomic drug discovery. Under the hypothesis that the drug molecules structures and protein amino acids sequence have a big impact on the relationships among drugs and target proteins, the in silico approach is proposed to infer potential drug-target relationships in this article. We implement it on Enzyme, Icon Channel, GPCR and Nuclear Receptor data sets, and obtained excellent results. To further evaluate the performance of the proposed approach, we compared it with PsePSSM algorithm model, the SVM classifier model and other existing methods on the same data sets. Competitive cross-validation experimental results show that the performance of DTIRF has been significantly improved, which demonstrated DTIRF is stable and reliable. In the next study, we plan to try more feature extraction algorithm to better predict DTI.     Table 5. Experimental results of the FwRF classifier combined with LPQ algorithm on four benchmark data sets.