Autoencoders for Drug-Target Interaction Prediction

Background: Because it is so laborious and expensive to experimentally identify Drug-Target Interactions (DTIs), only a few DTIs have been veriﬁed. Computational methods are useful for identifying DTIs in biological studies of drug discovery and development. Results: For drug-target interaction prediction, we propose a novel neural network architecture, DAEi, extended from Denoising AutoEncoder (DAE). We assume that a set of veriﬁed DTIs is a corrupted version of the full interaction set. We use DAEi to learn latent features from corrupted DTIs to reconstruct the full input. Also, to better predict DTIs, we add some similarities to DAEi and adopt a new nonlinear method for calculation. Similarity information is very eﬀective at improving the prediction of DTIs. Conclusion: Results of the extensive experiments we conducted on four real data sets show that our proposed methods are superior to other baseline approaches. Availability: All codes in this paper are open-sourced, and our projects are available at: https://github.com/XiuzeZhou/DAEi .


Introduction
To understand the protein functions and molecular mechanisms of cells, it is important to study Drug-Target Interactions (DTIs) [1]. Although an enormous amount of research has resulted in the discovery of new drugs, the number of approved drugs, which remains modest [2], is far from meeting the rapidly growing demand for identifying DTIs. Thus, there is an urgent need for effective and efficient computational methods to solve this problem.
Great advances in machine learning have contributed significantly to biological development [3,4]. For example, Support Vector Machines (SVMs), one of the most widely applied supervised learning methods, is adopted for drug discovery [5,6]. Random Forest, a popular tree-based ensemble machine learning tool with highly adaptive data, is applied to predict protein-protein interaction sites [7,8]. k-Nearest Neighbor (kNN), a simple and commonly used method, is designed to improve prediction accuracy in drug discovery [9,10]. Matrix Factorization (MF), an effective way to learn latent features from matrix data, is extended to various models to predict DTIs [10,11,12,13,14,15]. For further experimental investigation, such computational methods are useful for providing some critical and reasonable clues.
Recently, to predict DTIs, some researchers started to explore the application of powerful deep learning models to capture the higher-order relationship from inputs by their hidden layers. For example, to accurately predict new DTIs between approved drugs and targets, Wen et al. [16] used Deep Belief Network (DBN) architecture, which is constructed by stacking many Restricted Boltzmann Machines (RBMs). Gao et al. [17] combined Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) to obtain meaningful information from protein sequences and drug structures. Zhao et al. [18] proposed an end-to-end model, associated with an attention mechanism, to predict the binding affinity of DTIs. Zhao et al. [19] proposed a neural network, GANsDTA, which combined two Generative Adversarial Networks (GANs) and a regression network to predict binding affinity. Those deep learning models show great promise for identifying DTIs.
Among all deep learning architectures, autoencoders are one of the easiest and simplest tools to be applied in many fields, such as images and texts, to learn the meaningful features from raw data. For example, Hu et al. [20] used the Stacked AutoEncoder (SAE) to learn the deep features from protein sequence descriptors and drug molecular fingerprints, which were then fed to SVMs to make predictions. To fully train rotation forest classifiers, Wang et al. [21] used SAE to extract protein features from protein sequences. Yasuo et al. [22] adopted a Denoising AutoEncoder (DAE) to capture drug features, which are added to the MF as auxiliary information. But, in applying those models, one major problem is they cannot be applied to the fields without any, or with little, auxiliary information, such as protein sequence and drug molecular structure.
For DTI prediction, unlike using existing autoencoders as just ancillary tools to obtain latent features from protein sequences and drug molecular structures, we use autoencoders as the principal methods for directly predicting DTIs. We assume that the set of verified DTIs is a corrupted version of the full set of DTIs. We applied AutoEncoder for drug-target interaction prediction (AEi), and extended Denoising AutoEncoder for drug-target interaction prediction (DAEi) to learn latent features from a set of corrupted DTIs for reconstructing its full inputs. AEi and DAEi are reliable and effective for detecting potential interactions between drugs and targets.
Because DAEi has a simple and flexible structure, additional information can easily be incorporated. To further improve identification performance, some similarity information about drug-drug and target-target is added to our model. Similarity information improves the ability of the models to identify DTIs [13,23]. However, to calculate similarities, most previous methods adopted a linear method, which ignores some nonlinear and complicated relationships between drug-drug and targettarget. Different from existing similarity metrics, we design a nonlinear technique to effectively extract the proper relationships.
Finally, the performance of our models was empirically evaluated using several state-of-the-art methods and four benchmark data sets: Enzymes, Ion channels, G-Protein-Coupled Receptors (GPCRs), and Nuclear receptors. Results of extensive experiments show that our proposed models considerably outperform the baseline approaches for all data sets in terms of the two most used assessment methods: Area Under the Precision-Recall curve (AUPR) and the Area Under the receiver operator characteristic Curve (AUC). In addition, because our models have a very simple and effective framework, they can be easily extended to further research.
The rest of the paper is organized as follows: Section 2 briefly reviews the background and some related work. Section 3 presents our proposed models in detail. Section 4 describes the experimental results for several data sets to show the performance of our models. Section 5 gives the conclusion and provides future directions.

Related Work
First, we define the problem. Then, to solve the problem, we simply introduce MF and Collaborative Matrix Factorization (CMF), which extends MF by combining similarity information to achieve better performance.

Problem Definition
Given a set of drugs, D = {d 1 , · · · , d n }, and a set of targets, T = {t 1 , · · · , t m }, where n and m are the number of drugs and targets, respectively, their interactions, Y ∈ R n×m , is defined as follows: For typical DTIs, DTI prediction is viewed as a binary classification problem. When given a drug and a target, the goal of the model is to use learned latent features to accurately predict the possible values of their interaction. In general, the prediction of DTIs is defined as follows: where y d,t denotes the predicted interaction between drug, d, and target, t; Θ is the set of learning parameters of models for training; F is the function of the models for predicting DTIs. Let (·) denote a loss function, which measures the distance between the true label, y d,t , and the predicted label, y d,t ; Ω(·) denotes a regularization term, which prevents the model from over-fitting. The objective function is defined as follows: where V denotes the set of instances whose interactions have been experimentally verified; λ denotes a regularization coefficient.

MF and CMF
Let P ∈ R n×k and Q ∈ R m×k denote the latent features matrices of drugs and targets, respectively, where k is the number of latent features. p d ∈ R 1×k and q t ∈ R 1×k denote the latent features of drug, d, and target, t, respectively. To predict interactions, MF uses a dot product, which linearly combines latent features. Then, square loss is used to measure its loss. The objective function of MF is defined as follows: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 where λ denotes a regularization coefficient, and · 2 F denotes the Frobenius norm. To better predict DTIs, Zheng et al. [13] proposed a method, CMF, which adds multiple similarities to improve MF. CMF assumes that similar drugs will interact with similar targets, and vice versa. Thus, the feature vectors of similar drugs/targets should be near to each other. The objective function of CMF is defined as follows: where λ d and λ t denote regularization coefficients for drug and target similarities, respectively; S d ∈ R n×n denotes the similarity matrix for drugs, and S t ∈ R m×m denotes the similarity matrix for targets. Compared with MF, CMF improves predictive performance for DTIs by incorporating two similarity matrices as regularization terms. Similarity provides additional information about drug-drug and target-target relationships. Through that relationship information, more accurate representations are captured.

Methods
First, we introduce in detail the AEi framework applied to DTIs. Then, to predict DTIs, we describe DAEi, which learns the correlations between drugs and targets by training on a corrupted version of the known interaction set. Finally, for further development, two similarity matrices are added to our model and a nonlinear method is designed. The main goal of AEi and DAEi is to predict all unknown interactions.

AEi for DTIs
Autoencoders, one of the most popular deep neural networks, are widely used for classification and learning unsupervised features by reconstructing the original input [24,25]. A classical autoencoder is typically implemented as a one-hidden layer neural network. From the point of view of the drug, the set of all targets that interacts with each drug as an input for AEi, we call drug AEi (d-AEi). For example, take d-AEi; then, target AEi (t-AEi) is similar. The framework of d-AEi is shown in Figure 1.
Given a vector y d ∈ R m of drug, d, as input, the vector is mapped to a hidden representation z ∈ R k through the following mapping function: where W ∈ R m×k , b ∈ R k , and a(·) denote weight, bias, and activation function of the encoder, respectively.
The learning parameters of the autoencoder are trained by minimizing the average reconstruction error, defined as follows: where (·) denotes a loss function, and square loss and cross entropy loss are two of the most commonly used functions to measure loss in training. square loss is chose as our loss function.

DAEi for DTIs
d-AEi has two main shortcomings: 1) It simply reconstructs its input, ignoring the possible noise of the data; 2) It models on a set of targets, which only consider the association between the targets and ignores their interactions with drugs. To alleviate these problems and achieve an enhanced prediction of DTIs, we propose a DAEi model extended from DAE, which is robust for reconstructing original data. DAEi learns representation from corrupted input and reconstructs the input from that representation. We associate a latent feature of the drug with all targets (d-DAEi) and associate a latent feature of the target with all drugs (t-DAEi). DAEi learns from both drugs and targets -not separately from them. Figure 2 shows the network architecture of d-DAEi, which consists of three layers: input, hidden, and output. Next, we illustrate d-DAEi in detail; t-DAEi is similar to d-DAEi .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62 63 64 Input Layer. In the input layer, instead of the traditional approach of modeling separately, we simultaneously combine drugs and targets. A drug, d and its interactions are recorded with all targets, y d ∈ R m , i.e. in the d-th row of the matrix, Y . y d , the inputs of DAEi are the corrupted vector of y d . There are two popular ways to corrupt the original inputs: 1) Randomly discard some non-zero values of the inputs; 2) Add Gaussian noise to the inputs. We chose to add Gaussian noise to y d to generate its corrupted vector, defined as follows: Hidden Layer. The role of the hidden layer is to translate drug/target into latent representation space. Also, the hidden layer has a propensity for nonlinear representation. An interaction function for modeling y d and d, such as concatenation, element-wise product, and element-wise sum, is critical for the methods to learn their relationships. Element-wise sum is chose as our interaction function. The function of the hidden layer is defined as follows: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 where W ∈ R m×k , b ∈ R k , z ∈ R k and a(·) denote weight, bias, output, and activation function of the hidden layer, respectively. V d ∈ R k is the latent feature of drug, d. In DAEi, we chose the sigmoid function as the activation function. Output Layer. The goal of the output layer is to map back the latent representation to the original input space to reconstruct the input vector, defined as follows: where W , b , z and f (·) denote weight, bias, output, and map function of the output layer, respectively. And the identity function is chose as the map function of DAEi.
Finally, the objective function of DAEi is defined as follows:

Similarity Information
Drug-drug and target-target similarities provide important information: two similar things have similar characteristics. An interaction of one drug that has been experimentally verified with a target will interact with a similar target, and vice versa. Therefore, to improve our models, we added some similarity information.
Existing methods use a dot product, which linearly combines latent features to calculate the similarities between drug d and i, and between target t and j, as shown in the following: But, a linear combination for the latent features between drugs and targets might neglect their possible nonlinear relationships. To better capture drug-drug and targettarget relationships, we propose a nonlinear method. To calculate similarities, we use an exponential function defined as follows: Then square loss is used to measure the distance between the predicted and true values as shown in the following: where S d and S t are the predicted similarity matrices of drugs and targets, respectively.

Experiments
First, we describe the data sets used in our experiments, and their similarity matrices. Second, we introduce the metric we used to evaluate our models. Then, for all methods, we present some baseline approaches compared with our model and parameter settings. Finally, to illustrate the performance of our models, we compare our models with baselines using extensive experiments and discuss the influence of similarity information, the nonlinear method, and a key parameter, the hidden layer size.

Experimental Setting
Data Sets. To evaluate our proposed methods in respect to the prediction of DTIs, we use the following four benchmark data sets [1] : Nuclear Receptors, GPCRs, Ion Channels and Enzymes, collected by Yamanishi et al. [26] from KEGG BPITE [27], BRENDA [28], SuperTarget [29], and DrugBank [30], respectively. Each data set contains three types of information: 1) verified DTIs; 2) drug-drug similarity; and 3) target-target similarity. Listed in Table 1 are some statistics about the verified DTIs in all the data sets. The similarity of the chemical structure between compounds was computed by SIMCOMP [31], which measures the similarity score between two compound structures. Amino acid sequences of the target proteins were obtained from the KEGG GENES database [27]. The sequence similarity between two proteins was computed using a normalized version of the Smith-Waterman score [32]. Evaluation Metrics. Following previous studies [10,12,13,33], to evaluate the performance of the DTI prediction methods, we repeated 10-fold Cross-Validation (CV) five times. In CV, the set of all DTIs (Y ) was randomly split into ten folds, nine of which were used for training and one used for testing.
In this paper, we consider two popular metrics: AUC and AUPR. An AUC score is estimated in each repetition of CV; finally, the average score over all five repetitions is determined. The AUPR score is estimated in the same way. Because AUPR punishes highly ranked false positives much more than AUC [34], AUPR is chose as our major evaluation metric in this paper. Baseline Approaches. To show the effectiveness of our models, we compared our models with four baseline approaches. Each baseline has its unique characteristics and function and represents a particular capability to model. Those approaches are briefly introduced as follows: -PMF [14]. Probabilistic Matrix Factorization (PMF), one of the most widely applied in the prediction of DTIs, effectively learns latent features from matrix data; -CMF [13]. The state-of-the-art MF based method for DTI prediction simultaneously learns latent features from DTIs, drug-drug and target-target similarities, and uses dot product to calculate similarities; -RBM [35]. Restricted Boltzmann Machine (RBM), regarding the state-of-the-art shallow neural network-based method for DTI prediction, its visible units encode observed types of DTIs, and its hidden units represent latent features describing DTIs; -BRDTI [33]. The state-of-the-art Bayesian ranking method for DTI prediction, which extends Bayesian Personalized Ranking (BPR) [36] by adding similarity information and target bias; -CnnDTI [37]. The state-of-the-art deep neural network-based method for DTI prediction resembles the LeNet-5 framework and CNN on drug descriptor and target protein sequences. Parameter Settings. To make a fair comparison, the corresponding parameters in all models are set to the same value. We set the learning rate in all methods at 0.001; regularization coefficient (λ) in all methods except RBM at 10 −6 ; the number of latent feature (k) in PMF, CMF, and BRDTI, hidden size in RBM and DAE, embedding size of CnnDTI at 1024, 256, 256, and 128 for Enzymes, Ion Channels, GPCRs, and Nuclear Receptors, respectively; batch size in RBM, CnnDTI, and our models at 256; regularization coefficients for similarity (λ d and λ t ) in CMF, BRDTI, and DAEi models at 10 −6 ; epochs of neural network frameworks at 100.

Results and Analysis
Overall Performance. First, we conducted some experiments, using all data sets, to determine the overall performance of our models. Table 2 shows the average AUC and AUPR values obtained by all methods on four data sets. The best results are shown in bold. From the results shown in Table 2, the following can be observed: (1) In terms of both AUC and AUPR, DAEi models significantly outperform other baselines, indicating that DAEi models effectively predict DTIs; (2) In terms of AUC, d-DAEi performs better than t-DAEi on Enzymes, Ion Channels, and GPCRs data sets. In terms of AUPR, t-DAEi performs better than d-DAEi on GPCRs and Nuclear Receptors data sets, possibly because the number of drugs and targets affects DAEi models. Regarding a data set, the set with a larger number of drugs/targets is more suitable for learning meaningful features; (3) On the same data set, results between d-AEi and t-AEi differ significantly, especially on the GPCRs and Nuclear   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  Table 3: Effect of similarity information and nonlinear calculation method. Receptors data sets in terms of the AUPR metric. Possibly, the input vector with a large sparsity difference results in an instable model; (4) Gaps among the DAEi models, CnnDTI, the best baseline method are significant: in terms of AUPR, DAEi models achieve results for Enzyme, Ion Channel, GPCR and Nuclear Receptor, which are 5.2%, 2.8%, 4.2% and 5.8% higher than CnnDTI, indicating that our architectures, modeled on the set of drugs and targets, are more reasonable than CnnDTI, which is modeled on only a single drug and single target; (5) RBM has a similar neural network with the DAEi models; however RBM is much worse than the d-AEi model, especially in terms of AUPR. Possibly, one reason is that it has only a binary representation, which limits its performance; another reason might be that it does not consider any relationship between drugs and targets; (6) Compared with the PMF and DAEi models, the DAEi models achieve better results with a large gap, which indicates that neural networks have a powerful ability to learn nonlinear representations; whereas, PMF models only on linear features. Finally, it can be concluded that our proposed methods learn sufficient and effective features by neural networks to detect true DTIs. Effect of Similarity and Nonlinear Calculation. Next, we investigate how similarity information, which provides additional information to build more accurate drug-drug and target-target relationships, affects our models. In this experiment, we set λ d = λ t , and select their values from { 0, 10 −6 , 10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 }. We can view DAEi as special cases of DAEi s when λ d = λ t = 0. Then, a linear calculation method, as adopted in CMF, is used for comparison with our nonlinear calculation method. In this experiment, AUPR is chose as our evaluation metric. Table 3 shows the average AUPR values of similarity information and calculation methods of d-DAEi for all data sets.
It is seen from Table 3 that, in most cases of all the data sets (21 cases out of 24), our method with a nonlinear calculation method always achieves higher AUPR values than the linear calculation method, which indicates that nonlinear technology has a much stronger ability to extract the proper information from similarity information. Also, for the regularization coefficients, small values (smaller than 10 −4 ) improve the experimental results slightly; whereas, large values (larger than 10 −4 ) adversely affect the experimental results. Thus, for good performance, choosing an appropriate value for the regularization coefficient is critical. Effect of Hidden size (k). Finally, we illustrate the effect that hidden layer size (embedding size or the number of latent features) has on DAEi models. Hidden layer size is a key factor for DAEi models to learn meaningful latent features from  Table 4 shows the average AUPR values of the DAEi models on four data sets. As shown in Table 4, there is a common trend across all the data sets. As the value of k increases, the AUPR value increases very quickly and then decreases slowly. When k is small (less than 128), models are in a state of under-fitting, where the model is too simple to achieve optimal results with test data; when k is too large, models are in a state of over-fitting, where the model is too complex to learn from the highly sparse interaction matrix. Those observations indicate that choosing the appropriate value (about 1 to 2 times the number of drugs/targets in each data set) for hidden size (the number of latent features or embedding size) is essentially a tradeoff between under-fitting and over-fitting.

Conclusion
In drug discovery and development, effective and efficient computational methods are important to detect potential DTIs. To predict DTIs, we proposed two novel methods, AEi and DAEi, which were developed from autoencoder and DAE, respectively. DAEi models, which reconstruct the full input from the set of their corrupted DTIs, are effective in identifying potential interactions for drugs and targets. After some similarities are added to our models, they achieve better performance. Experimental results show that our models considerably outperform baseline approaches.
For future work, first, to develop more effective methods for larger data sets, we plan to further investigate the integration of other deep learning models into our methods. Second, in our experiments, we found that our models did not perform well in predicting new drugs and targets; therefore, to improve the performance of our models in new drug and target tasks, it will be necessary to incorporate some additional information, such as amino acid sequences of target proteins and drug molecular structure. Finally, other possible directions will be to explore extending our models by adding more hidden layers, and extending more autoencoders, such as Stacked AutoEncoder (SAE) and Variational AutoEncoder (VAE), to predict DTIs .  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64