Graph-DTI: A new Model for Drug-target Interaction Prediction Based on Heterogenous Network Graph Embedding

doi:10.21203/rs.3.rs-2106602/v1

Download PDF

Research Article

Graph-DTI: A new Model for Drug-target Interaction Prediction Based on Heterogenous Network Graph Embedding

https://doi.org/10.21203/rs.3.rs-2106602/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Dec, 2023

Read the published version in Current Computer-Aided Drug Design →

Version 1

posted

You are reading this latest preprint version

Background

Accurate prediction of drug-target interactions (DTIs) can guide the drug discovery process and thus facilitate drug development. Most existing computational models for machine learning tend to focus on integrating multiple data sources and combining them with popular embedding methods. However, researchers have paid less attention to the correlation between drugs and target proteins. In addition, recent studies have employed heterogeneous network graphs for DTI prediction, but there are limitations in obtaining rich neighborhood information among nodes in heterogeneous network graphs.

Results

Inspired by recent years of graph embedding and knowledge representation learning, we develop a new end-to-end learning model, called Graph-DTI, which integrates various information from heterogeneous network data and automatically learns topology-preserving representations of drugs and targets to facilitate DTI prediction. Our framework consists of three main building blocks. First, we integrate multiple data sources of drugs and target proteins and build a heterogeneous network from a collection of datasets. Second, the heterogeneous network is formed by extracting higher-order structural information using a GCN-inspired graph autoencoder to learn the nodes (drugs, proteins) and their topological neighborhood representations. The last part is to predict the potential DTIs and then send the trained samples to the classifier for binary classification.

Conclusions

The substantial improvement in prediction performance compared to other baseline DTI prediction methods demonstrates the superior predictive power of Graph-DTI. Moreover, the proposed framework has been successful in ranking drugs corresponding to different targets and vice versa. All these results suggest that Graph-DTI can provide a powerful tool for drug research, development and repositioning.

Drug discovery

Target identification

Graph convolutional network

Heterogenous network graph embedding

When drugs are co-administered with target proteins in human cells, drug-target interactions (DTIs) often occur that will lead to many (positive or negative) feedbacks, which are the reason for being the basis for drug development and drug repurposing [1]. Therefore, in order to reduce extremely expensive and time-consuming biological experiments, it is crucial to efficiently identify potential DTIs through computational models, which can reduce to some extent the cost of experiments when treating diseases [2–3].

Most existing DTI prediction methods are generally designed to integrate multiple data sources to obtain features, including drug similarity features [4–6], protein sequence similarity features [7–8], and multi-task learning [9]. These approaches all rely on a presupposition - that drugs and targets with similar representations will perform similar DTIs. meanwhile, some computational approaches tend to combine with popular embedding methods [10–12] that attempt to automatically learn drug and protein representations and then build DTI models through specific operations such as matrix decomposition, random walks, and graph neural networks [13–14]. Although the aforementioned methods show excellent performance, there is an overlooked drawback that they model DTIs as an independent data sample without considering their correlation in topology.

On the other hand, the prevalence of heterogeneous network graphs has widely led to a proliferation of studies on relational inference and graph embedding [15], especially recent studies have used heterogeneous network graphs for DTIs prediction [16–18]. They all applied heterogeneous network graphs to machine learning models to extract drug and protein features using various embedding methods. These methods learn the potential embeddings of nodes directly, but they are limited in obtaining rich neighborhood information for each node in the heterogeneous network graph.

To address the limitations in the above-mentioned prediction methods for DTIs regarding the acquisition of topological structure and neighborhood information, our design goal is to automatically capture higher-order structures in heterogeneous network graphs. Inspired by graph neural networks (GCNs) [19], we propose a new end-to-end framework, named Graph-DTI, for DTI prediction in this paper. Overall, our framework consists of three main components. First, we integrate multiple data sources of drugs and target proteins and build a heterogeneous network from a collection of datasets. Second, the heterogeneous network is formed by extracting higher-order structural information using a GCN-inspired graph autoencoder to learn the nodes (drugs, proteins) and their topological neighborhood representations. The last part is to predict potential DTIs and then send the trained samples to the classifier for binary classification. Unlike existing approaches, our proposed framework exploits the rich structural relationship information in heterogeneous network graphs, which distinguishes Graph-DTI from existing deep learning models. Our contributions are summarized below:

Graph-DTI utilizes the topological information of each node in the heterogeneous network graph, which is more conducive to DTI prediction.

Graph-DTI aggregates all topological neighborhood information from their respective local receptive domains to extract higher-order structural information.

Graph-DTI uses a graph neural network compatible with heterogeneous network graphs and finally predicts the potential DTI by a multilayer perceptron.

The experimental results show that the substantial improvement in prediction performance compared to other baseline DTI prediction methods has demonstrated the superior predictive power of Graph-DTI. Furthermore, the prediction of several new DTIs is supported by evidence from previous studies in the literature, which further suggests that Graph-DTI is effective in predicting potential DTIs and can be a useful tool to advance the field of drug discovery and repositioning.

In this section, we first present the DTI prediction problem. Then, we provide an overview of the graph-DTI framework. After that, we present the input and GNN layers of the framework, respectively. Finally, we discuss the DTI prediction problem with Graph-DTI.

Problem Formulation

In a typical DTI scheme, we set ${N}_{d}$ and ${N}_{t}$ for the drug and protein, respectively. Then, we also define the drug-target interaction matrix $Y \in {(0, 1)}^{\left|{N}_{d}\right|\times \left|{N}_{t}\right|}$, where $\left|{N}_{d}\right|$denotes the number of drugs and $\left|{N}_{t}\right|$ denotes the number of target proteins. In the matrix Y, for each node ${y}_{i,j}=1\left(i\in {N}_{d},j\in {N}_{t}, i\ne j\right)$, a value of 1 indicates that protein j has interaction with drug i; otherwise, when y_(i,j)=0, it may be an undetected potential DTI. The DTIs prediction problem can be considered as a semi-supervised classification problem in machine learning, and for a given DTI matrix Y, our goal is to predict whether drug i ($i\in {N}_{d}$) has a potential interaction with protein j ($j\in {N}_{t}$) that has not been previously discovered. To achieve this goal, our main task is to learn to predict ${\widehat{y}}_{i,j}=\phi \left(i,j\right|\beta ,Y)$, where ${\widehat{y}}_{i,j}$ denotes the probability of protein j interacting with drug i and β denotes the model parameter of the function φ.

Overview

Figure 1 shows the overview of Graph-DTI. It takes as input the processed DTI matrix and the heterogeneous network obtained from the dataset preprocessing. It outputs the interaction values of drug-target pairs. The central idea of Graph-DTI is to encode the nodes and their topological neighborhood information into distributed representations by considering higher-order neighbor structure information and topological structure information using graph neural networks. Therefore, we designed Graph-DTI as a three-step DTI prediction framework.

The original dataset is processed to construct a heterogeneous network graph, while the DTI in the graph is constructed mask。
Encoding of node features between drug-target pairs and their adjacency structures in heterogeneous networks.
Prediction is performed based on the coding vector input to the classifier obtained from the previous step.

Overall, in the first step, we extract DTI data sources from the dataset, which contains drug-drug pairs, protein-protein pairs, drug-target pairs, drug chemical structure similarity scores and protein sequence similarity scores, while constructing a heterogeneous network. In the second step, we use Graph-DTI to extract the features of nodes and their related nodes' neighborhood structures from the constructed heterogeneous networks. In the third step, to further predict the interaction values between drug-target pairs, Graph-DTI is used to output the potential representations of the nodes and the neighborhood topology between the drug-target pairs. Then, we calculate the scores between them and output a true interaction value. We will specify the details of the model below.

DTI Extraction and Heterogeneous Network Construction

The first step of Graph-DTI is divided into two parts, including the construction of a heterogeneous network and the construction of a mask for the DTI.

Graph-DTI predicts unknown DTIs from a heterogeneous network constructed from drug and target proteins, where drug and target proteins are denoted as nodes and DTIs and other interactions or associations are denoted as edges. The heterogeneous network includes 2 types of nodes, drug and target, and 5 types of edges: drug-drug interaction (DDI), drug-target interaction (DTI), protein-protein interaction (PPI), drug chemical structure similarity, and protein sequence similarity. sequence similarity).

The heterogeneous network is defined as an undirected graph G = (N, E), where N is the space of nodes and E is the space of edges belonging to the set of object types O and R. O = {Drug, Protein}, R = {Drug-target-interaction, Drug-drug-interaction, Protein-protein-interaction, Drug-structure-similarity, Protein-sequence-similarity}. The number of drugs is 708, the number of targets is 1512, and the number of nodes of N is 2220. the number of drug-target-interactions is 1923, the number of drug-drug-interactions is 10036, the number of protein-protein-interactions is 7363, the number of drug-structure-similarity is the square of 708, and the number of protein-sequence-similarity is the square of 1512. Specifically, drug-target-interaction, drug-drug-interaction and protein-protein-interaction are treated by binary matrices as sparse matrices representing edges, which are undirected and unweighted for all three types of edges. Also, the scores in drug-structure similarity and protein-sequence similarity are used as weights for these two types of edges, respectively, i.e., these two types of edges are undirected and weighted. The specific construction process can be detailed in Fig. 2. In the heterogeneous graph, two identical nodes can be connected by more than one edge.

Neighborhood Message Passing

For a given heterogeneous network graph, the model aims to automatically learn node embeddings in the network topology, that is, characteristic representations corresponding to a function-mapping node, preserving the original topological features as much as possible, which can greatly facilitate DTI prediction. Suppose that the features of the node u are ${h}_{u}\in R$. Because of the features of the edges are usually high-dimensional, linear operations are introduced to reduce the dimension of the node features, that is concatenate ${h}_{u}$ and ${h}_{v}$, by linear layer to W⋅ (${h}_{u}\left|\right|{h}_{v}$) =${W}_{u}{h}_{u} +{W}_{v}{h}_{v}$, ${W}_{u}$ and ${W}_{v}$ are the left and right halves of the linear matrix W, respectively. The process of calculating edge features is formula (1). Multiply the source node features ${h}_{u}$ with the edge features ${h}_{v}$ to obtain the message, then sum over all the message to update ${h}_{u}$, and multiply it by 2 to get the output feature ${h}_{v}{\prime }$.

$${h}_{v}{\prime }=2\text{*}\sum _{u\in {N}_{i}}（{h}_{u}\text{*}（{W}_{u}{h}_{u} +{W}_{v}{h}_{v}））, \left(1\right)$$

Node Embedding Learning

As a special graph representation method, the GCN uses neural networks to encode the graph nodes, using the graph structure embedding as a computer-tractable vector matrix. Although the GCN is defined on the Fourier domain, it can also be understood on the spatial domain, namely the so-called message passing mechanism, or gathering information from the neighborhood and then updating the central node each time [31]. The ${H}^{\left(l\right)}$ is the input feature of each node in layer l, when l = 1, the original feature of the node. W is a linear transformation matrix, used as a convergence node feature dimension transformation (or as a map), $\delta$ is an activation function, and A is adjacency matrix. The result for H · W is multiplied by the adjacency matrix A and is used to select the first-order neighbor nodes, which is essentially the information transmission among the neighbors. The adjacency matrix is renormalized, to express the adjacency matrix A as a constant c, to obtain the formula (2).

${H}^{(l+1)}$ = $\delta \left(\sum _{j\in {N}_{i}}\frac{1}{{c}_{i}}{W}^{\left(l\right)}{H}^{\left(l\right)}\right), \left(2\right)$

In the formula (2), ${W}^{\left(l\right)}$ is shared by all edges, meaning that all edges are of the same type. However, the goal of this study is to make the link prediction of the potential DTI on the heterogeneous network, so in the heterogeneous graph G = (N, E), N represents the set of entities, and E represents the set of relationships, where drug entities are represented as d (d∈N), target protein entities as t (t∈N), and edges of the DTI relationship type are represented as ${r}_{dt}=(d,t)({r}_{dt}\in \mathbf{E})$. Therefore, formula (2) needs to be converted to formula (3) with R-GCN method. ${C}_{d},{r}_{dt}$ are regularized constants. ${C}_{d},{r}_{dt}$=$\left|{N}_{d}^{r}\right|$ means the number of neighbor nodes sets of node d in the condition of edge is ${r}_{dt}$. $\delta$ is the nonlinear activation function ReLU (x) = max (0, x).

$${h}_{d}^{l+1}=\delta （{W}_{0}^{（l）}{h}_{d}^{（l）}+\sum _{{r}_{dt}ϵE}\sum _{tϵ{N}_{d}^{r}}\frac{1}{{C}_{d},{r}_{dt}}{W}_{t}^{\left(l\right)}{h}_{t}^{\left(l\right)}） \left(3\right)$$

Link Prediction and Negative Sampling

The basic idea of link prediction is to calculate the likelihood score of having a link ${y}_{d,t}$ by the required predicted node d and t representing vectors ${h}_{d}^{\left(l\right)}$ and ${h}_{t}^{\left(l\right)}$. Among them, ${h}_{d}^{\left(l\right)}$ and ${h}_{t}^{\left(l\right)}$ are respectively obtained from formula (3), and ${y}_{d,t}$ is calculated as formula (4). After passing the MLP linear layer, sigmoid smoothing process obtains the final score.

$${y}_{d,t} = sigmoid\left(W\text{*}\left({h}_{d}^{\left(l\right)}|\left|{h}_{t}^{\left(l\right)}\right)\right) \right(4)$$

$$Loss = {\sum }_{{t}_{i}∼Pn\left(t\right),i=\text{1,2},...,k}{max(\text{0,1}-{y}_{d,t}+{y}_{d,{t}_{i}})}^{2} \left(5\right)$$

The training involves comparing the difference in scores between two connected nodes and between any pair of nodes. Because of DTI prediction usually serves as a binary classification problem, with drug-target protein interactions with known interactions as positive examples, and unknown interactions as negative examples. Table 1 shows the generalized confusion matrix to evaluate the performance of classification models. For example, given an edge represented by (d, t), the score of the model from that edge is higher than the score between the nodes t' sampled by d from any noise distribution ${t}_{i} ∼Pn\left(t\right)$, namely ${y}_{d,t}$>${y}_{d,{t}_{i}}$, and negative sampling is required. The loss function is smoothed using a margin loss in formula (5), where k is the multiple of the negative sample sampling.

Table 1

The confusion matrix for performance evaluation.
	Positive (Predicted)	Negative (Predicted)
Positive (Actual)	True Positive (TP)	False Negative (FN)
Negative (Actual)	False Positive (FP)	True Negative (TN)

Datasets and Setting

This section gives the details of the drug-target interaction dataset used in our proposed work. The dataset was obtained from the study by Wan et al [35]. They obtained the original datasets for DDI and DTI from DrugBank Version 3.0 [26], PPI from HPRD database Release 9 [27], drug structure similarity from Morgan fingerprints of radius 2 calculated by RDCit [28–29], and protein sequence similarity from Smith-Waterman score [30]. In addition, only the edges for drug structure similarity and protein sequence similarity are real values, representing the drug chemical structure similarity score and protein sequence similarity score. All other edges are binary values indicating the presence or absence of interactions or associations between nodes. The sources of these data are listed in Table 2. pyTorch torchvision-0.6.1 and the DGL framework were used in our experiments.

Table 2

Source dataset of heterogenous graph network.
Source datasets	Description	Availability Latest URL	Reference
DrugBank Version3.0	Drug-drug interactions and drug target interactions	https://doi.org/10.1093/nar/gkq1126	[26]
HPRD database Release 9	protein-protein interactions	https://doi.org/10.1093/nar/gkn892	[27]
Morgan fingerprints	drug chemical structure similarity	https://doi.org/10.1021/ci100050t	[28]
Smith-Waterman score	Protein sequence similarity	https://doi.org/10.1002/prot.20264	[30]

For this dataset, we randomly divided all known DTIs as positive samples into training, validation, and test sets in the ratio of 8/1/1, and randomly sampled the complementary set of positive samples as negative samples at all stages. Meanwhile, we used Adam [37] algorithm to optimize all trainable parameters by random seeds and 10-fold cross-validation tests. Also, we set the number of epochs for training to 500, and other hyperparameter settings are shown in Table 3.

Table 3

Hyper-parameter settings for dataset.
Parameter	Setting	Parameter	Setting
Learning rate	5e-04	Weight decay	1e-02
Dimension	1e + 03	Hidden dimension	1e + 02
seeds	72	epochs	500

Baseline

In this section, we detail the experiments and analysis performed using the proposed technique, comparing the performance of the proposed method with the baseline. MSCMF [32], TL_HGBI [33], DTINet [34], NeoDTI [35], and GADTI [36]. MSCMF is a factor model that uses multiple drug and target similarity matrices to automatically select similarities by estimating the weights of multiple similarity matrices from the data, which is effective in improving the performance of predicting drug-target interactions. TL_HGBI is a computational framework based on a heterogeneous network model for better drug repositioning by leveraging existing holographic data about diseases, drugs and drug targets. DTINet is a computational pipeline for predicting novel drug-target interactions from constructed heterogeneous networks. To mine large-scale graph data and greatly improve the performance of many network-related prediction tasks by developing a new nonlinear end-to-end learning model inspired by recent information transfer and aggregation techniques, called NeoDTI. A graph autoencoder approach (GADTI) for DTI prediction is proposed to discover potential interactions between drugs and targets using a heterogeneous network that integrates different drug-related and target-related datasets.

Performance Parameters

We use AUPR and AUC-ROC to evaluate the performance of the model. AUPR is the area under the precision-recall curve, and if there is only a high recall, it only indicates that the model can predict a large number of data, but it does not guarantee that the predicted samples are correct. If there is only a high correct rate, it means that the predicted samples are correct, but only a very small fraction of the data set. AUC-ROC is the area under the receiver operating characteristic. The main difference is that AUC-ROC works on balanced data sets and AUPR works on highly unbalanced data sets.

Results and Analysis

In this section, we compare the performance of the proposed method with the baseline. Table 4 reports the AUPR and AUC-ROC scores on the 5-round data set, respectively. As can be seen in Table 4, Graph-DTI is the best performing method for all four different scales of results. When the number of positive samples is equal to the number of negative samples, Graph-DTI is 14.40% higher than the second-best method on AUPR and 7.56% higher on AUC-ROC. When the number of negative samples is 10 times the number of positive samples, Graph-DTI is 4.19% higher on AUPR and 6.64% higher on AUC-ROC than the second-best method. When the negative sample is 50 times the positive sample, Graph-DTI is 12.18% higher on AUPR and 10.46% higher on AUC-ROC than the second-best method. When the negative sample is 100 times the positive sample, Graph-DTI is 12.18% higher on AUPR and 6.78% higher on AUC-ROC than the second-best method. The original dataset with 555 times the number of negative samples as the number of positive samples is a super unbalanced dataset and was not taken into consideration in this study.

Table 4

Performance of Graph-DTI against comparative approach with different ration between positive samples and negative samples.
Method		1:1	1:10	1:50	1:100
MSCMF	AUPR	0.7486	0.7631	0.5112	0.2493
MSCMF	AUC-ROC	0.6290	0.6820	0.4763	0.5673
TL_HGBI	AUPR	0.7199	0.6201	0.6556	0.2613
TL_HGBI	AUC-ROC	0.8312	0.8325	0.7612	0.7238
DTINet	AUPR	0.6956	0.6594	0.7618	0.4126
DTINet	AUC-ROC	0.9192	0.9176	0.7933	0.8965
NeoDTI	AUPR	0.7326	0.8711	0.7151	0.4854
NeoDTI	AUC-ROC	0.8290	0.8954	0.8621	0.8021
GADTI	AUPR	0.8534	0.9533	0.8146	0.4984
GADTI	AUC-ROC	0.8911	0.9261	0.8657	0.9052
Graph-DTI	AUPR	0.9974	0.9952	0.9364	0.5662
Graph-DTI	AUC-ROC	0.9948	0.9925	0.9703	0.9746

Figure 3 compares the baseline AUPR and AUC-ROC through the different ratios between positive and negative samples, corresponding to a total of two subplots. It is clear that Graph-DTI is the highest of all subplots. graph-DTI has a 0.22% decrease in AUPR score when the negative sample is 10 times the positive sample compared to the score when the number of positive and negative samples is equal, a 6.10% decrease in AUPR score when the negative sample is 50 times the positive sample compared to the score when the number of positive and negative samples is equal, and a 43.12% decrease in AUPR score when the negative sample is 100 times the number of positive samples. It can be seen that the AUPR scores are gradually decreasing as the number of negative samples increases, and this decrease is not uniform. When the negative samples are 10 times the positive samples, the AUPR scores of Graph-DTI do not differ much from the scores when the number of positive and negative samples are equal, while the difference becomes obvious when the proportion of negative samples increases to 50 times. In contrast, in the AUC-ROC, this variation was not significant, with an upper and lower variation interval of only 2.45%. In addition, the AUC-ROC score was smallest when the number of negative samples was 50 times the number of positive samples, and it can be seen that the AUC-ROC score did not decrease gradually with the increase in the number of negative samples.

Predict Potential Drug-target pair

In this section, we examine the simulation results of removing some existing DTIs from the original dataset. We remove drug-target pairs from Temazepam (DB00231) that has 13 known targets. Our proposed technique can successfully predict 8 of the 13 known targets. Table 5 contains the predicted targets for Temazepam (DB00231), where the known interactions are highlighted in bold. In addition, we validated this process by identifying potential drug-target pairs for the Alpha-1B adrenergic receptor (ADRA1B), which interacts with eight previously identified drugs. Our proposed technique can successfully predict four drugs out of the eight known targets. Table 6 contains the predicted drugs for Alpha-1B adrenergic receptor (ADRA1B), where the known interactions are marked in bold.

Table 5

The predicted targets from Temazepam.
Rank	Target	UniProt ID
1	Gamma-aminobutyric acid receptor subunit rho-1	P24046
2	Gamma-aminobutyric acid receptor subunit alpha-4	P48169
3	All-trans-retinol dehydrogenase [NAD(+)] ADH7	P40394
4	Translocator protein	P30536
5	5-hydroxytryptamine receptor 1B	P28222
6	Gamma-aminobutyric acid receptor subunit beta-3	P28472
7	Gamma-aminobutyric acid receptor subunit beta-2	P47870
8	Proteasome subunit beta type-1	P20618
9	Gamma-aminobutyric acid receptor subunit gamma-1	Q8N1C3
10	Gamma-aminobutyric acid receptor subunit gamma-3	Q99928
11	Gamma-aminobutyric acid receptor subunit rho-2	P28476
12	Gamma-aminobutyric acid receptor subunit delta	O14764
13	Alpha-lactalbumin	P00709

Table 6

The predicted drugs from Alpha-1B adrenergic receptor (ADRA1B).
Rank	Drug	DrugBank ID
1	Pravastatin	DB00175
2	Lovastatin	DB00227
3	Cerivastatin	DB00439
4	Simvastatin	DB00641
5	Atorvastatin	DB01076
6	Fluvastatin	DB01095
7	Rosuvastatin	DB01098
8	Pitavastatin	DB08860

To evaluate the prediction performance, validation was performed using additional DTI datasets containing new DTIs and DTIs related to the original dataset. The top 193 potential DTIs for each drug were visualized based on the prediction scores. Of the top 50 DTIs predicted based on the scores, 21 DTIs could be supported by previous literature studies (Fig. 4), amounting to 42%. Overall, the new DTIs predicted by the literature further demonstrated their strong predictive power. For example, aminourea-sensitive amine oxidase substrates of hydrazidiazide, an antihypertensive drug, may enhance the antihypertensive effect of hydrazidiazide [38]. Pitavastatin, a drug used to lower lipid levels and reduce the risk of cardiovascular disease, is supported by previous research literature indicating its potential role as an inhibitor of 3-hydroxy-3-methylglutaryl-CoA reductase (HMGCR), [39]. Saxagliptin is used as a dipeptidyl peptidase 4 (DPP4) inhibitor for the treatment of type II diabetes [40], among others. In conclusion, literature validation further suggests that Graph-DTI has a strong predictive power.

In this paper, we propose a new model, called Graph-DTI, for predicting drug-target interactions. Graph-DTI extends the spatial-based GNN approach to heterogeneous network graphs and is able to learn the topology of heterogeneous networks, as well as the neighborhood information of nodes (drugs, proteins) and other relevant nodes, by selectively multiplying the aggregated neighborhood information. The proposed framework successfully ranks the drugs corresponding to the different targets and vice versa. In addition, the performance of Graph-DTI and all baseline methods was evaluated in terms of AUPR and AUC-ROC sums. All methods were trained and tested on a baseline dataset. More specifically, Graph-DTI achieved 14.40% on AUPR and 7.56% on AUC-ROC when the number of negative samples was equal to the number of negative samples, higher than NeoDTI (the second-best method. The better results compared to NeoDTI are due to the fact that Graph-DTI explores both node features and relational features in heterogeneous networks, while other methods such as MSCMF, TL_HGBI and DTINet only learn node features. Graph-DTI also achieves excellent results (21.22% improvement in AUPR) compared to the MSCMF method that uses similarity matrix for DTI prediction. This is due to the fact that the graph embedding in Graph-DTI captures more information about the topology of the network than the matrix decomposition used in the MSCMF method. Even though the resulting AUPR value decreases as the weight of negative samples continues to increase, our study method consistently maintains optimal results. When the negative samples are 100 times of the positive samples, Graph-DTI outperforms GADTI (the second-best method) by 6.78% in AUPR. The results show that our proposed method outperforms the existing methods. This can be clearly seen from the experimental results of all methods. To further check the predictive performance of the proposed framework, we removed some drug-target interaction pairs with 13 known temazepam targets. Out of these interactions, 8 were predicted by the proposed approach. On the other hand, we reversed this process by identifying potential drugs for the protein target Alpha-1B adrenergic receptor (ADRA1B). Eight drugs were found to interact with this target, and our proposed method successfully predicted four of the eight DTI interactions. Thus, Graph-DTI can be considered as a tool that can provide strong support to facilitate the process of drug discovery and drug repositioning.

However, our Graph-DTI still has several shortcomings that need to be addressed in future research. First, in our method, the initial embedding features of the nodes are randomly generated, and more generalization may be possible in the future if substitutions can be made in more realistic wet experimental results. Second, the performance of Graph-DTI on AUPR and AUC-ROC decreases as the weight of negative samples in the input model gradually increases, which can be improved in the future by adopting a better strategy for the imbalanced dataset to achieve the goal of restoring the negative samples in the original dataset to be 555 times the positive samples. Finally, removing some of the existing DTIs from the original dataset, the test results were only 61% and 50% accurate, due to the fact that structurally similar compounds tend to interact with similar targets. Our model is still shallow in DTI prediction compared to neighborhood aggregation of heterogeneous network graphs, and future work may address the aforementioned shortcomings in exploring semantic information about drugs and proteins.

In this paper, a new heterogeneous network embedding method, Graph-DTI, is proposed. It can extract different features from each node of the heterogeneous DTI network and connect these features by topological information between neighboring nodes through GCN. This study demonstrates a new idea of deconstructing heterogeneous DTI networks to capture the rich higher-order neighborhood information therein to generate high-quality embeddings of drug-target pairs. Furthermore, we have shown that as the number of negative samples in the model gradually increases, the AUPR value of the model decreases accordingly, but remains essentially constant on AUC-ROC. However, Graph-DTI can effectively learn the potential relationships of drug-target interactions when the number of negative samples is within 10 times of the number of positive samples. The substantial improvement in prediction performance compared to other baseline DTI prediction methods demonstrates the superior predictive power of Graph-DTI. Moreover, the proposed framework has been successful in ranking drugs corresponding to different targets and vice versa. All these results suggest that Graph-DTI can provide a powerful tool for drug research, development and repositioning.

Ethics approval and consent to participate

No ethics approval was required for the study.

Consent for publication

Not applicable

Availability of data and materials

The direct links for data availability are listed in Table 2.

Competing interests

The authors declare that they have no competing interests.

Funding

The Project-sponsored by SRF for ROCS, SEM and supported by the Project of Chinese Ministry of Education (N0.2017A11001), Research on Prediction Trend of Population Infected with COVID-19 Based on Big Data (NO.2020KZDZX1126), Natural Science Foundation of Guangdong Province (No.2020A1515010783).

Authors' contributions

YC decided the overall direction of the study, XQ designed the methodology and was the main contributor to writing the manuscript, and GD and HJ assisted with the preliminary data processing. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to the anonymous reviewers for their constructive comments on the original manuscript.

Iskar M, Campillos M, Kuhn M, et al. Drug-induced regulation of target expression. PLoS computational biology, 2010;6(9): e1000925.
Cheng F, Liu C, Jiang J, et al. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS computational biology, 2012;(5): e1002503.
Chen X, Yan C C, Zhang X, et al. Drug–target interaction prediction: databases, web servers and computational models. Briefings in bioinformatics, 2016;17(4): 696–712.
Tanoori B, Jahromi M Z, Mansoori E G. Drug-target continuous binding affinity prediction using multiple sources of information. Expert Systems with Applications, 2021; 186: 115810.
Buza K, Peška L. Drug–target interaction prediction with Bipartite Local Models and hubness-aware regression. Neurocomputing, 2017, 260: 284–293.
Sharma A, Rani R. BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Computer methods and programs in biomedicine, 2018;165: 151–162.
Chen C, Shi H, Jiang Z, et al. DNN-DTIs: Improved drug-target interactions prediction using XGBoost feature selection and deep neural network. Computers in Biology and Medicine, 2021;136: 104676.
Wang Y B, You Z H, Li X, et al. Predicting protein–protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Molecular BioSystems, 2017;13(7): 1336–1344.
Chu X, Lin Y, Wang Y, et al. Mlrda: A multi-task semi-supervised learning framework for drug-drug interaction prediction. Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019: 4518–4524.
Zhang Z, Chen L, Zhong F, et al. Graph neural network approaches for drug-target interactions. Current Opinion in Structural Biology, 2022;73: 102327.
Shang Y, Gao L, Zou Q, et al. Prediction of drug-target interactions based on multi-layer network representation learning. Neurocomputing, 2021;434: 80–89.
Jamali A A, Kusalik A, Wu F X. MDIPA: a microRNA–drug interaction prediction approach based on non-negative matrix factorization. Bioinformatics, 2020;36(20): 5061–5067.
Lim S, Lu Y, Cho C Y, et al. A review on compound-protein interaction prediction methods: data, format, representation and model. Computational and Structural Biotechnology Journal, 2021, 19: 1541–1556.
Zhang C, Song D, Huang C, et al. Heterogeneous graph neural network. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019; 793–803.
Xie Y, Yu B, Lv S, et al. A survey on heterogeneous network representation learning. Pattern Recognition, 2021; 116: 107936.
An Q, Yu L. A heterogeneous network embedding framework for predicting similarity-based drug-target interactions. Briefings in bioinformatics, 2021; 22(6): bbab275.
Li J, Wang J, Lv H, et al. IMCHGAN: inductive matrix completion with heterogeneous graph attention networks for drug-target interactions prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021; 19(2): 655–665.
Peng J, Wang Y, Guan J, et al. An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Briefings in Bioinformatics, 2021; 22(5): bbaa430.
Hamilton W, Ying Z, Leskovec J. (2017) Inductive representation learning on large graphs. Advances in neural information processing systems, pp 1024–1034
Zheng X, Ding H, Mamitsuka H, et al. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013; 1025–1033.
Wang Y C, Yang Z X, Wang Y, et al. Computationally probing drug-protein interactions via support vector machine. Letters in Drug Design & Discovery, 2010; 7(5): 370–378.
He Z, Zhang J, Shi X H, et al. Predicting drug-target interaction networks based on functional groups and biological features. PloS one, 2010; 5(3): e9603.
Wang S, Shan P, Zhao Y, et al. GanDTI: A multi-task neural network for drug-target interaction prediction. Computational Biology and Chemistry, 2021; 92: 107476.
Yazdani-Jahromi M, Yousefi N, Tayebi A, et al. AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification. Briefings in Bioinformatics, 2022;23(4): bbac272.
Tran H N T, Thomas J J, Malim N H A H. DeepNC: a framework for drug-target interaction prediction with graph neural networks. PeerJ, 2022; 10: e13163.
Knox C, Law V, Jewison T, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic acids research, 2010; 39(suppl_1): D1035-D1041.
Rogers D, Hahn M. Extended-connectivity fingerprints. Journal of chemical information and modeling, 2010; 50(5): 742–754.
Landrum G. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 2013.
Smith T F, Waterman M S. Identification of common molecular subsequences. Journal of molecular biology, 1981; 147(1): 195–197.
Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 2004; 57(4): 702–710.
Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 2011; 390(6): 1150–1170.
Zheng X, Ding H, Mamitsuka H, et al. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013; 1025–1033.
Wang W, Yang S, Zhang X, et al. Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics, 2014; 30(20): 2923–2930.
Luo Y, Zhao X, Zhou J, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications, 2017; 8(1): 1–13.
Wan F, Hong L, Xiao A, et al. NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions. Bioinformatics, 2019; 35(1): 104–111.
Liu Z, Chen Q, Lan W, et al. GADTI: Graph autoencoder approach for dti prediction from heterogeneous network. Frontiers in Genetics, 2021; 12: 650821.
Zhang Z (2018) Zhang Z. Improved adam optimizer for deep neural networks. 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). Ieee, 2018: 1–2.
Vidrio H, Medina M, González-Romo P, et al. Semicarbazide-sensitive amine oxidase substrates potentiate hydralazine hypotension: possible role of hydrogen peroxide. Journal of Pharmacology and Experimental Therapeutics, 2003, 307(2): 497–504.
Lamb Y N. Rosuvastatin/ezetimibe: a review in hypercholesterolemia. American Journal of Cardiovascular Drugs, 2020; 20(4): 381–392.
Gallwitz B. Novel therapeutic approaches in diabetes. Novelties in Diabetes, 2016; 31: 43–56.

No competing interests reported.

Download PDF

Journal Publication

published 01 Dec, 2023

Read the published version in Current Computer-Aided Drug Design →

Version 1

posted

You are reading this latest preprint version

Graph-DTI: A new Model for Drug-target Interaction Prediction Based on Heterogenous Network Graph Embedding

Status:

Journal Publication

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Method

Problem Formulation

Overview

DTI Extraction and Heterogeneous Network Construction

Neighborhood Message Passing

Node Embedding Learning

Link Prediction and Negative Sampling

Results

Datasets and Setting

Baseline

Performance Parameters

Results and Analysis

Predict Potential Drug-target pair

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1