FDMine: a graph mining approach to predict and evaluate food-drug interactions

Food-drug interactions (FDIs) arise when nutritional dietary consumption regulates biochemical mecha- 14 nisms involved in drug metabolism. These interactions can create unexpected adverse pharmacological 15 effects. By contrast, particular foods can aid in the recovery process of a patient. Towards characterizing 16 the nature of food’s influence on pharmacological treatment, it is essential to detect all possible FDIs. In 17 this study, we propose FDMine, a novel systematic framework that models the FDI problem as a homoge- 18 nous graph. In this graph, all nodes representing drug, food and food composition are referenced as chemical 19 structures. This homogenous representation enables us to take advantage of reported drug-drug interactions 20 for accuracy evaluation, especially when accessible ground truth for FDIs is lacking. Our dataset consists 21 of 788 unique approved small molecule drugs with metabolism-related drug-drug interactions (DDIs) and 22 320 unique food items, composed of 563 unique compounds with 179 health effects. The potential number 23 of interactions is 87,192 and 92,143 when two different versions of the graph referred to as disjoint and 24 joint graphs are considered, respectively. We defined several similarity subnetworks comprising food-drug 25 similarity (FDS), drug-drug similarity (DDS), and food-food similarity (FFS) networks, based on similarity 26 profiles. A unique part of the graph is the encoding of the food composition as a set of nodes and calculating 27 a content contribution score to re-weight the similarity links. To predict new FDI links, we applied the path 28 category-based of 35 receiver operating characteristic curve, and precision-recall curve. We have performed three types of eval- 31 uations to benchmark results using different types of interactions. The shortest path-based method has 32 achieved a precision 84%, 60% and 40% for the top 1%, 2% and 5% of FDIs identified, respectively. We 33 validated the top FDIs predicted using FDMine to demonstrate its applicability and we relate therapeutic 34 anti-inflammatory effects of food items informed by FDIs. We hypothesize that the proposed framework 35 can be used to gain new insights on FDIs. FDMine is publicly available to support clinicians and research- 36 ers. 37

conditions. Although the impact of the drugs depends on the affinity of the drug to bind to a specific cell/en-45 zyme receptor, its effectiveness depends on other factors such as when taken alongside other drugs or food. 46 Ideally, drug effects should be consistent for all patients and never be impacted by food ingredients or other 47 medical products [1]. However, several studies [2,3] have demonstrated the impact of certain foods, de-48 creasing or increasing the activity of different drugs (food-drug interactions -FDI).

49
FDIs often cause changes in drug plasma concentrations, which may significantly increase or decrease 50 the effectiveness of the drug [4]. These changes can occur in three ways: it can increase the actions of drugs 51 (i.e., increased metabolism of drugs), decrease the activity of the drugs (i.e., decreasing bioavailability of 52 drugs), or create an adverse effect.

53
FDIs can be classified into two basic mechanisms: pharmacokinetic (PK) interactions, and pharmacody-54 namic (PD) interactions [5]. PK interactions denote the circumstance when foods alter processes related to 55 absorption, distribution, metabolism, and excretion of medications. For example, for a short time after con-56 sumption, grapefruit juice slows the metabolism of cyclosporine (e.g.: cytochrome P450 enzymes) [6,7]. 57 PD interactions are caused by specific interactions between a drug and a food component that results in a 58 particular pharmacological effect [8]. An example of a PD interaction is a diet high in vitamin K that an-59 tagonizes the therapeutic effects of warfarin (used for blood clot treatments) [5].

of 35
Considering the potential for increasing or decreasing the absorption of a drug, FDIs can play a vital role 61 in drug discovery as well [9]. For example, Moringa oleifera leaf extract has been used to inhibit cancer 62 cells and to increase the efficacy of chemotherapy in humans [10,11,12]. The roots of Erythroxylum per-63 villei provide pervilleines A, B, C, and F, effective inhibitors of P-glycoprotein, which is linked to multi-64 drug resistance and low cancer therapeutic response [13].  [20]. Ruy et al. recently developed DeepDDI, a multi-label classification model that calculates structural 84 similarity profiles (SSP) of DDIs and uses principal components analysis to reduce features and feed them 85 into a feed-forward deep neural network (DNN) [21]. A predictive machine learning model [22]

119
DDIs, and FDIs. FDMine uses the simplified molecular-input-line-entry system (SMILE) description to 120 establish similarity profiles and link prediction algorithms to predict the FDIs. The proposed framework 121 uses two different kinds of representations (disjoint and joint) graphs consisting of three subnetworks con-122 nected. These subnetworks are drug-drug similarity, food-drug similarity, and food-food similarity. The rationale behind this approach is to capitalize on the similarity information of different subnetworks and 124 combine it with building a homogenous graph. We consider a unique representation of food items, their 125 compound composition, and the contribution of each compound. After building the graph network, the 126 framework implements a comprehensive set of different link prediction algorithms to predict potential 127 FDIs. The shortest path-based method has achieved a precision 84%, 60% and 40% for the top 1%, 2% and 128 5%, respectively. In the joint version of the graph, FDMine recovered 27

139
In this study, we considered the drugs assigned to the approved drug group and have small molecules. foods, compounds, nutrients, and health effects. In this study, we considered the FooDB content dataset 147 that directly mapped foods to the chemical compounds' composition. Initially, we created a subset of the 148 content dataset that stored the required attributes (i.e., food id, original food name, source id, source type, 149 among others), yielding a total of 19,867 objects. Then, we filtered the extracted data by removing the list 150 of predicted and unknown data entries by using the conditions "citation type == DATABASE" and "source 151 type == COMPOUND". This provides a more accurate source of information. Finally, we only considered 152 the food items mapped to a compound, resulting in 16,230 objects for further analysis.
After the parsing step, we mapped the resulting dataset with the "Compound" information to collect the 154 required details for each compound, including SMILE description and content contribution. In FoodDB, 155 the content range of each compound within a food item is presented (e.g., Strawberry has a content range 156 of Potassium of 0.000 -187.000 mg/100 g). Finally, we have the SMILE description of the corresponding 157 compounds and the international chemical key (InChiKey) as a unique identifier.

158
To relate the food compounds to health effects, we retrieved data from the health effects dataset that enabled 159 us to know which food compound has a health effect on the human body. The resulting dataset contains 160 8,846 objects including 320 unique foods, and 563 unique food compounds having 179 unique health ef-161 fects. One extracted example is that benzoic acid from American cranberry has an allergenic health effect.

162
Since the same compounds can be found in different foods, it is necessary to store these data with a naming 163 convention that allows us to differentiate each food with its composition correctly. In this study, we used

172
Each food item is composed of a set of chemical compounds. Clearly, the "amount of the original content" 173 of any compound is not the same for each food. For example, the amount of the phytic acid in carrot is 174 5270.000 ml/100g and buckwheat is 1800.000 ml/100g. Carrot contains approximately three times more 175 phytic acid than buckwheat by mass. Therefore, the contribution of the phytic acid is different for carrot 176 and buckwheat. Consequently, we used the following equation to calculate the contribution of each com-177 pound for each food based on the amount contained in the food: The range of the normalized contribution is from 0 to 1. Where 0 and 1 contribution refer to a food com-180 pound with no contribution or full contribution, respectively.
In the graph, the food item and its compound composition are represented as separate nodes. The normal-182 ized contribution score scales edge weights of links connecting compounds to the food item.

184
More details and an example on the contribution score of food compounds is given in the Additional file 1: 185 Table S1. 186

Homogenous Graph Representation 187
We consider a set of food compounds, = { 1 , 2 , … , } and a set of drugs, = { 1 , 2 , … , } where 188 represents the number of food compounds and n represents the number of drugs. We merged all drugs 189 and food compounds into a single graph. So, in our representation, we have a set of drug and food com-190 pounds = { 1 , 2 , … , , 1 , 2 , … , }. Then, we considered the set of an * dimensional struc-191 ture similarity matrices between drugs, between food compounds, and between food-drug. A score between 192 [0, 1] is the degree of similarity. A similarity score close to 0 means that two items are not identical to each 193 other, where the most similar items are represented by a similarity score close to 1. Using this similarity 194 concept, we derived a homogenous graph. From this homogenous graph, we will apply different path cate-195 gory and neighborhood-based similarity-based algorithms to predict the novel FDIs. prints [37] (also known as extended-connectivity fingerprint ECFP4 [38]) that is widely used in different 206 studies. ECFP4 showed the best performing fingerprints in the target prediction benchmarks [39,40] and 207 in small molecule virtual screening [41]. The calculating procedure of the SSP can be found in the Addi-208 tional file 1: Fig S3. 209

Sparse Matrix Representation 210
We used the similarity profile to derive the sparse matrix representation, later used for plotting the graphs.

211
In this matrix, we eliminated all the zero entries and applied a threshold since some similarity scores contain 212 trivial values and thus may not indicate significant changes. For determining the threshold, we have con-213 sidered the distribution of the similarity scores. The majority of similarity values lie between 0.3~0.6, hence 214 selecting a high similarity value may drastically change the data-set size. Also, of note, a high threshold 215 will always lead to potential pairs having increased probability of interaction. Several studies have referred 216 to different values in the range of 0.5-0.85 for applying a similarity threshold for the Tanimoto coefficient 217 [42,43,44]. While a higher threshold can lead to more potentially valuable hypotheses, it can limit the 218 number of genuinely novel predictions. Table 1

Updating Similarity Scores using Food-Compound Contribution 227
We obtained a total of 4,177,383 similarities using the SSP. Then, we multiplied the similarity score by the 228 normalized contribution of the food compound (Eq. 1). As illustrated in Table 2, when we have a food-drug 229 pair (see row 1), we multiplied the similarity score by the contribution of the food compound. Similarly, 230 we multiplied the similarity score by the higher contribution of the food compound. For example, the con-  244

Link Prediction Algorithms 245
After applying the similarity thresholds, the generated graph had several disjoint subgraphs. We call this 246 the disjoint version. Some link prediction algorithms cannot handle the disjoint version. Therefore, we con-247 sidered preparing a joint graph. We chose any node (randomly) from each subgraph and added an edge to 248 link all subgraphs to make the joint graph network. Then, a very small edge weight of 1e-5 was assigned to 249 the newly added links, limiting their effect on generating biased hypotheses. We generated results for both 250 versions. A detailed description is available in the Additional file 1: Fig S4. 251

252
Our goal is to predict the novel (unknown) FDIs from the generated homogenous graph. A homogenous 253 graph is one where all nodes are of the same type. Different than DTI heterogenous graphs (e.g., drug-254 protein), nodes in our graph are chemicals. One class of algorithms is based on running the shortest path to 255 find candidate interactions for the considered food and drug pair. Here, we have used 2-length and 3-length 256 pathways. For example, a 2-length path is "Drug1-Food1-Food2" (see Figure 1) connects the Drug1 node 257 with the Food2 node through the similarity between "Drug1 and Food1" and "Food1 and Food2". This is 258 defined as a D-F-F path. As illustrated in Figure 1 Dijkstra's algorithm was used for finding the shortest path where the similarity score is used as the 272 path weight. 273

274
In the link prediction, given a graph , the main aim is to predict new edges (drug-food) from the existing 275 graph. Predictions are useful to suggest unknown relations (or interactions) based on edges in the observed 276 graph. In the link prediction, we try to build a similarity measure between pairs of nodes and link the most 277 similar nodes. Link prediction algorithms are very common in many application domains such as,

Performance evaluation 326
To measure the performance of applied link prediction approaches, we adopted the idea of precision@k 327 [60,61] or top predictive rate [53,62]. This metric is also known as -precision [ 69,70]. Another statistical measure is the area under the precision-recall curve (PRC), which provides a more accurate assessment especially when dealing with imbalanced datasets 336 [71]. In this study, we used, precision@top, AUC, and PRC as performance metrics.

338
In order to compute some of the measures, we had to derive true positives (TP), false positives (FP), true 339 negatives (TN), and false negatives (FN). To perform this, we ranked the predicted links in descending 340 order based on the rank score given by the link prediction methods. Then, we considered several thresholds 341 as cutoff values. The starting threshold is the minimum score given by the link prediction methods. Then 342 we increase by a step size of 0.1, which was selected to ensure sufficient granularity in computing the area 343 under the curve. We repeated this step until the threshold value is the same as the maximum score given by 344 the link prediction algorithm. For each specific threshold score, if we found the known link in the test 345 dataset matched with the newly predicted link and the score is greater than the threshold, we considered 346 this matching as a true positive (TP) for evaluative purposes. Given an unknown link, which does not match 347 the test dataset, but was predicted by the link prediction algorithm, and the score is greater than the thresh-348 old, we consider the case a false positive (FP). Similarly, when we found a known link (same as the test 349 dataset and in the newly predicted links), but the score was below the threshold, we consider this a false 350 negative (FN). Lastly, when we found any unknown link with the score below the threshold, we assign the 351 sample as a true negative (TN). Using the TP, FP, TN, and FN we calculated the "precision@top-1%", 352 "precision@top-2%", "precision@top-5%", AUC, and PRC.

Data splitting for testing 355
To evaluating the performance of link prediction algorithms, the test data is generated by excluding a col-356 lection of links from the full homogenous networks. Our homogenous network contains drug-drug similar-357 ity, food-drug similarity, and food-food similarity. We split 30% of links randomly to make the test data 358 set, while the rest of the 70% of links are used for the training dataset. For stability, we repeat this evaluation 359 ten times and report average performance.

Our Proposed FDMine Framework 389
The FDMine framework (see Figure 2) is composed of several phases. In Phase 1, raw data is parsed from 390 DrugBank and FooDB databases. In Phase 2, we execute two steps including a) building a homogenous 391 network based on the structure similarity profile and b) updating the weights of the homogenous network 392 using food compound contributions. Next, the graph is prepared with nodes representing drugs, food and 393 food compounds' composition. In the graph, links are weighted by similarity and contribution scores (see 394 Phase 3 in Figure 2). When applying the similarity thresholds, the homogenous network produces multiple 395 subgraphs (disjoint graph). We build another version called the joint homogenous graph network and con-  Table 4 provides a summary of different models over the disjoint graph network. For the disjoint graph, the 427 SP_2 outperformed other methods. The precision rate for the top 1% (i.e., precision@top-1) is 84% for 428 SP_2 while RA, the second best has achieved 64%. For precision@top-2, SP_2 achieved the best results 429 with 60% and L3, the second best 42%. The highest value for the precision@top-5 was achieved by the 430 SP_2 (40%). In the disjoint version of the graph, neighborhood-based similarity-based methods achieved, 431 on average 17% with variant standard deviation each. However, SP_3 always showed a low performance 432 (05%, 03%, 02% for precision@top-1, precision@top-2, and precision@top-5 respectively) compared to 433 all other methods. SP_2 achieved 52% and 26% AUC and PRC, respectively. All neighborhood-based sim-434 ilarity-based methods achieved more than 80% (AUC) except L3 which had a reported precision of 60%.

444
The value of the PRC is also high for the neighborhood-based similarity-based methods. The PRC scores 445 for the RA, AA, CN were 87%, 86%, and 84% respectively. However, SP_3 always (disjoint and joint graphs) showed the weakest results in terms of all performance metrics (precision@top, AUC, and PRC). 447 Table 5 summarizes the different models over the joint graph network. The comparison graph for the pre-448 cision@top-1%, precision@top-2%, and precision@top-5% are provided in Figure 3.

481
Here we randomly assigned 30% of all (DD, FF, FD) links from the whole dataset to make the test dataset, 482 and the rest of the 70% was used to train the model. We applied 'shortest path length 2' over the disjoint  Table   508 8, the interactions we obtained appear to affect key biological pathways including -Prostaglandin biosyn-509 thesis for inflammatory response [72], beta-adrenergic signaling for cardiac output modulation

523
Prostaglandins are compounds that play a role in the anti-inflammatory pathway during injury [76]. An 524 essential molecular building block in humans is arachidonic acid. It interacts with the Peroxisome prolifer-525 ator-activated receptor (PPAR) to form various prostaglandins [76] or anti-inflammatory compounds. Var-526 ious dietary fatty acids (see Table 8; Oleic acid, Linoleic acid, Erucic Acid, Eldaic acid) are also absorbed 527 via the exogenous chylomicron pathway and hydrolysed for various tissues to absorb them for further pro-528 cessing [77]. Some of our predicted compound items include Oleic acid -FDB012858, and Erucic acid -529 FDB004287, that are similar to Arachidonic acid and are analogous [78] structures, belonging to the fatty 530 acid group and are found in many dietary sources including Celery -FOOD00015, Peanuts (FOOD00016) 531 and Burdock -FOOD00017 (See Table 8). Our literature review has highlighted reported evidence on the 532 influence of these dietary fatty acids on the Arachidonic acid cycle. Arachidonic acid is a precursor for the 533 synthesis of various other biomolecules, associated with anti-inflammatory pathways [79]. During injury, 534 inflammation occurs and causes arachidonic acid to bind with PPAR-gamma receptors as shown in Figure   535 4 to form prostaglandins or protective anti-inflammatory agents to curb the injury [80]. Fatty acids (see 536   Table 8) also compete with arachidonic acid during injury or inflammation to produce various substituted 537 prostaglandins belonging to a family of derivative compounds known as eicosanoids [81], via PPAR [82].

538
Since the substituted prostaglandins are not exactly derived from arachidonic acid, they show slightly fewer 539 anti-inflammatory profiles as compared to other eicosanoids produced directly from arachidonic acid [83].

540
It is worth noting that arachidonic acid belongs to the list of essential fatty acids including alpha-linoleic 541 acid and docosahexaenoic acid [83]. There has been evidence to show that dietary sources such Linoleic 542 acid, Erucic acid and Elaidic acid (see Table 8 [89]. Propranolol (DB00571) and Penbutolol (DB01359), on the other hand, are 573 non-selective beta-adrenergic blockers. Studies have also observed that beta-blockers may also contribute 574 to GABA turnover in the cerebrum [90]. We were able to confirm that fatty acids (Oleic acid (FDB012858), Erucic acid (FDB004287), (Z,Z)-9,12-583 Octadecadienoic (FDB012760) and Elaidic acid (FDB002951) ) can cross the blood-brain barrier and be 584 beneficial to relieve anxiety [91]. They are also believed to act via stimulation of GABA-A based receptors.

596
In summary, the discussed pairs of food ingredients and drugs can influence their own pharmacokinetics.

598
For example, taking beta-adrenergic drugs with food containing terpenes like Eugenol and Methyl chavicol 599 can potentially cause more pronounced antihypertensive effects. Taking antiepileptic medications along 600 with foods containing fatty acids can potentially elevate overall GABA levels significantly than when they 601 are taken individually. Moreover, dietary fatty acids can also interact with the PPAR receptor during in-602 flammation to produce variations of prostaglandins. This demonstrates the feasibility of using our FDMine 603 framework to identify potential food and drug interactions.

605
In this study, we introduced FDMine as a framework to infer the interaction between food compounds and 606 drugs using a homogenous graph representation. We considered several resources to construct food-drug, The authors declare that they have no competing interests.