Drug Components-Disease Network Related to Acute Lung Injury Inference Based on Forest Graph- embedded Deep Feedforward Network


 Background: Acute lung injury (ALI) is a serious respiratory disease, which can lead to acute respiratory failure or death. It is closely related to the pathogenesis of New Coronavirus pneumonia (COVID-19). Many researches showed that traditional Chinese medicine (TCM) had a good effect on its intervention, and network pharmacology could play a very important role. Results: In order to construct "disease-gene-target-drug" interaction network more accurately, deep learning algorithm is utilized in this paper. Two ALI-related target genes (REAL and SATA3) are considered, and the active and inactive compounds of the two corresponding target genes are collected as training data, respectively. Molecular descriptors and molecular fingerprints are utilized to characterize each compound. Forest graph embedded deep feed forward network (forgeNet) is proposed to train and identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS).Conclusions: The experiment results show that forgeNet performs better than support vector machines (SVM), random forest (RF) and gcForest.


Background
Internal and external etiology can lead to self-stable regulation disorder, which could change a series of metabolisms, functions and structures. Abnormal life activity processes are manifested as abnormal symptoms, signs and behavior [1][2]. Under certain conditions, the abnormal life activity processes caused by the disturbance of homeostasis after the damage of the disease cause the disease [3][4]. Traditional Chinese medicine (TCM) has been utilized to treat diseases for thousands of years [5][6][7].
Traditional Chinese medicine is a kind of material with the function of rehabilitation and health care, which could be utilized to prevent, treat and diagnose diseases under the guidance of TCM theory [8][9][10][11].
Traditional Chinese medicine mainly comes from natural medicine and its processed products, including plant medicine, animal medicine, mineral medicine and some chemical and biological products [12][13].
The most important feature of traditional Chinese medicine in treating diseases is to pay attention to the adjustment of the functions of viscera and organs, and the balance and coordination between them. The focus of traditional Chinese medicine treatment is not that the human body is infected with the speci c bacteria, virus and other pathogenic factors, but the speci c reaction of the human body after these pathogenic factors act on the human body [14][15]. The purpose of treatment is to enhance the disease resistance and recovery ability of human body. To kill bacteria and relieve symptoms are mainly achieved by enhancing the body's own functions. In recent years, traditional Chinese medicine has certain advantages in the treatment of pneumonia [16], shock [17], convulsion [18], hemorrhage [19], acute respiratory failure [20], renal failure [21], heart failure [22], cerebrovascular accident [23], etc. it is not only effective, but also safe and simple, with few adverse reactions.
In the past decade, with the rapid development of sequencing technology, a large number of genomics data such as genomics, proteomics, metabonomics and so on, have been generated, which has led to the Page 3/26 changes in the research of traditional Chinese medicine for diseases. Network pharmacology has been proposed, which was developed on the basis of the rapid development of systems biology and computer technology, generating the "disease-gene-target-drug" interaction network. Through network analysis, we can systematically and comprehensively observe the intervention and in uence of drugs on the disease network, reveal the mystery of the synergistic effect of multi branch drugs on the human body, and nd out the multi-target new drugs with high e ciency and low toxicity. Network pharmacology of traditional Chinese medicine has become a new idea for drug mechanism research and new drug development [24][25][26][27][28]. Lu et al. utilized network pharmacology and molecular docking technology to study the mechanism of Shaoyao Decoction in the treatment of ulcerative colitis, and found that Shaoyao decoction can improve the pathological damage of colon [29]. Liu et al. collected the main active components of Portulacae Herba, constructed interaction network of target proteins of liver cancer, and found that ketones may be the main material basis of its anti-liver cancer, which is related to the regulation of MAPK signaling pathway [30]. Liu et al. utilized network pharmacology to screen 102 active components of Danzhi Xiaoyao Powder, 147 corresponding targets and 52 intersecting targets with insomnia, and obtained the key components, key targets and key pathways of Danzhi Xiaoyao Powder in the treatment of insomnia [31]. Yang et al. presented network pharmacology to analyze the potential anti-tumor mechanisms of the main active components of Prunella vulgaris systematically at the molecular level [32]. Shen et al. discussed the possible mechanism of Wuling Powder in the treatment of diabetic nephropathy by network pharmacology, and found that Wuling Powder may reduce renal cell damage by regulating apoptosis related proteins, such as Caspases family protein and BCL2 Protein family [33].
In the recent years, data mining methods have been applied to extract useful information from lots of TCM data [33]. Ren et al. utilized data mining methods to screen out 47 prescriptions, and found out 14 core drugs and 7 new prescriptions in order to search the medication rules and mechanism of TCM in the treatment of carotid atherosclerosis (CAS) [34]. Ga et al. utilized data mining method to select the top ve active components of each Tibetan medicine with high frequency and network pharmacology was utilized to analyze the mechanism of Tibetan medicine in the treatment of high altitude polycythemia [35]. In order to study the medication rule of TCM intervention in iron death, Ou et al. constructed targetcompound, compound-TCM, target-compound-TCM network, and frequency statistics was utilized to show that bitter and pungent herbs were the main herbs that could interfere with iron death, while cold herbs were the main ones, which mainly belonged to liver and lung meridians In order to better mine omics data and construct "disease-gene-target-drug" interaction network, deep learning model was utilized in this paper. Taking acute lung injury (ALI) disease as an example, we selected two disease-related target genes (REAL and SATA3). The active and inactive compounds of the two target genes combined are collected. Molecular descriptors and molecular ngerprints are utilized to characterize each compound, which contain 374 features. Forest graph embedded deep feed forward network is utilized to train and identify new compounds target genes related.

Results
In this section, active and inactive ligands of two key target genes: REAL and SATA3 about ALI disease are collected. For REAL, 966 ligands are collected, which contain 146 positive samples and 820 negative samples (Data1). For SATA3, 193 active ligands and 1210 inactive ligands are collected (Data2). Molecular descriptors and molecular ngerprints of each ligand could be obtained, which contains 374 features. In order to better re ect the effectiveness of forgeNet, three classical classi ers (SVM [42], RF [43] and gcForest [44]) are utilized to identify the compounds associated with diseases. Five evaluation criteria of classi er performance are utilized, which are SN, SP, ACC, MCC and F1, respectively.

Model test
In order to test the generalization and stability of forgeNet, leave-one-out, 3-fold, 5-fold and 10-fold cross validation methods are utilized. With Data1, the inference performances of four methods with leave-oneout, 3-fold, 5-fold and 10-fold cross validation methods are listed in Table 1, Table 3, Table 5 and Table 7, respectively. With Data2, the inference performances of four methods with leave-one-out, 3-fold, 5-fold and 10-fold cross validation methods are listed in Table 2, Table 4, Table 6 and Table 8, respectively. From the results, it could be seen that gcForest has the highest SN performances among four methods, which reveal that gcForest could identify more true active ligands. RF could obtain the higher SP than SVM, gcForest and forgeNet, which show that RF could identify true more inactive ligands. ForgeNet has the best ACC, MCC and F1 performances among four methods. The results reveal that forgeNet could identify more true active and inactive ligands than SVM, RF and gcForest. And when two categories have very different sizes, forgeNet performs best. F1 performances show that on the whole forgeNet could infer components-disease network more accurately than other three classi ers.     Fig. 1, Fig. 2, Fig. 3 and Fig. 4, respectively. With Data2, ROC curve and AUC performances of four methods with 3-fold, 5-fold, 10-fold cross validation methods and leave-one-out are depicted in Fig. 5, Fig. 6, Fig. 7 and Fig. 8, respectively. From the results, in most of cases, forgeNet could obtain the best AUC performances among four classi ers. With Data1 and Data2, RF performs best in terms of ROC and AUC by leave-one-out method. The prediction ranks are listed in Table 9. By ranking results, we can see that DXMS ranks last by forgeNet on average, which is consistent with the results of molecular docking in the past research [39].

Compound screening
Thus the results reveal that forgeNet could screen the chemical compounds more accurately than SVM, RF and gcForest.

Discussions
In order to test the in uence of different feature sets on the identi cation results, we utilized molecular descriptors as control feature set. Molecular descriptors and molecular ngerprints make up full feature set. With these two feature sets, SVM, RF, gcForest and forgeNet are utilized by 3-fold, 5-fold, 10-fold and leave-one-out methods. The AUC and F1 results are depicted in Fig. 9 and Fig. 10, respectively. From the results, it could be seen that full feature set could improve the compound identi cation accuracy of methods.

Conclusions
Network pharmacology has become a frontier and hot spot in the eld of traditional Chinese medicine research. This research method can effectively predict the effective components, target and side effects of drugs, and is conducive to the process of modernization of traditional Chinese medicine. In order to construct "disease-gene-target-drug" interaction network more accurately, forest graph embedded deep feed forward network is utilized to infer "disease-compound" network in this paper. According to acute lung injury, two ALI-related target genes (REAL and SATA3) are selected, and the active and inactive compounds of the two corresponding target genes are collected, respectively. Molecular descriptors and molecular ngerprints are utilized to characterize each compound. By leave-one-out, 3-fold, 5-fold and 10fold cross validation methods, forgeNet has the better performance than SVM, RF and gcForest in terms of SN, SP, ACC, MCC, F1, AUC and ROC curves. ForgeNet is also utilized to identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS) and the results reveal that forgeNet could infer the compounds of disease related more accurately. We aslo test the in uence of different feature sets on the identi cation results and nd the feature set based on molecular descriptors and molecular ngerprints could improve the compound identi cation accuracy of methods.

forgeNet
Forest graph-embedded deep feedforward network (forgeNet) is a novel machine learning algorithm, which has been successfully applied to solve classi cation problem with TCGA RNA-seq data. The owchart of forgeNet is depicted in Fig. 11. From Fig. 12, it could be seen that this method contains two parts: feature graph construction and deep neural network. Compared with deep learning models, forgeNet solves the dimension problem of biological data and is more robust. The algorithm is described as follows [39].
Step 1: feature graph construction Before the labeled training data are input into classi er, the features of the data need to be extracted. In forgeNet, the used forest ξ contains p decision tree (DT). With the labeled training data, the forest is tted Page 10/26 and p DT are generated (ξ(θ) = {T 1 (θ 1 ), T 2 (θ 2 ), …, T p (θ p )}, θ i is a parameter). Meanwhile if binary tree is regarded as a special case of directed graph, we can gain the following graph set.

1
Where V i and E i represents vertex set and edge set of G i .
To integrate the directed graph set Φ, the nal aggregated graph can be gained by the following formula.
Step 2: deep neural network The feature graph obtained the previous step are embedded into this part. With the processed features graph-embedded deep feedforward networks (GEDFN) is used to train and make the classi cation for the unknown data [12]. Every layer of GEDFN is introduced as followed.
3 Where X is input data,Z k is the k − th hidden layers, Θdenotes Hadamard product, W k and b k are the weights and bias of the k − th hidden layer, respectively.

Inference algorithm
(1) Data preparation. Two key target genes: signal transducer and activator of transcription 3 (STAT3), and nuclear transcription factor-κ B/p65 (nuclear factor kappa, B/p65, REAL) were proved to be mainly involved in the key pathways related to acute lung injury (ALI), and losely related to ALI diseases in the literature [40]. Then, the BindingDB database is searched for the known active compounds of two key target genes [38]. The active ligands are screened with the condition that IC50<5000 nmol·L −1 . The collected active compounds are labeled as positive samples. In order to collect the negative samples, 20% of the active ligands are randomly selected and uploaded to DUD-E database (http://dude.docking.org/) to generate the inactive ligands [41]. In order to obtain the molecular descriptors and molecular ngerprints of each ligand, the active and inactive ligands collected are uploaded as the feature vectors.
(2) Model training. According to the collected data, the feature vector of each ligand is used as input for forgeNet. After training phase, the unknown compounds are screened for the target disease.