Supervised learning techniques to predict compounds in pathway modules based on molecular properties

doi:10.21203/rs.3.rs-1140648/v1

Download PDF

Research Article

Supervised learning techniques to predict compounds in pathway modules based on molecular properties

https://doi.org/10.21203/rs.3.rs-1140648/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Machine learning algorithms provide significant indications in metabolomics to predict chemical compounds in metabolic pathways and in their modules. The modules in the metabolic pathway are sub networks of functionally related genes based on rules such as protein-protein interactions, co-regulated expression, coordinated physiological activity, and successive reaction steps. Fully functional modules are helpful to improve the diseases process, drug discover, and prediction of missing reaction. All modules in the metabolic pathway are not functional due to missing reaction steps. The structural mapping of chemical compounds with the pathway module is helpful to understand the mechanism of prediction unknown reaction step. The main purpose of this paper to predict the chemical compounds in pathway modules and their classes. We have constructed binary and multi-label classification data sets to predict pathway module and module classes, respectively. In order to identify the pathway module and its classes, we have built an ensemble Extra trees classifier to learn the molecular and atomic properties of chemical compounds. We have also experimented with different ensemble machine learning algorithm for the prediction of pathway modules. The overall prediction rate of the classifier 98.59%, indicating extra tree classifier features are more interpretable and have a high predictive performance on various tasks.

Evolutionary Biology

Chemical compound

machine learning

metabolic pathway

pathway module

binary and multi-label classification

The metabolic pathway is the series of biochemical reactions between the start substrates and the final product, where the product of one enzyme is the substrates of subsequent enzymes. The metabolic pathway is decomposed into a subnetwork (module) of functionally related genes based on rules such as phenotypic features, protein to protein interaction, and co-regulated expression. These modules are categorized into three types: pathway module, signature module, and reaction module existing in public pathway databases. Several public databases are available such as KEGG [1, 2] MetaCyc [3, 4], Brenda [5] Rhea [6], comprising major components of the pathway module, including chemical compounds, reactions, and enzyme-substrates binding. However, some hidden reactions, enzymes, chemical compound have not been detected in the pathways modules, which is a great barrier to understanding pathway modules’ function. The mapping of small molecules with their corresponding pathway modules can play a major role in the prediction of compounds associated with the functional part of a pathway, that may help in drug designing for identifying better drug molecules in a more efficient.

A variety of methods have been established in vivo, to analyze the role of com- pounds in metabolic pathways based on physical, biological or biochemical experiments. However, these methods lead to the problems of high cost and low efficiency, and high-throughput equipment’s are required for the analysis of compounds in metabolic pathways. To overcome, these limitations machine learning methods have incredible ability to adapt structure and chemical activity of compounds for the prediction of metabolic pathways. Until now, many researchers approached to machine learning methods for the prediction of chemical compounds with the metabolic pathway classes. Amongst them, Cai et al. [7] proposed a single-label nearest neighbor algorithm (NNA) to classify compounds into pathway types existing in KEGG. Similar approaches adopted by Hu et al. [8] replaced the K-nearest neighbor (KNN) algorithm with AdaBoost to predict the metabolic pathway compounds. Later, random forest [9] as a multi-label classifier was adopted by Macchiarulo et al. [10], Barnawal et al. [11], and Jia et al. [12] for the prediction of metabolic pathway classes. Barnawal et al. [11] developed the random forest with graph convolution neural network (GCN) to predict metabolic pathway classes, and Jia et al. [12] targeted the actual metabolic pathway to which chemical compound belongs. However, these all methods only consider the association of compounds with the metabolic pathway based on biological function. In addition, a metabolic pathway is a large network consisting of several modules. All pathway modules in the metabolic pathway are not fully functional owing to the lack of experimentally identified compounds. These unknown compounds may cause misleading interpretation and non-functional of modules. Activate the nonfunctional modules, it is necessary to predict which type of compounds link with the pathway module. Besides, the analysis of compounds with pathway functional unit (module) may improve the understanding of diseases, diagnosing and drug discovery. In this article we develop machine learning method for prediction of query compound, whether belong or not to pathway modules. We define feature vectors representing the chemical transformation patterns of compound using MACCS (Molecular Access System) chemical fingerprints. We apply an ensemble Extra trees classifier (ETC) [13] has been widely applied in bioinformatics for a variety of computational task, including diagnosing diseases [14], tumor segmentation [15], cyber-attack detection [16], time-series classification [17]. The purpose of ensemble machine learning is to combine different base estimators with improving robustness and generalizability over a single estimator. The focus of this paper comprising two main contributions. First, we apply machine learning algorithms ETC to predict the metabolic pathways and pathway modules to which query compounds belong. In term of machine learning these predictions turns to binary classification, where our model classifies the data into two possible outcomes (a) compounds belong to the pathway module, (b) query compound does not belong to the pathway modules. Furthermore, to enhance the process of diagnosing and drug discovery, it is essential to analyze the structure, chemical behavior and biological function of compounds. Based on structure and common biological function, these compounds are classified into 10 different categories, such as carbohydrates, energy and lipid metabolism. Besides, some compounds are participating in multi classes for a different function such as Glyceraldehyde-3 Phosphate belong to Carbohydrate metabolism, Energy metabolism, Biosynthesis of terpenoids and polyketides, and Biosynthesis of other secondary metabolites. For multi classes classification we develop ETC to predicts the classes of pathway modules. The possible probabilistic outcomes of classifier will show the real class of compounds. The performance of our proposed classifier is also competitive with previous work and other ensemble machine learning classifiers.

The chemical compounds data sets of metabolic pathways and modules were retrieved from KEGG databases (https://www.genome.jp/kegg/module.html and https://www.kegg.jp/kegg/pathway.html) accessed in October 2020. For the binary classification, we retrieved 6664 different compounds; 4612 were associated with binary classification, whether link or not with the pathway modules. The remaining compounds were discarded from this experiment based on similar structure. For binary classification, our model access each compound 2(n) times, where n shows the number of compounds in data sets. The access of all compounds in binary classification are 9224. Furthermore, 1985 different compounds were retrieved for multi-class classification from the KEGG module database (https://www.genome.jp/kegg/module.html) accessed in October 2020. These compounds were associated with 10 different pathway module classes, including carbohydrate metabolism (CM), energy metabolism (EM), lipid metabolism (LM), nucleotide metabolism (NM), amino acid metabolism (AM), glycan metabolism (GM), metabolism of co factors and vitamins (MCV), biosynthesis of terepenoids and polyketides(BTP), biosynthesis of other secondary metabolites (BOSM), and xenobiotic biodegradation (XB). The overall label of compounds for these classes are 2908, including single label and multi-label. Each compound’s possible outcome could be 10(n), and for all compounds, the classifier can be classified 19850 possible outcomes.

Besides, these compounds were downloaded in SMILES (simplified molecular input line-entry system) format. During retrieving, many compounds were not available in SMILES format. The SDF (structure-data file) format of these compounds translated into SMILES structure using online Chemoinformatics web server of the Computer-Aided Drug Design (CADD) Group. Some compounds are involved in more than one pathway module, recommending these compounds as a multi-label classification problem.

We developed a machine learning ensemble extra tree classifier for predicting the pathway module and classes of modules using molecular and atomic properties as features. Classifier takes compounds in SMILES format along with 166 MACCS keys and 7 additional molecular descriptors, including molecular weight, rotatable bonds, ring counts, lipophilicity, aromaticity, and polarizability. These hyper parameters were applied by Baranwal et al. [11] for the prediction of metabolic pathway classes. To yield the optimal performance, hyper parameters of the ETC were determined and optimized through grid searching [18]. Optimized hyper parameters applied for the prediction of pathway modules and their classes. Our algorithm classified data into two classes C ∈ (pathway module) or C (pathway module), where output either belongs or not with the pathway module in the metabolic pathway. For multi-class classification, our classifier classifies inputs chemical compounds across single or multi classes of pathway modules. In this case, the output of the ensemble classifier is a probability distribution over ten different classes.

More precisely, the pathway module classes can be expressed by numbers C∈ {0, 1, 2, 3, ........., 9}, where the numbers from 0 to 9 shows the number of module classes in KEGG. This shows that each module classes belong to the query compound. Multi-class classification having a set of classes C, where (C = c1, c2, c3, c4, ............., c10), the task of a classifier is to assign c from the set of classes C to query compound on all data sets samples, x = (d1, d2, d3..............dm) consisting of n attribute values di ∈ Dm. For classification, the model needs to be trained on training data and to predict the new data. The data D is divided into DT and training, and testing data, respectively. The model M train on DT and predict , where can map unseen query compound to corresponding classes c ∈ C. The model M predicts module classes based on the likelihood of the query compound belonging to the class. As a machine learning model, we developed ETC that aggregates the outputs of multiple correlated decision trees collected in a forest to output its classification, shown in figure 1. Each decision tree is constructed from the training sample and provided with k features. Each decision tree selects the best features for splitting the data, based on mathematical criteria. The iteration of ETC starts to select m features randomly as a candidate set of splitting features. Within each of these features, Fi, with i ∈ (1...........m) draws a single random cutpoint equally from the interval (min(Fi)); (max(Fi)), evaluates the performance of this feature with this cutpoint regarding entropy. Finally, features paired with their randomly selected cutpoint and select the best cutpoint. The given formula can measure the entropy of the data.

$$E\left(S\right) = \sum _{i}^{h} -{p}_{i}{\text{l}\text{o}\text{g}}_{2}{p}_{i} i\in \left(1.........10\right) \left(1\right)$$

Where S is the number of unique class labels, and pi is simply the frequentist probability of a class i. In an account of the pathway module classes, n ∈ (1...........10) the input data is labeled across ten different classes. The data disorder can be reduced in our target classes by the information gain.

$$\text{I}\text{G}\left(\text{Y},\text{X}\right) =\text{E}\left(\text{Y}\right)-\text{E}\left(\frac{\text{Y}}{\text{X}}\right) \left(2\right)$$

Where, Y is the target pathway module class and X is the input of the classifier. To calculate the reduction of uncertainty, subtract the entropy of Y given X from Y entropy; the more information is (C = c1, c2, c3......c10) gained about Y from X, the greater uncertainty reduction.

We performed binary and multi-class classification based on two different types of data sets for predicting pathway modules in the metabolic pathway and module classes by constructing ensemble classifiers. We also compared our ensemble model with RF, and collection of classifiers, including KNN, DT, RF. All ensemble classifiers trained on similar molecular descriptors of compounds data sets with their own hyper parameters. We tuned the hyper parameters for all classifiers by performing a grid search over the set of possible hyper parameters settings to achieve high performance.

Extra Tree Classifier

The Extra Trees algorithms work by making an enormous number of unpruned decision trees from the training data sets. The final prediction is established by using majority voting on account of classification. The ETC working is different from other tree-based ensemble methods. It splits the node randomly by picking cut points, which will decrease variance better than other randomization strategies. Based on random splitting, the execution time of the ETC is faster. Owing to the computational efficiency, the Extra trees algorithm has massive applications for classification and regression [16] [19] [20].

Group of Classifiers

We integrated a group of classifiers, including KNN, DT, RF, for both module and module classes to achieve better performance than ETC. The hyper parameters of each classifier were selected from their possible hyper parameters. For good performance, the hyper parameters of each classifier tuned by grid search methods. However, these integrated classifiers’ performance worse than ETC and single RF in both experiments.

K-Nearest Neighbor (KNN)

K-Nearest Neighbor is a supervised learning algorithm used for classification and regression problems in bioinformatics [21] [22] [23]. It is assuming every data point near to each other is falling in the same class. KNN algorithms classify new data points based on a similarity measure by a majority vote to its neighbors. We integrated KNN with the number of neighbors 1, leaf size 6, and the value of power parameter (P) is 1; this is equivalent to using Manhattan distance (l1) [24]

Decision Tree (DT)

A decision tree belongs to the supervised learning algorithms for regression and classification problems. The DT [25] [26] tries to solve the problems, by using tree representation, where each of the trees relates to attributes, and each leaf node corresponds to the class label. The leaf nodes are the final nodes or decision nodes of the model. The DT creates a training model to predict class by learning decision rules inferred from prior training data. In this experiment, DT implemented the hyper parameters criterion” Gini,” The maximum depth of the tree is 10.

Random Forest (RF)

Random forest is an ensemble of DTs, trained with the bagging method used for classification and regression that operate by constructing a group of DTs to predict the mode of classes [27]. Generally, the bagging method is the combination of a learning model to increase the overall performance. We apply the RF algorithm on both experiments to compare the abilities with our model. The hyper parameters of the RF is similar to the ETC. However, RF subsamples the input data with replacement, whereas ETC uses the whole original sample.

Implementation

All these classifiers models are implemented in Python 3.8 with Keras library on an Intel(R) Core TM i7-4600U CPU @ 2.70 and 2.10 GHz Processor with a 64-bit operating system, x64 based processor. The structure of compounds were encoded to corresponding physio chemical structure by MACCS keys and RDKIT. For ensemble classifiers, we used the readily available implementation in the Scikit-learn module. The hyper parameters of classifiers were tuned and optimized by Grid search optimization method.

Hyper parameter Optimization

Like other machine learning algorithms, the Extra trees classifier’s performance can depend greatly on the selection of different hyper parameters, such as the number of estimators, criterion, maximum features, depth, etc. The accuracy, precision, and recall of the Extra trees classifier with default hyper parameters were 97.94, 84.67, and 85.43, respectively. To maximize our classifier’s performance, we perform hyper parameters optimization via the Grid Search [18]. We tuned the most important hyper parameters for our data and selected the number of estimators (200), maximum features (1.0), maximum depth (60) of the decision tree, and the criterion (entropy). As a consequence of our experiments, the classifier shows high performance based on the hyper parameters’ best selection. After optimizing hyper parameters, the accuracy, precision, and recall are 98.59, 90.70, and 91.71, respectively. The illustration of grid search with different hyper parameters is shown in Figures 2A, 2B, and 2C.

Prediction of pathway module (Binary Classification)

Prediction of pathway module is a binary classification problem in machine learning. Binary classification data sets labeled with ”Zero” and ”Ones”, where 1 and 0 represents the chemical compounds belong or not with pathway modules respectively. For binary classification, we retrieved 4614 biochemical compounds in SMILES structure, 2117 compounds do not belong to any pathway module in metabolic pathways noted it negatives, and the remaining 2497 belong to pathway modules marked it positives in our data. These data sets are randomly selected for the training, testing, and validation of the classifier. The prediction of the classifier belongs to (n = 1, 2) possible outcome classes. The data sets were split into training and testing data for training and validations. The model predicts the query compound belongs to the pathway module or not, based on the input data. After preprocessing the data, extra tree classifier was implemented to predict the pathway module illustrated in figure 1. After prediction the pathway modules we also implemented RF and different integrated classifiers to compare the metric performance with other ensemble classifiers. The metric score of the ETC and other ensemble classifiers shown in Table 1.

Table 1

Comparison of classifiers for pathway module prediction in metabolic pathways
Methods	Reference	Accuracy	Precision	Recall	MCC
Random forest (local ten – fold cross validation	Jia et al [12]	93.42	94.37	92.49	93.35
Random forest (Global ten- fold cross validation	Jia et al [12]	93.29	93.99	92.49	93.24
Group of classifiers (KNN, RF,DTs)	Compared with ETC	92.49	92.22	92.29	92.24
Extra Trees Classifier	Our	94.11	94.05	94.05	94.05
Random Forest	Compared with ETC	93.23	93.23	93.29	93.23

comparison with other ensemble classifiers

We also adopted other widely used machine learning classifiers in bioinformatics RF, collections of other classifiers, including KNN, DT, and RF. We evaluated our model for the prediction of pathway modules in the metabolic pathway. The metrics performance statistics of these classifier shown in Figure 3A, 3B, 3C, 3D, 3E and 3F. Besides, we compared our model with existing methods for the prediction of the binary classification problems. Jia et al. [12] used the same type of data sets for the prediction of actual metabolic pathways, to calculate the specificity (SP), sensitivity (SN), accuracy (ACC), precision, F1-measure [28] [29] and Matthews correlation coefficient (MCC) [30] according to the given formulas:

The accuracy of binary classification is defined as follows:

Here, the accuracy is the fraction correctly predicted of all query compounds associated with the pathway module in the metabolic pathway. The model is also needed observed precision and recall for performance measurement.

Here, true positive (TP), the chemical compound belongs to pathway module and model declared it is belonged with the module, true negative (TN), the compound is not present in pathway module and model declared it is not link. False-positive (FP), the compound is not-belonging with pathway module and model shows, it is belong with pathway module, false negative (FN) module declared it is not-link, but it is the part of pathway module.

Prior studies frequently focused on metabolomics on ensemble RF instead of other classifiers to predict binary, multi-label, and multi-class classification based on its effective performance. Therefore, we implemented RF alone and ensemble with other classifiers on our data, compared with ETC. The bold value of the arguments shows the high performance of the classifier in Table 1. The ETC has high performance metrics as compared to prior work and as well as other classifier used in these experiments shown in figure 4. We did a precision-recall curve (PR) and receiver operating characteristic curve (ROC) analysis on three different ensemble models mentioned in figure 3. The obtained curve, AUROCs, and AUPRs show that the ETC performance is higher than the other two ensemble RF and integrated classifiers. The curves also show that the classifier’s performance is worse than the other two methods. As a consequence of these analyses, the ETC is more related to the atom and molecular properties features in pathways modules.

Prediction of pathway module (Multi-class classification)

We performed second experiments for the predictions of compounds in multiple pathway module classes. In terms of machine learning classification, module class classification turned into multi-class classification, where inputs were categorized into multi classes. In our experiment, data sets belong to (N = 1, 2, 3......, 10) ten different classes. The query compound either belongs to a single class or multiple classes based on input labeled. The data divided into positive and negative samples. The positive samples are all the points in class i, and let the negative sample be all the points not in class. For the prediction of pathway module classes, we used 1985 labeled compounds L ∈ (0, 1, 2, ....., 9). The classifier predicts the probabilistic outcome in a single class or multi-classes based on the input labeled. The compound in data sets shown in Figure 5A, and each class’ performance statistics are illustrated in Figure 5B. Our model shows high performance (precision, recall, f1-score) for each class. The accuracy for multi-class classification problems as follows:

Here N, shows the total number of compounds in data sets c, represents the ten classes of pathway modules. The accurate class prediction is 1 if the model correctly predicts the label for the ith compound of the pathway module class c. The performance metrics of our algorithm compared with other machine learning algorithms, for multi-class classification illustrated in Table 2.

Table 2

Comparison of the multi-class classifier for the prediction of pathway module classes
Methods	References	Accuracy	Precision	Recall
Random forest	Baranwal et al.[11]	97.58	83.69	83.63
Ada Boost	Hu et al.[8]	94.64	77.97	67.83
Ensemble classifiers	This experiment	98.14	87.44	88.94
Extra Trees Classifiers	This experiment	98.59	90.70	91.71
Random forest	This Experiment	97.14	87.44	88.94

Extra Trees Classifier shown state-of-the-art metrics performance for the prediction of pathway module classes. We evaluated our selected classifiers with other ensemble and group of machine learning classifiers and also compared with existing methods, which used similar data sets to predict metabolic pathway classes [8] [11]. The performance metrics of these models shown in Table 2. We compared ETC with previous researcher works and other classifiers in the current experiment on multi-class classification data. We evaluated our model to calculate accuracy, sensitivity, and precision by the formulas shown in 5.2. The performance of the ETC is higher than other methods in all metrics performance terms. Let us assume that our model prediction is given in Table 3.

Table 3

Classifiers prediction of pathway module classes
Chemical compounds	Actual classes	Predicted classes	True positive	False positive	True negative	False negative
Glycerone Phosphate	CM & EM	CM & LM	CM	LM	NM, AM, GM, MCV, BTP, BOSM, XB	EM
Glycer- aldehyde 3 phosphate	CM, EM, BTP, BOSM	EM, BOSM, MCV	EM, BOSM	MCV	NM, AM, GM, BOSM	CM, BTP
Alpha-d Glucose	CM	CM	CM	-	EM, LM, NM GM, MCV, BTP, BOSM XB	-
Alpha D glucose 6 phosphate	CM, BOSM	EM, LM, NM, AM, GM, MCV, BTP,	-	EM, LM, NM, AM, GM, MCV, BTP,	CM, BOSM	-

The true positive (TP), false positive (FP), true negative (TN), and false-negative (FN) can be calculated based on actual classes and predicted classes of pathway modules. The table shows the actual classes, assuming predicted classes, and confusion metrics of the chemical compounds. CM, EM, LM, NM, AM, GM, MCV, BTP, BOSM, and XB are the classes of pathway modules in KEGG, described in section 2. Table 3 presents four query compounds and their corresponding metrics values. This method is repeated for all the compounds in the data sets, and the cumulative statistics for TPs, FPs, TNs, and FNs are used to evaluate the performance.

In this paper, we predict the functional unit of gene sets (module) in the metabolic pathway, where compounds link with pathway modules. The compounds’ data retrieved from two different databases of KEGG, pathway module, and metabolic pathway databases. We distributed this problem into two classification problems, the pathway module and its classes. In machine learning, both problems, prediction of module and classes of module recommended as a single and multi-class classification problem, respectively. For both types of classification, we adopted ensemble ETC based on the compounds’ SMILES molecular structure. The classifier makes the number of DTs from the training data sets and predicts using adaptive voting.

We then choose several based classifiers, including KNN, random forest, decision tree, and design an ensemble adaptive voting algorithm to improve the prediction accuracy. However, the performance of random forest and ensemble classifiers is not good as ETC. Further, we also compared ETC performance with prior published work. Resultant, our model showed better performance than others in all terms of accuracy, precision, and recall, shown in Tables 1 and 2. Overall, the ETC classifier’s implementation is easy to process the metabolomics for drug designing, synthesizing new reactions predictors, and predicting enzymes. Interaction of molecule with the functional parts of the metabolic pathways. The ETC classifier can be made to train and predict based on atom-bond specification, biological function to predict reactant pair from the available chemical compound data set, to determine unknown reaction for the optimization, reconstruction, of metabolic pathways.

This article proposed a machine learning ensemble classifier ETC to predict path way modules in the metabolic pathway based on the chemical-chemical interaction, molecular structure, and physical descriptors of chemical compounds. The experimental results proved, our ETC reached state of-the-art performance on both module and module-classes classification problems. This article only predicts the pathway module and classes of the module where chemical compounds map. This work can be extended to predict the metabolic pathways and functional unit of genes set in pathway based on chemical reactions. Further, hybrid data sets of chemical compounds and reactions to predict the metabolic pathway and modules is another future work in metabolomics.

Ethics and consent to participate

Not applicable, and no administrative permissions were required to access the raw data from KEGG.

Consent for publication

Not applicable

Availability of data and materials

The datasets used and analyzed during the current study available from the corresponding author on reasonable request.

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgements

We are very thankful to Wuhan University for her generous support to conduct this research work.

Funding

This work was funded by the National Key R&D Program of China Technological

Innovation in Hubei Province (2019AEA170), and the Frontier Projects of Wuhan

for Application Foundation (2019010701011381). The National Key R&D Program

of China (No. 2019YFA0904303) pay for the open access publication fees.

Authors’ contributions

JL proposed the ideas, HAS collected data and wrote the manuscript, JL, HAS,

ZY, XZ, and JF discussed the outline of the manuscript

Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research.2000;28(1):27–30.
Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. New approach for understanding genome variations in KEGG. Nucleic acids research. 2019;47(D1):590–595.
Besteiro S, Duy SV, Perigaud C, Lefebvre-Tournier I, Vial HJ. Exploring metabolomic approaches to analyse phospholipid biosynthetic pathways in Plasmodium. Parasitology. 2010;137(9):1343.
Caspi R, Billington R, Ferrer L, Foerster H, Fulcher CA, Keseler IM, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic acids research.2016;44(D1):471–480.
Placzek S, Schomburg I, Chang A, Jeske L, Ulbrich M, Tillack J, et al. BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic acids research. 2016;p. gkw952.
Lombardot T, Morgat A, Axelsen KB, Aimo L, Hyka-Nouspikel N, Niknejad A, et al. Updates in Rhea:SPARQLing biochemical reaction data. Nucleic acids research. 2019;47(D1):596–600.
Cai YD, Qian Z, Lu L, Feng KY, Meng X, Niu B, et al. Prediction of compounds’ biological function (metabolic pathways) based on functional group composition. Molecular diversity. 2008;12(2):131–137.
Hu LL, Chen C, Huang T, Cai YD, Chou KC. Predicting biological functions of compounds based on chemical-chemical interactions. PloS one. 2011;6(12):e29491.
Phiri D, Morgenroth J, Xu C, Hermosilla T. Effects of pre-processing methods on Landsat OLI-8 land cover classification using OBIA and random forests classifier. International journal of applied earth observation and geoinformation. 2018;73:170–178.
Macchiarulo A, Thornton JM, Nobeli I. Mapping human metabolic pathways in the small molecule chemical space. Journal of chemical information and modeling. 2009;49(10):2272–2289.
Baranwal M, Magner A, Elvati P, Saldinger J, Violi A, Hero AO. A deep learning architecture for metabolic pathway prediction. Bioinformatics. 2020;36(8):2547–2553.
Jia Y, Zhao R, Chen L. Similarity-based machine learning model for predicting the metabolic pathways of compounds. IEEE Access. 2020;8:130687–130696.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine learning. 2006;63(1):3–42.
Baranidharan B, Pal A, Muruganandam P. Cardio-Vascular Disease Prediction based on Ensemble technique enhanced using Extra Tree Classifier for Feature Selection;.
Goetz M, Weber C, Bloecher J, Stieltjes B, Meinzer HP, Maier-Hein K. Extremely randomized trees based brain tumor segmentation. Proceeding of BRATS challenge-MICCAI. 2014;p. 006–011.
Acosta MRC, Ahmed S, Garcia CE, Koo I. Extremely randomized trees-based scheme for stealthy cyber-attack detection in smart grid networks. IEEE access. 2020;8:19921–19933.
Wehenkel L, Ernst D, Geurts P. Ensembles of extremely randomized trees and some generic applications. Proceedings of robust methods for power system state estimation and load forecasting. 2006;.
Orlenko A, Kofink D, Lyytik¨ainen LP, Nikus K, Mishra P, Kuukasj¨arvi P, et al. Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning.Bioinformatics. 2020;36(6):1772–1778.
Ampomah EK, Qin Z, Nyame G. Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement. Information. 2020;11(6):332.
Shafique R, Mehmood A, Choi GS. Cardiovascular disease prediction system using extra trees classifier. 2019;.
Shen M, Xiao Y, Golbraikh A, Gombar VK, Tropsha A. Development and validation of k-nearest-neighbor QSPR models of metabolic stability of drug candidates. Journal of medicinal chemistry.2003;46(14):3013–3020.
Yao Z, Ruzzo WL. A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. In: BMC bioinformatics. vol. 7; 2006. p. 1–11.
de Magalh˜aes CR, Carrilho R, Schrama D, Cerqueira M, da Costa AMR, Rodrigues PM. Mid-infrared spectroscopic screening of metabolic alterations in stress-exposed gilthead seabream (Sparus aurata). Scientific reports. 2020;10(1):1–9.
Istiqhfarani WA, Cholissodin I, Bachtiar FA. Klasifikasi Penyakit Dental Caries Menggunakan Algoritme Modified K-Nearest Neighbor. Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer e-ISSN. 2020;2548:964.
Anwar S, Widyantoro DH, Pancoro A. Metabolic Pathway Prediction using HMM. In: 2019 5th International Conference on Science and Technology (ICST). vol. 1; 2019. p. 1–5.
Cuperlovic-Culf M. Machine learning methods for analysis of metabolic data and metabolic pathway modeling. Metabolites. 2018;8(1):4.
Murrugarra D, Veliz-Cuba A, Aguilar B, Arat S, Laubenbacher R. Modeling stochasticity and variability in gene regulatory networks. EURASIP Journal on Bioinformatics and Systems Biology. 2012;2012(1):1–11.
Shaikh A, Mahoto N, Khuhawar F, Memon M. Performance evaluation of classification methods for heart disease dataset. Sindh University Research Journal-SURJ (Science Series). 2015;47(3).
Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:201016061. 2020;.
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure. 1975;405(2):442–451.

README.txt
Readme

Download PDF

Version 1

posted

You are reading this latest preprint version

Supervised learning techniques to predict compounds in pathway modules based on molecular properties

Status:

Version 1

Abstract

Figures

Background

Materials

Methods

Experiments

K-Nearest Neighbor (KNN)

Decision Tree (DT)

Results

Discussion

Conclusion

Declarations

Consent for publication

Competing interests

Acknowledgements

References

Supplementary Files

Status:

Version 1