Prediction of pathway module (Binary Classification)
Prediction of pathway module is a binary classification problem in machine learning. Binary classification data sets labeled with ”Zero” and ”Ones”, where 1 and 0 represents the chemical compounds belong or not with pathway modules respectively. For binary classification, we retrieved 4614 biochemical compounds in SMILES structure, 2117 compounds do not belong to any pathway module in metabolic pathways noted it negatives, and the remaining 2497 belong to pathway modules marked it positives in our data. These data sets are randomly selected for the training, testing, and validation of the classifier. The prediction of the classifier belongs to (n = 1, 2) possible outcome classes. The data sets were split into training and testing data for training and validations. The model predicts the query compound belongs to the pathway module or not, based on the input data. After preprocessing the data, extra tree classifier was implemented to predict the pathway module illustrated in figure 1. After prediction the pathway modules we also implemented RF and different integrated classifiers to compare the metric performance with other ensemble classifiers. The metric score of the ETC and other ensemble classifiers shown in Table 1.
Table 1
Comparison of classifiers for pathway module prediction in metabolic pathways
Methods
|
Reference
|
Accuracy
|
Precision
|
Recall
|
MCC
|
Random forest (local ten –
fold cross validation
|
Jia et al [12]
|
93.42
|
94.37
|
92.49
|
93.35
|
Random forest (Global ten-
fold cross validation
|
Jia et al [12]
|
93.29
|
93.99
|
92.49
|
93.24
|
Group of classifiers (KNN,
RF,DTs)
|
Compared with ETC
|
92.49
|
92.22
|
92.29
|
92.24
|
Extra Trees Classifier
|
Our
|
94.11
|
94.05
|
94.05
|
94.05
|
Random Forest
|
Compared with ETC
|
93.23
|
93.23
|
93.29
|
93.23
|
comparison with other ensemble classifiers
We also adopted other widely used machine learning classifiers in bioinformatics RF, collections of other classifiers, including KNN, DT, and RF. We evaluated our model for the prediction of pathway modules in the metabolic pathway. The metrics performance statistics of these classifier shown in Figure 3A, 3B, 3C, 3D, 3E and 3F. Besides, we compared our model with existing methods for the prediction of the binary classification problems. Jia et al. [12] used the same type of data sets for the prediction of actual metabolic pathways, to calculate the specificity (SP), sensitivity (SN), accuracy (ACC), precision, F1-measure [28] [29] and Matthews correlation coefficient (MCC) [30] according to the given formulas:
The accuracy of binary classification is defined as follows:
Here, the accuracy is the fraction correctly predicted of all query compounds associated with the pathway module in the metabolic pathway. The model is also needed observed precision and recall for performance measurement.

Here, true positive (TP), the chemical compound belongs to pathway module and model declared it is belonged with the module, true negative (TN), the compound is not present in pathway module and model declared it is not link. False-positive (FP), the compound is not-belonging with pathway module and model shows, it is belong with pathway module, false negative (FN) module declared it is not-link, but it is the part of pathway module.
Prior studies frequently focused on metabolomics on ensemble RF instead of other classifiers to predict binary, multi-label, and multi-class classification based on its effective performance. Therefore, we implemented RF alone and ensemble with other classifiers on our data, compared with ETC. The bold value of the arguments shows the high performance of the classifier in Table 1. The ETC has high performance metrics as compared to prior work and as well as other classifier used in these experiments shown in figure 4. We did a precision-recall curve (PR) and receiver operating characteristic curve (ROC) analysis on three different ensemble models mentioned in figure 3. The obtained curve, AUROCs, and AUPRs show that the ETC performance is higher than the other two ensemble RF and integrated classifiers. The curves also show that the classifier’s performance is worse than the other two methods. As a consequence of these analyses, the ETC is more related to the atom and molecular properties features in pathways modules.
Prediction of pathway module (Multi-class classification)
We performed second experiments for the predictions of compounds in multiple pathway module classes. In terms of machine learning classification, module class classification turned into multi-class classification, where inputs were categorized into multi classes. In our experiment, data sets belong to (N = 1, 2, 3......, 10) ten different classes. The query compound either belongs to a single class or multiple classes based on input labeled. The data divided into positive and negative samples. The positive samples are all the points in class i, and let the negative sample be all the points not in class. For the prediction of pathway module classes, we used 1985 labeled compounds L ∈ (0, 1, 2, ....., 9). The classifier predicts the probabilistic outcome in a single class or multi-classes based on the input labeled. The compound in data sets shown in Figure 5A, and each class’ performance statistics are illustrated in Figure 5B. Our model shows high performance (precision, recall, f1-score) for each class. The accuracy for multi-class classification problems as follows:

Here N, shows the total number of compounds in data sets c, represents the ten classes of pathway modules. The accurate class prediction is 1 if the model correctly predicts the label for the ith compound of the pathway module class c. The performance metrics of our algorithm compared with other machine learning algorithms, for multi-class classification illustrated in Table 2.
Table 2
Comparison of the multi-class classifier for the prediction of pathway module classes
Methods
|
References
|
Accuracy
|
Precision
|
Recall
|
Random forest
|
Baranwal et al.[11]
|
97.58
|
83.69
|
83.63
|
Ada Boost
|
Hu et al.[8]
|
94.64
|
77.97
|
67.83
|
Ensemble classifiers
|
This experiment
|
98.14
|
87.44
|
88.94
|
Extra Trees Classifiers
|
This experiment
|
98.59
|
90.70
|
91.71
|
Random forest
|
This Experiment
|
97.14
|
87.44
|
88.94
|
Extra Trees Classifier shown state-of-the-art metrics performance for the prediction of pathway module classes. We evaluated our selected classifiers with other ensemble and group of machine learning classifiers and also compared with existing methods, which used similar data sets to predict metabolic pathway classes [8] [11]. The performance metrics of these models shown in Table 2. We compared ETC with previous researcher works and other classifiers in the current experiment on multi-class classification data. We evaluated our model to calculate accuracy, sensitivity, and precision by the formulas shown in 5.2. The performance of the ETC is higher than other methods in all metrics performance terms. Let us assume that our model prediction is given in Table 3.
Table 3
Classifiers prediction of pathway module classes
Chemical compounds
|
Actual classes
|
Predicted classes
|
True positive
|
False positive
|
True negative
|
False negative
|
Glycerone Phosphate
|
CM & EM
|
CM & LM
|
CM
|
LM
|
NM, AM,
GM, MCV,
BTP, BOSM,
XB
|
EM
|
Glycer- aldehyde 3 phosphate
|
CM,
EM, BTP,
BOSM
|
EM, BOSM, MCV
|
EM, BOSM
|
MCV
|
NM, AM, GM,
BOSM
|
CM,
BTP
|
Alpha-d Glucose
|
CM
|
CM
|
CM
|
-
|
EM, LM, NM
GM, MCV,
BTP, BOSM
XB
|
-
|
Alpha D glucose 6 phosphate
|
CM, BOSM
|
EM, LM, NM, AM, GM, MCV, BTP,
|
-
|
EM, LM, NM, AM, GM, MCV, BTP,
|
CM, BOSM
|
-
|
The true positive (TP), false positive (FP), true negative (TN), and false-negative (FN) can be calculated based on actual classes and predicted classes of pathway modules. The table shows the actual classes, assuming predicted classes, and confusion metrics of the chemical compounds. CM, EM, LM, NM, AM, GM, MCV, BTP, BOSM, and XB are the classes of pathway modules in KEGG, described in section 2. Table 3 presents four query compounds and their corresponding metrics values. This method is repeated for all the compounds in the data sets, and the cumulative statistics for TPs, FPs, TNs, and FNs are used to evaluate the performance.