Evaluation metrics
The following evaluation metrics with 5-fold cross-validation were used to evaluate the performance of our algorithm and compare it with some other state-of-the-art algorithms. For simplicity, we use the abbreviations TP, FP, TN, and FN for true positive, false positive, true negative, and false negative, respectively.
1. AUC: AUC or area under Receiver Operating Characteristic (ROC) curve was the primary scoring metric we used in comparing models against each other. To obtain this, we need to calculate the area under a plot with points whose x coefficients are the false positive rates (FPR) of the model examined and whose y coefficients are the true positive rates (TPR) of that same model for different classification thresholds. The formulas to obtain FPR and TPR are shown below.
$$FPR= \frac{FP}{FP+TN}$$
$$TPR= \frac{TP}{TP+FN}$$
2. Accuracy (Acc): The ratio of correctly classified samples to all samples can be calculated as follows:
$$Acc=\frac{TP+TN}{TP+TN+FP+FN}$$
3. Precision (Pre): The ratio of true positive samples to all samples labeled as positive. It can be calculated as follows:
4. Sensitivity (Sen): The ratio of true positive samples to all ground truth (also known as Recall). It can be calculated based on the TPR formula.
5. F1 Score: It is the geometric mean of Precision and Recall and calculated as follows:
$$F1= \frac{2 \times Precision \times Recall}{Precision+Recall}$$
6. Specificity (Spe): The ratio of true negative samples to all ground truth negative samples. It is calculated as follows:
Benchmark dataset
To evaluate our method, we need a labeled set of CircRNA-Disease pairs as a benchmark dataset, wherein the label is 1 if the pair are associated and 0 otherwise. The labels will later be used for supervised binary classification. To create the benchmark dataset, we adopted the approach in [52], in which an equal number with the positive samples were randomly selected from unknown pairs as negative samples. Our dataset has 575 known circRNA-Disease pairs reconstructed from 474 unique circRNAs and 64 unique diseases. Hence, there are 474 × 64 = 30336 possible CircRNA-Disease combinations, 474 × 64–575 = 29761 of which are possibly unrelated. We randomly select 575 pairs from them as our negative samples (label = 0). This approach allows us to have a balanced dataset and reduces the probability of having false negatives (i.e., CircRNA-Disease pairs that are really associated but whose associations have not been discovered yet) by a factor of \(\frac{575}{29761} \cong 1.93\%\).
Evaluate classification methods
For each classifier, the evaluation statistics depend on the number of features of circRNAs and diseases in the CircRNA-Disease dataset fed to them as input. Therefore, using DeepWalk, we created a set of feature vectors with different vector sizes (a multiple of 10 ranging from 10 through 200). We obtained the classification results on the benchmark dataset for each classifier and found the optimal number of features in terms of AUC. Overall, the best result we produced was achieved by the XGB and ABRF classifiers. Figure 2 shows how the average AUC of each classifier changes with the number of features extracted by DeepWalk. The optimal number of features for each classifier was considered for further evaluation.
Table 1 shows the values of the evaluation metrics for our six classifiers based on their optimal number of features. This table shows that SVM and LR have the minimum performance in our experiment with an average accuracy of 72 and 71, respectively. Overall, it seems that the boosting algorithms enjoy better performance compared with the others. Random forest shows the appropriate performance as well. In terms of accuracy, the random forest has the best result after XGBoost, but if we consider AUC, its effect is very close to AdaBoost. We employed AdaBoost to improve the random forest model results, but as you can see in the table, the results of these two approaches are very close. The XGBoost algorithm obtained the best result. We chose this algorithm as the classifier in our final pipeline. It is noteworthy, however, that the training time of the XGBoost classifier is by far longer than AdaBoost and random forest.
Table 1
The values of evaluation metrics for different classifiers based on their optimal number of features.
Classifier (optimal feature vector size) | Fold | Acc (%) | F1 (%) | Pre (%) | Sen (%) | Spe (%) | AUC (%) |
ABRF (10) | 1 | 89.57 | 89.09 | 93.33 | 85.22 | 93.91 | 96.79 |
2 | 88.70 | 88.29 | 91.59 | 85.22 | 92.17 | 95.58 |
3 | 90.43 | 90.43 | 90.43 | 90.43 | 90.43 | 97.40 |
4 | 90.87 | 90.58 | 93.52 | 87.83 | 93.91 | 96.19 |
5 | 89.13 | 88.48 | 94.12 | 83.48 | 94.78 | 96.96 |
Ave | 89.74 | 89.37 | 92.60 | 86.44 | 93.04 | 96.58 |
LR (80) | 1 | 71.30 | 70.54 | 72.48 | 68.70 | 73.91 | 77.64 |
2 | 71.30 | 71.05 | 71.68 | 70.43 | 72.17 | 77.04 |
3 | 74.78 | 75.63 | 73.17 | 78.26 | 71.30 | 81.07 |
4 | 70.00 | 69.33 | 70.91 | 67.83 | 72.17 | 75.26 |
5 | 69.13 | 70.04 | 68.03 | 72.17 | 66.09 | 76.42 |
average | 71.30 | 71.32 | 71.25 | 71.48 | 71.13 | 77.48 |
MP (10) | 1 | 90.87 | 90.83 | 91.23 | 90.43 | 91.30 | 96.56 |
2 | 90.43 | 90.83 | 87.20 | 94.78 | 86.09 | 93.44 |
3 | 87.83 | 87.72 | 88.50 | 86.96 | 88.70 | 96.43 |
4 | 88.26 | 88.41 | 87.29 | 89.56 | 86.96 | 94.82 |
5 | 90.43 | 90.18 | 92.66 | 87.83 | 93.04 | 96.47 |
Ave | 89.56 | 89.59 | 89.37 | 89.91 | 89.22 | 95.54 |
RF (10) | 1 | 89.13 | 88.69 | 92.45 | 85.21 | 93.04 | 97.20 |
2 | 88.70 | 88.39 | 90.83 | 86.09 | 91.30 | 95.74 |
3 | 91.74 | 91.63 | 92.86 | 90.43 | 93.04 | 96.54 |
4 | 90.43 | 90.09 | 93.46 | 86.957 | 93.913 | 95.558 |
5 | 90.43 | 90.09 | 93.46 | 86.96 | 93.91 | 97.17 |
Ave | 90.09 | 89.78 | 92.61 | 87.13 | 93.04 | 96.44 |
XGB (20) | 1 | 90.87 | 90.83 | 91.23 | 90.4 | 91.30 | 97.99 |
2 | 93.04 | 92.92 | 94.59 | 91.30 | 94.78 | 97.76 |
3 | 90.00 | 90.21 | 88.33 | 92.17 | 87.83 | 96.82 |
4 | 92.17 | 92.10 | 92.92 | 91.30 | 93.04 | 97.49 |
5 | 94.35 | 94.32 | 94.74 | 93.91 | 94.78 | 98.78 |
Ave | 92.09 | 92.078 | 92.36 | 91.82 | 92.35 | 97.77 |
SVM (90) | 1 | 70.87 | 71.73 | 69.67 | 73.91 | 67.83 | 75.86 |
2 | 73.04 | 74.17 | 71.20 | 77.39 | 68.70 | 79.10 |
3 | 77.83 | 79.52 | 73.88 | 86.09 | 69.56 | 81.65 |
4 | 69.13 | 70.04 | 68.03 | 72.17 | 66.09 | 72.77 |
5 | 69.56 | 70.83 | 68.00 | 73.91 | 65.22 | 72.66 |
Ave | 72.09 | 73.26 | 70.16 | 76.69 | 67.48 | 76.41 |
The permutation of the samples in the cross-validation folds was identical for all six classifiers. Figure 3 shows a comparison of ROC curves of different classifiers for each fold of the data in the 5-fold CV. Other algorithms outperformed SVM and logistic regression with an approximate gap of 20% in terms of all six metrics. Not to mention that the SVM took the longest training time of all models. The multilayer perceptron was superior to SVM and logistic regression but missed out on the others by a margin of about 1%. XGBoost had the best AUC in 4 of the 5-folds of the dataset.
Comparison with existing methods
We compared CircWalk with three state-of-the-art algorithms based on the benchmark dataset: DMFCDA (Deep Matrix Factorization CircRNA-Disease Association) [53], GCNCDA [54], and SIMCCDA (Speedup Inductive Matrix Completion for CircRNA-Disease Associations prediction) [55].
Table 2 shows the evaluation process results for the selected algorithms and our method based on the benchmark dataset. As shown in this table, CircWalk is the most outperforming algorithm in our experiment, and its average values for all evaluation metrics are larger than 90%. After CircWalk, DMFCDA obtained higher accuracy compared with the others. GCNCDA is the most similar algorithm to our method among these comparison methods. Although this approach shows lower accuracy than CircWalk and DMFCDA, it is more stable and shows approximately the same results in all folds. SIMCCDA has acceptable performance in all metrics except precision and F1. This algorithm accurately predicted the negative class (unassociated CircRNA-Disease pairs), but its true positive rate was very low.
Table 2
evaluation metrics values for our algorithm and three other state-of-the-art algorithms based on the benchmark dataset.
Algorithm | Fold | Acc (%) | F1 (%) | Pr (%) | Se (%) | Sp (%) | AUC (%) |
CircWalk | 1 | 90.87 | 90.83 | 91.23 | 90.43 | 91.30 | 97.99 |
2 | 93.04 | 92.92 | 94.59 | 91.30 | 94.78 | 97.76 |
3 | 90.00 | 90.21 | 88.33 | 92.17 | 87.83 | 96.82 |
4 | 92.17 | 92.10 | 92.92 | 91.30 | 93.04 | 97.49 |
5 | 94.35 | 94.32 | 94.74 | 93.91 | 94.78 | 98.78 |
Ave | 92.09 | 92.08 | 92.36 | 91.83 | 92.35 | 97.77 |
DMFCDA | 1 | 77.83 | 77.83 | 71.92 | 91.30 | 64.35 | 77.83 |
2 | 80.00 | 80.00 | 79.49 | 80.87 | 79.13 | 80.00 |
3 | 88.26 | 88.26 | 87.29 | 89.57 | 86.96 | 88.26 |
4 | 86.84 | 86.84 | 85.59 | 88.60 | 85.09 | 86.84 |
5 | 85.53 | 85.53 | 83.47 | 88.60 | 82.46 | 85.53 |
Ave | 83.69 | 83.69 | 81.55 | 87.79 | 79.60 | 83.69 |
GCNCDA | 1 | 73.48 | 73.59 | 73.28 | 73.91 | 73.04 | 82.17 |
2 | 76.09 | 76.79 | 74.59 | 79.13 | 73.04 | 82.92 |
3 | 73.91 | 73.45 | 74.77 | 72.17 | 75.65 | 82.45 |
4 | 74.35 | 74.24 | 74.56 | 73.91 | 74.78 | 84.62 |
5 | 74.78 | 76.42 | 71.76 | 81.74 | 67.83 | 81.42 |
Ave | 74.52 | 74.90 | 73.79 | 76.17 | 72.87 | 82.72 |
SIMCCDA | 1 | 78.04 | 11.88 | 06.41 | 81.82 | 77.97 | 68.82 |
2 | 82.80 | 16.73 | 09.30 | 83.20 | 82.79 | 71.30 |
3 | 86.96 | 18.33 | 10.25 | 86.41 | 86.97 | 76.80 |
4 | 84.00 | 18.44 | 10.35 | 84.64 | 83.98 | 73.91 |
5 | 85.00 | 16.64 | 09.20 | 86.67 | 84.97 | 75.45 |
Ave | 83.36 | 16.40 | 09.10 | 84.54 | 83.34 | 73.30 |
Figure 4 compares the ROC curve of each algorithm in each fold of the validation. As shown in this figure, CircWalk obtained an AUC of more than 96% (about 97% on average). GCNCDA and DMFCDA have almost the same results, and SIMCCDA has the worst results in our experiment (because of its low true positive rate).
Case study
This step aims to evaluate the performance of CircWalk in the prediction of novel CircRNA-Disease associations in some selected common diseases. To this end, we selected three common cancer (lung, gastric, and colorectal) that are the target of many circRNA-related kinds of research. We train our model on the feature vectors of the positive pairs and a third of the negative pairs. As we pointed out earlier, the negative pairs (i.e., associations) are a subset of unverified CircRNA-Disease associations, which means there may be positive associations among them. As a result, we decided to train our model on a few negative pairs as possible to reduce learning from these false negatives. However, we could not completely omit them as there must be at least two classes in the dataset for XGBoost to be trained on it. Then, we make a list of all CircRNA-Disease pairs whose circRNA is present in our initial CircRNA-Disease dataset and whose disease is one of the three diseases we selected in this part. After that, filter out the CircRNA-Disease pairs present in the data, which our model was trained on in this part. We give this list of CircRNA-Disease associations as input to our trained model. Instead of labeling them as positive (1) or negative (0), we use our model to calculate the probability of association in each pair. Finally, for each disease, we find the circRNAs that are most probable to be associated with that disease and investigate the existing literature in PubMed to check if empirical studies have already confirmed that CircRNA-Disease association. Table 3 shows the result of this investigation.
Table 3
Predicted CircRNA-Disease relations with the highest probability for some selected diseases.
Disease | circRNA | Probability | Related article (PMID) |
Lung cancer | hsa_circ_0007534 | 0.996 | 30017736 |
hsa_circ_0001946 | 0.995 | 31249811 |
hsa_circ_0002874 | 0.992 | 33612481 |
hsa_circ_0014130 | 0.991 | 29440731, 31241217, 31818066, 32060230, 32616621, 34349347 |
hsa_circ_0002702 | 0.990 | 32962802 |
hsa_circ_0007874 | 0.988 | 30975029 |
hsa_circ_0074930 | 0.985 | 32962802 |
hsa_circ_0086414 | 0.983 | 30777071 |
hsa_circ_0079530 | 0.972 | 29689350 |
hsa_circ_0007385 | 0.972 | 29372377, 32602212, 32666646 |
hsa_circ_0016760 | 0.968 | 29440731 |
hsa_circ_0012673 | 0.960 | 29366790, 32141553 |
hsa_circ_0067934 | 0.954 | 33832139 |
hsa_circ_0000567 | 0.950 | 32328186, 33768996, 34435479 |
hsa_circ_0072088 | 0.941 | 32308427, 34135596 |
hsa_circ_0001727 | 0.934 | 32010565 |
hsa_circ_0008305 | 0.901 | 30261900 |
gastric cancer | hsa_circ_0001313 | 0.999 | 32253030 |
hsa_circ_0004771 | 0.998 | 29098316 |
hsa_circ_0002874 | 0.998 | 34388244 |
hsa_circ_0000615 | 0.998 | 34049561 |
hsa_circ_0006404 | 0.977 | 32445925 |
hsa_circ_0001982 | 0.977 | 33000178 |
hsa_circ_0032683 | 0.910 | 33449227 |
hsa_circ_0014130 | 0.819 | 32190005 |
colorectal cancer | hsa_circ_0006054 | 0.995 | 30585259 |
hsa_circ_0000745 | 0.990 | 28974900 |
hsa_circ_0044556 | 0.989 | 32884449 |
hsa_circ_0005075 | 0.964 | 31081084, 31476947, 34015582 |
hsa_circ_0040809 | 0.958 | 34438465 |
hsa_circ_0004771 | 0.945 | 31737058, 32419229 |
hsa_circ_0007874 | 0.924 | 32419229 |
hsa_circ_0080210 | 0.914 | 34222420 |
As shown in Table 3, all the predicted pairs (except gastric cancer) had a probability of larger than 90 percent. There is much experimentally validated evidence in the result of this step. For instance, CircWalk predicted an association between hsa_circ_0001313 and gastric cancer with a probability of almost 100 percent. Based on a recent study by Zhang et al. [56], this circRNA is a key regulator in drug resistance in gastric cancer. CircRNA hsa_circ_0007534 (predicted by a probability of 99.6 percent) is an important oncogene in lung cancer related to cancer cell proliferation and apoptosis [57]. Another example is the association between hsa_circ_0044556 and colorectal cancer (predicted by a probability of 98.9 percent). Knocking down this circRNA cause inhibit proliferation, migration, and invasion of colorectal cancer cells [58]. These results represent the power of CircWalk to predict truly novel CircRNA-Disease associations.