The five-fold cross-validation performance of NEMPD
To evaluate the prediction performance of NEMPD, we adopted the 5-fold cross-validation method in our experiment. Specifically, we firstly divide the training set into five parts, where the ratio of positive and negative samples is the same in each part. Each time we select 4 parts as the training sample and the remaining 1 part as the test sample, and then repeat the experiment 5 times. In the result, we selected six parameters as evaluation indicators: accuracy (Acc.), precision (Prec.), matthews correlation coefficient (MCC), specificity (Spec.), sensitivity (Sen.) and areas under the ROC curve (AUC). Table 1 shows the training results of each fold in detail. The final results well prove the good performance of NEMPD in the prediction of potential miRNA-disease associations.
The ROC (Receiver Operating Characteristic) curve is often used to evaluate the advantages and disadvantages of a binary classifier and to measure the non-equilibrium in classification. The abscissa of the ROC curve is FPR (false positive rate), which means the number of cases predicted to be positive among all negative cases. The ordinate of the ROC curve is TPR (true positive rate), which means the total predicted true positive samples. The AUC is defined as the areas under the ROC curve, with values generally ranging from 0.5 to 1. In general, the reason why AUC is usually used as an evaluation indicator in most cases is that the ROC curve cannot clearly indicate which classifier has a better effect. In addition, as a value, the larger the AUC value, the better the performance of the classifier. The PR (Precision-Recall) curve is another tool for evaluating the classification ability of machine learning algorithms for a given data set. Moreover, when dealing with some highly imbalanced data sets, the PR curve can display more information and find more problems. The AUPR is defined as the areas under the PR curve. Same as AUC, the larger the AUPR value, the better the performance of the classifier. The ROC and PR curves of NEMPD under 5-fold cross-validation are respectively shown in Figure 3 and Figure 4. As we can be seen from the figure, the mean AUC and AUPR of NEMPD is 0.9158 and 0.9233, respectively. Generally, the results fully demonstrate that NEMPD has a good performance in the field of potential miRNA-disease association prediction.
Comparison with Different Feature Combinations
In order to verify the validity of the proposed feature representation information, we discussed the influence of different feature combinations on the results of NEMPD. In detail, the combination 1 is only composed of the attribute information of miRNAs and diseases, the combination 2 is only composed of behavior information of miRNAs and diseases, the combination 3 is composed of attribute and behavior information. These three different feature combinations were respectively used as training features of the random forest classifier and verified under 5-fold cross-validation. The detailed results and ROC and PR curves are respectively shown in Table 2 and Figure 5. In the end, the experimental results show that the NEMPD method using the combination 3 as the final training feature vector can get better performance in the prediction.
Comparison with Different Classifier models
To verify the performance of the random forest classifier in NEMPD, we further compared it with three other different classifier models (KNN, Naive Bayes and Decision Tree). It is worth noting that all these four classifiers use the same data set, and all use the default parameters for training and prediction to ensure the effectiveness of the comparison. We also utilize these six parameters (accuracy (Acc.), precision (Prec.), matthews correlation coefficient (MCC), specificity (Spec.), sensitivity (Sen.) and areas under the ROC curve (AUC)) as evaluation indicators of different classifiers. As a result, the KNN model achieves the average AUC of 90.14±0.48%, which the AUC value of each fold is 89.86%, 89.52%, 90.12%, 90.73%, and 90.47%. The Naive Bayes model achieves the average AUC of 88.98±0.44%, which the AUC value of each fold is 88.79%, 88.52%, 88.84%, 89.69%, and 89.07%. The Decision Tree model achieves the average AUC of 82.20±0.80%, which the AUC value of each fold is 81.66%, 81.07%, 82.59%, 82.96%, and 82.70%. The Random Forest model achieves the average AUC of 91.58±0.54%, which the AUC value of each fold is 91.72%, 90.70%, 91.50%, 92.06%, and 91.93%. Details of the remaining 5 parameters are shown in Table 3, and Figure 6 shows the ROC and PR curves of different classifiers. The results of the comparison experiment fully prove that the random forest classifier is more suitable for NEMPD. Although it is not as good as KNN and Naive Bayes in sensitivity, random forest performs better in accuracy and AUC, which can better reflect the classification ability of a model.
Case studies
To further verify NEMPD's ability to discover potential miRNA-disease associations, we selected three common and complex human cancers (colon neoplasms, breast neoplasms, and lung neoplasms) to conduct the case studies, which is the most common experiment in miRNA-disease association prediction methods. After the experiment was completed, we selected the top 50 predicted associations between miRNAs and corresponding cancers and confirmed them with two other databases, dbDEMC [32] and miR2Disease [33].
Colon neoplasms are currently the third common gastrointestinal disease in the world [34, 35]. Furthermore, some of the potential miRNA-colon neoplasms associations have been verified by previous experiments, such as miR-17, miR-92a, miR-31, miR-155, and miR-21 [36]. These researches have demonstrated that miRNA is crucial for the prediction of colon neoplasms and can be used as an important biomarker for colon neoplasms. Therefore, the prediction of miRNA-colon neoplasms associations is very important for the treatment and diagnosis of colon neoplasms. In this work, we sorted the final prediction results of NEMPD according to the prediction score. Finally, 48 of the top 50 miRNAs are verified to be associated with colon neoplasms through the miR2Disease and dbDEMC databases (see Table 4). For example, hsa-miR-20a-5p has been experimentally confirmed to be associated with colon neoplasms [37]. This method draws conclusions through statistical analysis of population-based colorectal cancer studies conducted in Utah and the Kaiser Permanente Medical Care Project (PMID: 26963002).
Breast neoplasms are another common malignant tumor that mainly occurs in women. In the United States, there are about 180,000 new breast patients each year, and about 40,000 die from breast neoplasms. In recent years, the incidence of breast neoplasms in China is also rising and has become the second leading cause of cancer death after lung neoplasms. As a small molecule RNA, miRNA can inhibit breast neoplasms by inhibiting its target mRNA. Besides, the miRNA-breast neoplasms associations have been verified by many previous works of literature. For example, miR-21 has been found to be excessive in breast neoplasms [38], while miR-429 and miR-200c are down-regulated [39]. Similarly, we sorted the final prediction results according to the prediction score. Finally, 47 of the top 50 miRNAs are verified to be associated with breast neoplasms through the miR2Disease and dbDEMC databases (see Table 5). For example, hsa-miR-93-5p has been experimentally proved to be related to breast neoplasms [40] (PMID: 24865188).
Lung neoplasms are a common tumor disease worldwide and one of the leading causes of cancer death. It is also one of the fastest-growing morbidity and mortality rates and the most threatening to the health and life of the population. In recent years, the incidence and mortality of lung cancer in many countries have increased significantly. In addition, miRNAs have been confirmed by many previous researches that are crucial in the early treatment and diagnosis of lung neoplasms. For example, Yanaihara et al. [41] found that the expression of 17 miRNAs in lung cancer cells has changed compared to normal cells through microarray analysis. Mascaux et al. [42] also found that the expression profile of miRNAs also changed during the entire process of lung cancer. Similarly, we sorted the final prediction results of NEMPD according to the prediction score. Finally, 47 of the top 50 miRNAs were verified to be related to lung neoplasms by the dbDEMC and miR2Disease databases (see Table 6).