Redundancy of the benchmark dataset
In order to examine the redundancy of the benchmark dataset, we calculated the similarity of any two miRNAs, diseases and miRNA-disease association pairs, respectively. For two miRNAs, we performed sequence alignment by using CD-HIT software [29] to calculate similarity. For two diseases, we calculated the Jaccard coefficient to evaluate similarity based on the disease symptoms feature vector. For two miRNA-disease association pairs m(i)-d(p) and m(j)-d(q), their similarity (mDPS) can be calculated according to the Eq. 1:
[Due to technical limitations, the formula could not be displayed here. Please see the supplementary files section to access the formula.] (1)
Where, Sm(i,j) and Sd(p,q) means the similarity of two miRNAs m(i) and m(j) as well as two diseases d(p) and d(q).
Similarity values of any two miRNAs, diseases and miRNA-disease association pairs were shown in Fig. 1(A), (B) and (C), respectively. Meanwhile, statistical results of similarity values were shown in Fig. 1(D). For miRNAs, their similarity mainly concentrated in [0.4, 0.5) (value higher than or equal to 0.4 and lower than 0.5, the same below), accounting for 70.88% and the proportion of interval [0.2, 0.3), [0.3, 0.4), [0.5, 0.6), [0.6, 0.7), [0.7, 0.8), [0.8, 0.9) and [0.9, 1.0) is 0.14%, 0.89%, 19.56%, 0.22%, 15.45%, 2.11%, 0.32% respectively. Similarity values of any two miRNAs are not included in the range of [0, 0.1) and [0.1, 0.2).The similarity value between any two diseases is usually low. The 81.98% and 15.45% of all similarity values are located in the range of [0, 0.1) and [0.1, 0.2). Only 2.11%, 0.32%, 0.0698%, 0.0471%, 0.0070% and 0.0087% concentrated in the range of [0.2, 0.3), [0.3, 0.4), [0.4, 0.5), [0.5, 0.6), [0.6, 0.7) and [0.7, 0.8). And, similarity value of any two diseases is always lower than 0.8.
The distribution of miRNA-disease associations similarity values is more dispersed. The 62.75% and 32.76% of associations similarity values concentrated in the range of [0.2, 0.3) and [0.3, 0.4). The 0.27%, 2.15%, 0.55%, 0.0969%, 1.41%, 0.0135% and 0.0017% are located in the range of [0.1, 0.2), [0.4, 0.5), [0.5, 0.6), [0.6, 0.7), [0.7, 0.8), [0.8, 0.9) and [0.9, 1.0), respectively. These results indicate that the constructed benchmark dataset has a low redundancy, miRNA sequences are diverse and diseases fall into different categories.
[Figure 1]
Performance evaluation
To evaluate the identification performance of model, five-fold cross-validation was utilized based on the benchmark dataset, which was divided into 5 parts, each part would be treated as test set in turn and other 4 parts would be treated as training set. We repeated 10 times for each evaluation to minimize the performance difference.
We adopt several evaluation metrics: accuracy (Acc), specificity (Spe), sensitivity (Sen), precision (Pre) and Matthews correlation coefficient (Mcc), the areas under receiver operating characteristic curve (AUROC) and precision recall curve (AUPRC) to assess prediction performance.
Optimization of model parameter
Random forest is a classifier with multiple decision trees, and the numbers of decision trees and predictor variables at each decision split have an important impact on prediction performance. Therefore, prediction performance of constructed model was optimized based on the grid search strategy, in which change the tree number from 100 to 1000 with interval 100 and vary number of features selected at each split from 21, 22, …, to 25 and a default value (i.e. square root of the total number of variables). The Acc, Sen, Spe, Pre, Mcc, as well as AUROC and AUPRC for each combination of tree number and selected variable number were listed in supplement Table 1~7.
From results, we can see that the number of trees has less impact on model performance. When the number of trees is determined, the model can obtain better performance as the number of selected feature increases. And, the performance remains stable when the number of selected variables is greater than the default value. Moreover, it takes more time to train the model when the numbers of trees and selected variables are larger. Therefore, the optimal model was constructed by setting the number of trees and choose feature as 100 and 22, respectively. Finally, the optimized model achieved the average Acc of 92.27%, Spe of 92.93%, Sen of 91.62%, Pre of 92.85%, and Mcc of 0.8456, respectively. The corresponding AUROC and AUPRC were 0.9756 and 0.9787, respectively.
Performance comparing of different negative examples selection strategy
In order to demonstrate the reliability of the current negative sample selection strategy, we compared it with the random selection strategy. For better comparison, the result of two selection strategy was shown in Fig. 2. When the negative sample was randomly selected, the valudes of Acc, Spe, Sen, Pre, Mcc, AUROC and AUPRC were 83.67%, 82.93%, 84.40%, 83.17%, 0.6733, 0.9184 and 0.9072, respectively. It is 8.60%, 10.00%, 7.22%, 9.68%, 17.23%, 5.72%, 7.15% lower than current method, respectively. Further, we calculated the relative standard deviation (RSD). For the current method, the RSD of Acc, Spe, Sen, Pre and Mcc is 0.28%, 0.31%, 0.24%, 0.32% and 0.60%, respectively. These values are always lower than those of the random selection strategy (0.43%, 0.53%, 0.41%, 0.48% and 1.08%). The results demonstrate that the selection of negative sample in the current method is effective and reliable. The lower RSD also prove that current selection strategy can make model more stable.
[Figure 2]
The reliability of negative sample
In this part, we further construct five reliable negative sample datasets based on the different threshold (1.5AOD, 1.0AOD, 0.8AOD, 0.6AOD and 0.5AOD) to investigate the impact on the performance of the model (nAOD represents n×AOD). Results of different threshold were shown in Fig. 3. We can see that the performance of the model decrease gradually with the decrease of the threshold. When the threshold is lower than 0.5AOD, there is no performance difference between the current negative samples selection strategy and the random negative samples selection method. We can conclude that the reliability of negative samples has a positive impact on the prediction performance. Finally, we chose 1.5AOD as threshold to select reliable negative sample
[Figure 3]
The proportion of positive and negative samples
In random forest models, the ratio between positive samples and negative samples also affects the model performance. To further explore this effect, we chose five different ratios to construct five training datasets. By increasing the number of negative samples, we get five training datasets with the ratio in 1:1, 1:2, 1:3, 1:5 and 1:10 respectively. Results of five-fold cross-validation test were shown in Fig. 4 (In this section, the selecting threshold of reliable negative samples was set to 1.0AOD to ensure that there are sufficient negative samples). As the number of negative samples increases, Acc, Spe and Pre rise slowly, Mcc and AUROC fluctuate randomly, AUPRC decreases continuously. It is worth noting that Sen decreases significantly when the ratio increases from 1:1 to 1:10. Because the purpose of our current research is to identify potential positive samples, the optimal ratio choice of positive and negative samples was set to 1:1
[Figure 4]
Prediction capability for new miRNA
Identifying potential miRNAs related to known diseases has an important role in drug target discovery and new drug development. To further verify the prediction capability of our method for disease-related potential miRNAs, we remove the miRNA from our dataset if the similarity is higher than a specific threshold. Here, four non-redundant datasets were constructed based on the different thresholds: 0.9, 0.8, 0.7 and 0.6 (When the threshold is lower than 0.6, the number of miRNAs is so small that lost its statistical significance). Results of five-fold cross-validation test based on the various non-redundant datasets were listed in Table 1. For the four non-redundant datasets, AUROC and AUPRC maintain stability at about 0.9962 and 0.9970. When the threshold is higher than 0.6, Acc, Sen, Spe, Pre and Mcc have a smaller fluctuations range (<1%). In summary, these results reveal that our method has a better robustness and insensitive for the redundancy of dataset, and has outstanding performance for identifying disease-associated potential miRNA.
Table 1 The results of five-fold cross-validation test from the different non-redundant datasets
Threshold
|
Acc(%)
|
Sen(%)
|
Spe(%)
|
Pre(%)
|
Mcc
|
AUROC
|
AUPRC
|
0.9
|
97.98
|
96.34
|
99.61
|
99.60
|
0.9601
|
0.9968
|
0.9974
|
0.8
|
97.87
|
96.42
|
99.32
|
99.30
|
0.9578
|
0.9955
|
0.9965
|
0.7
|
97.39
|
95.47
|
99.31
|
99.29
|
0.9486
|
0.9973
|
0.9978
|
0.6
|
97.57
|
95.35
|
99.80
|
99.80
|
0.9524
|
0.9950
|
0.9964
|
Identification ability for new diseases
Identifying potential diseases-associated with known miRNAs contributes to the study of pathological mechanisms of diseases. Similarly, we remove the disease from our dataset if its similarity value is greater than the threshold: 0.7, 0.6, 0.5, 0.4, 0.3 and 0.2, respectively. Please note that the highest similarity score is 0.7370 in our dataset, and only 290 positive samples remained when the threshold is set to 0.1. Finally, we constructed 6 non-redundant datasets, and the corresponding results were shown in Table 2. As the threshold decreases from 0.7 to 0.2, all indicators change slightly around a certain value. For example, Acc and Sen fluctuate slowly around the average of 97.81% and 96.34%. Even if the threshold is reduced to 0.2, our method still achieves Spe of 98.68%, Pre of 98.61%, Mcc of 0.9469, AUROC of 0.9934 and AUPRC of 0.9942. These results demonstrate that our method can identify potential miRNA-related diseases.
Table 2 The results of different non-redundant datasets
Threshold
|
Acc(%)
|
Sen(%)
|
Spe(%)
|
Pre(%)
|
Mcc
|
AUROC
|
AUPRC
|
0.7
|
97.89
|
96.40
|
99.37
|
99.35
|
0.9582
|
0.9958
|
0.9967
|
0.6
|
98.03
|
96.51
|
99.55
|
99.54
|
0.9609
|
0.9955
|
0.9963
|
0.5
|
97.97
|
96.52
|
99.43
|
99.41
|
0.9598
|
0.9974
|
0.9979
|
0.4
|
97.90
|
96.41
|
99.40
|
99.38
|
0.9550
|
0.9962
|
0.9975
|
0.3
|
97.73
|
96.13
|
99.31
|
99.30
|
0.9550
|
0.9977
|
0.9981
|
0.2
|
97.33
|
96.04
|
98.68
|
98.61
|
0.9469
|
0.9934
|
0.9942
|
Prediction capability for potential miRNA-disease associations
To further verify the robustness of our method, we construct a series of non-redundant datasets according to the different thresholds following these steps:: (1) Set a threshold, and randomly select a miRNA-disease association involved in the positive sample set. (2) Calculate the similarity between the selected association and other associations contained in the positive sample set. (3) Remove the selected miRNA-disease association if its similarity value is higher than the specific threshold, otherwise, keep it in the positive sample set. (4) Repeat step 3, until the similarity value of any two association pairs is lower than the threshold, and the obtained set is called as non-redundant positive sample set. (5) Randomly choose an association from reliable negative samples, and calculate its similarity values with each association in the positive sample set as well as in the negative sample set. (6) Delete the chosen association from the reliable negative samples if similarity values are higher than the threshold, otherwise, remain it. The obtained set is known as non-redundant negative sample set. (7) Repeat step 6, until the size of the non-redundant positive sample set and non-redundant negative sample set is equal. (8) Combine the non-redundant positive sample set with the non-redundant negative sample set to construct the non-redundant training dataset. Finally, we set two thresholds 0.9 and 0.8 to build the two non-redundant training datasets (Only 53 samples contained in non-redundant negative sample set when the threshold is set as 0.7, it is unable to construct the non-redundant training dataset with positive-negative ratio 1:1.). Results of five-fold cross-validation test are shown in Table 3. The Acc of 97.88%, Sen of 96.44%, Spe of 99.33%, Pre of 99.31%, Mcc of 0.9581, AUROC of 0.9964 and AUPRC of 0.9973 are obtained when the threshold is set to 0.9. The Acc, Sen, Spe, Pre, Mcc, AUROC and AUPRC only decrease 0.29%, 0.52%, 0.06%, 0.07%, 0.0057, 0.0023 and 0.0015, when change the threshold from 0.9 to 0.8. There results demonstrate that the current method has a prominent robustness and can identify potential miRNA-disease associations.
Table 3 The results of 5-fold cross-validation based on the two non-redundant miRNA-disease associations datasets.
Threshold
|
Acc(%)
|
Sen(%)
|
Spe(%)
|
Pre(%)
|
Mcc
|
AUROC
|
AUPRC
|
0.9
|
97.88
|
96.44
|
99.33
|
99.31
|
0.9581
|
0.9964
|
0.9973
|
0.8
|
97.59
|
95.92
|
99.27
|
99.24
|
0.9524
|
0.9941
|
0.9958
|
Comparison with other methods
We further compare our method with three existing models PBMDA [22], WBSMDA [18] and RLSMDA [24] by constructing a new dataset including 4119 known associations between 415 miRNAs and 327 diseases from HMDD (v2.0). For PBMDA, we downloaded the code from its supplementary material to obtain the prediction results. For WBSMDA and RLSMDA, we rewritten the code based on their articles. The three approaches have been confirmed to achieve excellent prediction accuracy. The ROC and PRC curves of various methods were shown in Fig. 6. After implemented five-fold cross-validation, Seq-SymRF achieved the average AUROC and ARPRC were 0.9640 and 0.9704. The AUROC and AUPRC of PBMDA, RLSMDA and WBSMDA were 0.9189 and 0.9211, 0.8328 and 0.7964, 0.8293 and 0.8346, respectively. The AUROC of current method was 4.51%, 13.12% and 13.47% higher than those of three models (PBMDA, RLSMDA and WBSMDA), and AUPRC was 4.93%, 17.40% and 13.94% improvement. In conclusion, Seq-SymRF is an effective and accurate method for predicting miRNA-disease associations.
[Figure 5]
Case Study
To further evaluate the prediction performance and demonstrate application ability of our method, we implement the following two case studies: (1) two common diseases breast neoplasms and leukemia are removed from our dataset and predict their association score with each miRNA. (2) We choose a miRNA hsa-mir-21 un-included in the dataset and predict its association score with each disease. Top 50 results of case study are validated by dbDEMC [30] or the literature.
Breast cancer is one of the most fatal diseases among female. According to the Global Cancer Statistics 2018 [31], female breast cancer is the second most common cancer, accounting for 18.6% of cancer deaths, behind lung cancer. MiRNA has practical effect on the treatment and diagnosis of breast cancer. For example, miRNA-223 can maintain cell proliferation of breast cancer cell through targeting FOXO1 [32] and let-7a can inhibit growth and migration of breast cancer cell by targeting HMGA1 [33]. Taking breast neoplasms as case study, our random forest model was implemented to prioritize candidate miRNAs. Table 4 lists the prediction score with different miRNA. From the table, we can see that 33 of top 50 potential related miRNAs were confirmed to be associated with breast neoplasms based on the dbDEMC database.
Table 4 here
Leukemia is a group of life-threatening malignant disorders of the blood and bone marrow. It is a dangerous disease which may occur at all ages, from the newborn to the very old [34]. As a special biomarker, miRNA also has a significant impression in leukemia treatment. The identification of potential association between miRNA and leukemia is very helpful in the development of therapeutic drugs. The prediction scores between leukemia and each candidate miRNA are listed in Table 5. In our model, 36 of top 50 potential related miRNAs were confirmed based on the dbDEMC database.
Table 5 here
Mir-21 is a well know miRNA, and it has been proved to be associated with many diseases, especially cancer. Its association scores with various diseases are predicted and listed in Table 6. We can see that 32 of top 50 potential associations were confirmed based on the literature.
Table 6 here