In this study, we integrated CNA, DNA methylation, TF-gene interactions, gene, and miRNA expression datasets in the miRDriver tool to compute miRNA-gene interactions based on DNA copy number aberrated regions in eighteen different cancer types from TCGA. Table 1 shows the cohort sizes for each data modality, the number of all GISTIC regions, the count of trans genes in the LASSO step, and the computed miRNA-gene interactions in eighteen different cancer types.
Computed miRNAs were significantly enriched in the experimentally-validated oncogenic miRNAs. We performed a two-sided Fisher's exact test to check the association between the cancer-related miRNAs in OncomiRDB (see Materials and methods) and the computed miRNAs by miRDriver. For each cancer type, the background set in the Fisher's exact test consisted of all TCGA miRNAs used in the LASSO step (see Materials and methods) for that cancer type. For all cancer types, computed miRNAs were significantly enriched (Fisher's exact test p-value < 0.05) in the oncogenic miRNAs in OncomiRDB (Table 1).
Computed miRNA-gene interactions were enriched in the known miRNA-gene interactions. To check if the miRNA-gene interactions computed by miRDriver were significantly enriched in the known miRNA-gene interactions, we performed a hypergeometric test for each miRNA's computed target genes in each cancer type. We considered only those miRNAs that had at least one known target in the ground truth data (i.e., known miRNA-gene interactions) (see Materials and Methods) from the computed target list. We labeled them as "Eligible miRNAs" for the hypergeometric test. The background set, i.e., the hypergeometric test universe, was the set of all the trans genes in the HGNC symbol21 that were
common to the ground truth data. For fourteen cancer types, at least 50% of the "Eligible miRNAs" had significant enrichment (p-value < 0.05) (Table 2). The entire list of the computed miRNAs with individual hypergeometric p-values for all eighteen cancer types can be accessed in Supplemental Table S1.
miRDriver outperformed five state-of-the-art methods in inferring significant miRNA-gene interactions. We compared miRDriver with five state-of-the-art methods, namely, ARACNe, ProMISe, HiddenICP, idaFast and jointIDA, by running them on eighteen different cancer types from TCGA. For all these methods, we used gene expression data to compute miRNA-gene interaction networks for our comparison (see Materials and methods). We performed the hypergeometric test to measure each miRNA's computed targets' enrichment significance in the known miRNA-gene interaction data. We selected only "Eligible miRNAs" (i.e., miRNAs with at least one known target in the ground truth data) for this test. We computed the overlapping "Eligible miRNAs" for miRDriver and each comparable method. We checked if the count of the "Significant miRNAs"(i.e., miRNAs with target enrichment test p-value < 0.05) in miRDriver was more (i.e., miRDriver won), less (i.e., miRDriver lost), or equal (i.e., there was a draw) than the other method in the overlap. miRDriver had more "Significant miRNAs" than all other methods for most of the cancer types. For ACC, LUSC and THCA, miRDriver and the different methods had no common "Eligible miRNAs"; hence, we eliminated these three cancer types from this test. Table 3 summarizes the comparison results in all the cancer types. Table 4 presents the comparison results for ovarian cancer (OV) in detail with the number of "Eligible miRNAs" and "Significant miRNAs" in all the methods. For the detailed comparison with all the cancer types, see Supplemental Table S2.
Computed genes were enriched in biological pathways, cancer Hallmark and GO terms. To evaluate the functional roles of the computed target genes by miRDriver for each cancer type, we checked whether these genes were enriched in the biological pathways and GO terms20. For this purpose, we performed pathway enrichment analysis with the pathways in REACTOME22 and KEGG23 databases. For REACTOME pathway enrichment, we used R package Pathfinder24 and for KEGG pathways, Hallmark gene set from the MSigDB25,26 database and GO enrichment, we used R package clusterProfiler27. We selected the pathways and GO terms with significant enrichment (multiple tests corrected, i.e., adjusted p-value < 0.05). We found 213 unique REACTOME pathways spanning over seventeen cancer types, twelve unique KEGG pathways in twelve cancer types and 224 unique enriched GO terms spanning over fifteen cancer types. Table 5 shows the enriched pathways and GO terms that were common in multiple cancer types. We provided the entire list of enriched pathways and GO terms for all the cancer types in Supplemental Table S3. Among these pathways, "Immune System" related pathways were found to play essential roles in cancer28,29. The G protein-coupled receptors (GPCRs)-related REACTOME pathways such as "Signaling by GPCR", "GPCR ligand binding" and "GPCR downstream signalling", which were implicated in several cancer-related studies, were found to be enriched in the computed target genes in more than ten cancer types in our study. These pathways were found to play crucial roles in tumor development, invasion, migration, survival, and metastasis30,31. The GO terms, such as "receptor ligand activity" and "receptor regulator activity", enriched in at least five cancer types, were highlighted in several cancer studies for playing roles in drug toxicity, cell function, tumor growth32–34. The compted target genes in each cancer type were also enriched in the cancer Hallmark gene set (Table 6).
Furthermore, miRDriver computed 22 common miRNAs that were shared in at least eight different cancer types among eighteen total cancer types used in the study (Table 7). The targets of these miRNAs could regulate the common biological processes in cancer. Hence, we performed a GO enrichment test with 1,161 computed genes targeted by at least one of these 22 miRNAs among eighteen cancer types and found 49 GO terms with significant enrichment. Table 8 shows a few of these GO terms with their cancer related citations; the entire list can be found in Supplemental Table S4.
Although there were common miRNAs across multiple cancer types, there were not much common miRNA-gene interactions due to a much higher number of trans genes than the miRNAs in this pan-cancer analysis. Table 9 presents fourteen common gene-miRNA interactions shared in at least two cancer types among ~10,000 selected interactions from pan-cancer. Among these, RSPO3 and miR-22 interaction have been selected in LAML (leukemia) and LUAD (lung cancer). Interestingly, RSPO3 was found to play a role in leukemia35 and promote tumors in lung cancer36. miR-22 was found to play the anti-tumor role with therapeutic potential in acute myeloid leukemia37 and found to have roles in lung cancer via CNAs38. Another interaction PAX5 with miR-5699 was found in BLCA (bladder cancer) and OV (ovarian). Interestingly, PAX5 was found to have a role in bladder cancer39 and ovarian cancer40 as a co-regulator of PAX8. miR-5699 has a proven role in ovarian cancer treatment's oxidative response41. Another interaction, LINC01833- miR-1226, was found in BRCA (breast cancer) and LGG (brain cancer). LINC01833 was listed in the top five long non-coding RNA (lncRNA) according to the prioritization of variation in ER-negative-associated lncRNAs in breast cancer42. miR-1266 was found to target the expression of the mucin 1 oncoprotein and induces cell death in a breast cancer study43.
Several cancer-related terms and pathways were enriched in the targets of the computed miRNAs. We checked the involvement of the computed miRNAs in cancer-related pathways. For this analysis, we collected all 556 miRNAs that were computed by miRDriver in at least one of the cancer type. We collected the computed target genes for each of these miRNAs from all the cancer types where that miRNA was present. We performed cancer Hallmark gene set enrichment with these collected target genes of each miRNA. We found 38 unique enriched cancer Hallmark terms (adjusted p-values < 0.05) for 134 miRNAs (Supplemental Table S5).
We also performed REACTOME pathway enrichment analysis with these collected target genes of each miRNA. We found 240 unique enriched REACTOME pathways (adjusted p-values < 0.05) for 69 miRNAs with these target genes (Supplemental Table S5). Eleven of these enriched pathways, such as, "Epithelial-Mesenchymal Transition", "Hypoxia", "Inflammatory Response", "KRAS Signaling Up", "p53 Pathway", "P13 AKT MTOR Signaling", "Xenobiotic Metabolism", "Apoptosis", "DNA Repair" and "Immune" were present in nineteen experimentally-validated cancer-related pathways for miRNAs54. Furthermore, we performed an analysis to find cancer-driving miRNAs (i.e., tumor-suppressor, oncogenes or both) using the enriched cancer Hallmark terms (Supplemental Table S5). We hypothesized that a miRNA could be a candidate cancer-driving miRNA if its target genes that were found to be enriched in the cancer Hallmark terms could also be enriched in the known cancer-driving genes. Hence, for each of the enriched cancer Hallmark terms, we gathered all the miRNAs with their target genes for which that term was enriched (Table 10). We downloaded a list of 83 cancer-driving genes found to be frequently mutated in different cancer types from the Catalogue Of Somatic Mutation In Cancer (COSMIC) database from the cancer gene census project55. We performed a hypergeometric test for the overlapping target genes with the 83 cancer-driving genes for each cancer Hallmark term. The background gene set for this test was all 5,604 the target genes computed by miRDriver in pan-cancer. We considered the miRNAs related to the hypergeometric p-values < 0.05 as the candidate miRNAs to be evaluated as cancer-driving miRNAs since their targets were enriched in known cancer-driving genes. Furthermore, considering the fact that the up or down-regulation of a miRNA causes the inverse regulation of it's target genes56–58, we specifically checked the target genes of these candidate miRNAs for different cancer types that were found to have negative LASSO regression coefficient computed by miRDriver (Table 11). Interestingly, all of the target genes in this group (Table 11), except OLIG2, were found to be oncogene in the previous studies59–65. OLIG2 was found to be working as a tumor-suppressor gene (TSG) in human glioblastoma66. All the miRNAs except miR-5001 and miR-2276 were found to act as TSGs in cancer in several studies67–71. miR-5001 and miR-2276 were found to have evidence of working as oncogenes in endometrial cancer and colorectal cancer, respectively72,73. These studies support the findings of miRDriver in terms of connecting miRNAs and genes that were related inversely, having a possibility to be working as drivers in pairs of TSG-oncogene in different cancer types.
Computed target genes revealed the subtype-specific expression signature in multiple cancer types. We checked the subtype-specific association of gene expression of computed target genes in BRCA, LGG, LUSC and PAAD. We used the R package TCGAbiolinks74 to download the different subtype labels for the different cancer types. Since TPM (transcript per million reads) values are normalized and comparable across samples, for this analysis, we utilized RNA-Seq data in TPM of TCGA samples whose subtype labels were available. We applied log2(TPM + 1) transformation from Cancer Dependency Map [https://depmap.org]. For all these cancer types, we performed unsupervised clustering using gene expression of these target genes and compared these clusters with baseline (i.e., known) subtype clusters using Rand Index (RI) and Uniform Manifold Approximation and Projection (UMAP)75 plots.
For BRCA, we computed a UMAP plot using around 1,000 BRCA samples and 106 high-degree genes (i.e., computed genes targeted by more than three miRNAs) to check the PAM50 gene-based subtypes76. These subtypes were, Basal-like (BL), HER2-enriched (HER2+), LuminalA (LA), LuminalB (LB) and Normal-like (NL) (Figure 2A). We also computed the UMAP plot using the PAM50 genes with PAM50 gene-based subtypes (Figure 2B). These UMAP plots show a clear separation between different subtype-specific clusters. We also performed an unsupervised clustering (k-means) (with R base package Stats with k = 5 and all other parameters as default) on the BRCA cohort with high-degree target genes (Figure 2C) and with PAM50 genes (Figure 2D). The computed RIs between five known subtype labels with the five predicted clusters by computed high-degree target genes and PAM50 genes were 0.74 and 0.82, respectively. This result shows that both the computed high-degree target genes and PAM50 gene set were able to detect subtype structure in BRCA samples with high accuracy.
Furthermore, we used the high-degree genes to classify the BRCA cohort into five different classes. For this purpose, we used R package keras (https://github.com/rstudio/keras) implementation of the Random Forest classifier with 80% samples for training with 10-fold cross-validation where 20% of data was held out to test the performance of the model. We achieved a high classification accuracy of 0.86. The same sample cohort was classified with PAM50 genes and achieved a classification accuracy of 0.89. Figure 2E and Figure 2F present the classification matrices for both the cases with F1 scores. The F1 scores for the classification with high-degree target genes were comparable to F1 scores of the PAM50-based classification, which suggests that these high-degree target genes can serve as potential markers for PAM50-based subtype signatures in BRCA.
For the other cancer types except for LGG, we computed UMAP plots to check the baseline subtype clusters with the selected high-degree target genes. For these cancer types, since there was a fewer number of genes targeted by more than three miRNAs, we defined high-degree genes as the genes targeted by more than two miRNAs. For LGG, we used 402 samples with all 151 computed target genes since no gene was targeted by multiple miRNAs (Figure 2G). For LUSC, we used 178 patient samples with 75 high-degree target genes (Figure 2H), and in PAAD, we used 150 patient samples with 101 selected high-degree target genes (Figure 2I). We also performed k-means clustering for all these cancer types. For LGG, LUSC and PAAD, the computed RIs between known subtype clusters with the predicted clusters were 0.71, 0.62 and 0.70, respectively. For LGG and PAAD in which we achieved high RI values, we visualized clear separation among the known subtype-specific clusters based on UMAP plots. For LUSC, although we achieved lower RI value, the "Basal" cluster was separated from other clusters (Figure 2H). These results showed that the computed high-degree target genes could reveal subtype-specific expression signatures in multiple cancer types.
Computed miRNAs were found to be potential biomarkers for patients' survival and progression of the disease in each cancer type. We performed survival analysis with the computed miRNAs to assess the miRNAs' prognostic relevance as clinical biomarkers for patients' survival (Figure 3). For each miRNA, we divided the patient cohort of each cancer type into two groups, such as high expression and low expression for that miRNA. We considered the available clinical variables among age, race, gender, stage, and grade as independent variables (see Materials and Methods). To remove the confounding effect of multiple factors, we used the Adjusted Kaplan-Meier Estimator and computed adjusted survival curves by weighting the individual contributions by the inverse probability weighting (IPW) using the R package IPWsurvival79. We considered four different survival endpoints, namely, Overall Survival (OS), Progression Free Interval (PFI), Disease Specific Survival (DSS) and Disease Free Interval (DFI) (see Materials and Methods). We found several prognostic miRNAs (adjusted log-rank test p-value < 0.05) based on Adjusted Kaplan-Meier survival plots in multiple cancer types. Figure 3 shows the survival plots for the common miRNAs in different cancer types. Among 22 common miRNAs (Table 7), eighteen had significant survival differences in high and low miRNA expression patient groups in at least one cancer type (Figure 3). We provided the survival plots for all miRNAs for eighteen cancer types in Supplemental Figure S1-S18.
miRDriver discovered several cancer-specific miRNAs. In this study, miRDriver discovered 240 cancer-specific miRNAs, i.e., these miRNAs were selected in only one cancer type. We used the R package OncoScore80 to measure these miRNAs' association with cancer based on citation frequencies in cancer-related biomedical literature. Fifty percent of these miRNAs (i.e., 121) were found to be cited in cancer-related studies (Supplemental Table S6). Moreover, several of these miRNAs were found to be prognostic, i.e., associated with patients' survival based on Adjusted Kaplan-Meier survival analysis (adjusted log-rank test p-value < 0.05) (Table 12).
Selected high-degree genes were highly significant as potential biomarkers to predict prognosis in cancer patients than low-degree genes in several cancer types. We computed the hazard ratio (HR) of the selected high-degree target genes as the genes targeted by four or more miRNAs and low-degree target genes as the genes targeted by only one miRNA to get optimized list of high-degree and low-degree genes. We performed by the multivariate Cox regression analysis81 using these genes. Due to the low sample size of the high-degree target genes, we computed effect size using the r-value of the Mann-Whitney test with |ln (HR)|. Higher |ln (HR)| implies a higher association with an event's risk with an increase or decrease of gene expression. The r-value was negative if the |ln (HR)| values in the high-degree group were higher than the low-degree group and positive otherwise. We used OS, PFI, DSS and DFI as clinical endpoints in this analysis. We ran this analysis on fifteen different cancer types omitting the cancer types with no high-degree target gene (THCA and PRAD) and no clinical endpoint (LAML). In our previous work17 with BRCA and OV, we discussed the significance of high-degree target genes; hence, we omitted these two cancer types as well, leaving us thirteen cancer types for this analysis. Although the Wilcoxon rank-sum test p-values for the comparison between the boxplots of the two groups were insignificant ( > 0.05), we found negative r-values in most of the cancer types (see Figure 4). The hazard ratio boxplots of all thirteen cancer types with r-values in different clinical endpoints can be found in Supplemental Figure S19-S22. Table 13 shows the high-degree target genes with OS in seven cancer types that had negative r-values. These genes were found to be cited in cancer-related work in a high percentage (≥ 50%) among total citations in biomedical literature by OncoScore. The entire list of high-degree genes with OncoScore frequencies has been provided in Supplemental Table S7.