DEGdriver outperforms the cutting-edge approaches for identifying cancer driver genes
We thoroughly compared DEGdriver to 7 cutting-edge approaches for driver gene prediction in 11 cancer datasets to assess its efficacy. These approaches included DriverNet, DawnRank, MaxMIF, MutPanning, driverMAPS, WITER, and DriverML. On the PPI network HumanNet-PI, we used the TCGA datasets and the aforementioned techniques (Table S1-S3). The DEGdriver's parameters were set to α=0.99 and β=1/3 (see Methods Section). The most recent version, 1.34.0, which was released in 2021, was used to run DriverNet. Since they don't rely on PPI networks, MutPanning, driverMAPS, WITER, and DriverML directly retrieved driver gene predictions from the Cancer Driver Catalog [9]. 729 cancer driver genes from the Cancer Gene Census (CGC) repository (https://cancer.sanger.ac.uk/cosmic) [43] and 3,347 cancer driver genes from the NCG 7.0 repository (http://ncg.kcl.ac.uk/download.php) [44] were used to evaluate all the projected driver genes of the methodologies.
We evaluated the approaches by contrasting their AUC (area under the ROC curve) and AUPR (area under the PR curve). A receiver operating characteristic (ROC) curve is a plot of the true positive rate (TPR or recall) against the false positive rate (FPR) across the range of thresholds for the real-valued marker or feature at hand. A precision-recall (PR) curve is a plot of the precision (positive predictive value) against the recall for different thresholds. Precision is the fraction of predictive examples that are truly positive. The recall is the fraction of positive examples that are correctly labeled [45]. We used the common genes in the mutation dataset, gene expression dataset, and the considered PPI network to plot the ROC and PR curves. We analyzed the top 100, 200, and 300 genes for each technique (Figure 2a-2l, Table S4-S15). The CGC/NCG AUC/AUPR is a convenient name for the AUC/AUPR calculated using the CGC/NCG cancer genes. It is clear that in the vast majority of instances, the distributions of AUC/AUPR for DEGdriver and DriverNet against various cancer types are superior to those of the alternative techniques.
To further highlight DEGdriver's superiority, we recalculated the techniques' (so-called relative) AUPR scores by setting DriverNet's AUPR default value to 1 for each cancer type (Table S16-S21). By analyzing the top 100, 200, and 300 genes, respectively, the relative CGC AUPR score of DEGdriver, 1.119, 1.220, and 1.127, showed that the CGC AUPR score of DEGdriver was improved by 11.9%, 22.0%, and 12.7%, respectively, compared to DriverNet. As shown in Table S16-S21, DEGdriver outperformed DriverNet by at least 10.0% with CGC AUPR scores when comparing the top 100, 200, and 300 genes across 4 cancer types (BRCA, LIHC, LUAD, THCA), 4 cancer types (COAD, LIHC, LUSC, THCA), and 7 cancer types (COAD, KIRC, LIHC, LUAD, LUSC, PRAD, THCA), respectively.
We calculated the concordance scores of all approaches across 11 cancer types under CGC/NCG cancer genes when N=20, 50, 100, 200, and 300, as represented by the violin plots in Figures 2m-2n. The percentage of CGC/NCG cancer genes in the top N genes is the concordance score of the top N genes predicted by a technique. Overall, DEGdriver outperformed the approaches being compared. In some cancer types, such as COAD, DEGdriver has shown appreciable superiority with precision and specificity over the other approaches, followed by DriverNet (Figure 2o, 2p). Specificity is a measure of how many negative predictions made are correct (true negatives).
Given the high performance of DEGdriver and DriverNet, we compared the concordance curves of the top N ranked genes predicted by each method across each cancer type under CGC, NCG cancer genes (Figure 3a, 3d). As a result, DEGdriver outperformed DriverNet for the majority of cancer types. Figure 3b, 3e, and 3c, 3f show the number of recovered CGC, NCG cancer genes, and the concordance scores of the top genes output by DEGdriver and DriverNet for each cancer type when N=20, 50, 100, 200, and 300.
To make a more intuitive comparison between DEGdriver and DriverNet, we denote by XN (resp. YN) the set of top N genes output by DEGdriver (resp. DriverNet) over a specific cancer type, and by sN (resp. tN) the number of NCG cancer genes in XN (resp. YN) but YN (resp. XN). Then we have over cancer type PRAD that s20 = 8, t20 = 3; s50 = 18, t50 = 15; s100 = 30, t100 = 21; s200 = 62, t200 = 46; s300 = 83, t300 = 53. We have over cancer type BRCA that s20 = 4, t20 = 2; s50 = 9, t50 = 9; s100 = 23, t100 = 18; s200 = 42, t200 = 30; s300 = 63, t300 = 39. We have over cancer type THCA that s20 = 8, t20 = 6; s50 = 17, t50 = 14; s100 = 33, t100 = 30; s200 = 52, t200 = 37; s300 = 71, t300 = 53. We have over cancer type LIHC that s20 = 8, t20 = 7; s50 = 18, t50 = 12; s100 = 32, t100 = 25; s200 = 53, t200 = 45; s300 = 71, t300 = 60. Tables S22-S25 summarize the comparisons of all cancer types under CGC/NCG cancer genes.
We also looked at the overlap between DEGdriver's top N genes and DriverNet's top N genes (Table S26-S29). According to our findings, the majority of the genes produced by both methods were NCG cancer genes across all cancer types, and the number of NCG cancer genes predicted by DEGdriver and DriverNet over each cancer type was always greater than the number of NCG cancer genes output by just one of them (Figure 3e). For instance, over the cancer type KIRC, all genes predicted by both methods for N=50 except TTN, CFTR, HSP90AA1, PRPF8, and FN1 are NCG cancer genes; over the cancer type HNSC, all genes predicted by both methods for N=50 except TTN, APP, CEP250, CFTR, and EIF4G1 are NCG cancer genes; over the cancer type LUAD, all genes predicted by both methods for N=50 except TTN, APP, FLNC, RGPD4, LAMA1, and RYR2 are NCG cancer genes; and over the other cancer types, we summarized in Table S30.
We further confirmed that DEGdriver was parameter-resistant. We set α=0.8, 0.9, 0.99 with β=1/3 fixed and β=0.1, 0.2, 0.25, 1/3 with α=0.99 fixed, respectively. By analyzing all output genes, we determined the CGC/NCG AUC/AUPR (Figure 4, Table S31-S38). We use the NCG AUPR analysis as an illustration. In the first scenario, for all cancer types, the NCG AUPRs of DEGdriver are similar and bigger than those of DriverNet (Figure 4f, Table S34). So, α=0.99 is our default value due to the low computation and running time. The NCG AUPRs of DEGdriver for the second scenario are all similar for β=0.2, 0.25, and 1/3. For β=0.1, the NCG AUPRs of DEGdriver are larger than those of DriverNet for all cancer types but COAD, BRCA, LUAD, UCEC; β=0.2 for all but COAD; β=0.25, for all but COAD; and β=1/3 for all (Figure 4h, Table S38). We default βto 1/3 since DEGdriver achieves the largest NCG AUPR at β=1/3 for the majority of cancer types.
To assess the effects of different PPI networks on the outcomes, we tested DEGdriver and DriverNet using each of the four PPI networks, HumanNet-PI, two HumanNet-PI component networks, including HumanNet-LC and HumanNet-HT, as well as Multinet. We evaluated all of the output genes to determine the CGC/NCG AUC/AUPR. DEGdriver exhibits advantages over DriverNet for practically all cases (Figure 5, Table S39-S54). We take the NCG AUPR analysis as an example. The NCG AUPRs of DEGdriver are greater than those of DriverNet for all cancer types for HumanNet-LC and HumanNet-HT (Figure 5n, 5o). Concretely, The NCG AUPRs of DEGdriver have been improved by at least 6.4% for THCA, PRAD, KIRC, HNSC, UCEC, and 6.1% for THCA, LUSC, PRAD, BRCA, LIHC, respectively. For nine cancer types on Multinet, DEGdriver's NCG AUPRs are greater than DriverNet's (Figure 5p). The NCG AUPRs of DEGdriver and DriverNet are only marginally different for the two cancer types COAD and LUAD. With the exception of HNSC, STAD, COAD, and LUAD, DEGdriver has been enhanced even further (by more than 7.1%) across all cancer types. We discovered that DEGdriver was enhanced in their NCG AUPR by 16.4% and 15.9%, respectively when restricted to THCA and PRAD.
DEGdriver is capable of identifying functional modules or pathways
Both new therapeutic targets and new insights into the pathologic pathways behind cancer are provided by DEGdriver. To demonstrate this point, we contrasted DEGdriver and DriverNet, two approaches that performed exceptionally well, on the cancer types prostate adenocarcinoma (PRAD) and liver hepatocellular carcinoma (LIHC). We only took into account the top 20 genes produced by each approach to maintain fairness.
We performed enrichment analysis in Metascape [http://metascape.org] [46] with the default settings. For each given gene list, pathway and process enrichment analysis have been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, Cell Type Signatures, CORUM, TRRUST, DisGeNET, PaGenBase, Transcription Factor Targets, WikiPathways, PANTHER Pathway, and COVID. The disease association analysis has been carried out by DisGeNET [47]. Based on commonalities in membership, the enriched pathway/process terms are clustered. The statistically most significant term is selected to represent each cluster.
Prostate adenocarcinoma (PRAD)
By using DEGdriver (DriverNet), we discovered that 17 (12) out of 20 genes for PRAD are NCG cancer genes. According to the pathway and process enrichment analyses, the enriched terms of the top 20 genes are grouped into 17 (17) clusters respectively (Figure 6). The 17 representative enriched terms by DEGdriver include two Canonical Pathways terms, nine GO Biological Processes terms, one KEGG Pathway term, two Reactome Gene Sets terms, and three WikiPathways terms. The top five representative enriched terms are, in order, hsa05165: Human papillomavirus infection, GO:0010942: positive regulation of cell death, M145: PID P53 DOWNSTREAM PATHWAY, WP1984: Integrated breast cancer pathway, and R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers.
The 17 representative enriched terms by DriverNet include two Canonical Pathways terms, nine GO Biological Processes terms, one KEGG Pathway term, two Reactome Gene Sets terms, and three WikiPathways terms. The top five representative enriched terms are, in order, GO:0010564: regulation of cell cycle process, R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers, hsa05226: Gastric cancer, M261: PID P53 REGULATION PATHWAY and GO:0042176: regulation of protein catabolic process.
We only identified two common representative enriched terms for both of them, which are GO:0042176: regulation of protein catabolic process and R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers. GO: 0042176 includes the genes APC, CDKN1B, EGFR, SMAD4, PTEN, HUWE1, and LRRK2 output by DEGdriver, and the genes CDKN1B, DDB1, SMAD4, PTEN, RPL11, HUWE1 output by DriverNet. R-HSA-5663202 includes the genes APC, CDKN1B, CTNNB1, EGFR, EP300, SMAD4, PTEN, and AKAP9 output by DEGdriver, and the genes CDKN1B, CLTC, CTNNB1, SMAD4, MET, PTEN, AKAP9 output by DriverNet. The PRAD diagnosis and care may benefit from further research for the terms GO: 0042176 and R-HSA-5663202.
Enrichment analysis in DisGeNET indicates that the genes produced by each approach are both strongly enriched for many terms, such as C0007112: Adenocarcinoma of prostate, C1654637: androgen independent prostate cancer, C4722328: Hereditary Prostate Carcinoma, C1328504: Hormone refractory prostate cancer, C1708566: Invasive Prostate Carcinoma, C1282496: Metastasis from malignant tumor of prostate, C4721208: Metastatic castration-resistant prostate cancer, C0936223: Metastatic Prostate Carcinoma and C1739135: Progression of prostate cancer. Some terms are enriched by genes output by DEGdriver only, such as C0278838: Prostate cancer recurrent, C2931456: Prostate cancer, familial, C4722327: PROSTATE CANCER, HEREDITARY, 1, C1853195: Prostate Cancer, Hereditary, 7 and C0033575: Prostatic Diseases. The analysis above indicates that DEGdriver did find a few disease-related genes that DriverNet had overlooked.
Liver hepatocellular carcinoma (LIHC)
We found that 14 (13) out of the 20 genes for LIHC are NCG cancer genes utilizing DEGdriver (DriverNet). The top 20 genes' enriched terms are categorized into 18 (19) clusters each by the pathway and process enrichment analysis (Figure 7). The 18 representative enriched terms by DEGdriver consist of one Canonical Pathways term, ten GO Biological Processes terms, three Reactome Gene Sets terms, and four WikiPathways terms. The top five representative enriched terms are, in order, M145: PID P53 DOWNSTREAM PATHWAY, GO:0060341: regulation of cellular localization, WP399: Wnt signaling pathway and pluripotency, WP4879: Overlap between signal transduction pathways contributing to LMNA laminopathies and GO:0051099: positive regulation of binding.
The 19 representative enriched terms by DriverNet consist of two Canonical Pathways terms, twelve GO Biological Processes terms, four Reactome Gene Sets terms, and one WikiPathways term. The top five representative enriched terms are, in order, M261: PID P53 REGULATION PATHWAY, GO:0051098: regulation of binding, GO:0090399: replicative senescence, GO:0070997: neuron death, and WP366: TGF-beta signaling pathway.
We only found two common representative enriched terms for both of them, which are GO: 0060341: regulation of cellular localization and GO: 0051099: positive regulation of binding. GO: 0060341 includes the genes APC, CFTR, CLTC, CTNNB1, DMD, HTT, LRP1, RB1, CEP250, LRRK2 by DEGdriver and the genes CFTR, CTNNB1, RB1, HUWE1, TRIM28, SIN3A, LRRK2 by DriverNet. GO: 0051099 includes the genes APP, CTNNB1, EP300, LRP1, RB1, LRRK2 by DEGdriver and the genes APP, CTNNB1, RB1, TRIM28, LRRK2 by DriverNet. For their potential roles in the diagnosis and treatment of LIHC, the terms GO: 0060341 and GO: 0051099 need further research.
Enrichment analysis in DisGeNET indicates that the genes generated by each approach are both substantially enriched for many terms, including C0279607: Adult Hepatocellular Carcinoma, C0279606: Childhood Hepatocellular Carcinoma, C0086404: Experimental Hepatoma, C0206624: Hepatoblastoma, C2676033: Hepatoblastoma Caused By Somatic Mutation, C0206669: Hepatocellular Adenoma, C0019207: Hepatoma, Morris, C0019208: Hepatoma, Novikoff, and C0023904: Liver Neoplasms, Experimental. Some terms are enriched by genes output by DEGdriver only, such as C0334287: Fibrolamellar Hepatocellular Carcinoma, C0267792: Hepatobiliary disease, C0861876: Recurrent Hepatocellular Carcinoma. The analysis above indicates that DEGdriver did find a few disease-related genes that DriverNet had missed.