Prediction of cancer driver genes through integrated analysis of differentially expressed genes at the individual level

DOI: https://doi.org/10.21203/rs.3.rs-1982883/v1

Abstract

Driver mutations are anticipated to change the gene expression of their related or interacting partners, or cognate proteins. We introduce DEGdriver, a novel method that can discriminate between mutations in drivers and passengers by utilizing gene differential expression at the individual level. Tested on eleven TCGA cancer datasets, DEGdriver substantially outperforms cutting-edge approaches in distinguishing driver genes from passengers and exhibits robustness to varying parameters and protein-protein interaction networks. We further show, through enrichment analysis, that DEGdriver is capable of identifying functional modules or pathways in addition to novel driver genes.

Background

A severe threat to the public's health is cancer, a complex hereditary disease. The Cancer Genome Atlas (TCGA) [1] and other large-scale cancer genomics sequencing initiatives have produced whole genome and transcriptome data for many cancer types. Distinguishing functional cancer-causing driver mutations from stochastic passenger mutations is a crucial step in making sense of the genomic variation data [2, 3]. The early focus has been given to the identification of the frequently mutated genes [46]. Because most mutated genes are found with low population frequencies, these techniques are hampered by the occurrence of less frequently mutated driver genes and the long-tail phenomena of genomic variation data [7, 8].

Based on cancer mutation data, significant efforts have been made to discover driver mutations and genes. These techniques can be divided into three categories based on frequency, function, and network [9]. Frequency-based approaches, such as MutSig2.0 [10], MutSigCV [11], InVEx [12], MuSiC [13], and some more recent methods driverMAPS [14], WITER [15] and DriverML [16], identify genes with extraordinarily higher mutation rates based on the estimated background mutation rates. Accurate background mutation frequency estimation is a problem for these approaches. Function-based approaches, such as Mutation Assessor [17], CHASM [18], transFIC [19], FATHMM [20], and MutPanning [21], discover driver mutations by evaluating the functional impact of missense mutations. Protein-protein interaction (PPI) networks are used in network-based approaches like MUFFINN [22] and MaxMIF [23] to rank putative driver genes.

Numerous algorithms have been developed to detect significantly altered pathways [24, 25], but their applicability is constrained by our inadequate understanding of them [26, 27]. A linear function Dendrix [28] is proposed by balancing coverage and mutual exclusivity of a gene module, improved by the authors of [2830], and further generalized to Multi-Dendrix [31] and CoMDP [32] for the evaluation of multiple gene modules. The goal of HotNet [33] and HotNet2 [34] is to locate subnetworks with a notably high mutation coverage. MEMo [35] and MEMCover [36] are developed to identify mutually exclusive gene modules with PPI networks involved, but they suffer from excessive computational complexity. To reduce the complexity, we propose CovEx [37], which is implemented by solving a series of subproblems from the PPI network, UniCovEx to balance mutual exclusivity and enhance prediction accuracy [38], and ComCovEx [39] to evaluate cancer associations and identify shared driver gene modules between cancers.

Gene expression data is used to construct several algorithms for the identification of driver mutations/genes to enhance their performance. These techniques are referred to as expression-based for convenience. The expression of genes in the same pathway has been observed to be more highly correlated than those in different pathways. Following this observation, a new function is suggested [29] by integrating Dendrix and the gene expression Pearson correlation coefficient. Since driver mutations are anticipated to change the gene expression of their cognate proteins, their interacting or functionally related partners, DriverNet [40], which has been widely used to analyze driver mutations and gene expression, was developed by integrating PPI networks, mutation, and gene expression data. However, as stated by the DriverNet authors themselves [40], DriverNet might not be able to catch less dramatic but significant changes in individual expression. According to their influence on the differential expression of downstream genes in the molecular interaction network, putative driver genes are ranked by DawnRank in a personalized fashion [41].

We developed DEGdriver to which we incorporated RankComp [42], an individual-level gene expression analysis tool, to address the shortcomings of DriverNet (Fig. 1). RankComp uses the disrupted ordering in individual disease samples to identify genes that are differently expressed in disease samples. Tested on 11 cancer datasets annotated by the TCGA project and compared with 7 state-of-the-art algorithms, DEGdriver shows its superiority in identifying cancer driver genes, and its robustness to both parameters and PPI networks. Additionally, enrichment analyses show that DEGdriver can discover novel functional modules or pathways that other tools missed.

Results

DEGdriver outperforms the cutting-edge approaches for identifying cancer driver genes

We thoroughly compared DEGdriver to 7 cutting-edge approaches for driver gene prediction in 11 cancer datasets to assess its efficacy. These approaches included DriverNet, DawnRank, MaxMIF, MutPanning, driverMAPS, WITER, and DriverML. On the PPI network HumanNet-PI, we used the TCGA datasets and the aforementioned techniques (Table S1-S3). The DEGdriver's parameters were set to α=0.99 and β=1/3 (see Methods Section). The most recent version, 1.34.0, which was released in 2021, was used to run DriverNet. Since they don't rely on PPI networks, MutPanning, driverMAPS, WITER, and DriverML directly retrieved driver gene predictions from the Cancer Driver Catalog [9]. 729 cancer driver genes from the Cancer Gene Census (CGC) repository (https://cancer.sanger.ac.uk/cosmic) [43] and 3,347 cancer driver genes from the NCG 7.0 repository (http://ncg.kcl.ac.uk/download.php) [44] were used to evaluate all the projected driver genes of the methodologies.

We evaluated the approaches by contrasting their AUC (area under the ROC curve) and AUPR (area under the PR curve). A receiver operating characteristic (ROC) curve is a plot of the true positive rate (TPR or recall) against the false positive rate (FPR) across the range of thresholds for the real-valued marker or feature at hand. A precision-recall (PR) curve is a plot of the precision (positive predictive value) against the recall for different thresholds. Precision is the fraction of predictive examples that are truly positive. The recall is the fraction of positive examples that are correctly labeled [45]. We used the common genes in the mutation dataset, gene expression dataset, and the considered PPI network to plot the ROC and PR curves. We analyzed the top 100, 200, and 300 genes for each technique (Figure 2a-2l, Table S4-S15). The CGC/NCG AUC/AUPR is a convenient name for the AUC/AUPR calculated using the CGC/NCG cancer genes. It is clear that in the vast majority of instances, the distributions of AUC/AUPR for DEGdriver and DriverNet against various cancer types are superior to those of the alternative techniques.

To further highlight DEGdriver's superiority, we recalculated the techniques' (so-called relative) AUPR scores by setting DriverNet's AUPR default value to 1 for each cancer type (Table S16-S21). By analyzing the top 100, 200, and 300 genes, respectively, the relative CGC AUPR score of DEGdriver, 1.119, 1.220, and 1.127, showed that the CGC AUPR score of DEGdriver was improved by 11.9%, 22.0%, and 12.7%, respectively, compared to DriverNet. As shown in Table S16-S21, DEGdriver outperformed DriverNet by at least 10.0% with CGC AUPR scores when comparing the top 100, 200, and 300 genes across 4 cancer types (BRCA, LIHC, LUAD, THCA), 4 cancer types (COAD, LIHC, LUSC, THCA), and 7 cancer types (COAD, KIRC, LIHC, LUAD, LUSC, PRAD, THCA), respectively. 

We calculated the concordance scores of all approaches across 11 cancer types under CGC/NCG cancer genes when N=20, 50, 100, 200, and 300, as represented by the violin plots in Figures 2m-2n. The percentage of CGC/NCG cancer genes in the top N genes is the concordance score of the top N genes predicted by a technique. Overall, DEGdriver outperformed the approaches being compared. In some cancer types, such as COAD, DEGdriver has shown appreciable superiority with precision and specificity over the other approaches, followed by DriverNet (Figure 2o, 2p). Specificity is a measure of how many negative predictions made are correct (true negatives). 

Given the high performance of DEGdriver and DriverNet, we compared the concordance curves of the top N ranked genes predicted by each method across each cancer type under CGC, NCG cancer genes (Figure 3a, 3d). As a result, DEGdriver outperformed DriverNet for the majority of cancer types. Figure 3b, 3e, and 3c, 3f show the number of recovered CGC, NCG cancer genes, and the concordance scores of the top genes output by DEGdriver and DriverNet for each cancer type when N=20, 50, 100, 200, and 300. 

To make a more intuitive comparison between DEGdriver and DriverNet, we denote by XN (resp. YN) the set of top N genes output by DEGdriver (resp. DriverNet) over a specific cancer type, and by sN (resp. tN) the number of NCG cancer genes in X(resp. YN) but Y(resp. XN). Then we have over cancer type PRAD that s20 = 8, t20 = 3; s50 = 18, t50 = 15; s100 = 30, t100 = 21; s200 = 62, t200 = 46; s300 = 83, t300 = 53. We have over cancer type BRCA that s20 = 4, t20 = 2; s50 = 9, t50 = 9; s100 = 23, t100 = 18; s200 = 42, t200 = 30; s300 = 63, t300 = 39. We have over cancer type THCA that s20 = 8, t20 = 6; s50 = 17, t50 = 14; s100 = 33, t100 = 30; s200 = 52, t200 = 37; s300 = 71, t300 = 53. We have over cancer type LIHC that s20 = 8, t20 = 7; s50 = 18, t50 = 12; s100 = 32, t100 = 25; s200 = 53, t200 = 45; s300 = 71, t300 = 60. Tables S22-S25 summarize the comparisons of all cancer types under CGC/NCG cancer genes.

We also looked at the overlap between DEGdriver's top N genes and DriverNet's top N genes (Table S26-S29). According to our findings, the majority of the genes produced by both methods were NCG cancer genes across all cancer types, and the number of NCG cancer genes predicted by DEGdriver and DriverNet over each cancer type was always greater than the number of NCG cancer genes output by just one of them (Figure 3e). For instance, over the cancer type KIRC, all genes predicted by both methods for N=50 except TTN, CFTR, HSP90AA1, PRPF8, and FN1 are NCG cancer genes; over the cancer type HNSC, all genes predicted by both methods for N=50 except TTN, APP, CEP250, CFTR, and EIF4G1 are NCG cancer genes; over the cancer type LUAD, all genes predicted by both methods for N=50 except TTN, APP, FLNC, RGPD4, LAMA1, and RYR2 are NCG cancer genes; and over the other cancer types, we summarized in Table S30. 

We further confirmed that DEGdriver was parameter-resistant. We set α=0.8, 0.9, 0.99 with β=1/3 fixed and β=0.1, 0.2, 0.25, 1/3 with α=0.99 fixed, respectively. By analyzing all output genes, we determined the CGC/NCG AUC/AUPR (Figure 4, Table S31-S38). We use the NCG AUPR analysis as an illustration. In the first scenario, for all cancer types, the NCG AUPRs of DEGdriver are similar and bigger than those of DriverNet (Figure 4f, Table S34). So, α=0.99 is our default value due to the low computation and running time. The NCG AUPRs of DEGdriver for the second scenario are all similar for β=0.2, 0.25, and 1/3. For β=0.1, the NCG AUPRs of DEGdriver are larger than those of DriverNet for all cancer types but COAD, BRCA, LUAD, UCEC; β=0.2 for all but COAD; β=0.25, for all but COAD; and β=1/3 for all (Figure 4h, Table S38). We default βto 1/3 since DEGdriver achieves the largest NCG AUPR at β=1/3 for the majority of cancer types.

To assess the effects of different PPI networks on the outcomes, we tested DEGdriver and DriverNet using each of the four PPI networks, HumanNet-PI, two HumanNet-PI component networks, including HumanNet-LC and HumanNet-HT, as well as Multinet. We evaluated all of the output genes to determine the CGC/NCG AUC/AUPR. DEGdriver exhibits advantages over DriverNet for practically all cases (Figure 5, Table S39-S54). We take the NCG AUPR analysis as an example. The NCG AUPRs of DEGdriver are greater than those of DriverNet for all cancer types for HumanNet-LC and HumanNet-HT (Figure 5n, 5o). Concretely, The NCG AUPRs of DEGdriver have been improved by at least 6.4% for THCA, PRAD, KIRC, HNSC, UCEC, and 6.1% for THCA, LUSC, PRAD, BRCA, LIHC, respectively. For nine cancer types on Multinet, DEGdriver's NCG AUPRs are greater than DriverNet's (Figure 5p). The NCG AUPRs of DEGdriver and DriverNet are only marginally different for the two cancer types COAD and LUAD. With the exception of HNSC, STAD, COAD, and LUAD, DEGdriver has been enhanced even further (by more than 7.1%) across all cancer types. We discovered that DEGdriver was enhanced in their NCG AUPR by 16.4% and 15.9%, respectively when restricted to THCA and PRAD.

DEGdriver is capable of identifying functional modules or pathways

Both new therapeutic targets and new insights into the pathologic pathways behind cancer are provided by DEGdriver. To demonstrate this point, we contrasted DEGdriver and DriverNet, two approaches that performed exceptionally well, on the cancer types prostate adenocarcinoma (PRAD) and liver hepatocellular carcinoma (LIHC). We only took into account the top 20 genes produced by each approach to maintain fairness.

We performed enrichment analysis in Metascape [http://metascape.org] [46] with the default settings. For each given gene list, pathway and process enrichment analysis have been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, Cell Type Signatures, CORUM, TRRUST, DisGeNET, PaGenBase, Transcription Factor Targets, WikiPathways, PANTHER Pathway, and COVID. The disease association analysis has been carried out by DisGeNET [47]. Based on commonalities in membership, the enriched pathway/process terms are clustered. The statistically most significant term is selected to represent each cluster.

Prostate adenocarcinoma (PRAD) 

By using DEGdriver (DriverNet), we discovered that 17 (12) out of 20 genes for PRAD are NCG cancer genes. According to the pathway and process enrichment analyses, the enriched terms of the top 20 genes are grouped into 17 (17) clusters respectively (Figure 6). The 17 representative enriched terms by DEGdriver include two Canonical Pathways terms, nine GO Biological Processes terms, one KEGG Pathway term, two Reactome Gene Sets terms, and three WikiPathways terms. The top five representative enriched terms are, in order, hsa05165: Human papillomavirus infection, GO:0010942: positive regulation of cell death, M145: PID P53 DOWNSTREAM PATHWAY, WP1984: Integrated breast cancer pathway, and R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers.

The 17 representative enriched terms by DriverNet include two Canonical Pathways terms, nine GO Biological Processes terms, one KEGG Pathway term, two Reactome Gene Sets terms, and three WikiPathways terms. The top five representative enriched terms are, in order, GO:0010564: regulation of cell cycle process, R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers, hsa05226: Gastric cancer, M261: PID P53 REGULATION PATHWAY and GO:0042176: regulation of protein catabolic process. 

We only identified two common representative enriched terms for both of them, which are GO:0042176: regulation of protein catabolic process and R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers. GO: 0042176 includes the genes APC, CDKN1B, EGFR, SMAD4, PTEN, HUWE1, and LRRK2 output by DEGdriver, and the genes CDKN1B, DDB1, SMAD4, PTEN, RPL11, HUWE1 output by DriverNet. R-HSA-5663202 includes the genes APC, CDKN1B, CTNNB1, EGFR, EP300, SMAD4, PTEN, and AKAP9 output by DEGdriver, and the genes CDKN1B, CLTC, CTNNB1, SMAD4, MET, PTEN, AKAP9 output by DriverNet. The PRAD diagnosis and care may benefit from further research for the terms GO: 0042176 and R-HSA-5663202.

Enrichment analysis in DisGeNET indicates that the genes produced by each approach are both strongly enriched for many terms, such as C0007112: Adenocarcinoma of prostate, C1654637: androgen independent prostate cancer, C4722328: Hereditary Prostate Carcinoma, C1328504: Hormone refractory prostate cancer, C1708566: Invasive Prostate Carcinoma, C1282496: Metastasis from malignant tumor of prostate, C4721208: Metastatic castration-resistant prostate cancer, C0936223: Metastatic Prostate Carcinoma and C1739135: Progression of prostate cancer. Some terms are enriched by genes output by DEGdriver only, such as C0278838: Prostate cancer recurrent, C2931456: Prostate cancer, familial, C4722327: PROSTATE CANCER, HEREDITARY, 1, C1853195: Prostate Cancer, Hereditary, 7 and C0033575: Prostatic Diseases. The analysis above indicates that DEGdriver did find a few disease-related genes that DriverNet had overlooked. 

Liver hepatocellular carcinoma (LIHC)

We found that 14 (13) out of the 20 genes for LIHC are NCG cancer genes utilizing DEGdriver (DriverNet). The top 20 genes' enriched terms are categorized into 18 (19) clusters each by the pathway and process enrichment analysis (Figure 7). The 18 representative enriched terms by DEGdriver consist of one Canonical Pathways term, ten GO Biological Processes terms, three Reactome Gene Sets terms, and four WikiPathways terms. The top five representative enriched terms are, in order, M145: PID P53 DOWNSTREAM PATHWAY, GO:0060341: regulation of cellular localization, WP399: Wnt signaling pathway and pluripotency, WP4879: Overlap between signal transduction pathways contributing to LMNA laminopathies and GO:0051099: positive regulation of binding. 

The 19 representative enriched terms by DriverNet consist of two Canonical Pathways terms, twelve GO Biological Processes terms, four Reactome Gene Sets terms, and one WikiPathways term. The top five representative enriched terms are, in order, M261: PID P53 REGULATION PATHWAY, GO:0051098: regulation of binding, GO:0090399: replicative senescence, GO:0070997: neuron death, and WP366: TGF-beta signaling pathway. 

We only found two common representative enriched terms for both of them, which are GO: 0060341: regulation of cellular localization and GO: 0051099: positive regulation of binding. GO: 0060341 includes the genes APC, CFTR, CLTC, CTNNB1, DMD, HTT, LRP1, RB1, CEP250, LRRK2 by DEGdriver and the genes CFTR, CTNNB1, RB1, HUWE1, TRIM28, SIN3A, LRRK2 by DriverNet. GO: 0051099 includes the genes APP, CTNNB1, EP300, LRP1, RB1, LRRK2 by DEGdriver and the genes APP, CTNNB1, RB1, TRIM28, LRRK2 by DriverNet. For their potential roles in the diagnosis and treatment of LIHC, the terms GO: 0060341 and GO: 0051099 need further research.

Enrichment analysis in DisGeNET indicates that the genes generated by each approach are both substantially enriched for many terms, including C0279607: Adult Hepatocellular Carcinoma, C0279606: Childhood Hepatocellular Carcinoma, C0086404: Experimental Hepatoma, C0206624: Hepatoblastoma, C2676033: Hepatoblastoma Caused By Somatic Mutation, C0206669: Hepatocellular Adenoma, C0019207: Hepatoma, Morris, C0019208: Hepatoma, Novikoff, and C0023904: Liver Neoplasms, Experimental. Some terms are enriched by genes output by DEGdriver only, such as C0334287: Fibrolamellar Hepatocellular Carcinoma, C0267792: Hepatobiliary disease, C0861876: Recurrent Hepatocellular Carcinoma. The analysis above indicates that DEGdriver did find a few disease-related genes that DriverNet had missed. 

Discussion

By merging tumor genomes and transcriptomes, numerous algorithms have been created to locate driver mutations and describe the relationship between mutations and differential expressions. DriverNet could only detect genes with expression values that are outliers relative to the average expression values of the genes across all patients. DriverNet might not be able to detect less drastic but significant changes in expression that are influenced by a genetic event, according to the study's authors. By identifying the individually differentially expressed genes, we offer DEGdriver to address this issue. DEGdriver builds a bipartite graph similar to DriverNet to find potential driver mutant genes.

DEGdriver exhibits a significant advantage over the prominent algorithms, such as the frequency-based, function-based, and network-based approaches, in discovering cancer driver genes. The majority of instances have seen improvement in the top genes' concordance scores and the AUC/AUPR values. We found that the majority of the shared genes by DEGdriver and DriverNet are all NCG cancer genes. It suggests that additional research into the common genes should be done for the cancer types. It has also been confirmed that DEGdriver is resistant to variable settings. The enrichment analysis demonstrates that the individual-level differentially expressed gene analysis captures the valuable signals disregarded by DriverNet and overcomes DriverNet's shortcomings by not only identifying novel cancer genes but also novel pathways.

For the majority of cancer types, DEGdriver produces more genes than DriverNet (Table S29). It could be argued that the detection of genes with differential expression at the individual level captures the subtle relationships between gene mutations and differential expression. When analyzing genes in the population level for which DEGdriver has demonstrated its accuracy for the top genes, we typically pay more attention to the top genes of a method. For the individual level analysis, DEGdriver can predict the patient-specific driver genes for a given patient due to its specific design of the bipartite graph. The candidate drivers for a given patient would be the mutated genes chosen by the algorithm that had connections to the patient's identified differentially expressed genes.

The model's outcomes are directly impacted by the identified differentially expressed genes at the individual level. To identify differentially expressed genes at the individual level in the current version of DEGdriver, we use RankComp, a subroutine that was only applied to the gene expression dataset. It should be noted that the deployment of subsequent DEGdriver models won't be impacted by the new method for identifying differentially expressed genes at the individual level. It is anticipated that RankComp will be replaced by shorter, more precise approaches for analyzing differentially expressed genes at the individual level.

Conclusions

By integrating analysis of tumor genomes and transcriptomes, we created an algorithm called DEGdriver that was inspired by DriverNet for identifying cancer driver genes. It has been shown that DEGdriver outperforms state-of-the-art methods for locating cancer driver genes. DEGdriver has been demonstrated to be able to identify those less extreme but significant changes in expression that DriverNet may have missed when applied to 11 different cancer types. DEGdriver can find functional modules and pathways in addition to new driver genes, according to enrichment analysis. DEGdriver might be a helpful addition to DriverNet to better comprehend pathologic pathways, identify targets, diagnose cancer, and provide individualized care.

Methods

1. Individual-level analysis of differentially expressed genes

To identify genes that are differentially expressed, we use RankComp to calculate the p-value of each gene for each patient. RankComp identifies genes that are differentially expressed for each patient based on the distribution of gene expression in normal samples. The gene pair (gi, gj) is considered positive/negative stable if the expression values of gene gi are consistently larger/smaller than those of gene gj across 100*α% normal samples of a specific cancer type. Note that the gene order within a stable gene pair influences whether it is positive or negative. The parameter α has defaulted to 0.99. For a patient, a gene pair (gi, gj) is referred to as a positive/negative gene pair if the expression value of gene gi is larger/smaller than that of gene gj. For a gene gi, let G denote the set of stable gene pairs (gi, g) in normal samples. Let a and b represent the numbers of positive and negative stable gene pairs belonging to G. Let c and d represent the numbers of positive and negative gene pairs belonging to for a specific patient k. The ratios of the positive and negative gene pairs are then A/B=a/b in normal samples and C/D=c/d in the patient k, respectively. Under the null hypothesis that A/B=C/D, Fisher's exact test is used to obtain the p-value to determine whether gene gi is differentially expressed in patient k. Then a p-value matrix P can be obtained where the matrix element P(i, j) denotes the p-value of gene i in patient j, which is further binarized by setting a cutoff to determine whether a gene is differentially expressed. In our experiments, 100*β% of elements with small p-values in matrix P are set 1, and 0 for others in the corresponding DEG matrix, and the parameter βis defaulted to 1/3. The DEG matrix can be used to evaluate the impact of mutations on gene expressions.

2. Formulation of associations between mutations and expression levels

A bipartite graph B = (X, Y) is constructed to formulate associations between mutations and expression levels, where X and Y, which are independent, form a partition of the node set of B. A node in X represents a gene mutated in at least one patient. A node in Y represents a differentially expressed gene in a patient. A node gi in X and a node gj in Y are connected by an edge if they interact with each other in a given PPI network. 

3. The inference algorithm

The top n genes (nodes) in X are defined as genes that are connected to the maximum number of differentially expressed genes (nodes) in Y. The inference algorithm aims to identify the minimum top genes in X that are connected to all nodes with edges in Y. A greedy algorithm is applied to solve the problem. At each stage, select a gene in X that connects to the largest number of uncovered nodes in Y. When there is a tie, we arbitrarily take one of them. When a gene is selected, the nodes in Y connecting to the selected gene are marked as covered nodes. It terminates when all the nodes with edges in Y are exhausted. All selected mutated genes are output as predicted driver genes in order.

Availability of data and materials

We downloaded the TCGA gene expression RNAseq (HTSeq-Counts) datasets which were derived from GDC API using the Sangerbox tools (http://www.sangerbox.com/tool) on June 5, 2021. For the RNAseq datasets, we only consider the disease samples with the sample label 01 (Primary Solid Tumor TP) and the normal samples with the sample label 11 (Solid Tissue Normal NT). We obtained 11 cancer types with at least 400 samples in total and at least 20 normal samples (Table S1). 

The 11 cancer types include breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate adenocarcinoma (PRAD), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA) and uterine corpus endometrial carcinoma (UCEC). 

We downloaded the TCGA somatic mutation (SNP and INDEL) datasets of the 11 cancer types via the TCGA hub of the UCSC Xena platform (https://tcga.xenahubs.net) (Table S2). For the somatic mutation datasets, the MC3 gene-level non-silent mutation was selected with the data version of 2016-12-29. The datasets can be represented as a binary gene mutation matrix M where M(i, j)=1 indicated gene i was mutated in patient j and M(i, j)=0 otherwise.

We downloaded PPI networks HumanNet-PI and its two component networks HT and LC from HumanNet v2 [48] (http://www.inetbio.org/humannet/) and Multinet from HotNet2 [34] (http://compbio-research.cs.brown.edu/pancancer/hotnet2/#!/) (Table S3). The component networks HT and LC were denoted as HumanNet-HT and HumanNet-LC respectively in our analysis.

Declarations

Author contributions

Conceived and designed the project: GL. Analyzed the data and performed the experiments: YZ, BG. Wrote the paper: GL, BG, and YZ. 

Funding

This work was supported by the National Natural Science Foundation of China (No. 61902390 to BG, and No. 11931008 to GL), and the National Key R&D Program of China (No. 2020YFA0712400 to GL). This work was also supported by the Beijing Municipal Key Laboratory of Clinical Epidemiology and the Jinan Innovation Team Project. The funders had no role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

Conflicts of interest

All authors have disclosed no potential conflicts of interest.

References

  1. Cancer Genome Atlas Research N: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455:1061–1068.
  2. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, et al: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446:153–158.
  3. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458:719–724.
  4. Greenman C, Wooster R, Futreal PA, Stratton MR, Easton DF: Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 2006, 173:2187–2198.
  5. Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, et al: Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma. Proceedings of the National Academy of Sciences of the United States of America 2007, 104:20007–20012.
  6. Youn A, Simon R: Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics 2011, 27:175–181.
  7. Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, et al: The genomic landscapes of human breast and colorectal cancers. Science 2007, 318:1108–1113.
  8. Torkamani A, Schork NJ: Identification of rare cancer driver mutations by network reconstruction. Genome Res 2009, 19:1570–1578.
  9. Shi X, Teng H, Shi L, Bi W, Wei W, Mao F, Sun Z: Comprehensive evaluation of computational methods for predicting cancer driver genes. Brief Bioinform 2022, 23.
  10. Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, Getz G: Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505:495–501.
  11. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499:214–218.
  12. Hodis E, Watson IR, Kryukov GV, Arold ST, Imielinski M, Theurillat JP, Nickerson E, Auclair D, Li L, Place C, et al: A landscape of driver mutations in melanoma. Cell 2012, 150:251–263.
  13. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, Mooney TB, Callaway MB, Dooling D, Mardis ER, et al: MuSiC: identifying mutational significance in cancer genomes. Genome Res 2012, 22:1589–1598.
  14. Zhao S, Liu J, Nanga P, Liu Y, Cicek AE, Knoblauch N, He C, Stephens M, He X: Detailed modeling of positive selection improves detection of cancer driver genes. Nat Commun 2019, 10:3399.
  15. Jiang L, Zheng J, Kwan JSH, Dai S, Li C, Li MJ, Yu B, To KF, Sham PC, Zhu Y, Li M: WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Res 2019, 47:e96.
  16. Han Y, Yang J, Qian X, Cheng WC, Liu SH, Hua X, Zhou L, Yang Y, Wu Q, Liu P, Lu Y: DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies. Nucleic Acids Res 2019, 47:e45.
  17. Reva B, Antipin Y, Sander C: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011, 39:E118-U185.
  18. Carter H, Chen SN, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R: Cancer-Specific High-Throughput Annotation of Somatic Mutations: Computational Prediction of Driver Missense Mutations. Cancer Research 2009, 69:6660–6667.
  19. Gonzalez-Perez A, Deu-Pons J, Lopez-Bigas N: Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med 2012, 4:89.
  20. Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GL, Edwards KJ, Day IN, Gaunt TR: Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat 2013, 34:57–65.
  21. Dietlein F, Weghorn D, Taylor-Weiner A, Richters A, Reardon B, Liu D, Lander ES, Van Allen EM, Sunyaev SR: Identification of cancer driver genes based on nucleotide context. Nat Genet 2020, 52:208–218.
  22. Cho A, Shim JE, Kim E, Supek F, Lehner B, Lee I: MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biol 2016, 17:129.
  23. Hou YN, Gao B, Li GJ, Su ZC: MaxMIF: A New Method for Identifying Cancer Driver Genes through Effective Data Integration. Advanced Science 2018, 5.
  24. Boca SM, Kinzler KW, Velculescu VE, Vogelstein B, Parmigiani G: Patient-oriented gene set analysis for cancer mutation data. Genome Biology 2010, 11.
  25. Efroni S, Ben-Hamo R, Edmonson M, Greenblum S, Schaefer CF, Buetow KH: Detecting Cancer Gene Networks Characterized by Recurrent Genomic Alterations in a Population. Plos One 2011, 6.
  26. Raphael BJ, Dobson JR, Oesper L, Vandin F: Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Medicine 2014, 6.
  27. Ding L, Raphael BJ, Chen F, Wendl MC: Advances for studying clonal evolution in cancer. Cancer Letters 2013, 340:212–219.
  28. Vandin F, Upfal E, Raphael BJ: De novo discovery of mutated driver pathways in cancer. Genome Res 2012, 22:375–385.
  29. Zhao J, Zhang S, Wu LY, Zhang XS: Efficient methods for identifying mutated driver pathways in cancer. Bioinformatics 2012, 28:2940–2947.
  30. Li HT, Zhang YL, Zheng CH, Wang HQ: Simulated Annealing Based Algorithm for Identifying Mutated Driver Pathways in Cancer. Biomed Research International 2014.
  31. Leiserson MD, Blokh D, Sharan R, Raphael BJ: Simultaneous identification of multiple driver pathways in cancer. PLoS Comput Biol 2013, 9:e1003054.
  32. Zhang JH, Wu LY, Zhang XS, Zhang SH: Discovery of co-occurring driver pathways in cancer. Bmc Bioinformatics 2014, 15.
  33. Vandin F, Upfal E, Raphael BJ: Algorithms for Detecting Significantly Mutated Pathways in Cancer. Journal of Computational Biology 2011, 18:507–522.
  34. Leiserson MD, Vandin F, Wu HT, Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellan M, et al: Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 2015, 47:106–114.
  35. Ciriello G, Cerami E, Sander C, Schultz N: Mutual exclusivity analysis identifies oncogenic network modules. Genome Res 2012, 22:398–406.
  36. Kim YA, Cho DY, Dao P, Przytycka TM: MEMCover: integrated analysis of mutual exclusivity and functional network reveals dysregulated pathways across multiple cancer types. Bioinformatics 2015, 31:i284-292.
  37. Gao B, Li G, Liu J, Li Y, Huang X: Identification of driver modules in pan-cancer via coordinating coverage and exclusivity. Oncotarget 2017.
  38. Gao B, Zhao Y, Li Y, Liu JT, Wang LS, Li GJ, Su ZC: Prediction of Driver Modules via Balancing Exclusive Coverages of Mutations in Cancer Samples. Advanced Science 2019, 6.
  39. Gao B, Zhao Y, Gao YH, Li GJ, Wu LY: Identification of Common Driver Gene Modules and Associations between Cancers through Integrated Network Analysis. Global Challenges 2021, 5.
  40. Bashashati A, Haffari G, Ding JR, Ha G, Lui K, Rosner J, Huntsman DG, Caldas C, Aparicio SA, Shah SP: DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biology 2012, 13.
  41. Hou JP, Ma J: DawnRank: discovering personalized driver genes in cancer. Genome Med 2014, 6:56.
  42. Wang HW, Sun Q, Zhao WY, Qi LS, Gu YY, Li PF, Zhang MM, Li Y, Liu SL, Guo Z: Individual-level analysis of differential expression of genes and pathways for personalized medicine. Bioinformatics 2015, 31:62–68.
  43. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, Boutselakis H, Cole CG, Creatore C, Dawson E, et al: COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res 2019, 47:D941-D947.
  44. Dressler L, Bortolomeazzi M, Keddar MR, Misetic H, Sartini G, Acha-Sagredo A, Montorsi L, Wijewardhane N, Repana D, Nulsen J, et al: Comparative assessment of genes driving cancer and somatic evolution in non-cancer tissues: an update of the Network of Cancer Genes (NCG) resource. Genome Biol 2022, 23:35.
  45. Davis J, Goadrich M: The Relationship Between Precision-Recall and ROC Curves. 2006.
  46. Zhou YY, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, Benner C, Chanda SK: Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications 2019, 10.
  47. Pinero J, Bravo A, Queralt-Rosinach N, Gutierrez-Sacristan A, Deu-Pons J, Centeno E, Garcia-Garcia J, Sanz F, Furlong LI: DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res 2017, 45:D833-D839.
  48. Hwang S, Kim CY, Yang S, Kim E, Hart T, Marcotte EM, Lee I: HumanNet v2: human gene networks for disease research. Nucleic Acids Res 2019, 47:D573-D580.