DriverRWH: Discovering Cancer Driver Genes By Random Walk On a Gene Mutation Hypergraph

doi:10.21203/rs.3.rs-1192205/v1

Download PDF

Research Article

DriverRWH: Discovering Cancer Driver Genes By Random Walk On a Gene Mutation Hypergraph

https://doi.org/10.21203/rs.3.rs-1192205/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data. A critical challenge in cancer genomics is identification of a few driver mutation genes from a much larger number of passenger mutation genes. However, majority of existing computational approaches underuse the co-occurrence information of the individuals, which deems to be important in tumorigenesis and tumor progression. Driver gene list predicted from these tools are prone to be false positive, recent research is far from achieving the ultimate goal of discovering a complete catalog of driver genes.

Results: To make full use of co-mutation information, we present a random walk algorithm referred to as DriverRWH on a weighted gene mutation hypergraph model, using somatic mutation data and molecular interaction network data to prioritize candidate driver genes. Applied to tumor samples of different cancer types from The Cancer Genome Atlas (TCGA), DriverRWH shows significantly better performance than state-of-art prioritization methods in terms of the area under the curve (AUC) scores and the cumulative number of known driver genes recovered in top-ranked candidate genes. DriverRWH recovers approximately 50% known driver genes in the top 30 ranked candidate genes for more than half of the cancer types. In addition, DriverRWH is also highly robust to perturbations in the mutation data and gene functional network data.

Conclusion: DriverRWH is effective among various cancer types in prioritizes cancer driver genes and provides considerable improvement over other tools with a better balance of precision and sensitivity. It can be a useful tool for detecting potential driver genes and facilitate targeted cancer therapies.

Bioinformatics

cancer driver genes

somatic mutation

gene network

hypergraph model

random walk

candidate gene prioritization

Cancer is a complex genetic disease characterized by abnormal and uncontrolled cellular growth, which is caused primarily by the accumulation of genomic alterations that together enable malignant growth [1, 2]. Recent advances in next-generation sequencing (NGS) technologies have generated massive amounts of cancer genomic data, such as The Cancer Genome Atlas (TCGA), which provides somatic mutation landscapes to better characterize the molecular signatures of cancer [3]. There is a consensus viewpoint on tumorigenesis that only a few mutational events affect driver genes and the other mutations are expected to be ‘passenger’ mutations that have no impaction cancer progression [4, 5]. Currently, distinguishing ‘driver’ mutations from functional neutral ‘passenger’ mutations is a key step in understanding tumor biology and developing targeted anticancer therapies.

A number of computational tools have been developed to identify cancer driver genes from multidimensional genomic data. Most of these tools can be classified into three categories based on their basic principles [6]. Frequency-based approaches define that the most commonly occurring mutation are more likely to be drivers, such as MutSigCV and MuSic [7, 8]. Unfortunately, methods based on frequency are underpowered for uncovering low recurrently driver genes [9]. Functional impact-based approaches, such as OncodriveFM, integrate multiple-domain information for predicting the functional impact of single nucleotide variants (SNVs) [10]. However, most of such methods are machine learning-based models. Building either a gold-standard positive data set or a negative data set for model is a difficult task that restricts the use of these methods [9]. The third category is network-based methods enlightened by the observation that mutations in a cancer genome tend to converge on a few biological pathways, attempt to identify groups of driver genes based on prior knowledge of pathways and proteins or genetic interactions [11, 12]. A recent tool named DawnRank adopts PageRank algorithm to rank potential drivers based on their impact on the overall differential expression of the downstream genes [13]. HotNet2 uses a random walk with restart algorithm for identification of mutated subnetworks, in which the mutation frequency of each gene and the frequencies of its network neighbors are considered and hub genes are often yielded with highly predicted scores [12]. This kind of methods have advantages in their ability to identify driver genes with low recurrence and improve the accuracy of predicting driver genes to some extent [14].

Despite the rapid progress in computational approaches to prioritize cancer driver genes with the advent of next-generation sequencing technologies, the false positive rates of these existing methods are still too high. Among majority of the published methods, the practice, putting single gene mutation frequency as input information, could result in the loss of all the co-occurring alternations information of the individual tumors, which is considered to play a key role in cancer initiation and progression [15–17]. In this study, we present a novel approach, DriverRWH, which makes full use of the individual’s co-mutation information to improve the performance of prioritizing driver mutated genes [18]. In order to indicate the co-exist relationship among genes in single individuals, we adopt hypergraph to represent the complete somatic mutation information. A random walk algorithm is then used to utilize both co-mutation and PPI network information. To verify our method, we applied our method to 31 cancer types from TCGA and found that regardless of which reference network we use, our method outperforms the state-of-the-art tools for the majority of cancer types. We also evaluated the robustness of our method and found that DriverRWH is highly robust to various data perturbations

Overview

In the present model, we incorporated somatic mutation profiles and PPI datasets to prioritize the driver genes. At first, a weighted mutation hypergraph was constructed, wherein tumor samples are presented as hyperedges and mutant genes are presented as vertices. To be clear, hypergraph is a generalization of a traditional simple graph where its edges (called hyperedges) are allowed to connect arbitrary number of vertices rather than two. According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, we differentiated gene within a hyperedge of sample in accordance with their degrees in the corresponding induced subnetwork of the PPI network. Second, we extended the typical random walk process on a simple graph to the hypergraph in a modified manner and carry out this process iteratively. After some steps, the random walk would stabilize. At last, all candidate mutant genes are ranked in descending order based on their DriverRWH score.

Datasets and networks

Somatic mutation data for 9183 tumor samples across 31 cancer types (Table-S1) used in this work are available from TCGA, which were downloaded by UCSC Browser (https://xenabrowser.net/datapages/) [19]. We downloaded two independently developed PPI datasets from the HumanNet (http://www.functionalnet.org/humannet/) [20] and the STRINGv10 (https://string-db.org) [21] .

Performance evaluation

To evaluate the method, an unbiased comprehensive known cancer gene set is needed. Unfortunately, such a gold-standard set of cancer genes is currently unavailable. Alternatively, we used four complementary cancer gene sets derived from various sources as the reference driver gene set for all the cancer types. First, 616 cancer genes were downloaded from the Cancer Gene Census (CGC) database, which includes genes for which mutations have been causally implicated in cancer and is widely used as a gold-standard cancer gene set [22]. Second, the list of HiConf cancer gene panels consists of 99 driver genes that have previously been detected through genetic criteria and that could plausibly be detected with exome sequencing data [23]. The third set has 291 high-confidence cancer driver genes identified by a rule-based method (HCD) [24]. The fourth set contains 125 driver genes defined by the "20/20 rules", which identifies Mut-driver genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes [25]. Now that each cancer gene set is biased toward particular features or study methods, we utilized a union of these four lists as the reference driver gene set, with a total of 785 genes. This operation can reduce the bias caused by using a single reference gene list to some degree. Using aforementioned reference driver genes as a benchmark, we generated receiver operating characteristic (ROC) curves and areas under the curve (AUCs) to evaluate the true positive and false positive rate. For practical reasons, only top-ranked candidate genes might enter into follow-up experimental validation. Considering that the high performance of prioritization for all genes cannot guarantee successful prioritization for the top ranked candidates, we also assessed the number of known driver gene recovered in the top 20, 50, 100,150 and 200 candidate genes.

Due to the diversity of cancer types, we are more interested in tumor-specific drivers than the general common drivers across all tumor types. We downloaded IntOGen database (https://www.intogen.org/download) [25]. This database harnesses the strengths of different driver prediction methods and provides a tumor-specific driver genes list, which is considered to be the best trade-off between sensitivity and specificity. This list contains 31 types of cancer among which Kidney Chromophobe (KICH) has 7 specific drivers (minimum) and Uterine Corpus Endometrial Carcinoma (UCEC) has 55 (maximum). All of the above lists are shown in Table-S2. From an application point of view, we should assess the ability of our method to identify novel driver genes that may not have been discovered in IntOGen. The genes in top 200 candidate gene list predicted by DriverRWH with both HumanNet and the STRINGv10 while not in tumor-specific drivers were evaluated by the enrichment analysis using DAVID on-line database [26, 27].

Specifically, the top 30 genes are selected as significant drivers. We leveraged a literature mining method named CoCiter, which calculates the co-citation significance between predicted driver genes and the keywords cancer type, ‘driver’ and ‘cancer’ to verify the top 30 significant genes [28]. The higher co-citation score implicates the stronger association between the genes and the key terms. Without loss of generality, we compared DriverRWH with 21 driver gene prediction methods across 31 cancer type, some of which identify significant drivers by p-value (the genes with FDR adjusted P value < 0.05) and the rest of methods provide the priority scores for candidate driver genes (the top 30 genes are selected as significant drivers). It is acceptable for the reason that the median number of significant genes for other methods in all data sets is 30.

The DriverRWH algorithm

In this study, we propose the DriverRWH by random walk on hypergraph to prioritize candidate driver gene. First, a hypergraph consisting of the mutated genes of all samples was constructed, wherein samples were presented as hyperedges and genes were presented as vertices (Fig. 1B). If a gene is mutated in a sample, it would be presented as a vertex in the hyperedge corresponding to the sample. Without loss of generalization, the hypergraph can be defined as $HG(V,\mathcal{E})$, where $V$ is the set of vertices and $\mathcal{E}$ is the set of hyperedges. A hyperedge $e$ is a subset of $V$, satisfying $\bigcup _{e\in \mathcal{E}}=V$. Hyperedge $e$ is said to be incident with vertex $u$ if $u\in e$; thus, the incidence matrix $H\in {R}^{\left|V\right|\times \left|\mathcal{E}\right|}$ can be defined as follows:

$h\left(u,e\right)=\left\{\begin{array}{ll}1& \text{ if }u\in e\\ 0& \text{ if }u\notin e\end{array}\right.$

After construction of the hypergraph, we developed a random walk process on it. Similar to a random walk on a simple graph, this walk is a type of Markov process, which is seen as the transition between two vertices. Note that the transition on the hypergraph occurs only if two vertices are incident to a hyperedge, so the random walk on a hypergraph is defined to be a two-step process. In the first step, the surfer selects a hyperedge $e$ incident with the current vertex $u$; thereafter, it selects a target vertex $v$ within the chosen hyperedge (Fig. 1C).

According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, a fairly standard choice of the weight of vertices in each hyperedge are their degrees in the corresponding induced subnetwork of the PPI network. If one vertex is an isolated node in the subnetwork, it also has the potential to be a driver gene, so a small weight of 0.01 is set. Let $Ne$ be the subnetwork containing vertices in hyperedge $e$ and denote ${d}_{Ne}\left(u\right)$ as the degree of $u$ in the subnetwork.

$$w(u,e)=\left\{\begin{array}{cc}{d}_{Ne}\left(u\right),& \text{ if }u\in e\\ 0.01,& \text{ if }u\notin e\end{array}\right.$$

Thereafter, the surfer selects vertex $v$ proportional to the weight of $v$ within the hyperedge. Notably, in our model, the weights of vertices may vary in accordance with the hyperedges. According to the aforementioned definition, the degree of vertex $u$ and hyperedge $e$ in hypergraph $HG(V,\mathcal{E})$ can be defined as follows:

$$d\left(u\right)=\sum _{e\in \mathcal{E}}h\left(u,e\right)$$

$$\delta \left(e\right)=\sum _{u\in e}w(u,e)$$

With all the elements defined, we calculated the transition probability from vertex $u$ to vertex $v$ as follows:

$$P\left(u,v\right)=\sum _{e\in \mathcal{E}}\frac{h\left(u,e\right)}{d\left(u\right)}\frac{w\left(v,e\right)}{\sum _{\widehat{v}\in e}w\left(\widehat{v},e\right)}$$

which can also be written in matrix form:

$$P={{D}_{u}}^{-1}H{D}_{e}^{-1}{W}^{T}$$

where ${D}_{u}\in {R}^{\left|V\right|\times \left|V\right|}$ is the diagonal vertex degree matrix, ${D}_{e}\in {R}^{\left|\mathcal{E}\right|\times \left|\mathcal{E}\right|}$ is the diagonal hyperedge degree matrix with element $\delta \left(e\right)$ and $W\in {R}^{\left|V\right|\times \left|\mathcal{E}\right|}$ is the weighted incident matrix of hypergraph $HG(V,\mathcal{E})$. Note that the transition matrix $P$ is stochastic, where each row sums to 1.

Furthermore, we implemented a random walk with restart on the hypergraph. All genes are considered to be potential driver genes and are assigned with equal probabilities; i.e., the initially normalized probability vector $\overrightarrow{v}\left(0\right)\in {R}^{\left|V\right|\times 1}$ such that each element is assigned with equal probability $\frac{1}{\left|V\right|}$. Moreover, the restart probability at every step is set to be $1-\alpha (0cript>$. In this article, we set $\alpha$ to be 0.2. Finally, the random walk formula can be expressed as follows:

$$\overrightarrow{v}\left(t+1\right)=\alpha {P}^{T}\overrightarrow{v}\left(t\right)+\left(1-\alpha \right)\overrightarrow{v}\left(0\right) , t=\text{0,1},2,\dots$$

In the formula above, $\overrightarrow{v}\left(t\right)$ is defined such that the $i$th element means the probability that the surfer stops at vertex $i$ at step $t$. After a number of steps, the random walk will be stable, which can be defined as $\overrightarrow{v}\left({\infty }\right)$. The stabilized state implies that the distance between $\overrightarrow{v}(t+1)$ and $\overrightarrow{v}\left(t\right)$ by the L1 norm is smaller than the provided cutoff value. In this paper, we set the cutoff as ${10}^{-6}$. The elements of the stabilized vector $\overrightarrow{v}$ are defined as the DriverRWH score, which can reflect the role that the mutated genes play in cancer. A higher score implies a potential driver gene.

Known driver genes have higher degree in the PPI network

In DriverRWH, we hypothesized that a gene is more likely to be a cancer driver if it is prone to associate with other mutated genes in cancer. This hypothesis has already been proposed in some studies [12, 29]. To further validate it, we analyzed the linkage of mutated genes in the PPI network. For a given cancer type, an induced subnetwork of the PPI network which just contains mutated genes from all samples was built. Mutated genes were divided into two groups according to whether they are in the reference driver gene set: known driver genes and the others. We calculated the degree of vertices in the induced subnetwork. Taking the three cancer types LUSC, BRCA and UCEC for illustration, we found the degrees of known driver genes were significantly larger than those of the other mutant genes (P-value < 0.001), no matter which gene interaction networks used (Fig 2). This result suggests that cancer driver genes were adjacent to more mutated genes than the others.

Performance of DriverRWH

We applied DriverRWH to 31 independent datasets from TCGA using the PPI networks HumanNet (H) and STRINGv10 (S) respectively. We first showed that DriverRWH outperforms four well-regarded prioritization methods (i.e., MutsigCV [7], Gravity [29], DawnRank [30] and OncodriveFML [31]). We present three cancer types here, namely Lung squamous cell carcinoma (LUSC), Breast invasive carcinoma (BRCA), and Uterine Corpus Endometrial Carcinoma (UCEC). After that, we evaluated DriverRWH across 31 cancer types and compared it with 21 other driver gene prediction tools.

Results for Lung squamous cell carcinoma

Lung cancer is regarded as the main leading cause of cancer deaths, which take up 18.0% of deaths [32]. In this research, we applied DriverRWH to 480 LUSC samples in TCGA database.

Using reference driver genes as benchmarks, we generated receiver operating characteristic (ROC) curves and areas under the curve (AUCs) of all ROC curves. The AUCs of DriverRWH are all above 0.75, regardless of the choice of networks. Both the ROC curves and AUCs show that DriverRWH outperforms the other four tools in terms of sensitivity and specificity in identifying driver gene (Fig 3A). We further assessed the predictive power for the top-ranked candidate genes. As shown in fig 3B, we observed that DriverRWH identified more known cancer driver genes by its top 20, 50, 100, 150 and 200 genes. Besides, the number of know driver gene retrieved by DriverRWH with STRINGv10 network in its 20 top-ranked candidates is more than half of it.

To assess the ability of DriverRWH of discovering potential novel cancer driver genes, we considered the genes in the 200 top ranked candidate genes predicted with both HumanNet and the STRINGv10 while not in tumor-specific drivers list, resulting in 72 genes after screening. Enrichment analysis using DAVID against Genetic Association Database (GAD) shows that 36 genes (48.6%) are cancer-related (P-value = 5.92 × 10^-6, FDR = 5.92 × 10^-4) [33]. In particular, these genes are enriched for "lung cancer" (P-value = 1× 10⁻³, FDR = 0.1217). Furthermore, the KEGG pathway enrichment analysis for the potential drivers is encouraging. 8 genes (11.2%) are significantly enriched in pathway: "PI3K−Akt signaling pathway" (P-adjust < 0.05), which is significantly related to lung cancer (supplementary Fig. 1).

Specifically, using the top 30 candidate genes as significant driver, we searched these genes in co-citer website by the key terms ‘Cancer’, ‘Driver’ and ‘Lung’. As Table 1 shows, some significant well-known driver genes like TP53, PTEN and PIK3CA are near the top of the list. Although they are also identified by most of other methods, their ranking fell behind ours. The well-known suppressor TP53 which disrupts the cell cycle arrest and the apoptosis pathways in human cancer ranks first in our method, but it ranks 527th in Gravity algorithm and 1414th in Dawnrank. The PTEN is proved to be related to small cell lung cancer, which is an admitted tumor suppressor gene with phosphatase activity [34]. It is co-cited with ‘Lung’ and ‘Cancer’ for 253 and 2597 times, which is regarded as driver genes in 35 publications. The PTEN ranks the 16th in our list but ranked 26th in MutsigCV algorithm and 116th in Dawnrank. The mutation of PIK3CA gene can lead to abnormal enhancement of the catalytic activity of PI3Ks and promote the carcinogenesis of cells in lung cancer [34]. It ranks 7th in our method but 22th in MutsigCV and OncodriveFML, and 476th in DawnRank. On the other hand, KDR (Kinase insert domain-containing receptor), ranked 24th, was reported to play a critical role in the metastasis of cancer and is used as a molecular target in cancer therapy [35]. Co-cited with "Cancer" for 207 times and ‘Lung’ for 105 times, KDR even not deemed as a diver gene in lung cancer and can be thought as a potential driver. We adopted the GAD and KEGG pathway enrichment analysis and found driver genes enrich in the small cell lung cancer, PI3K-Akt signaling pathway, etc., which are significantly related to lung cancer (supplementary Fig. 1).

Table 1

Cociter mining analysis of top 30 candidate driver genes identified by DriverRWH (STRINGv10)
Genes	Lung	Cancer	Driver	Is_Specificity	MutsigCV	Dawnrank	Gravity	OncodriveFML
TP53	854	5942	55	1	1	1414	527	3
TTN	1	8	1	0	2	1	3959	13175
DNAH8	0	1	1	0	15	5960	NA	2741
RYR2	3	3	2	0	4	4528	400	11456
LRRK2	5	18	1	0	58	NA	1556	10604
PTEN	253	2597	35	1	26	116	588	2
PIK3CA	94	576	13	1	22	476	2536	22
NOTCH1	84	486	23	1	49	6836	1591	20
CSMD3	0	3	1	0	3	NA	NA	46
ANK2	1	4	0	0	99	5467	576	13775
SYNE1	1	2	1	0	11	NA	181	4035
KMT2D	1	18	1	1	NA	NA	3147	1
DMD	14	19	3	0	23	1774	1998	11399
USH2A	2	4	1	0	6	5828	4591	13959
OBSCN	0	4	0	0	209	4819	204	9252
RYR1	0	3	1	0	53	4830	1788	1299
NF1	12	137	8	1	52	17	3255	2881
LRP1B	8	15	2	0	5	5732	399	9317
APOB	3	23	1	0	84	6	253	8676
RELN	0	9	2	0	113	3	40	13119
MYH1	1	14	1	0	122	6876	NA	6933
EPHA5	2	6	1	0	172	31	NA	7697
MYH2	4	3	1	0	44	7419	NA	5025
KDR	105	207	3	0	131	543	549	2832
HERC2	0	14	1	0	155	NA	148	8436
POTEE	1	9	0	0	1153	NA	3138	8733
PIK3CG	36	119	1	0	426	2213	602	8186
CPS1	2	6	1	0	71	9	3852	13518
KMT2C	3	21	4	1	NA	NA	5041	479
HDAC9	5	18	1	0	371	4080	3783	10894

Results for Breast invasive carcinoma

Breast cancer is the most commonly diagnosed cancer, with an estimate 2.3 million new cases, taking up to 11.7% of all the cancer cases in 2020 [32]. We focused on 791 BRCA samples in TCGA database to construct the hypergraph.

Respectively, the AUCs of HumanNet and STRINGv10 are up to 0.743 and 0.751, which are higher than the other methods. Meanwhile, although DriverRWH discerned less driver gene than MutSigCV in top 20 candidates, it was found to predict more known driver genes in the top 50, 100, 150 and 200 candidates (Fig. 4B).

We evaluated the capacity of DriverRWH in identifying the breast cancer potential driver genes. Similarly, we adopted 61 genes, which are in the 200 top ranked candidate genes predicted with both HumanNet and the STRINGv10 while not in tumor-specific drivers list to conduct the GAD and pathway enrichment analysis. Notably, 29 genes (44.6%) are enriched for "CANCER" (P-value = 1.67 × 10^-4, FDR = 1.67 × 10^-4) and 12 (18.5%) are enriched for "breast cancer" (P-value = 2.15 × 10⁻⁵, FDR = 0.0087). In the case of pathways, these genes are significantly enriched in "Breast cancer". The top 25 pathways are shown in supplementary Fig2.

Table 2

Cociter mining analysis of top 30 candidate driver genes identified by DriverRWH
Genes	Breast	Cancer	Driver	Is_Specificity	MutsigCV	Dawnrank	Gravity	OncodriveFML
TP53	1177	5942	55	1	1	1447	884	5
PIK3CA	170	576	13	1	2	1	3949	9885
CDH1	291	1143	13	1	3	3	448	6
GATA3	84	114	4	1	4	4	179	1
TTN	2	8	1	0	NA	2	197	16438
PTEN	595	2597	35	1	6	86	300	7
MAP3K1	59	129	2	1	5	6102	208	4
KMT2C	3	21	4	1	NA	NA	2121	3
DNAH8	0	1	1	0	115	5364	NA	1942
AKT1	477	1863	13	1	NA	8	1226	2233
OBSCN	1	4	0	0	NA	4317	642	15684
DMD	1	19	3	0	16	1478	1457	5795
NF1	19	137	8	1	13	11	1641	25
UBC	176	653	4	0	197	NA	NA	2704
PRDM10	1	1	2	0	916	NA	4104	49
ERBB2	3631	4422	36	1	126	3407	1465	35
MYH9	10	32	4	0	30	6	1540	4141
NCOR1	16	58	2	1	9	330	76	22
FOXA1	82	128	5	1	26	2858	33	21
ANK3	3	4	2	0	NA	4782	2415	4803
LRRK2	4	18	1	0	218	NA	7388	3480
MTOR	321	1896	21	0	NA	NA	664	1419
EGFR	722	4091	94	0	NA	64	2909	14413
RYR2	2	3	2	0	NA	4570	2525	12929
PRKDC	57	274	4	0	77	563	127	9326
ANK2	0	4	0	0	NA	4686	3532	971
ASH1L	0	2	1	0	109	2754	957	6540
KDM6A	5	24	3	0	64	NA	2140	78
SYNE1	0	2	1	0	NA	NA	1233	12287
RUNX1	14	110	6	1	7	4105	748	17

The cociter score of the top 30 candidate genes predicted by DriverRWH using STRINGv10 network is demonstrated in Table 2. Particularly, 8 of the top 10 candidate genes are exactly driver genes, including acknowledged driver gene TP53 (ranked 1st), the most recurrently mutated gene PIK3CA (ranked second), etc. With high cociter scores, KMT2C ranked 8th in DriverRWH, not even identified in MutsigCV and Dawnrank and ranked 2121 in Gravity. AKT1, which co-appears with "Cancer" for 1863 times and "Breast" for 477 times, ranked 10th in DriverRWH while it ranked merely 1226th in Gravity and 2233th in OncodriveFML. The ERBB2, which ranked 16th in DriverRWH, is confirmed to be related to breast cancer, but it ranked 35th in OncodriveFML, 126th in MutsigCV, 1465th in Gravity, even 3407th in Dawnrank [36]. Besides, DriverRWH can identify some genes that are highly related with breast cancer but was not recognized by other four methods. For instance, EGFR is one of the first identified important targets of novel antitumor agents, which co-occur "Breast" 722 times, "Cancer" 4091 times, and "Driver" 94 times [37]. MTOR ranked 22nd, co-appearing 321 times with "Breast", 1896 times with "Cancer", and 21 times with "Driver". We performed GAD and pathway enrichment analysis of the top 30 candidate driver genes. The identified genes are enriched in "breast cancer" in GAD. These gene are significantly enriched in "Breast cancer", "Proteoglycans in cancer", "Endometrial cancer", etc., which have an association with breast cancer by KEGG enrichment analysis (supplementary Fig2).

Results for Uterine corpus cancer

Uterine corpus cancer is the sixth most common type of cancer and the second most common gynecological malignancy in female, with more than 417,000 new cases and 97,000 deaths worldwide in 2020 [38]. We used 448 patients with 40,543 candidate genes from the TCGA database.

DriverRWH outperforms the other four prioritizing methods with the same reference driver genes as benchmarks when assessed by the AUCs and percentage of known driver gene in the top candidate genes (Fig. 5).

For the discovery of potential drives, we selected 41 genes with the same criteria mentioned earlier, of which 22 genes (51.2%) are association with cancer (P-value=1.37 × 10^-4, FDR=1.37 × 10^-4). These genes are significantly enriched in PI3K−Akt signaling pathway and MAPK signaling pathway, both of which play an important role in cellular growth and survival, have been implicated in endometrial cancer pathogenesis [39].

Table 3

Cociter mining analysis of top 30 candidate driver genes identified by DriverRWH
Genes	Endometrial	Cancer	Driver	Is_Specificity	MutsigCV	Dawnrank	Gravity	OncodriveFML
PTEN	380	2597	35	1	1	1	233	168
TP53	143	5942	55	1	2	5	1687	403
PIK3CA	39	576	13	1	3	4	2	673
CTNNB1	112	2014	29	1	5	7	10	22
KRAS	51	2538	95	1	4	9	1787	8653
DNAH8	0	1	1	0	3741	75	6012	NA
LRRK2	0	18	1	0	2433	222	NA	7114
OBSCN	0	4	0	0	1171	80	5199	2055
PRDM10	0	1	2	0	7658	3312	NA	992
RANBP2	0	12	1	0	148	233	40	1176
NOTCH1	15	486	23	1	61	5537	7054	2680
TAF1	0	9	1	0	45	35	5227	171
ARID1A	14	67	4	1	31	3	4505	2
ANK3	0	4	2	0	216	58	5808	157
ATM	4	1222	5	1	241	17	1097	493
ALB	0	32	1	0	7085	2440	7566	7390
EP300	2	145	2	0	18	55	79	17
DMD	0	19	3	0	180	24	2776	3448
MTOR	63	1896	21	1	182	301	NA	1308
PRKDC	3	274	4	0	337	197	1217	41
CTCF	2	50	3	1	1214	6	24	19
TTN	0	8	1	0	12	12	3	1195
FGFR2	23	294	5	1	177	13	4875	1394
CAD	0	40	1	0	17	293	7174	1091
NSD1	0	16	2	1	1167	217	34	631
ASH1L	0	2	1	0	1279	195	4260	3296
TRRAP	0	26	1	0	81	207	8	96
POTEE	0	9	0	0	4902	1181	NA	4873
GLI3	2	36	2	0	21	1046	2263	3088
KMT2D	0	18	1	1	23	NA	NA	3113

We took top 30 candidate drivers in consideration, Table 3 shows the cociter score between these candidate genes and the terms " Endometrial", "Cancer" and "Drivers". Apoptosis-suppressing gene MTOR which co-appears with "Endometrial" 63 times, with "Cancer" 1896 times, ranked 19th in DriverRWH, but ranked 182th, 301th and 1380th in MutsigCV, Dawnrank and OncodriveFML. Notch1 is tumor-suppressive in human endometrial cancer cells [40], which ranked 11th in DriverRWH, while 61th in MutsigCV, 2630th in OncodriveFML, even 5537th in Dawnrank and 7054th in Gravity. We performed GAD and pathway enrichment analysis of these candidate genes (supplementary Fig 3). In terms of GAD enrichment analysis, these genes are enriched in "endometrial cancer", etc. In pathway enrichment analysis, they significantly enriched in Endometrial cancer.

The stability of the performance across 31 cancer types

Furthermore, we compared the performance of DriverRWH with 21 up-to-date driver gene prediction methods in order to assess the stability of DriverRWH across 31 cancer types.

For DriverRWH and four methods mentioned above which provide ranks of the candidate driver gene, the top 30 genes were selected as significant drivers [41]. For those methods that generate P-values, an adjusted P-values < 0.05 was used as the threshold to claim driver genes [42, 43]. The details of 21 tools and the criteria for candidate driver genes are provided in the supplementary material Table-S3. Fig 6 displays the proportion of predicted driver genes presented in the reference driver set across 31 cancer types, arranged by the order of the median. DriverRWH recovered approximately 50% of known driver genes in the top 30 ranked candidate genes in more than half of 31 cancer types, which is significantly better than the results of the other methods. DriverRWH(S) and DriverRWH(H) predicted 53.3% and 46.7% (median fraction) of cancer drivers in the reference driver gene set, respectively. Although e.Driver obtains almost the same median as DriverRWH(H), its variance is much higher than DriverRWH’s [42].

Robustness of DriverRWH

To test the robustness of DriverRWH, we applied our method to perturbed data where the mutation data and network data were shuffled randomly. In detail, for the mutation data, two types of perturbations were taken: (1) randomly selecting 50% and 10% of the samples and (2) randomly selecting 50% and 10% of the original mutation information in the somatic mutation matrix. As shown in Fig. 7, with an average of over 20 repeats, the AUC scores and the cumulative number of recovered driver genes shows only a slight decrease when only 50% and 10% of samples and 50% of mutation information were used. As expected, if only 10% of mutation information was retained, there would be a significant decrease. More interestingly, the performance of the top 20 candidates was always at a high level. For the network data, two forms of perturbation were also taken: (1) randomly selecting 50% and 10% of the original network information and (2) using PPI data with 50% and 10% noise added. There was also only a minor decrease in the AUC scores and the cumulative number of recovered cancer genes. These results suggest that the perturbation of mutation data and the network did not seriously affect the result, indicating that DriverRWH is highly robust to the quality of the input data.

Recent years, many methods have been developed to distinguish driver genes from passengers. Limited by the design of the model, most of them are incapable of expressing the many-to-many multiple association relationship. The compressed part carries the co-mutation information between samples. In this study, we propose a network-based method DriverRWH, which has the capability of effectively integrating the mutation and PPI network data to predict cancer driver genes. The novelty of our method lies in the introduction of the concept of hypergraph, which is constructed to describe the mutation. We extended the typical random walk process on a simple graph to hypergraph in a modified manner. Instead of taking the single gene mutation frequency as model input, our model retains the co-mutation information of individual tumors, which can adequately embody the implicit inherent peculiarity of them and avoid the loss of information.

Using a reference driver gene set as a benchmark, DriverRWH consistently outperformed the other four state-of-art prioritization methods in terms of the ROC analysis, AUC scores, rank of driver genes and the cumulative number of known driver genes recovered in the top-ranked candidate genes. Moreover, some new unknown potential driver genes which are co-cited by some cancer associated literatures also can be discovered by DriverRWH, meanwhile the high-ranking genes enrich in some significant cancer pathway. At last, taking top 30 as predicted candidate driver genes, we can compare DriverRWH with other non-ranking methods. The results shows that DriverRWH achieves a higher performance than four prioritization methods and 19 other non-ranking methods across 31 cancer types.

Despite of these encouraging results, there are several limitations in the current model. First, for TCGA data, tumor heterogeneity may increase the data bias, and future work should be done to reduce false-positive discoveries by using single-cell genomics data. Second, DriverRWH relies on a broad context molecular network that is still incomplete at present, so refined gene functional networks in the near future could improve the performance of our method. A cancer-specific network might better represent the natural interactions of genes in cancer and potentially provide a more reliable network. Third, our method focuses on general driver gene detection but does not aim to offer personalized means of diagnosis, which is more useful in real applications. In the future, we plan to extend our method to discover drivers in personalized manner.

Recently, many computational methods and tools have been proposed to identify driver genes. However, long-tail distribution of the mutation frequency of genes in cancer genomes remains a major concern. There are many widely accepted methods based on mutation frequencies, but they fail to comprehensively consider the co-mutation information in individuals. Considering hypergraph has unique advantages of retaining complete co-occurrence information, we introduced the hypergraph theory in driver gene prediction, thus compensating for the co-mutation information loss issue by existing methods. For each hyperedge, degrees of vertex in the corresponding induced subnetwork of the PPI network were utilized to design the weighted hypergraph, through which we realized the integration of the mutation data and the PPI data. Subsequently, motivated by PageRank algorithm, we implemented the random walk with restart on the hypergraph, and proposed a novel approach DriverRWH to prioritize mutated genes. As demonstrated in this paper, DriverRWH not only excels existing methods in the identification of known driver genes but also is capable of discovering potential driver genes. Furthermore, the model behaves robustly under the perturbation of mutation data and network data. Our results show that DriverRWH can be a useful tool for prioritization driver genes. The source code of DriverRWH is freely available at https://github.com/ShandongUniversityZhanglab/DriverRWH.

TCGA: The Cancer Genome Atlas NGS: Next-Generation Sequencing SNV: Single Nucleotide Variants

BRCA: Breast invasive carcinoma LUSC: Lung squamous cell carcinoma UCEC: Uterine Corpus Endometrial Carcinoma ROC: Receiver Operating Characteristic AUC: Area Under Curve PPI: Protein to Protein Interaction network

CGC: Cancer Gene Census HCD: High Confidence cancer Driver genes GAD: Genetic Association Database

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and material

The source code and example datasets used in this research can be download form https://github.com/ShandongUniversityZhanglab/DriverRWH

Competing interests

The authors declare that they have no competing interests.

Funding

This work has been supported by the National Natural Science Foundation of China [62072277 to NZ, 61972257 to XZ, 61877064 to YZ, 12071351 to JC].

Authors contributions

This project was designed and supervised by NZ. NZ, JC, YZ and XZ proposed the hypergraph model. NZ, CW and JS designed and performed the data analyses. JS implemented the software package. The manuscript was written by NZ, CW and JS. All authors read and approved the final manuscript.

Acknowledgments

The results here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.

Author’s information

¹School of Mathematics and Statistics, Shandong University, Weihai 264209, China. ²Department of mathematics, Weifang University, Weifang, Shandong 261061, China. ³Department of mathematics, Shanghai Normal University, Shanghai 200234, China.

*To whom correspondence should be addressed.

^†The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719–724.
Stratton MR: Exploring the Genomes of Cancer Cells: Progress and Promise. Science 2011, 331(6024):1553–1558.
Chin L, Meyerson M, Aldape K, Bigner D, Mikkelsen T, VandenBerg S, Kahn A, Penny R, Ferguson ML, Gerhard DS et al: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061–1068.
Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 1998, 30(1-7):107–117.
Bert Vogelstein NP, Victor E. Velculescu, Shibin Zhou, Luis A. Diaz Jr., Kenneth W. Kinzler*: Cancer Genome Landscapes. science 2013, 339(6127).
Han Y, Yang J, Qian X, Cheng WC, Liu SH, Hua X, Zhou L, Yang Y, Wu Q, Liu P et al: DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies. Nucleic Acids Res 2019, 47(8):e45.
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA et al: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499(7457):214–218.
Reimand J, Wagih O, Bader GD: The mutational landscape of phosphorylation signaling in cancer. Scientific Reports 2013, 3.
Cheng F, Zhao J, Zhao Z: Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Briefings in Bioinformatics 2016, 17(4):642–656.
Gonzalez-Perez A, Lopez-Bigas N: Functional impact bias reveals cancer drivers. Nucleic Acids Res 2012, 40(21).
Gao B, Li GJ, Liu JT, Li Y, Huang XZ: Identification of driver modules in pan-cancer via coordinating coverage and exclusivity. Oncotarget 2017, 8(22):36115–36126.
Leiserson MD, Vandin F, Wu HT, Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellan M et al: Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 2015, 47(2):106–114.
Hou JP, Ma J: DawnRank: discovering personalized driver genes in cancer. Genome Medicine 2014, 6:16.
Cheng W-C, Chung IF, Chen C-Y, Sun H-J, Fen J-J, Tang W-C, Chang T-Y, Wong T-T, Wang H-W: DriverDB: an exome sequencing database for cancer driver gene identification. Nucleic Acids Res 2014, 42(D1):D1048-D1054.
Skoulidis F, Heymach JV: Co-occurring genomic alterations in non-small-cell lung cancer biology and therapy. Nature Reviews Cancer 2019, 19(9):495–509.
Uren AG, Kool J, Matentzoglu K, de Ridder J, Mattison J, van Uitert M, Lagcher W, Sie D, Tanger E, Cox T et al: Large-scale mutagenesis in p19(ARF)- and p53- Deficient mice identifies cancer genes and their collaborative networks. Cell 2008, 133(4):727–741.
Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069–1075.
Al-Dhelaan ABaM: Random Walks in Hypergraph. 2013.
Tomczak K, Czerwinska P, Wiznerowicz M: The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology (Poznan, Poland) 2015, 19(1A):A68-77.
Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM: Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 2011, 21(7):1109–1121.
Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP et al: STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 2015, 43(Database issue):D447-452.
Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole CG, Ward S, Dawson E, Ponting L et al: COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res 2017, 45(D1):D777-D783.
Kumar RD, Searleman AC, Swamidass SJ, Griffith OL, Bose R: Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 2015, 31(22):3561–3568.
Reimand J, Wagih O, Bader GD: The mutational landscape of phosphorylation signaling in cancer. Sci Rep 2013, 3:2651.
Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene-Sanz A, Santos A, Lopez-Bigas N: IntOGen-mutations identifies cancer drivers across tumor types. Nat Methods 2013, 10(11):1081–1082.
Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37(1):1–13.
Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols 2009, 4(1):44–57.
Qiao N, Huang Y, Naveed H, Green CD, Han J-DJ: CoCiter: An Efficient Tool to Infer Gene Function by Assessing the Significance of Literature Co-Citation. Plos One 2013, 8(9).
Cheng F, Liu C, Lin CC, Zhao J, Jia P, Li WH, Zhao Z: A Gene Gravity Model for the Evolution of Cancer Genomes: A Study of 3,000 Cancer Genomes across 9 Cancer Types. PLoS Comput Biol 2015, 11(9):e1004497.
Hou JP, Ma J: DawnRank: discovering personalized driver genes in cancer. Genome Medicine 2014, 6.
Mularoni L, Sabarinathan R, Deu-Pons J, Gonzalez-Perez A, Lopez-Bigas N: OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol 2016, 17(1):128.
Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F: Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021, 71(3):209–249.
Becker KG, Barnes KC, Bright TJ, Wang SA: The Genetic Association Database. Nature Genetics 2004, 36(5):431–432.
Tan AC: Targeting the PI3K/Akt/mTOR pathway in non-small cell lung cancer (NSCLC). Thoracic Cancer 2020, 11(3):511–518.
An SJ, Chen ZH, Lin QX, Su J, Chen HJ, Lin JY, Wu YL: The-271 G > A polymorphism of kinase insert domain-containing receptor gene regulates its transcription level in patients with non-small cell lung cancer. Bmc Cancer 2009, 9.
Waks AG, Winer EP: Breast Cancer Treatment A Review. Jama-Journal of the American Medical Association 2019, 321(3):288–300.
Masuda H, Zhang DW, Bartholomeusz C, Doihara H, Hortobagyi GN, Ueno NT: Role of epidermal growth factor receptor in breast cancer. Breast Cancer Research and Treatment 2012, 136(2):331–345.
Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Ca-a Cancer Journal for Clinicians 2018, 68(6):394–424.
Slomovitz BM, Coleman RL: The PI3K/AKT/mTOR Pathway as a Therapeutic Target in Endometrial Cancer. Clinical Cancer Research 2012, 18(21):5856–5864.
Sasnauskiene A, Jonusiene V, Krikstaponiene A, Butkyte S, Dabkeviciene D, Kanopiene D, Kazbariene B, Didziapetriene J: NOTCH1, NOTCH3, NOTCH4, and JAG2 protein levels in human endometrial cancer. Medicina-Lithuania 2014, 50(1):14–18.
Guo W-F, Zhang S-W, Liu L-L, Liu F, Shi Q-Q, Zhang L, Tang Y, Zeng T, Chen L: Discovering personalized driver mutation profiles of single samples in cancer by network control strategy. Bioinformatics 2018, 34(11):1893–1903.
Porta-Pardo E, Godzik A: e-Driver: a novel method to identify protein regions driving cancer. Bioinformatics 2014, 30(21):3109–3114.
Jia PL, Wang Q, Chen QX, Hutchinson KE, Pao W, Zhao ZM: MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis. Genome Biol 2014, 15(10).

No competing interests reported.

TableS1.xls
Table-S1: The details for 31 cancer types used in this work. (XLS, 29KB)
TableS2.xls
Table-S2: The known driver gene lists we used as benchmarks in this work. (XLS, 119KB)
TableS3.xls
Table-S3: The details of 21 tools and the criteria for candidate driver genes. (XLS, 29KB)
supplementary.docx
Supplementary: It contains the results of cociter mining and KEGG enrichment analysis in LUSC, BRCA and UCEC. (DOCX, 58111KB)

Download PDF

Editorial decision: Major revision
04 Apr, 2022
Reviews received at journal
19 Mar, 2022
Reviewers agreed at journal
18 Mar, 2022
Reviewers agreed at journal
09 Mar, 2022
Reviewers invited by journal
25 Feb, 2022
Editor assigned by journal
25 Feb, 2022
Editor invited by journal
27 Dec, 2021
Submission checks completed at journal
27 Dec, 2021
First submitted to journal
21 Dec, 2021

You are reading this latest preprint version

DriverRWH: Discovering Cancer Driver Genes By Random Walk On a Gene Mutation Hypergraph

Status:

Version 1

Abstract

Figures

Background

Methods

Overview

Results

Known driver genes have higher degree in the PPI network

Results for Breast invasive carcinoma

Results for Uterine corpus cancer

The stability of the performance across 31 cancer types

Robustness of DriverRWH

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1