Overview
In the present model, we incorporated somatic mutation profiles and PPI datasets to prioritize the driver genes. At first, a weighted mutation hypergraph was constructed, wherein tumor samples are presented as hyperedges and mutant genes are presented as vertices. To be clear, hypergraph is a generalization of a traditional simple graph where its edges (called hyperedges) are allowed to connect arbitrary number of vertices rather than two. According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, we differentiated gene within a hyperedge of sample in accordance with their degrees in the corresponding induced subnetwork of the PPI network. Second, we extended the typical random walk process on a simple graph to the hypergraph in a modified manner and carry out this process iteratively. After some steps, the random walk would stabilize. At last, all candidate mutant genes are ranked in descending order based on their DriverRWH score.
Datasets and networks
Somatic mutation data for 9183 tumor samples across 31 cancer types (Table-S1) used in this work are available from TCGA, which were downloaded by UCSC Browser (https://xenabrowser.net/datapages/) [19]. We downloaded two independently developed PPI datasets from the HumanNet (http://www.functionalnet.org/humannet/) [20] and the STRINGv10 (https://string-db.org) [21] .
Performance evaluation
To evaluate the method, an unbiased comprehensive known cancer gene set is needed. Unfortunately, such a gold-standard set of cancer genes is currently unavailable. Alternatively, we used four complementary cancer gene sets derived from various sources as the reference driver gene set for all the cancer types. First, 616 cancer genes were downloaded from the Cancer Gene Census (CGC) database, which includes genes for which mutations have been causally implicated in cancer and is widely used as a gold-standard cancer gene set [22]. Second, the list of HiConf cancer gene panels consists of 99 driver genes that have previously been detected through genetic criteria and that could plausibly be detected with exome sequencing data [23]. The third set has 291 high-confidence cancer driver genes identified by a rule-based method (HCD) [24]. The fourth set contains 125 driver genes defined by the "20/20 rules", which identifies Mut-driver genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes [25]. Now that each cancer gene set is biased toward particular features or study methods, we utilized a union of these four lists as the reference driver gene set, with a total of 785 genes. This operation can reduce the bias caused by using a single reference gene list to some degree. Using aforementioned reference driver genes as a benchmark, we generated receiver operating characteristic (ROC) curves and areas under the curve (AUCs) to evaluate the true positive and false positive rate. For practical reasons, only top-ranked candidate genes might enter into follow-up experimental validation. Considering that the high performance of prioritization for all genes cannot guarantee successful prioritization for the top ranked candidates, we also assessed the number of known driver gene recovered in the top 20, 50, 100,150 and 200 candidate genes.
Due to the diversity of cancer types, we are more interested in tumor-specific drivers than the general common drivers across all tumor types. We downloaded IntOGen database (https://www.intogen.org/download) [25]. This database harnesses the strengths of different driver prediction methods and provides a tumor-specific driver genes list, which is considered to be the best trade-off between sensitivity and specificity. This list contains 31 types of cancer among which Kidney Chromophobe (KICH) has 7 specific drivers (minimum) and Uterine Corpus Endometrial Carcinoma (UCEC) has 55 (maximum). All of the above lists are shown in Table-S2. From an application point of view, we should assess the ability of our method to identify novel driver genes that may not have been discovered in IntOGen. The genes in top 200 candidate gene list predicted by DriverRWH with both HumanNet and the STRINGv10 while not in tumor-specific drivers were evaluated by the enrichment analysis using DAVID on-line database [26, 27].
Specifically, the top 30 genes are selected as significant drivers. We leveraged a literature mining method named CoCiter, which calculates the co-citation significance between predicted driver genes and the keywords cancer type, ‘driver’ and ‘cancer’ to verify the top 30 significant genes [28]. The higher co-citation score implicates the stronger association between the genes and the key terms. Without loss of generality, we compared DriverRWH with 21 driver gene prediction methods across 31 cancer type, some of which identify significant drivers by p-value (the genes with FDR adjusted P value < 0.05) and the rest of methods provide the priority scores for candidate driver genes (the top 30 genes are selected as significant drivers). It is acceptable for the reason that the median number of significant genes for other methods in all data sets is 30.
The DriverRWH algorithm
In this study, we propose the DriverRWH by random walk on hypergraph to prioritize candidate driver gene. First, a hypergraph consisting of the mutated genes of all samples was constructed, wherein samples were presented as hyperedges and genes were presented as vertices (Fig. 1B). If a gene is mutated in a sample, it would be presented as a vertex in the hyperedge corresponding to the sample. Without loss of generalization, the hypergraph can be defined as \(HG(V,\mathcal{E})\), where \(V\) is the set of vertices and \(\mathcal{E}\) is the set of hyperedges. A hyperedge \(e\) is a subset of \(V\), satisfying \(\bigcup _{e\in \mathcal{E}}=V\). Hyperedge \(e\) is said to be incident with vertex \(u\) if \(u\in e\); thus, the incidence matrix \(H\in {R}^{\left|V\right|\times \left|\mathcal{E}\right|}\) can be defined as follows:
\(\)\(h\left(u,e\right)=\left\{\begin{array}{ll}1& \text{ if }u\in e\\ 0& \text{ if }u\notin e\end{array}\right.\)
After construction of the hypergraph, we developed a random walk process on it. Similar to a random walk on a simple graph, this walk is a type of Markov process, which is seen as the transition between two vertices. Note that the transition on the hypergraph occurs only if two vertices are incident to a hyperedge, so the random walk on a hypergraph is defined to be a two-step process. In the first step, the surfer selects a hyperedge \(e\) incident with the current vertex \(u\); thereafter, it selects a target vertex \(v\) within the chosen hyperedge (Fig. 1C).
According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, a fairly standard choice of the weight of vertices in each hyperedge are their degrees in the corresponding induced subnetwork of the PPI network. If one vertex is an isolated node in the subnetwork, it also has the potential to be a driver gene, so a small weight of 0.01 is set. Let \(Ne\) be the subnetwork containing vertices in hyperedge \(e\) and denote \({d}_{Ne}\left(u\right)\) as the degree of \(u\) in the subnetwork.
$$w(u,e)=\left\{\begin{array}{cc}{d}_{Ne}\left(u\right),& \text{ if }u\in e\\ 0.01,& \text{ if }u\notin e\end{array}\right.$$
Thereafter, the surfer selects vertex \(v\) proportional to the weight of \(v\) within the hyperedge. Notably, in our model, the weights of vertices may vary in accordance with the hyperedges. According to the aforementioned definition, the degree of vertex \(u\) and hyperedge \(e\) in hypergraph \(HG(V,\mathcal{E})\) can be defined as follows:
$$d\left(u\right)=\sum _{e\in \mathcal{E}}h\left(u,e\right)$$
$$\delta \left(e\right)=\sum _{u\in e}w(u,e)$$
With all the elements defined, we calculated the transition probability from vertex \(u\) to vertex \(v\) as follows:
$$P\left(u,v\right)=\sum _{e\in \mathcal{E}}\frac{h\left(u,e\right)}{d\left(u\right)}\frac{w\left(v,e\right)}{\sum _{\widehat{v}\in e}w\left(\widehat{v},e\right)}$$
which can also be written in matrix form:
$$P={{D}_{u}}^{-1}H{D}_{e}^{-1}{W}^{T}$$
where \({D}_{u}\in {R}^{\left|V\right|\times \left|V\right|}\) is the diagonal vertex degree matrix, \({D}_{e}\in {R}^{\left|\mathcal{E}\right|\times \left|\mathcal{E}\right|}\) is the diagonal hyperedge degree matrix with element \(\delta \left(e\right)\) and \(W\in {R}^{\left|V\right|\times \left|\mathcal{E}\right|}\) is the weighted incident matrix of hypergraph \(HG(V,\mathcal{E})\). Note that the transition matrix \(P\) is stochastic, where each row sums to 1.
Furthermore, we implemented a random walk with restart on the hypergraph. All genes are considered to be potential driver genes and are assigned with equal probabilities; i.e., the initially normalized probability vector \(\overrightarrow{v}\left(0\right)\in {R}^{\left|V\right|\times 1}\) such that each element is assigned with equal probability \(\frac{1}{\left|V\right|}\). Moreover, the restart probability at every step is set to be \(1-\alpha (0cript>\). In this article, we set \(\alpha\) to be 0.2. Finally, the random walk formula can be expressed as follows:
$$\overrightarrow{v}\left(t+1\right)=\alpha {P}^{T}\overrightarrow{v}\left(t\right)+\left(1-\alpha \right)\overrightarrow{v}\left(0\right) , t=\text{0,1},2,\dots$$
In the formula above, \(\overrightarrow{v}\left(t\right)\) is defined such that the \(i\)th element means the probability that the surfer stops at vertex \(i\) at step \(t\). After a number of steps, the random walk will be stable, which can be defined as \(\overrightarrow{v}\left({\infty }\right)\). The stabilized state implies that the distance between \(\overrightarrow{v}(t+1)\) and \(\overrightarrow{v}\left(t\right)\) by the L1 norm is smaller than the provided cutoff value. In this paper, we set the cutoff as \({10}^{-6}\). The elements of the stabilized vector \(\overrightarrow{v}\) are defined as the DriverRWH score, which can reflect the role that the mutated genes play in cancer. A higher score implies a potential driver gene.