Integrating Protein–Protein Interaction Networks and Somatic Mutation Data to Detect Driver Modules in Pan-Cancer

With the constant update of large-scale sequencing data and the continuous improvement of cancer genomics data, such as International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), it gains increasing importance to detect the functional high-frequency mutation gene set in cells that causes cancer in the field of medicine. In this study, we propose a new recognition method of driver modules, named ECSWalk to solve the issue of mutated gene heterogeneity and improve the accuracy of driver modules detection, based on human protein–protein interaction networks and pan-cancer somatic mutation data. This study first utilizes high mutual exclusivity and high coverage between mutation genes and topological structure similarity of the nodes in complex networks to calculate interaction weights between genes. Second, the method of random walk with restart is utilized to construct a weighted directed network, and the strong connectivity principle of the directed graph is utilized to create the initial candidate modules with a certain number of genes. Finally, the large modules in the candidate modules are split using induced subgraph method, and the small modules are expanded using a greedy strategy to obtain the optimal driver modules. This method is applied to TCGA pan-cancer data and the experimental results show that ECSWalk can detect driver modules more effectively and accurately, and can identify new candidate gene sets with higher biological relevance and statistical significance than MEXCOWalk and HotNet2. Thus, ECSWalk is of theoretical implication and practical value for cancer diagnosis, treatment and drug targets.


Introduction
In recent years, with the continuous advancement of cancer research at the bio-molecular level, targeted therapy for cancer-causing genes has become a major area in cancer research and treatment [1]. For the driver genes that have functional mutations in cancer, specific drugs can be used to control their expression and transcription levels, and effectively control the development and deterioration of cancer. However, all driver mutations in cancer cannot could be identified effectively by statistical analysis of a single mutation gene [2]. Therefore, compared with the screening of a single driver mutation gene, detecting mutated driver gene sets in cancer is of high biological relevance and statistical significance [3,4]. In particular, the driver gene sets can be used not only to investigate the pathophysiology of cancer in more depth, but also to identify therapeutic targets for clinical cancer treatment by inferring the interaction of upstream and downstream genes in the driver modules, which thus provides reliable theoretical basis and data support for precision medicine and personalized medicine [5].
Previous studies have found that screening for high-frequency mutations is conducive to identifying the group of mutated driver genes in cancer [6][7][8], such as EGFR [9], TP53 [10], PIK3CA [11] and other high-frequency mutation genes, which are screened as driver factors during tumorigenesis. However, it normally involves a large amount of time and efforts to identify high-frequency mutation gene sites in a large number of tumor samples for biological experimental verification, which is difficult to achieve with the current technical methods, experimental conditions, and research capabilities. In addition, the current methods tend to ignore the problem of mutation heterogeneity in complex diseases [12,13]. To solve this problem, most of the previous studies utilized the high mutual exclusivity and high coverage widely existed in the genome map to explore the driver pathways that lead to the occurrence of cancer [6,7,14,15].
Based on the characteristics of high mutual exclusivity and high coverage, Vandin et al. [6] proposed the Dendrix algorithm to identify driver pathways from somatic mutation data. This method introduces a penalty overlap and a reward coverage mechanism to solve the driver pathway identification maximum weight sub-matrix problem based on gene mutation data. The Dendrix algorithm not only improves coverage but also guarantees mutual exclusivity among the genes in a driver pathway. However, this iterative search method is prone to producing local optimal solutions, and in advance, it is necessary to specify the number of genes in a driver pathway. Leiserson et al. [7] proposed a Multi-Dendrix algorithm based on the Dendrix algorithm, which detects multiple driver pathways at the same time. The algorithm utilizes linear regression to simultaneously detect multiple gene sets that meet high coverage and high mutual exclusivity, and can generate a globally optimal solution. However, the Multi-Dendrix algorithm needs to pre-specify both the maximum number of genes in a driver pathway and the number of driver pathways. Thus, the two algorithms do not have good universality and robustness, and tend to have certain limitations when being applied to different data sets.
Pan-cancer data analysis provides new ideas and methods for clinical diagnosis and treatment across cancer types [16][17][18][19]. As one application, Leiserson et al. [17] proposed the HotNet2 algorithm by integrating differential expression genes, significant mutation genes and protein-protein interaction networks to detect combinations of rare somatic mutations. Under the principle of thermal diffusion and the random walk with restart model, the edge weight that becomes stable after the random walk is restarted as the weight of the directed edge. By removing the directed edge with a small weight, the method detects strongly connected components as driver modules in the directed graph. Although the algorithm reduces the output of false positive results and improves the accuracy of the prediction results, the conversion probability just considers the degree of the vertex during the random walk process, but ignores the mutual exclusivity among genes. Therefore, the problem of gene mutation heterogeneity cannot be effectively solved in the pan-cancer data of great different varieties.
Based on the HotNet2 algorithm, Rafsan et al. [19] proposed a MEXCOWalk algorithm to detect driver modules using split and expansion techniques. The algorithm utilizes mutual exclusivity between genes and coverage scores of gene set to reflect the edge weight of the network. Then the random walk with restart strategy is used to construct a weighted directed network, and the split and expansion techniques are utilized to identify driver modules. Although the set of driver genes identified by this algorithm has high mutual exclusivity and high coverage, the algorithm ignores the mutual exclusivity between expanded leaf nodes and seed modules in the expansion stage of small modules. Therefore, this algorithm reduces the accuracy of driver module identification to a certain extent.
As mentioned above, although the previous methods can detect the gene sets with high mutual exclusivity and high coverage, they just focus on the mutual exclusivity and coverage between genes, instead of the topological structure of complex networks. To effectively solve the problem of mutated gene heterogeneity and improve the accuracy of driver modules, this study proposes a driver module detection algorithm (ECSWalk) based on gene mutation and human protein-protein interaction network. The algorithm takes into account aspects, such as high mutual exclusivity and high coverage between genes, and high similarity of topological structure. First, the complex network topology analysis method is used in human protein-protein interaction network data to calculate the topological similarity between network nodes, and then the two characteristics of high coverage and high mutual exclusivity of the mutated genes are combined to obtain the weight of the vertices and edges in the human protein-protein network. The weights of vertices in the human protein-protein network are obtained according to the coverage of mutated gene, and the random walk with restart strategy is utilized to calculate the weights of edges in the network by the three characteristics, namely, the coverage, the mutual exclusivity, and the similarity of the topological structure between the nodes. Second, based on the weighted network constructed in the previous step, the large modules are split into several candidate gene sets using the method of the induced subgraph. In addition, the greedy strategy is utilized to add the nodes in the leaf module to the seed module to achieve the optimal gene sets. These mutated gene sets with high mutual exclusivity, high coverage and high similarity of the topological structure are likely to work as driver modules in cancer [20,21]. This study not only applies the analysis method of complex network topology to the biological network, but also improves the method of determining module size based on split and expansion. The study clarifies the interactions between genes in the detected driver modules, which promote the study of cancer pathogenesis and drug targets.

Mutual Exclusivity and Coverage
To accurately detect and classify a large number of genes in the cancer genome map, and reduce the error in the actual biological sequencing experiment process, this study adopts the definition of mutual exclusivity and coverage [19] as shown below.
Let G(V, E) denote the protein-protein interaction (PPI) network, each vertex u i ∈ V corresponds to a protein in PPI network, and each protein u i corresponds to a mutated gene g i . The undirected edge (u i , u j ) ∈ E in PPI network corresponds to the interaction between gene pair (g i , g j ) . Therefore, the node g i represents both a gene and the corresponding protein in G. The sample with gene g i mutated is represented by S i , and M ⊆ V is a subset of genes. For any pair of genes If CD(M) = 1 , then the gene subset M completely covers all patients, That is, at least one gene within the subset M is mutated in each sample, and it is possible that none gene of some samples within the subset M is mutated. Figure 1 is a mutation matrix, in which each row represents a patient, each column represents a gene, the black rectangular box M ij represents g j is mutated in sample S i , and the white rectangular box M ij represents g j is not mutated in sample S i . The mutation matrix includes two gene sets ( M 1 , M 2 ), and the exclusive degree of each gene set is 1, that is, a gene set is considered mutually exclusive if at most one gene within the gene set is mutated in each sample. Errors in real biological data will interfere with the calculation of mutual exclusivity and coverage, therefore, a gene set that is approximately mutually exclusive and high coverage is usually considered as a driver pathway.

Node Similarity
The abnormal local area network composed of nodes and their neighbor nodes may affect the performance of the entire network in complex networks, where neighbor nodes refer to the nodes within one step of a core node. Therefore, to measure the topological relationship of each mutated gene in the same driver module, we set the local area network with node g i as the center and its direct neighbor node as the radius. Combined with the characteristics of the local area network structure, we propose node similarity based on Jensen-Shannon (JS) divergence and analyze the topological structure of the protein-protein interaction network [22][23][24]. The similarity is mainly defined by the discrete probability set, and the index construction process mainly includes the following two steps, namely, to construct the probability set [25], and to define the node similarity according to the JS divergence.
Construction of the probability set. Let d i be the degree of the ith gene node, and d max be the maximum node degree in the local area network. Suppose there are N nodes in the probability set of one node, N = d max + 1 . The sum of node degrees D g i of gene node g i in its local area network is expressed as follows: where N is the number of genes in the local area network, and d(j) denotes the degree of the jth gene in the local area network of gene g i .
In the local area network, the discrete probability of gene g i is expressed as follows: Standardize the discrete probabilities of gene g i , and sort the discrete probabilities of genes in the local area network of gene g i from large to small, and obtain the set of discrete probabilities as P(i): where p i (n) represents the discrete probability value of the nth gene in the local area network with N gene nodes (n ⩽ N) . Therefore, in the discrete probability set P(i), the N elements in set P(i) are the discrete probability values of each node in the local area network.
Jensen-Shannon (JS) divergence is an improved method based on KullbackLeibler (KL) divergence, which is known as relative entropy. In information theory, relative entropy is a measure of the difference between two probability distributions P(x) and Q(x). The difference between the Shannon entropies of the two probability distributions P(x) and Q(x) is the value of KL divergence, which is expressed as follows: The KL divergence is asymmetric, so the position exchange of the two probability distributions P(x) and Q(x) will get different results. JS divergence is a variant form of KL that solves asymmetric problems, which is expressed as follows: In this study, we suppose that two adjacent genes g i and g j in the constructed gene network correspond to two different probability sets P(i) and P(j), where P(i) and P(j) have the same number of genes in the local area network. According to the set of discrete probability constructed in formula (6), KL divergence value between genes g i and g j is expressed as JS divergence value is obtained based on the KL divergence value, which is expressed as Therefore, we investigate the similarity of gene pairs using network topology characteristics. Node similarity is defined as follows: Obviously, the larger the SIM(g i , g j ) value, the more similar the two gene nodes in the network. It can be seen from formula (10) that the value range of SIM( indicates that the two gene nodes in the network have the same topological structure. As shown in Fig. 2, each gene in the network has a specific topological structure. We first choose three nodes g 2 , g 4 and g 6 and calculate the discrete probability set of the three nodes according to formula (5), that is, Topological similarity of the two nodes can be calculated according to formula (10), that is, SIM(g 2 , g 4 ) = 0.95 , SIM(g 2 , g 6 ) = 0.95 , SIM(g 4 , g 6 ) = 1 . From the results of similarity, we can see that nodes g 4 and g 6 have the highest topological similarity. It can also be seen from Fig. 2 that nodes g 4 and g 6 have the same neighbors g 3 and g 5 ; however, nodes g 2 and g 4 have only one common

Fig. 2 Network Topology
neighbor g 3 , and nodes g 2 and g 6 also have only one common neighbor g 3 .

Construction of Edge-Weighted Networks
Given a PPI network G(V, E), where node set V = (u 1 , u 2 , u 3 , ..., u n ) represents the set of mutated genes corresponding to the PPI network, edge set E = e = (u i , u j ) includes these edges satisfying the condition that if there are edges between proteins in the protein-protein network corresponding to the mutated genes, then there are also edges between these mutated genes.
Construction of a weighted undirected graph G . For each vertex g i ∈ V , the coverage of vertex g i represents the weight of vertex g i , that is, (g i ) = CD(g i ) . Obviously, the more the mutated samples in gene g i , the greater the weight of the vertex g i .
Taking into account the chance of increasing the coexistence of a gene and its surrounding genes, where surrounding genes refer to the genes within one step of a core gene. This study defines the set of node g i and its direct neighbor nodes as the local area network Ne(g i ) as follows: To balance the mutual exclusivity between genes and the opportunities for the coexistence between a gene and its surrounding genes, this study utilizes the average value of ED(Ne(g i )) and ED(Ne(g j )) as the mutual exclusivity ED(g i , g j ) of gene pairs in the network, as shown below: To reduce the chance of a single gene with large coverage added to the edge weight, the product of the coverage of two genes is used to represent the coverage CD(g i , g j ) between gene pairs, as shown below: The study integrates the three characteristics of mutual exclusivity, coverage, and similarity among gene pairs to calculate the edge weight of the weighted undirected graph as follows: The principle of thermal diffusion is utilized to construct a weighted directed graph by performing a random walk with restart on G [17]. The random walk with restart means that the source node gene g i transfers to its neighboring nodes with a certain probability, and they utilize the restart probability to transfer to the source node again. This process is repeated until it reaches a stable state. The formula is expressed as follows: where F 0 is the initial state of the source node gene g i , and F 0 = CD(g i ).F t is the probability distribution at time t; represents the probability of returning from the current node to the initial node in the process of restarting the random walk. There are two options for restarting the random walk from the current node, returning to the initial node or going to the neighboring node. The probability of the two options are and 1-, respectively, where 0 ⩽ ⩽ 1 , which is used to control the heat of the source node diffusion to the rest nodes of the network. It is necessary to choose a suitable , where all source nodes retain most of the heat in their direct neighbor nodes [17]. According to [17,19], the value of in the study is set to be 0.4; E represents the transition probability matrix of the restart random walk process, which is positively correlated with the edge weight, as shown below: where ∑ k (g i , g j ) represents the sum of the edge weights between the source node g i and its direct neighbor nodes.
As the value of t increases, F t+1 gradually converges, and then the random walk restarts until it reaches a stable state [26]. The edge weight value F is calculated according to the following formula [17]: where I is the identity matrix. Restart the random walk to create a directed edge with weight F for each pair of gene pair g i and g j (i ≠ j) [17], and finally realize the construction of a weighted directed graph G d . The algorithm is described as follows.

Driver Modules Detection
In a directed graph, two vertices are strongly connected if there is at least one directed path between the two vertices. A graph is regarded as a strongly connected graph if any two vertices in the graph are strongly connected. The strongly connected component (SCC) refers to the maximal strongly connected subgraph in the directed graph. The strongly connected component (SCC) division method of the directed graph is utilized to generate the driver modules in this study [17]. Figure 3 is a directed non-strongly connected graph; however, {g 3 , g 4 , g 5 , g 6 } is the strongly connected component (SCC) of the directed graph. The detection process of driver modules is composed of the following three steps.
The first step is to create a set of initial candidate modules. The SCC is first employed as an initial set of candidate modules. The minimum weight edge in G d is iteratively deleted until strongly connected subgraphs are generated from G d , and then the strongly connected subgraph is added to the initial module set P. Finally, all the modules in P whose gene number is less than min_module_size are removed. The above process is carried out iteratively until the number of genes in P decreases to total_genes . We finally obtain the initial module set P = (M 1 , M 2 , ..., M r ).
The second step is to split the large and medium-sized modules into module set P [19]. For the weighted directed graph G d and a module M q , G d (M q ) denotes the gene set of derived subgraphs in the directed graph G d (the derived subgraph is different from the strongly connected subgraph), which corresponds to the genes in M q . L denotes the set of derived subgraphs, as shown below: Let split_size be the degree of the node with the largest value in the subgraph derived by module M q , the modules with more nodes than split_size will be split as large modules. In the splitting process, G c ∈ L is a subgraph derived from directed graph M q , v ′ is the node with the largest outdegree value in G c , and IN(v � ) represents the local area network of v ′ in G c . If the number of nodes in IN(v � ) is above min_module_size , then they will be classified as seed modules, otherwise they will be classified as leaf modules, where the leaf module is a small module with the number of nodes less than min_module_size . All the strongly connected subgraphs that meet the conditions in the directed graph G c are classified in the same way. The third step is to add leaf modules to the seed module. The leaf node g m connected to any node in the seed module is selected to extend the seed module by utilizing the greedy strategy, and the extension function is defined as follows: where G in (g m ) represents the average weight of the edge between node g m in the leaf module and the node in the seed module, G out (g m ) represents the average weight of the edge between node g m in the leaf module and the rest of the nodes in the leaf module. If G in (g m ) is higher G out (g m ) , then the node is added to the seed module.

Data Preprocessing
This study utilizes the somatic mutation data and the combined human PPI network data from HINT+HI2012 [17]. Somatic mutation data comes from the TCGA pan-cancer data set containing 12 cancer types, which are composed of 3281 samples with 20472 SNVs and 4334 samples with 720 CNAs. According to the data preprocessing method of [17], the samples with hyper-mutation and the genes that are low-expression in all tumor types are first screened out. And then, MutSigCV tool is applied to filtering the genes without obvious mutations in SNVs, after that 218 genes are deleted, 5 samples without SNV and 1973 samples with only CNA are deleted, and 7894 genes with <3 RNA-seq reads in> 30% of tumors of each cancer type are deleted. Finally, we obtain the data set containing 3110 samples and a total of 11565 somatic mutation genes. (2) HINT+HI2012 as the PPI network data includes high-quality interaction database (HINT) and human interaction database (HI2012). The data preprocessing is as follows, a merge operation is first performed based on the interaction relationship in the HINT and HI2012 database. Then, we delete closed loops and duplicate edges in the new PPI network. Finally, we obtain a protein-protein interaction network composed of 9858 proteins and 40704 interactions.

Parameter Setting
If the number of genes in a driver module is less than 3, the gene set is not usually considered as a driver module

Enrichment Analysis
To verify the enrichment effect of the modules identified by ECSWalk, we utilize functional annotation tools (DAVID) to analyze the enrichment of the driver modules detected by ECSWalk, HotNet2, and MEXCOwalk algorithm. The results in nine types of cancer are shown in Fig. 4. The value is set to be 100 to obtain the driver modules in the ECSWalk and MEXCOWalk, and HotNet2 obtains 14 consensus driver modules.
As shown in Fig. 4, the enrichment effect of the driver modules detected by ECSWalk is better than that detected by HotNet2 and MEXCOWalk algorithms among the nine types of cancer, especially in Glioblastoma Multiforme (GBM), Melanoma, Uterine Corpus Endometrial Carcinoma (UCEC), Non-Small-Cell Lung Cancer (NSCLC), Pancreatic Adenocarcinoma (PAAD) and Chronic Myelocytic Leukemia (CML). Therefore, the driver modules identified by ECSWalk show extremely high statistical significance and biological relevance.

Comparison of Module Accuracy
To evaluate the accuracy of the driver modules detected by ECSWalk, this study uses the Accuracy and F-measure evaluation indices to measure the accuracy of the driver modules based on the known pathways [27]. The higher the Accuracy value, the better the classification effect. The higher the F-measure value, the more driver modules can be enriched in the known biological pathways, indicating that the method is more accurate in mining driver modules. The calculation formulas of Accuracy and F-measure are as follows: The following formula is utilized to calculate the enrichment of the driver modules in one known biological pathway.
where N represents the total number of genes, K represents the number of genes in a known biological pathway, n represents the number of genes in a driver module, and k represents the number of overlapping genes between a known biological pathway overlaps and a driver module. The driver modules with p value <0.01 are set to be the positive type, and the driver modules with p value ⩾ 0.01 are set to be the negative type. The Benjamin-Hochberg method is used to correct all p values. Figure 5 shows the enrichment performance of the driver modules obtained by ECSWalk, MEXCOWalk and HotNet2 based on the known biological pathways obtained using the DAVID tool for enrichment analysis. It can be seen from Fig. 5a that the ACC value of the ECSWalk is 1 in the nine types of cancer, which shows that ECSWalk has extremely high accuracy in detecting driver modules. Specifically, the ACC value of the ECSWalk is 100%, 200% and 50% higher than that of the HotNet2, respectively, in the three cancers of GBM, BLCA and PAAD, and the ACC value of the ECSWalk is 50%, 200%, 50%, 33.3%, 100% and 100% higher than MEXCOWalk, respectively, in GBM, Bladder Urothelial Carcinoma (BLCA), UCEC, NSCLC, PAAD and Acute Myeloid Leukemia (LAML). In addition, of note, the ACC value of the ECSWalk is 1, but the ACC values of the MEXCOWalk and HotNet2 are 0 in Colon Adenocarcinoma (COAD). Therefore, ECSWalk has better performance than MEXCOWalk and HotNet2 in detecting driver modules in cancer.
As can be seen in Fig. 5b, the F-measure values of the ECSWalk algorithm are 1 in the nine types of cancer, which means that the accuracy and recall of the ECSWalk algorithm are both 1 in these nine cancers. This shows all the predicted driver modules are positive. It is interesting to note that the F-measure values of gene sets detected by MEX-COWalk and HotNet2 algorithms are both 0 in COAD. Therefore, ECSWalk has a better capability in detecting driver modules in nine types of cancers.

EGFR Module
The weighted diagram of the EGFR module detected by ECSWalk is shown in Fig. 6A. The EGFR module detected by ECSWalk, MEXCOWalk and HotNet2 is shown in Table 1. The genes of the EGFR module identified by each method together with the overlapping genes and non-overlapping genes are shown in Table S1 (SI Appendix, Table S1).
It can be seen from Table 1 that the p value of the EGFR module detected by ECSWalk is 1.09E-13, and the p values of the EGFR module detected by MEXCOWalk and Hot-Net2 are 8.1E-07 and 6.9E-06, respectively. Although the number of genes in the EGFR module detected by ECSWalk is less than that detected by MEXCOWalk, the EGFR module detected by ECSWalk has higher coverage than those of the other two algorithms. Therefore, the gene set detected by ECSWalk has higher biological relevance and statistical significance than those detected by the other two algorithms.
The EGFR module detected by ECSWalk mutates in 62.77% (1952/3110) samples, in which PIK3CA, EGFR, APC, ERBB2 and PIK3R1 have high mutation frequency, and mutate in 19.36%, 8.39%, 7.52%, 6.37% and 4.98% samples, respectively. This module is enriched in a variety of cancers, especially when the p value of this module in UCEC is 1.09E-13 according to the DAVID enrichment analysis. It can be also seen from Fig. 6A that PIK3CA and NRAS, PIK3R1 have large edge weights, and PIK3R1 and HRAS have a large edge weight, which indicates that the four mutated genes have strong interaction relationships. In the ErbB signaling pathway (Fig. 7a), EGFR and IGF1R promote the expression of PIK3CA and PIK3R1, and phosphorylation promotes the expression of HRAS and NRAS. At the same time, HRAS and NRAS promote the expression of PIK3CA and PIK3R1. Besides, CTNNB1 and APC have large weight values, which indicates that the two mutated genes have a strong interaction relationship. In the Wnt signaling pathway (Fig. 7b), CTNNB1 promotes the expression of itself and APC. This shows that the mutations of CTNNB1 and APC constitute two key genes in the Wnt/ -Catenin pathway, and thus can be used as predictors of cervical cancer susceptibility [28].

PPFIA1 Module
The weighted diagram of the PPFIA1 module detected by ECSWalk is shown in Fig. 6B. The PPFIA1 module detected by ECSWalk, MEXCOWalk and HotNet2 is shown in Table 2. The genes of the EGFR module identified by each method together with the overlapping genes and non-overlapping genes are shown in Table S2 (SI Appendix, Table S2) . HotNet2 does not mine any genes contained in the module, and the gene set detected by ECSWalk contains one more gene PPFIA1 than that detected by MEXCOwalk. This module has a coverage of 7.72% and mutual exclusivity of 96.77%. Although the PPFIA1 modules detected by the three algorithms do not show enrichment information in the DAVID enrichment analysis, studies have shown that there is an interaction between genes within the module [29,30]. The accuracy of the module can be verified in the previous research [30] in which PTPRF, a surface tyrosine phosphatase receptor, and its adapter, PPFIA1, govern fibronectin fibrillogenesis and vascular morphogenesis by driving active 5 1 integrin recycling. Besides, the proteins encoded by PTPRF and PTPRS are important members of the PTP family of protein tyrosine phosphatases. PTPS is a signal molecule that regulates cell growth cycle, differentiation process, mitosis, gene mutations and other cell life processes. The lack of PTPRS and PTPRF affects the proliferation of mandibular cells, and results in craniofacial deformities. In cells lacking PTPRS and PTPRF, the WNT and BMP signaling pathways are dysregulated [29]. Therefore, PPFIA1, PTPRS and PTPRF may be mutated in the same pathway and cause cancer. In summary, the PPFIA1 module detected by ECSWalk is of more biological relevance compared with the modules detected by the other two algorithms.

TP53 Module
The weighted diagram of the TP53 module detected by ECSWalk is shown in Fig. 6C. The TP53 module detected by ECSWalk, MEXCOWalk and HotNet2 is shown in Table 3. The genes of the EGFR module identified by each method together with the overlapping genes and non-overlapping genes are shown in Table S3 (SI Appendix, Table S3) . The TP53 module detected by ECSWalk has higher coverage than those detected by MEXCOWalk and HotNet2. The p value of the TP53 module detected by ECSWalk is 3.25E-20 according to the DAVID enrichment analysis, and the p values detected by MEXCOWalk and HotNet2 are 5.6E-08 and 3.84E-06, respectively. Therefore, the gene set detected by ECSWalk has higher biological relevance and statistical significance than those detected by the other two algorithms. The TP53 module detected by ECSWalk mutates in 70.09% (2180/3110) samples. This module has significant enrichment in CML cancer, and the p value of this module in CML cancer 1 3 is 1.1E-11 according to the DAVID enrichment analysis, which indicates that the module has high biological relevance. It can be seen from Fig. 6C that TP53, ATM, CHEK2, MDM2, MDM4, PTEN, CDKN2A, CDK4 and CDK6 have large edge weights, which indicates that the nine mutated genes have strong relationships. In the P53 signaling pathway (Fig. 7d), phosphorylation of ATM promotes the expression of CHEK2, and phosphorylation of CHEK2 promotes the expression of TP53; TP53 inhibits the expression of CDK4 and CDK6 by promoting the expression of CDKN1A, and TP53 promotes the expression of upstream MDM2 and the expression of downstream MDM2 and PTEN; MDM4 promotes the expression of MDM2, but inhibits the expression of TP53; CDKN2A inhibits the expression of MDM2, and MDM2 inhibits the expression of MDM4. This indicates that regulating the p53 signaling pathway can interfere with CML, thereby regulating the occurrence and development of CML [31].

BAP1 Module
The weighted diagram of the BAP1 module detected by ECSWalk is shown in Fig. 6D, and the mutual exclusivity of the BAP1 module is 96.95%. The BAP1 module is a PR-DUB protein complex composed of BAP1 and ASXL1, which activates its downstream tumor suppressor genes, such as SOCS1/2, VHL and TXNIP by combining with transcription factors [32]; In blood tumor cells, the BAP1 mutation makes the C-terminal truncated ASXL1 mutation protein completely lose its ability to bind to transcription factors, causes the ASXL1 mutation protein to significantly weaken the transcription and regulation functions of the BAP1-ASXL1-FOXK1/K2 complex through a dominant negative mutation effect, and reduces the expression of tumor suppressor genes, thereby regulating glucose metabolism, JAK-STAT, hypoxia perception and other tumor-related signaling pathways. This thus contributes to promoting the proliferation and self-renewal of leukemia cells, and further inhibiting cell apoptosis under hypoxia [32]. It can be seen that ASXL1 and BAP1 have high biological relevance, and may exist in the same driver pathway and work together to promote the occurrence of cancer.

NOTCH3 Module
The weighted diagram of the NOTCH3 module is shown in Fig. 6E. The NOTCH3 gene has high coverage and is altered in 129 samples. The mutual exclusivity of this module is 98.82%, and it has been reported that the NOTCH signaling pathway plays a vital role in the tumor microenvironment, and the ligand genes DLL1 and JAG1 of the NOTCH signaling pathway have mutations in a variety of cancers, such as DLL1, JAG1 and NOTCH3 existing high probability of mutations in patients with colon cancer [33]. Besides, DLL1 and JAG1 can control the rate of neural development in the NOTCH3 gene expression domain, and JAG1 can activate the NOTCH signal transduction in the V1 and DL6 domains. DLL1 can also send signals to nerve cells outside the NOTCH gene expression domain [34]. Therefore, DLL1, JAG1 and NOTCH3 genes may exist in the same driver pathway.

PAF1 Module
The weighted diagram of the PAF1 module detected by ECSWalk is shown in Fig. 6F. The PAF1 module detected by ECSWalk mutates in 4.34% (135/3110) samples, and the PAF1 gene mutates in 104 samples. The mutual exclusivity between genes in the module is 100%. This module has a significant enrichment relationship in CDC73/PAF1 complex, and the p value of this module in CDC73/PAF1 complex is 2.9E-9 according to the DAVID enrichment analysis, which indicates that the genes in this module have high biological relevance. Besides, CTR9, as the main component of the RTF1 complex, participates in the assembly of the PAF1 complex by the TRP domain, and CDC73/PAF1 is a polyprotein complex associated with general RNA polymerase II and RNA polymerase II transcription factor complexes [35], and it may be involved in the beginning and extension of transcription. Furthermore, the mutated genes PAF1 and CTR9 in the CDC73/PAF1 complex can cause a variety of diseases including malignant tumors. The PAF1 module also has a significant enrichment relationship in transcription elongation from RNA polymeraseII promoter pathways. The p value of this module is 2.7E-8 according to the DAVID enrichment analysis. It has been reported in [36] that RTF1, LEO1 and CTR9 are the main members of the PAF1/RNA polymerase II complex, PAF1, RTF1 and LEO1 have frequent interactions with PAF1, CDC73 and PolII, and the deletion of PAF1 or CTR9 will lead to similar severe pleiotropic phenotypes. Therefore, the genes in this module have high biological relevance and may exist in the same driver pathway to cause cancer.

MAP3K1 Module
The weighted diagram of the MAP3K1 module detected by ECSWalk is shown in Fig. 6G. The MAPK signaling pathway is significantly enriched in this module, and its p value is 1.3E-3 according to the DAVID enrichment analysis. The phosphorylation of MAP3K1 in this module activates MAP2K4, MAP2K1 and MAP2K2, and the phosphorylation of BRAF activates MAP2K1 and MAP2K2 (Fig. 7c). It has been reported in [37] that inhibiting the MAPK pathway can lead to lung cancer cell apoptosis. Therefore, the genes in this module have high biological relevance and may exist in the same driver pathway to cause cancer.

MCL1 Module
The weighted diagram of the MCL1 module detected by ECSWalk is shown in Fig. 6H. It has been reported in [38] that USP9X and MCL1 in this module have high biological relevance. For example, USP9X stabilizes the expression of MCL1 and promotes cell survival, and high expression of USP9X leads to an increase in the amount of MCL1 protein in human follicular lymphoma and diffuse large B-cell lymphoma, and vice versa. Therefore, USP9X is usually used as a target for clinical treatment by maintaining the stability of the amount of MCL1 and other proteins in human malignant tumors [38]. It can be seen that MCL1 and USP9X have high biological relevance and may exist in the same driver pathway.

MYC Module
The weighted diagram of the MYC module detected by ECSWalk is shown in Fig. 6I. The module has more significant enrichment in the nucleoplasm, and the p value of this module in nucleoplasm is 8.3E-5 according to the DAVID enrichment analysis, which indicates that this module has high biological relevance. It can be seen from Fig. 6I that MYC has the largest coverage in the module, and MYC is altered in 9.10% (283/3110) samples. It has been reported in [39] that MYC-encoded protein is a multifunctional nucleolar phosphate protein, which regulates target genes as a transcription factor during the cell life cycle and cell transformation process. Overexpression, mutation, translocation and rearrangement of MYC are closely related to the occurrence and development of a variety of cancers. It can be seen from Fig. 6I that FBXW7 and MYC have a large edge weight value, so these two genes may likely have biological relevance. It has been reported in [40] that the decrease of FBXW7 function is linked to a lower overall survival rate in muscle invasive bladder cancer and is thus linked to MYC accumulation.

Discussion
The HotNet2, MEXCOWalk, and ECSWalk models utilize the restart random walk algorithm to construct directed weighted biological networks. HotNet2 uses only gene coverage information to construct the biological networks, which ignores the mutual exclusivity among genes, so it cannot effectively solve the problem of gene mutation heterogeneity. MEXCOWalk improves HotNet2 by adding mutually exclusive information among mutated genes to construct biological networks. However, errors in the actual biological sequencing data processing will interfere the calculation of mutual exclusivity and coverage. To improve the accuracy of driver module detection effectively, ECSWalk adds the topological similarity of nodes in biological networks. The biological network constructed by ECSWalk not only contains the biological correlation between genes, but also reflects the topological correlation between genes. Driver genes with the same or similar biological attributes will be divided into the same module, which can more accurately reflect the true functional relevance. , a greedy strategy is utilized to expand the detected leaf modules, which can add effective genes with topological significance and screen out the detected unrelated genes in the seed module. The EGFR module just verify this fact. The EGFR module detected by ECSWalk includes fewer mutated genes compared with that by MEXCOWalk, but the EGFR module detected by ECSWalk has higher coverage and smaller p value. It shows that it is reasonable and necessary to use greedy strategy to improve the accuracy of driver module detection. ECSWalk identifies more new candidate gene sets for biological verification, and the findings indicate that the identified candidate gene sets have high biological relevance and statistical significance. The enrichment effect of the driver modules detected by ECSWalk is better than that detected by HotNet2 and MEXCOWalk among the nine types of cancer, especially in GBM, Melanoma, UCEC, NSCLC, and CML. The EGFR, PPFIA1 and TP53 modules detected by ECSWalk have higher coverage than those detected by the other two algorithms. In particular, the PPFIA1 module detected by ECSWalk contains more PTPRS genes compared with the other two algorithms, which makes the PPFIA1 module higher biological important and statistical significant. Besides, we found that the EGFR module has extremely high enrichment in UCEC (p value: 1.09E-13), and the TP53 module has extremely high enrichment in CML (p value: 1.1E-11). The results are of theoretical significance and practical value for cancer diagnosis, treatment and drug targets. There are also limitations in this study, that is, although the greedy strategy can accurately identify the driver modules, this also leads to a decrease in the operating efficiency of the algorithm to some extent.

Conclusion
This study proposes a carcinogenic driver module detection algorithm (ECSWalk) by integrating somatic mutation data and protein-protein interaction network. This study first calculates the similarity between connected nodes in the protein-protein interaction network. Then, a weighted network is created by calculating mutual exclusivity, coverage and topological structure similarity between mutation genes, and a restart random walk clustering method is utilized to detect the carcinogenic driver modules. Finally, an induced subgraph method is utilized to split the large modules, and a greedy strategy is utilized to expand the small modules to generate a set of driver modules. The experimental results show that the ECSWalk can detect more accurate driver modules than the other two algorithms. The driver modules detected by ECSWalk have a lower p value by the DAVID enrichment analysis, which indicates that the results of the algorithm have higher biological relevance and statistical significance. It also shows that ECSWalk can achieve good results by combining biological characteristics and complex network characteristics in detecting carcinogenic driver modules. Therefore, the application of complex network topology to biological networks is conducive to the study of the pathogenesis of cancer based on the inherent properties of the data itself, and it is also useful for researchers and medical practitioners when carrying out research from the aspect of complex network topology. Besides, ECSWalk can identify the target gene set in some common cancers accurately, and analyze the biological relevance and statistical significance of the driver modules, which thus helps to enrich our understanding of the pathogenesis of cancer.