Integrating Protein Interaction Networks and Somatic Mutation Data to Detect Driver Modules in Pan-cancer

Methods: In this study, to solve the issue of mutated gene heterogeneity and improve the accuracy of driver modules, we propose a new recognition method of driver modules, named ECSWalk, based on the human protein interaction networks and pan-cancer somatic mutation data. This study firstly utilizes high mutual exclusivity and high coverage between mutation genes and topological structure similarity of the nodes in complex networks to calculate interaction weights between genes. Secondly, the method of random walk with restart is utilized to construct a weighted directed network, and the strong connectivity principle of the directed graph is utilized to create the initial candidate modules with a certain number of genes. Finally, the large modules in the candidate modules are reasonably split using the way of the induced subgraph, and the small modules are expanded using a greedy strategy to obtain the optimal driver modules.

In recent years, with the continuous advancement of cancer research at the bio-2 molecular level, targeted therapy for cancer-causing genes has become a hot research 3 area in cancer [1]. For the driver genes that have functional mutations in cancer, Based on the characteristics of high mutual exclusivity and high coverage, Vandin 27 et al. [6] proposed the Dendrix algorithm to identify driver pathways from somatic 28 mutation data. This method introduces a penalty overlap and a reward coverage 29 mechanism to solve the maximum weight sub-matrix problem of driver pathway 30 identification based on gene mutation data. The Dendrix algorithm not only im- 31 proves coverage but also guarantees mutual exclusivity among the genes in a driver 32 pathway. However, this iterative search method is prone to produce local optimal 33 solutions, and it requires specifying the number of genes in a driver pathway in 34 advance. Leiserson et al.[7]proposed a Multi-Dendrix algorithm based on the Den- 35 drix algorithm, which detects multiple driver pathways at the same time. The al-36 gorithm utilizes linear regression to simultaneously detect multiple gene sets that 37 meet high coverage and high mutual exclusivity, and can generate a globally opti-38 mal solution. However, the Multi-Dendrix algorithm needs to pre-specify both the 39 maximum number of genes in a driver pathway and the number of driver pathways. 40 Therefore, the two algorithms do not have good universality and robustness, and 41 have certain limitations when applied to different data sets. 42 Pan-cancer data analysis provides new ideas and methods for the study of clinical 43 diagnosis and treatment across cancer types [13,14,15,16]. As one application, 44 Leiserson et al. [14] proposed the HotNet2 algorithm by integrating differential ex-45 pression genes, significant mutation genes and protein interaction networks to detect 46 combinations of rare somatic mutations. Using the principle of thermal diffusion and 47 the random walk with restart model, the edge weight that becomes stable after the 48 random walk is restarted as the weight of the directed edge. By removing the di-49 rected edge with a small weight, the method detects strongly connected components 50 as driver modules in the directed graph. Although the algorithm reduces the output 51 of false positive results and improves the accuracy of the prediction results, the con-52 version probability just considers the degree of the vertex during the random walk 53 process, but ignores the mutual exclusivity among genes. Therefore, the problem 54 of gene mutation heterogeneity cannot be effectively solved in the pan-cancer data 55 with large differences. 56 Based on the HotNet2 algorithm, Rafsan et al. [16]proposed a MEXCOWalk al-57 gorithm to mine driver modules by using split and expansion techniques. The algo-58 rithm utilizes mutual exclusivity between genes and coverage scores of gene set to 59 reflect the edge weight of the network. Then the random walk with restart strat-60 egy is used to construct a weighted directed network, and the split and expansion 61 techniques are utilized to mine driver modules. Although the set of driver genes 62 identified by this algorithm has high mutual exclusivity and high coverage, the al-63 gorithm ignores the relationships of mutual exclusivity between expanded leaf nodes 64 and seed modules in the expansion stage of small modules. Therefore, this algorithm 65 reduces the accuracy of driver module identification to a certain extent. 66 As mentioned above, although the previous methods can detect the gene sets with 67 high mutual exclusivity and high coverage, they just focus on the mutual exclusivity 68 and coverage between genes, instead of the topological structure of complex net-69 works. To effectively solve the problem of mutated gene heterogeneity and improve 70 the accuracy of driver modules, this paper proposes a driver module detection algo-71 rithm (ECSWalk) based on gene mutation and human protein interaction network.
The algorithm optimizes the definition in terms of the following three characteris-73 tics of high mutual exclusivity, high coverage among genes and high similarity of 74 topological structure. Firstly, the complex network topology analysis method is used 75 in the human protein interaction network data to calculate the topology similarity 76 between network nodes, and then it is combined with the two characteristics of high 77 coverage and high mutual exclusivity of the mutated genes to obtain the weight of 78 the vertices and edges in the human protein network. Among them, the weights of 79 vertices in the human protein network are obtained according to the coverage of 80 mutated gene; the random walk with restart strategy is utilized to calculate the 81 weights of edges in the network by the three characteristics, namely the coverage, 82 the mutual exclusivity, and the similarity of the topological structure between the 83 nodes. Secondly, based on the weighted network constructed in the previous step, 84 the large modules are split into several candidate gene sets using the method of the 85 induced subgraph. And the greedy strategy is utilized to add the nodes in the leaf 86 module to the seed module to achieve the optimal gene sets. These mutated gene 87 sets with high mutual exclusivity, high coverage and high similarity of the topo-88 logical structure are likely to work as driver modules in cancer [17,18]. This study 89 not only applies the analysis method of complex network topology to the biological 90 network, but also improves the method of determining module size based on split 91 and expansion. The study clarifies the interactions between genes in the detected 92 driver modules, which promote the study of cancer pathogenesis and drug targets.  Let G(V, E) denote the protein interaction network (PPI), each vertex u i ∈ V 99 corresponds to a protein in PPI network, and each protein u i corresponds to a 100 mutated gene g i . The undirected edge (u i , u j ) ∈ E in PPI network corresponds to 101 the interaction between gene pair (g i , g j ). Therefore, the node g i represents both 102 a gene and the corresponding protein in G. The sample with gene g i mutated is 103 represented by S i , and M ⊆ V is a subset of genes. For any pair of genes g i , g j ∈ M , 104 g i = g j , if S i ∩ S j = ∅ , the genes in M are mutually exclusive.

105
The mutual exclusivity of gene subset M is represented as: if ED(M ) = 1, then the genes in the subset M are mutually exclusive, that is, at 106 most one gene within the subset M is mutated in each sample.

107
The coverage of gene subset M is represented as: If CD(M ) = 1, then the gene subset M completely covers all patients, that is, at 108 least one gene within the subset M is mutated in each sample. [19, 20, 21]. The similarity is mainly defined by the discrete probability set, and the 118 index construction process mainly includes two steps, the first step is to construct 119 the probability set [22], and the second step is to define the node similarity according 120 to the JS divergence.

121
Construction of the probability set. Let d i be the degree of the ith gene node, and d max be the maximum node degree in the local area network. Suppose there are N nodes in the probability set of one node, N = d max + 1. The sum of node degrees D gi of gene node g i in its local area network is expressed as follows: Where N is the number of genes in the local area network, and d(j) is the degree 122 of the jth gene in the local area network of gene g i .

123
In the local area network, the discrete probability of gene g i is expressed as follows: Standardize the discrete probabilities of gene g i , and sort the discrete probabilities 124 of genes in the local area network of gene g i from large to small, and obtain the set 125 of discrete probabilities as P (i). 126 Where p i (n) represents the discrete probability value of the nth gene in the local 127 area network with N gene nodes (n N ). Therefore, in the discrete probability set 128 P (i), the N elements in set P (i) are the discrete probability values of each node in 129 the local area network.

130
Suppose that two adjacent genes g i and g j in the constructed gene network correspond to two different probability sets P (i) and P (j) with the same number of genes in the local area network. According to the set of discrete probability constructed in formula (5), the Kullback-Leibler (KL) divergence value between genes g i and g j is expressed as: Due to the asymmetry of KL divergence, different results can be obtained by exchanging the positions of P (i) and P (j) . To solve this asymmetry problem, the same result can be obtained by exchanging the positions of P (i) and P (j) , The JS divergence value is obtained based on the KL divergence value, which is expressed as: Therefore, we use an analysis of gene pair similarity based on the network topology, which is defined as follows: Obviously, the larger the SIM (g i , g j ) value, the more similar the two gene nodes represents the set of protein interaction relationships.

138
Construction of a weighted undirected graph G ω . For each vertex g i ∈ V , the 140 Obviously, the more the mutated samples in gene g i , the greater the weight of the 141 vertex g i .

142
Taking into account the chance of increasing the coexistence of a gene and its surrounding genes, this study defines the set of node g i and its direct neighbor nodes as the local area network N e(g i ) as follows: To balance the mutual exclusivity between genes and the opportunities for the coexistence between a gene and its surrounding genes, this study utilizes the average value of ED(N e(g i )) and ED(N e(g j )) as the mutual exclusivity ED(g i , g j ) of gene pairs in the network, as shown below.
ED(g i , g j ) = ED(N e(g i )) + ED(N e(g j )) 2 (10) To reduce the chance of a single gene with large coverage added to the edge weight, the product of the coverage of two genes is used to represent the coverage CD(g i , g j ) between gene pairs, as shown below.
The study integrates the three characteristics of mutual exclusivity, coverage, and similarity among gene pairs to calculate the edge weight of the weighted undirected graph as follows: The principle of thermal diffusion is utilized to construct a weighted directed graph by performing a random walk with restart on G ω [14]. The random walk with restart means that the source node gene g i transfers to its neighboring nodes with a certain probability, and they utilize the restart probability to transfer to the source node again. This process is repeated until it reaches a stable state. It is expressed as follows: Where F 0 is the initial state of the source node gene g i , and F 0 = CD(g i ).F t is the probability distribution at time t; β is the restart probability of its neighbor nodes transferring to the source node, and 0 β 1, which is used to control the heat of the source node diffusion to the rest nodes of the network. It is necessary to choose a suitable β that all source nodes retain most of the heat in their direct neighbor nodes [14]. According to [14,16], the value of β in the study is set to be 0.4; E represents the transition probability matrix of the restart random walk process, which is positively correlated with the edge weight, as shown below: Where k ω(g i , g j ) represents the sum of the edge weights between the source node 143 g i and its direct neighbor nodes.

144
As the value of t increases, F t+1 gradually converges, and then the random walk restarts until it reaches a stable state [23]. The edge weight value F is calculated according to the following formula [14]: Where I is the identity matrix. Restart the random walk to create a directed edge 145 with weight F for each pair of gene pair g i and g j (i = j) [14], and finally realize 146 the construction of a weighted directed graph G d . The algorithm is described as 147 follows.

Algorithm 1 Construction of the weighted directed network
if ED(g i , g j ) = 0 and CD(g i , g j ) = 0 and SIM (g i , g j ) = 0 then 5: In this study, the strongly connected component (SCC) division method of the 150 directed graph is utilized to generate the driver modules [14]. The process is divided 151 into three steps.

152
The first step is to create a set of initial candidate modules. The SCC is firstly 153 employed as an initial set of candidate modules. The minimum weight edge in G d is 154 iteratively deleted until strongly connected subgraphs are generated from G d , and 155 then the strongly connected subgraph is added to the initial module set P . Finally, 156 all the modules in P whose gene number is less than min module size are removed.

157
The above process is carried out iteratively until the number of genes in P decreases 158 to total genes. We finally obtain the initial module set P = (M 1 , M 2 , ..., M r ).

159
The second step is to split the large and medium-sized modules into module set Let split size be the degree of the node with the largest value in the subgraph 160 derived by module M q , Modules with more nodes than split size will be split as not less than min module size, then they will be divided into seed modules, or else 165 they will be divided into leaf modules. All the strongly connected subgraphs that 166 meet the conditions in the directed graph G c are split in the same way.

167
The third step is to add leaf modules to the seed module. The leaf node g m connected to any node in the seed module is selected to extend the seed module by utilizing the greedy strategy, and the extension function is defined as follows: Where G in (g m ) represents the average weight of the edge between node g m in the 168 leaf module and the node in the seed module, G out (g m ) represents the average 169 weight of the edge between node g m in the leaf module and the rest of the nodes 170 in the leaf module. If G in (g m ) is not less than G out (g m ), then the node is added to 171 the seed module.
for M q ∈ P and |M q | > split size do 7: for M j ∈ SCC(G c ) do Accuracy and F-measure are as follows: Where T P indicates the number of modules in which positive classes are predicted The following formula is utilized to calculate the enrichment of the driver modules in one known biological pathway.
Where N represents the total number of genes, K represents the number of genes 220 in a known biological pathway, n represents the number of genes in a driver mod- shows that ECSWalk has extremely high accuracy in detecting driver modules.

231
Specifically, the ACC value of the ECSWalk is 100%, 200% and 50% higher than performance than MEXCOWalk and HotNet2 in detecting driver modules in cancer.

238
As can be seen in Figure 2B,  Table 1.

250
It can be seen from  The weighted diagram of the TP53 module detected by ECSWalk is shown in Figure   296 3C. The TP53 module detected by ECSWalk, MEXCOWalk and HotNet2 is shown 297 in indicates that the module has high biological relevance. It can be seen from Figure  The weighted diagram of the NOTCH3 module is shown in Figure 3E.

363
It has been reported in [33] Figure 3H. It has been reported in [35] that USP9X and MCL1 in this module have in human malignant tumors [35]. It can be seen that MCL1 and USP9X have high 386 biological relevance and may exist in the same driver pathway.

MYC module 388
The weighted diagram of the MYC module detected by ECSWalk is shown in Figure   389 3I. The module has more significant enrichment in the nucleoplasm, and the p-value 390 of this module in nucleoplasm is 8.3E-5 according to GO enrichment analysis, which 391 indicates that this module has high biological relevance. It can be seen from Figure   392 3I that MYC has the largest coverage in the module, and MYC is altered in 9 HotNet2: An algorithm for finding significantly altered subnetworks in a large gene interaction network.