Hierarchical hidden community detection for protein complex prediction

Background: Discovering functional modules in protein-protein interaction networks through optimization remains a longstanding challenge in Biology. Traditional algorithms simply consider strong protein complexes found in the original network by optimizing some metric, which may cause obstacles for discovering weak and hidden complexes that are overshadowed by strong complexes. Additionally, protein complexes have not only diﬀerent densities but also various ranges of scales, making them extremely diﬃcult to be detected. We address these issues and propose a hierarchical hidden community detection approach to predict protein complexes of various strengths and scales accurately. Results: We propose a meta-method called HirHide (Hierarchical Hidden Community Detection). It is the ﬁrst combination of hierarchical structure with hidden structure, which provides a new perspective for ﬁnding protein complexes of various strengths and scales. We compare the performance of several standard community detection methods with their HirHide versions. Experimental results show that the HirHide versions achieve better performance and sometimes even signiﬁcantly outperform the baselines. Conclusions: HirHide can adopt any standard community detection method as the base algorithm and enable it to discover hidden hierarchical communities as well as boosting the detection of strong hierarchical communities. Some biological networks are too complex for standard community detection algorithms to produce a positive performance. Most of the time, a better choice is to choose a corresponding algorithm based on the characteristics of a speciﬁc biological network. Under these circumstances, HirHide has clear advantages because of its ﬂexibility. At the same time, according to the natural hierarchy of cells, organelle, intracellular compound etc. , hierarchical structure with hidden structure is in line with the characteristics of the data itself, thus helping researchers to study biological interactions more deeply.


Background
A protein complex is a group of proteins that interact with each other for specific biological activities [1]. The identification of protein complexes is crucial for predicting protein functions [2][3][4][5], disease genes [6,7], phenotypic effects of genetic mutations [8], and drug-disease associations [9]. Given a proteinprotein interaction (PPI) network, where nodes represent proteins and edges represent interactions, the protein complexes can be searched by detecting densely connected subgraphs in the network. Mathematically, such subgraphs are called communities, in which nodes are joined together in tightly-knit groups, and there shielded by some dominant communities are sometimes of high value. For example, real-world protein complexes are not always dense, and sometimes they can be very sparse [21]. As proteins typically get involved in several interactions, there exist many overlaps among the protein complexes, and thus weak protein complexes are usually hidden behind the stronger ones.
Additionally, there could be some undiscovered protein interactions [22]. Therefore we have no connection between these proteins in PPI networks. As a consequence, sparse protein complexes may be only sparse in existing incomplete PPI networks. But they are more likely to be dominant in real-world PPI networks if all protein interactions had been discovered. That is, sparse protein complexes may also be potential strong protein complexes. However, these sparse complexes are overlooked by previous technique.
Though standard community detection methods can find a portion of the sparse protein complexes by detecting weak communities, they cannot deal with the case where most nodes of the weak communities also belong to other stronger communities. Under such a case, these weak communities are defined as hidden communities [23,24]. In PPI networks, we call these communities the hidden protein complexes. For instance, in Fig. 1 (a), we build a network with hidden community structures. Because standard algorithms focus on discovering dominant communities, the weaker community with green nodes is generally overlooked. Even if it is detected, its structure is considered as the structure of green nodes in Fig. 1 (a), which contains four smaller blocks with dense intraconnections. However, its structure is actually like the structure of green nodes in Fig. 1 (b). The community in (b) becomes detectable because the stronger communities have been weakened. And as edges belonging to stronger communities are removed, this hidden community doesn't contain four smaller blocks with dense intra-connections like (a) shows. Mathematical graph representations of many protein complexes related to sparser communities are hidden communities, we call they hidden complexes. They are partially or completely covered by stronger protein complexes, and finding such hidden complexes is very difficult. Traditional algorithms simply consider hidden protein complexes as a part of the stronger protein complexes, causing big obstacles for the hidden ones to be discovered, which can partially explain why standard methods are not working well.
To address this problem, we design HirHide, which is inspired by a meta-approach called HiCode (Hidden Community Detection) [23,25], a novel approach that first addresses the hidden community structure. However, HiCode does not consider hierarchical community structure, and hence it cannot handle complicated networks where communities are organized hierarchically, which exists in many real-world networks. In largescale PPI networks, many protein complexes are organized hierarchically, indicating that protein complexes may consist of sub-complexes extending to several hierarchical levels deep. An example of such deeply embedded complex is the SAGA complex (MIPS identifier 510.190.10.20.10), a multi-functional coactivator that regulates the transcription by RNA polymerase II [26].
The main contributions of this work include: • We propose HirHide that combines the detection of hierarchical structure with hidden structure, and could detect communities of various strengths (related to density) as well as communities of various scales (related to size). • HirHide is designed as a general method that can be combined with standard community detection methods and enables them to discover hidden hierarchical communities and boosts the detection of dominant hierarchical communities. • We also propose the concept of hidden complexes and try to explain and relate the biological significance of hidden complexes based on gene ontology. For easy distinction, we use level to denote the stratification of hierarchical structure and layer to denote the stratification of hidden structure throughout the experiments. Multi-level hierarchical structure together with multi-layer hidden structure build a multigranular characteristic of HirHide (see Fig. 1 (c)). In experiments, we set two layers for hidden structure (one is dominant, the other is hidden), two levels for hierarchical structures to avoid over-complication (a total of 4 granular divisions). Layer1-level1 represents the first level of layer 1, which corresponds to the root, strong communities; Layer1-level2 indicates the second level of layer 1, which is a more detailed division of the root, strong communities; Layer2-level1 indicates the root, hidden communities in layer 2; Layer2-level2 denotes a more detailed division of the root, hidden communities. In the following, the four community divisions are shortly recorded as L 11 , L 12 , L 21 , and L 22 respectively.

Hiddenness Value
A PPI network can be represented as a graph G = (V, E), where V is the node set and E is the edge set. Suppose the network is divided into K communities, denoted by C = {C 1 , ..., C k , ..., C K }.
Let S k be the set of communities stronger than a community C k : (1) F k represents the strength of community C k , which can be calculated by the modularity metric [11]. The larger the value is, the stronger the corresponding community structure is. The modularity Q is defined as: where m indicates the number of edges in the graph, d i , d j separately indicate the degree of nodes i and j, Com i , Com j separately indicate the community to which nodes i and j belong. And δ(Com i , Com j ) indicates whether nodes i, j are in the same community. If so, δ(Com i , Com j ) = 1, and otherwise 0. He et al. raise a formula to calculate the hiddenness value of a community [23]. This definition calculates the fraction of nodes of C k belonging to other stronger communities.
The larger the hiddenness value is, the higher the probability the community is hidden, and the less likely the community can be detected. Note that there is no single specific threshold between a hidden community and a dominant community because the hiddenness value indicates the fraction of nodes of C k belonging to other stronger communities. For two communities , then community C i is comparatively hidden compared to C j , and community C j is comparatively dominant compared to C i .

Hidden communities versus hierarchical communities
Hierarchical community structure and hidden community structure are two aspects of community structure. An algorithm combining these two concepts may be confusing. For example, a subgraph with a hidden community and one or multiple stronger communities can be wrongly detected as a hierarchical structure. Fig. 1 (a) and (b) show this case. So we first discuss the difference between hidden structure and hierarchical structure to clarify the concepts. When considering the hierarchical community structure, an algorithm gradually detects stronger and smaller communities, and ignore weaker communities. When considering the hidden community structure, an algorithm detects communities weaker than the overhead communities. Intuitively, when the density of a community is calculated by modularity, if the average density of root communities is 0.5, then the average density of the sub-hierarchical level is larger than 0.5, and the average density of the next hidden layer is less than 0.5. Note that, a hidden community can be completely covered by one stronger community or partially covered by one or multiple stronger communities. And the hidden community does not need to be smaller than the overhead communities. Due to their different characteristics, each layer of hidden structure can contain several levels of hierarchical structure, and vice versa ( Fig. 1 (c)). HirHide combines the advantages of both types of algorithms, and guarantees a complete protein complex detection.

HirHide
We propose a hierarchical hidden detection approach (HirHide) for community mining tasks. Our algorithm consists of three steps. In the first step of initialization, HirHide identifies a layer of communities Layer1 via the base algorithm ( Fig. 1 (c)), which can be a standard community detection algorithm with promising performance.
The second step is called the hierarchical detection step. HirHide constructs a hierarchical structure by recursively capturing sub-communities and iterating until an appropriate number of levels are found. A crucial dimension of this step is to determine the number of levels in a network. We simply count the number of nodes in each community. When the average amount is smaller than a certain threshold, the algorithm stops capturing sub-communities. In our experiments, the default threshold is set to 9.
The third step is called the hidden detection step. HirHide weakens the structure of the previously detected layer Layer 1 to get a reduced graph G . In G , the base algorithm is used to identify new communities to form a hidden layer Layer 2 . Because G does not contain the strong communities of Layer 1 , the weaker communities can be easily discovered. The hierarchical detection step can be used on Layer 2 in the reduced graph G to find hierarchical structures for this layer. Then HirHide weakens Layer 1 and Layer 2 and calls the base algorithm again to detect Layer 3 . This process iterates until no communities can be detected.
In HirHide, the step of digging out hidden structure is in a similar spirit with HiCode [23]. But the step of combining hidden structure and hierarchical structure is new and well-designed to guarantee both hierarchical communities and hidden communities to be well captured. A key issue is which level of hierarchical communities should be weakened in the detected layer. When these hierarchical levels are organized into trees with each community in the first level as root, we can weaken the communities in roots or the communities in leaves ( Fig. 1 (c)). Because the edge connection and the total size of nodes in each hierarchical level are similar (considering separate nodes and too small communities are removed) and the weakening step is a global operation, which level is weakened does not make a big difference. To double-check, we analyzed the performance of HirHide framework when it separately weakens the communities in roots or in leaves in the confirmatory experiments.
In the hierarchical step, the sub-communities are captured on the original graph. They can also be caught on the reduced graph G after weakening the structure of strong communities. Intuitively, the reduced graph has weakened the influence of other layers, so the detected sub-communities should be more precise. In our confirmatory experiments, recursively capturing sub-communities on the original graph or the weakened graph are compared.
Although the base algorithm is recursively called in both the hierarchical and hidden steps, its role is significantly different. In the hierarchical step, the base algorithm is repeatedly applied in sub-communities to discover smaller sub-communities. In the hidden step, the base algorithm is repeatedly used in the reduced graphs to dig out hidden communities.

Results
Algorithms, data, and metrics Algorithms We select three state-of-the-art algorithms as the baseline methods as well as the base algorithms of HirHide, which are MOD [12], Infomap [14], and LC [15]. None of them can detect hidden communities without HirHide. After they are combined with HirHide, their HirHide versions are called HirHide-MOD, HirHide-Infomap, and HirHide-LC. We also compare HirHide with the HiCode [23] version of the three methods, separately called HiCode-MOD, HiCode-Infomap, and HiCode-LC. One of our metrics comes from Clus-terONE [26], but the authors did not provide the complete source code. As a consequence, we use Clus-terONE as a baseline method without the corresponding HirHide version. Overall, the algorithms involved in the comparisons are HirHide-MOD, HirHide-Infomap, HirHide-LC with HiCode-MOD, HiCode-Infomap, HiCode-LC, MOD, Infomap, LC and Clus-terONE.
Except HirHide and HiCode, the above comparison involves four other algorithms. MOD [12] treats each node as a separate community in initialization and gradually optimizes the modularity value by expanding each community's size. After repeatedly iterating, MOD can get a community division with the largest modularity value. Infomap [14] is based on the principle of information theory and defines the community from the perspective of coding. To get the maximum compression ratio, Infomap uses Huffman coding and community structure secondary coding. In this way, the problem of community detection is transformed into an optimization problem: finding a community division so that the codeword length of random code walks within and between communities is the smallest. LC [15] reinvents communities as groups of links rather than nodes. This approach successfully reconciles the antagonistic organizing principles of overlapping communities and hierarchical structure. Link communities naturally incorporate overlap while revealing hierarchical organization. ClusterONE [26] outlines the concept of cohesiveness score and uses a greedy growth process to find groups that are likely to correspond to protein complexes in a PPI network.

Data
We compare these algorithms in three large scale yeast PPI networks, which consist of core experimental yeast PPI network [27], a combined computational interaction network [28] and the entire set of physical protein-protein interactions in yeast from Bi-oGRID [29]. These three datasets are referred to as Krogan core, Y eastN et, and BioGRID.
To evaluate the performance of each algorithm, we use Munich Information Center of Protein Sequences (MIPS) [30] and CYC [31] as the reference sets. The yeast protein complexes cataloged by the MIPS database have been widely used to generate protein-protein interaction reference sets. And CYC is a comprehensive catalog of manually curated 408 heteromeric protein complexes in S. cerevisiae. For convincing, we choose the latest version of the two reference sets. Additionally, we only consider complexes containing 3 to 100 proteins as the reference protein complexes to avoid the selection bias. Table 1 shows the basic information of the three PPI networks. Because different PPI networks contain different nodes and edges and the two reference sets have different reference complexes, we have removed proteins that only exist in PPI networks or reference datasets.

Evaluation Metrics
It has become a standard practice to compare the performance of different methods by assessing their ability to identify the reference communities. When eval- uating the performance of community detection algorithms, the recognized metric is the F1 score [32], which is the harmonic mean of precision and recall. However, the reference communities, which only contain protein complexes whose interactions can be discovered under the current experimental conditions, have incomplete nature [33]. Using the F1 score as an evaluation metric is unreasonable when the reference sets are incomplete. Because under the same condition, the more communities an algorithm detects, the smaller the precision is. Traditional algorithms are typically designed to detect fewer complexes than the reference complexes (much less than the real-world complexes) to increase the precision. Consequently, comparing the F1 score is unfair for algorithms that detect more complexes.
As the current reference sets are incomplete [33], we evaluate the algorithms' performance by two other measures. One is the maximum matching ra- tio (MMR), which is designed specifically for protein complex detection [26]. MMR guarantees that each detected community only matches one reference community and vice versa. And it maximizes the total score of all one-to-one connections between predicted and reference complexes. This measure is inspired by the bipartite graph maximum matching problem, in which the two sets of nodes respectively represent detected complexes and reference complexes. MMR tries to find the best match for each reference complex. So even if an algorithm detects more complexes, MMR won't be reduced. The other is the recall, which only measures the capacity of discovering the reference complexes. So additional hidden hierarchical communities do not decrease the score. The recall scores each pair composed of a predicted complex and a reference complex by their similarity. Given a set of detected communities D and a set of reference communities G. Each reference community G j has its recall: The final recall is defined as the average of R(G j ) over all reference communities.

Experimental results on real-world data
We compare the performance of standard algorithms with their HirHide versions and their HiCode versions in real-world networks. Fig. 2 shows the comparative performance of the ten algorithms using MIPS and CYC separately as the reference sets. The different bar clusters represent using different datasets (Biogrid, Krogan core, and Y eastN et) as the PPI networks. The bars on the bottom indicate the maximum matching ratio (MMR) scores, and the bars on the top represent the recall scores. Higher bars represent better performance. In each bar, the specific score is shown if it is larger than 0.2. We can see that when benchmark algorithms are combined with HirHide, they achieve better MMR, recall, and composite scores on most PPI datasets. As mentioned before, HirHide does not change the core of a standard algorithm but enables it to detect hidden hierarchical community structures. So the better performance demonstrates that detecting hidden hierarchical community structures helps detect complexes in PPI networks. Furthermore, the HirHide versions of the three benchmark algorithms perform better than their HiCode versions in all the datasets, which demonstrates that the hierarchical structure is necessary. Fig. 3 illustrates the performance of different algorithms on communities with increasing hiddenness values. We show the results of HirHide-MOD versus MOD and HirHide-Infomap versus Infomap on Biogrid. The reference dataset is MIPS. A higher hiddenness value indicates a deeper hidden degree. Because there are hundreds of communities, we smooth the results to make them more concise. As illustrated in Fig. 3, HirHide-MOD and HirHide-Infomap show increasing advantage with the higher hiddenness value, indicating that HirHide has significant benefits in detecting the hidden communities.

Synthetic data and experimental results
To some extent, the two evaluation metrics can solve the problem that the reference sets are incomplete, but they may not be persuasive enough. As a result, the comparison of synthetic networks is necessary. We build three synthetic networks, each of which contains two layers for hidden structure. The first layer consists of strong communities, and the second layer consists of relatively weak communities. In addition to the multilayer feature, hierarchical structures are added. Specifically, in each layer, smaller and denser communities are added to make up the next hierarchical level.
The choice of edge probability follows a certain rule. According to the hidden community conception, the interactions between proteins in the hidden layer are relatively sparse, so the edge probability is lower than that of the first layer. When constructing the hierarchical structure, the edge probability of the leaf communities at the second level is slightly higher than that  of the root communities, to further highlight the structure of sub-communities. The first network called G 360 contains 360 nodes. At the first layer, 360 nodes are divided into three communities, each with 120 nodes. According to the definition of the community, internal edges are added to the three communities with probability P 11 = 0.2, and no edge is added outside (Fig.  4 (a)). Then the hierarchical structure is appended. The 120 nodes of each community are subdivided into two sub-communities. Six sub-communities make up the second hierarchical level, each of which contains 60 nodes. Edges are added to the inside of the six subcommunities with probability P 12 = 0.3 ( Fig. 4 (b)). Then the first layer is successfully built. According to the idea of hidden structure, before building the second layer, the node numbers in the first layer are randomly scrambled so that the previous community structure is evenly distributed in the adjacency matrix (Fig. 4 (c)).
The second layer is constructed on the randomly scrambled graph. In the second layer, 360 nodes are divided into four communities, each of which has 90 nodes. Internal edges are added in these communities with a probability of P 21 = 0.15 (Fig. 4 (d)). Then we add a hierarchical structure, and each community is divided into three small sub-communities, each of which has 30 nodes. Edges are added inside these subcommunities with a probability of P 22 = 0.25 (Fig. 4  (e)).
Similarly, we also construct a network with 2000 nodes and a network with 3000 nodes, called G 2000 and G 3000 .
Because MOD has the best performance among the three benchmark algorithms, We compare HirHide-MOD with MOD on our synthetic data. Table 2 shows the experimental results. L 11 , L 12 represent the dominant community layer and L 21 , L 22 represent the hidden community layer. As illustrated in Table 2, HirHide-MOD has a slight advantage over MOD on L 11 and L 12 , which means HirHide doesn't reduce the performance of traditional algorithms in networks without hidden structure. Moreover, HirHide-MOD has significant advantage over MOD on L 21 and L 22 . The scores of MOD on L 21 and L 22 are extremely low while the scores of HirHide-MOD on L 21 and L 22 are normal and high. Consequently, we can conclude that MOD can not detect the hidden community layer. But after it is combined with the HirHide framework, HirHide-MOD can detect the hidden community layer well. These results are consistent with the results in real-world networks.

Confirmatory experiments
In the HirHide framework, the strong community structure detected by the base algorithm needs to be weakened. After combining the concept of hierarchical structure, there is a choice of weakening communities at the top level of the hierarchical structure or weakening communities at the bottom level of the hierarchical structure. To determine which level we should choose, we experiment on the synthetic network G 3000 . Because the synthetic data is complete, we have chosen the F1 score as the evaluation metric. As illustrated in Table 3, weakening communities in roots can make HirHide have a better performance, especially in level 2. In HirHide, there are two options of grabbing subcommunities in a layer: to grab sub-communities on the original graph or to grab sub-communities on the reduced graph after weakening other layers. We still use the F1 score to evaluate the performance. As illustrated in Table 4, in graph G 3000 , grabbing the subcommunities on the original graph has a slight advantage. But in graph G 2000 , we have the opposite result. Consequently, overall which graph is chosen does not make a big difference. We compare the information of the two sets about gene ontology in AmiGO 2 [34], and find a significant difference in the biological aspect of ancestor evidence used in manual assertion between the two sets. The hidden complexes typically have much more ancestor evidence (about 2.5 times) than the dominant complexes. Fig. 5 visualizes the difference. Approximately 82% of the hidden complexes have more than 480 ancestor evidence. For the dominant complexes, this fraction is around 40% (See details for each complex in the Additional file 1: Table S1). Although the number of evidence can be affected by other factors like the importance of a complex, such a significant difference indicates consistency between the hiddenness value and the number of ancestor evidence. Therefore, we conjecture that complexes with high hiddenness values have more ancestors or have a deeper relationship with their ancestors. Traditional methods determine hierarchy by the relationship of containing or be contained, which can not work in the hierarchy of gene ontology because of the complicated relationship among the gene products. We find that determining hierarchy by hiddenness value is more consistent within gene ontology. So the hidden structure found by HirHide is a useful supplement to the gene ontology. In case the ancestor evidence is inconsistent with the hiddenness value for a gene product, it is worth further exploration.

Discussion
A HirHide-combined algorithm can detect additional hidden hierarchical communities that can not be detected without HirHide. Some of them fit the characteristics of protein complexes with sparse internal connections. So they can serve as predicted protein complexes. Here, the predicted protein complexes mean these protein complexes do not appear in the reference sets, and they have the potential to be protein complexes.
However, detecting these hidden hierarchical communities are solely based on density property in PPI networks. Not all of them are reliable enough, which is a main weakness of HirHide. Emerging patterns (EPs) are conjunctive patterns that contrast sharply between different classes of data, which contain more informative properties such as degree statistics, clustering coefficient, topological coefficients and eigenvalues of a sub-graph. Recently, EPs are exploited to address the complex prediction problem [21]. In EPs, a feature vector is first constructed to describe the critical properties of the reference protein complexes as well as those of random non-complexes communities. Then to discover EPs by contrasting feature vectors of reference protein complexes and random non-complexes communities. Next, the discovered EPs are used to find out potential complexes. We combine the results of the two methods to screen more reliable predicted protein complexes. For each of the complexes predicted by HirHide, if it is quite similar to complexes discovered by EPs, it is considered a more reliable predicted complex. Examples of the predicted complexes are illustrated in the Additional file 1: Figure S1. Note that the communities detected by HirHide are based on mathematical prediction. More reliable results need to be conformed by biological experiments.