Protein complexes detection based on node local properties and gene expression in PPI weighted networks

Background Identifying protein complexes from protein–protein interaction (PPI) networks is a crucial task, and many related algorithms have been developed. Most algorithms usually employ direct neighbors of nodes and ignore resource allocation and second-order neighbors. The effective use of such information is crucial to protein complex detection. Result Based on this observation, we propose a new way by combining node resource allocation and gene expression information to weight protein network (NRAGE-WPN), in which protein complexes are detected based on core-attachment and second-order neighbors. Conclusions Through comparison with eleven methods in Yeast and Human PPI network, the experimental results demonstrate that this algorithm not only performs better than other methods on 75% in terms of f-measure+, but also can achieve an ideal overall performance in terms of a composite score consisting of five performance measures. This identification method is simple and can accurately identify more complexes. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04543-4.

Although great progress has been made in identifying protein complexes, laboratorybased methods are expensive, ineffective and sometimes even infeasible, and only parts of protein complexes are located. In addition, experiments in the laboratory are often incomplete because of the constraints of experimental conditions. As it is necessary to overcome the lacking of laboratory-based methods, a large number of computational algorithms have been designed as alternative methods to identify protein clusters, such as density-based clustering [4][5][6][7][8], hierarchical clustering [8][9][10], partition-based clustering [11,12], flow simulation-based clustering [13][14][15][16] and other methods with integrating biological and topological multiple information [17][18][19][20]. Although methods of protein complexes detection have achieved some effective results, how to reasonably integrate PPI node local data and gene expression biological information to construct weighted graphs, and how to define effective detection methods to identify complexes from the weighted network still need further study. Only direct neighbors are applied to PPI network clustering problems, which is not sufficient. In fact, node resource allocation information and second-order neighbors often contain some important potential information in PPI networks.
Aiming at the solution for the above-mentioned problems, we introduce a novel method based on resource allocation and gene expression in weighted PPI networks (called NRAGE-WPN) with based on core-attachment structure and second-order neighbors searching. First, based on the resource allocation and gene expression of the PPI network, a new weight metric is designed to accurately describe the interaction between proteins. Then our method detects a series of dense complex cores based on density and network diameter constraints and the final complexes are recognized by expanding the second-order neighbors of nodes in core complexes. This identification method is simple and can accurately identify more complexes.

Methods
Protein complex detection with a computational approach from PPI data is useful as the useful supplement to the limited experimental methods. Besides the enhancement in graph clustering techniques, successful and accurate methods for protein complex prediction depends more on the construction of weighted graphs. Therefore, constructing weighted graph for protein interactions is essential. In this section, we introduce a novel method based on resource allocation and gene expression in weighted PPI networks with two main steps. First, a method is proposed to evaluate the reliability of the protein interaction data considering both the common neighbor information and gene expression profiles through the weighted graph construction. Second, protein complexes are detected based on core-attachment and second-order neighbors in this new weighted graph. The workflow of our method is shown in Fig. 1.

Assessing the reliability of protein interaction
To represent a PPI network, a 3-element tuple is a set of N proteins, and E = {e ij } is the set of PPI edges whose values are stored in matrix W. For each pair of nodes, i, j ∈ V and the edge e ij is assigned a score as w ij . Inspired by the reference [21], resource allocation index (RA), is introduced to measure the similarity of interaction proteins in a network and a weighted graph based on resource allocation (WRA) is constructed in this step.
Taking Fig. 2 as an example, there is an edge between node 1 and node 2 and no common neighbors between them, but e 12 is an important bridge for information transmission between node group{1, 2, 6, 7} and node group{1, 2, 3, 4, 5}. Simply, it is assumed that the transmitter 1 can carry resources, and will equally deliver it among all its neighbors. Based on this, the similarity of two nodes is shown in Eq. (1). We can consider node i and node j, which are directly connected without common neighbors and the node i can transmit the information to node j through edge e ij to help the communication between two clusters {1, 2, 6, 7} and {1, 2, 3, 4, 5}. The value range of WRA belongs to [0 1]. This measure requires only the information of the nearest neighbors which therefore has very low computational complexity. N (i) is the set of the neighbors of node i and node i, N(j) is the set of the neighbors of node j and node j.

Pearson's correlation of expression levels
Co-expression genes tend to encode interacting proteins [22]. In this paper, we mainly concentrate on linear gene expression networks unless explicitly stated otherwise and Pearson's correlation coefficient of expression levels (PCC) is employed as biological information for interacting protein pair p and q. According to GBA principle (i.e. genes with similar expression spectrums have similar biological functions) [23], a higher correlation suggests a higher confidence in their interaction. PCC is generally used to measure the strength of the linear relationship between two variables and is also commonly used to measure the linear relationship between two sets of gene expression values. Suppose there are two columns of gene expression profiles X = (x 1 , . . . , x n ) and Y = (y 1 , . . . , y n ) . Matrix W p is formed by the PCC calculation formula, which is defined in Eq. (3). The value range of PCC belongs to [− 1 1]. If PCC (X,Y) < 0, it means that gene X and Y show a negative correlation; if PCC (X,Y) > 0, it means gene X and Y show a positive correlation, PCC (X, Y) = 0 means that there is no correlation between genes X and Y. If PCC(X, Y) < 0, protein pairs will be removed from PPI network in order to reduce the negative effect of low noise data on the detection results of mining protein complexes. The value range [0 1] of PCC is employed in this step.
where x denotes the average value of the expression value of gene X at 36 different times and y denotes the average value of the expression value of gene Y at 36 different times.

Weighted graph construction
In this part, we first describe how to compute the weighted value by combining gene expression information (GEI) based on PCC and RA information between two interaction proteins. The final weighted construction formula is proposed in Eq. (5).
Matrix W P is constructed based on Pearson correlation coefficient and matrix W N is constructed based on RA, respectively. After a simple calculation, the range of values can be known from 0 to 2. The final values are normalized to [0 1]. α(0 ≤ α ≤ 1 ) is a constant, where a smaller α indicates that the importance of the modules is dependent more on RA information of the network, and a bigger α indicates that the importance of the modules depends more on gene expression information. When α = 0 , the weighted method only considers RA information. When α = 1 , the weighted method only considers gene expression information. Therefore the Eq. (5) can measure the differential importance of interaction in protein networks by integrating node local information and biological information.

Detecting protein complexes in weighted graphs
The proposed algorithm, NRAGE-WPN, consists of two phases: weighted graph construction and core-attachment protein complex detection based on second-order neighbors searching. In the weighted graph construction phase, gene expression information and common neighbor information are integrated. A detailed description of the algorithm is outlined in Algorithm 1. Line 1 is for constructing matrix W N with the given PPI datasets. Line 2 is for constructing matrix . W p with the gene expression data. Line 3 is for constructing the new matrix W based on W N and W p , and the protein interaction confidence is the sum of the weights of W N and W p . Lines 4-8 are for identifying core clusters. Lines 9-11 are for enlarging core clusters based on second-order neighbors of nodes in each core.
In this algorithm, density and diameter are employed as the condition for complex detection.
If a node meets the two constraints in condition (7), it is added to the current cluster (subgraph). Generally, is usually set to 0.7 and δ is set to 2, according to the references [12,24].
(1) Density: The degree of a node V is the sum of the weights for each edge connecting to this node. Density in the weighted subgraph G = (V, E) is defined in (6). |N| is the number of nodes in G and w(e) is the weight of the edge e ij in G.
(2) Network Diameter: Diameter is the shortest path in a cluster.

Datasets
The effectiveness of our method is evaluated using PPI networks and gold standards of protein complexes from yeast and human and the detail information is shown in Table 1 and relative detail information can be find in reference [25]. GSE3431 dataset [26] is employed in our paper which records the data of 36 time points during three successive metabolic cycles.

Evaluation criteria
To evaluate our method on benchmark datasets and compare NRAGE-WPN with other methods, evaluation measures are given in this section, such as sensitivity (SN), positive predictive value (PPV), accuracy (ACC), separation (SEP), fraction match (FRM), maximum matching ratio(MMR), precision (Prec), recall (Rec) and f-measure, precision+, recall+, f-measure+, the sum (F_MMR) of MMR and f-measure+, the composite score(CS) of MMR, FRM, SEP, ACC and f-measure [25]. Given a set of benchmark protein complexes R = {R 1 , R 2 , . . . , R n } and a set of predicted clusters P = {P 1 , P 2 , . . . , P n } , two protein complexes, namely, R i and P j , are generated from benchmark complex datasets R and predicted protein complex sets P, respectively. T ij is the number of proteins in common between ith benchmark complex R i and jth predicted complex P j . SN , PPV and ACC are defined as follows.
diameter ≤ δ and density ≥ STRING [39], PIPS [40] Gold standards CYC2008 [41], MIPS [42] Corum [43] N i presents the size of proteins in the ith benchmark module. Here, n is the number of benchmark complexes and m is the number of predicted complexes.
To evaluate protein complex prediction in terms of precision and recall, the Jaccard index is employed. The located complex P j is defined to match the real complex R i if the Jacquard similarity is greater than 0.5.
In terms of precision+, recall+ and f-measure+, neighborhood affinity score NA(P j , R i ) between P j and R i , as defined in Eq. (10) can be used to determine whether they match with each other. If NA(P j , R i ) = ω, ω ≥ t , ω is greater than 0.2, P j and R i are considered to be matching. In this paper, t is usually set as 0.20. |P i | |P i | and R j R j are the numbers of proteins in P i P i and R j , respectively.
Comparative analysis is performed with the sum score of MMR, FMR, SEP, ACC and f-measure. Performances among different methods are compared for yeast and human with the corresponding complex datasets and PPI networks. First, as is illustrated in Fig. 3 that NRAGE-WPN can achieve best performance in Collins, Gavin and KroganExt network and perform better than other ten methods except PC2P in KroganCore in terms of CS on CYC2008. On MIPS, NRAGE-WPN outperforms all methods on MIPS in network Collins and ten methods in network Gavin, Kro-ganExt and KroganCore except PC2P in Additional file 1: Table S1. On CORUM in recall + = |{R j |R j ∈ R ∧ P i ∈ P, P i matchesR j }| n precision + = |{P i |P i ∧ R j ∈ R, R j matchesP i }| m f − measure + = 2 * recall + * precision + recall + + precision + 2 combinations, NRAGE-WPN can achieve best performances in terms of CS. Second, in terms of f-measure+, NRAGE-WPN results the best performance except in Collins on MIPS. Third, in the rest measures, NRAGE-WPN performs better than most other methods and the all detail information can be shown in Additional file 1: Table S1.

Assessment performances of f-measure+ and accuracy with parameter α
By evaluating the importance of parameter α , we can more intuitively observe the influence of a certain parameter on the experimental results, and it is helpful to understand the advantages and disadvantages of the algorithm and enhance it. The critical parameter  Fig. 4,when the parameter α is greater than or equal to 0.3, the f-measure+ tends to stable. In In Fig. 5, when α = 0.3, the best performance of accuracy can be achieved. In this article, we take α = 0.3.

Robustness to the different thresholds (t)
In order to illustrate the comprehensive performance of NRAGE-WPN, we demonstrate f-measure+ performances with nine thresholds t = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} among different methods in Fig. 6. Figure 6a shows the comparisons of f-measure+ performances on the CYC2008 benchmark dataset in Collins. It can be illustrated that NRAGE-WPN outperforms other eleven methods. Similar results can also be found on the CYC2008 benchmark in Gavin in Fig. 6b. Other comparisons are shown in Additional file 1: Fig. S1, which illustrates that NRAGE-WPN performs better than other  combinations on 50%. This further demonstrates the effectiveness of the fusion information from local node and gene expression data.

Functional analysis
For the protein complexes identified by the NRAGE-WPN algorithm, we measure the effectiveness of the algorithm quantitatively and qualitatively. We analyze the biological significance of the identified protein complexes. Real protein complexes often present high functional homogeneity, so the function enrichment test is employed to demonstrate the biological significances of detected protein complexes [31]. The function enrichment analysis of protein complexes identified from yeast PPI network is carried out to further verify the effectiveness of NRAGE-WPN algorithm. The analysis and comparison of P value are shown in Table 2. When P value is greater than 0.001, it is generally considered that the function of the complex is very likely to be randomly assigned and has no biological significance. The percentages in brackets in Table 2 indicate the ratio of the number of complexes in a certain interval to the number of complexes in all intervals. For example, a total of 325 complexes are predicted by NRAGE-WPN on CYC2008 in Collins and effective percentage of NRAGE-WPN is greater than other eleven algorithms. Further, with respect to the biological relevance, the enrichment score of the annotations are employed to evaluate the performance of predicted complex. The average of detected complexes with at least one enriched annotation over all clusters among eleven approaches on six datasets is compared in Additional file 1: Table S2. The results illustrate that NRAGE-WPN predicts biologically relevant clusters with enrichment scores with the top 70% of other methods in terms of the different GO categories.

Effectiveness of RA
Due to the noise data in the PPI network, NRAGE-WPN uses gene expression and RA information to score a weight to each interaction of the PPI network. To assess the effect of using RA in the f-measure+ for complexes detection, we conduct NRAGE-WPN without considering RA information and compare its results with normal the NRAGE-WPN which employs both gene expression and RA information. Without using RA situation, a weighted PPI network is constructed by gene expression only. Figure 7 shows the results of NRAGE-WPN in RA-OFF and RA-ON in Collins, Gavin, KroganCore and KroganExt datasets with CYC2008 and MIPS benchmarks, respectively. From Fig. 7, it can be shown that by introducing RA, the quality performance of F_MMR is enhanced.
In term of RA-ON mode in Collins data, F_MMR increases 8.8% for the CYC2008 benchmark and 8.3% for the MIPS benchmark. According to Fig. 7, the same trend can also be shown in other three PPI datasets on two benchmarks, respectively. This experiment shows that using RA can reduce noise data and improve the overall performance of complexes detection. Fig. 7 The effectiveness of NRAGE-WPN when RA is off/on with CYC2008 and MIPS benchmarks Fig. 8 The effectiveness of using SNS in NRAGE-WPN compare with NRAGE-WPN without using SNS in four PPI datasets with CYC2008 and MIPS benchmarks

Effectiveness of second-order neighbors searching (SNS)
The second phase of the NRAGE-WPN method is to enlarge the core complexes by second-order neighbors. After detecting core protein complexes from weighted PPI network, due to the nature of complexes of core-attachment, there may be many attachment parts to be added to the cores. In this situation, the cores and attachment parts are combined to form final complexes. In order to assess the effect of introducing second-order neighbors searching(SNS), we conduct NRAGE-WPN without its second phase. Figure 8 shows the comparison between second-order neighbors searching-on (SNS-ON) and second-order neighbors searching-off (SNS-OFF) modes in terms of f-measure+. On the CYC2008 benchmark, when NRAGE-WPN uses the SNS phase, we can see a 5.2%, 3.2%, 7.8% and 7.7% rise in Collins, Gavin, KroganCore, KroganExt, respectively. As the results show, performance of f-measure+ can be improved by introducing the secondorder neighbors searching.

Assessment of density in different weighted graphs
Although PCC cannot identify whether gene variables are directly regulated or indirectly regulated [33][34][35], in this paper, we mainly focus on PCC as biological information to construct weighted graph network based on gene expression, which is one of the most commonly used methods for constructing gene regulatory networks. At the same time, we discuss the influence of nonlinear correlation of gene expression on the density of whole network. We construct another four weighted graphs based on KBRV [32] method and the density of networks are compared in Fig. 9. First, the results show that four weighted networks based on KBRV can increase the density of PPI network. The reason is that the weighted value of the protein pairs that can be increased by (12). Second, we can find that when α belongs to [0.3 0.5], the densities of four weighted graph by (5) decrease slow. In our experiment, α = 0.3 is used. Lastly, in our future work,we will (12) W = αW P + (1 − α)KBRV Fig. 9 The density of using different a by comparing with KVRB method for construction of four weighted graph in Yeast focus on the nonlinear correlation of gene expression for weighted graph construction and complex detection.

Conclusions
The identification of protein complexes is important for discovering and understanding the cellular organizations and biological processes in PPI networks. In this paper a new approach named NRAGE-WPN is proposed for identifying protein complexes in protein-protein interaction networks. Based on the resource allocation and gene expression of the PPI network, we first design a new weight metric to accurately describe the interaction between proteins. Our method then constructs a series of dense complex cores based on density and network diameter constraints, and the final complexes are recognized by expanding the second-order neighbors of nodes in core complexes. Through comparison with eleven methods in Yeast and Human PPI network, the experimental results demonstrate that this algorithm not only performs better than other methods on 75% in terms of f-measure+, but also can achieve an ideal overall performance in terms of a composite score consisting of five performance measures. In the future work, we will focus on locating sparse and density protein complexes by integrating multiple information.