Identifying of protein complexes and functional modules in E.coli PPI networks

Background: Escherichia coli has been at the center of microbial research for decades, making it a standard microorganism for studying molecular mechanism. Molecular complexes, operons and functional modules are important molecular functional domains of Escherichia coli . Most previous studies focused on the detection of E. coli protein complexes based on the experimental methods. While the research of prediction of protein complexes in E. coli based on large-scale proteomic data, especially the functional modules of E. coli are relatively few. Identifying protein complexes and functional modules of E. coli is crucial to reveal principles of cellular organizations, processes and functions. Results: In this study, the protein complexes and functional modules of two high-quality binary interaction datasets of E. coli are predicted by an efficient edge clustering algorithm (ELPA) for complex biological network, respectively. According to the gold standard protein complexes and function annotations provided by EcoCyc dataset, the experimental results show that most topological modules predicted in the two datasets match very well with the real protein complexes, cellular processes and biological functions. By analyzing the corresponding complexes and functional modules shows that all predicted protein complexes are fully covered by one or more functional modules. Furthermore, we compared the results of ELPA with a famous node clustering algorithm (MCL) on the same PPI network of E. coli , and found that ELPA outperforms MCL in terms of matching with gold standard complexes. Conclusions: As a consequence, we surmise that topological modules of PPI network detected by ELPA fits well with real protein complexes and functional units. In most predicted topological modules, the protein complexes and corresponding functional modules are highly overlapping. ELPA is an effective tool to predict protein complexes and functional modules in PPI networks of E. coli . The function annotations of malE, malG/malF, and malK are maltose ABC transporter-periplasmic binding protein, maltose ABC transporter-membrane subunit, and maltose ABC transporter-ATP binding subunit respectively. Above results show that protein complexes should be highly related with corresponding functional modules.


Introduction
Escherichia coli is a primary model organism for microbe, and perhaps it is the most intensively studied species of bacteria [1][2][3][4]. Even so, only two-third of the protein-coding gene products of E.
Coli K-12 currently have experimental evidence indicative of a biological role, and others remain functionally unannotated (orphans) [5]. The growing large-scale genomic data make E. Coli particularly well-suited to systematic investigations of microbial protein components and functional relationships from a global perspective. Experiments and data analysis (PPI network model) are two effective methods to identify protein complexes and functional modules of E. coli. As we all know, the research of E. Coli based on experimental method is absolutely dominant [5][6][7] for decades. The experimental method has the advantages of direct verification, but it also has the limitations of high false positive and false negative results. With the development of genomic techniques, some highquality, binary protein-protein interactions (PPIs) maps of E. Coli have been released. Although the construction and understanding of these networks are far from complete, the protein complexes and functional modules predicted based on them are useful complements to the experimental methods.
Protein complex is formed by the interaction of more than two functional related peptide chains through disulfide bonds or other proteins, which performs some given biological functions. Functional module is basic functional unit of proteins, which implying the complex relationships involving multiple biological interaction types [8]. It is an important topic for us to understand the basic biological functions of protein by unveiling protein complexes and functional modules from PPI network of E. coli. Some studies have been performed to predict protein complexes based on PPI networks of E. Coli [5,8,9], still some works focused on the functional relations about transcriptional regulatory [10][11][12][13] and metabolic pathways [13][14][15] of E. Coli. However, few jobs involved in predicting functional modules of E. coli, especially comparing and analyzing protein complexes and functional modules simultaneously.
In recent years, the development of high-throughput sequencing technologies for pairwise protein interactions (such as two hybrid systems and mass spectrometry) makes it possible to construct PPI networks of E. coli at the genomic level. In this study, the protein complexes and functional modules of two high-quality PPI networks of E. Coli are predicted by ELPA method [16]. According to the gold standard protein complexes and function annotations provided by EcoCyc dataset, the experimental results show that most topological modules predicted by ELPA match well with real protein complexes, cellular processes or biological functions. For example, in the PPI network provided by Hu et al., 75.8% of predicted topological modules match with one or more real protein complexes of E.
Coli, 88.1% of real protein complexes match with one or more topological modules, and 88.3% of predicted topological modules match with at least one functional unit of E. Coli. On the same PPI network of E. Coli, we compared the results of ELPA with Markov Clustering algorithm (MCL, the most popular protein complexes detecting method) [17], and found that ELPA outperforms MCL in terms of matching with gold standard complexes. At last, we focused on comparative analysis protein complex with functional module in the same topological module. The results show that the protein complexes and corresponding functional modules in most topological modules are highly overlapping, and the function of protein complexes are basically consistent with the corresponding functional modules. As a consequence, we surmise that ELPA may serve as an effective method to predict protein complexes and functional modules in PPI networks of E. Coli.

PPI Datasets of E. coli
In order to prove the validity of this method better, two different large-scale PPIs datasets of Escherichia coli are assembled in this study. One is combined interaction dataset (AKA 'Coreexperimental') of E. Coli released by Bacteriome.org [18], which integrates the classic PPI dataset of Hu et al. [5] and functional datasets [18]. It collected 7613 binary interactions between 2283 E. Coli proteins, which were derived by large-scale tandem affinity purification followed by mass spectrometry (AP/MS) experiments. The other is a PPI dataset provided by Rajagopala et al. [7], which were performed by yeast-two-hybrid method (Y2H). It collected 3929 binary-Y2H interactions between 2039 E. Coli proteins. In order to predict the protein complex more effectively, only those binary interactions, which associated with the proteins of known complexes are considered. In the end, 3280 interactions between 1298 proteins were collected from the first PPI dataset, and 1602 interactions between 1144 proteins were collected from the second PPI dataset.

Protein complexes and functional classes
The last literature-curated protein complexes of E. Coli released by EcoCyc database [19] is taken as the "gold standard" protein complexes. As we all know, many protein complexes of E. Coli contain only two proteins, so those protein complexes that contain at least two proteins showed in E. Coli K-benchmark, 732 proteins included. Figure 1 shows the overlap between proteins involved in the E.

Clustering method
Many clustering algorithms for complex network have been developed in the past decade and most of these methods are based on node clustering. However, few clustering algorithms can be used in complex biological networks directly. In this study, a novel algorithm based on edge clustering of complex network, which named edge label propagation algorithm (ELPA) [16] was used to identify topological modules of E. Coli in the above two PPI networks. Relative to node clustering, edge clustering has the natural advantages of compatible node attributes and link attributes of complex network, and can reflect the network topology structure better. In the following, it will be compared with an excellent node clustering algorithm for further explanation. These protein topological modules detected by ELPA will be matched with "gold standard" protein complexes and functional annotations of EcoCyc to identify corresponding protein complexes and functional modules, respectively.

Evaluation metrics
Precision, recall and F-measure are three types of commonly used evaluation metrics, which are used to measure the quality of predicted complexes [20], and to define how well a predicted complex matches a real complex. Let N P denotes the number of complexes predicted by a clustering method, and N B denotes the number of real ones in the gold standard protein complexes. Let N PC be the number of predicted complexes which match at least one real complex, and N BC be the number of real ones that match at least one predicted complex. The three metrics are then defined as follows:

F=2×(Precision×Recall)/(Precision+Recall)
The matching score (MS PB ) [20] between a predicted complex and the real one in EcoCyc dataset is defined as: The set of true-positive predictions can be obtained by selecting predictions with matching score larger than a threshold. it is found that the scale of PPI network has a great influence on matching quality. In order to evaluate the predicted complexes, the set of effective matching complexes were obtained by selecting whose matching score larger than a threshold. Then Precision, Recall and F-measure were used to evaluate the quality of each module based on these effective matching complexes, respectively. The results show that most protein complexes predicted by ELPA are matched well with corresponding real complexes in both NetH and NetR. For example, In the NetH, 50th protein module consists of 8 proteins, among them potF, potH and potI are three proteins in putrescine ABC transporter complex; potA, potB, potC and potD cover all the four proteins of putrescine/spermidine ABC transporter complex (showed in Fig. 2a). 54th protein module consists of 5 proteins, it fully covered by two complexes: ferrichrome transport system and ferric coprogen transport system. fhuA, fhuB, fhuC and fhuD are ferrichrome transport system proteins, while fhuB, fhuC, fhuD and fhuE are ferric coprogen transport system proteins. Figure 2b shows that fhuB, fhuC and fhuD are the common proteins of the two complexes. As Fig. 3 shows, in the NetR, among the 8 proteins of 76th protein module, ccmA, ccmC, ccmD and ccmE are Protoheme IX ABC transporter proteins, while ccmE, ccmF and ccmH are CcmEFGH holocytochrome synthetase proteins, and ccmE links the two complexes. The above analysis shows that ELPA can effectively predict protein complex implied in E. Coli PPI network.

Identification of functional module
To determine whether the predicted protein modules have biological significance, each topological module is analyzed with the gold-standard proteins functional annotations of EcoCyc dataset. The topological modules of NetH and NetR detected by EPLA are matching with benchmark functional annotations, respectively. If the majority of proteins (> 50%) of a predicted topological module covered by a single functional term, then it is defined as a significance functional module. In the NetH, most of the predicted modules (82.5%) are significance functional modules, among them about 30% match perfectly (fully covered by a single functional term). For example, as Fig. 4a shows, 24 in out of 25 proteins of 19th protein module are annotated by GO:0006810, Obviously, the functions of these proteins are similar; all the 7 proteins of 40th protein module are fully covered by GO:0005886, GO:0016020 and GO:0017004, respectively (showed in Fig. 4b). While in the NetR, 60.8% predicted modules are significance functional modules, and 24.7% of them are fully covered by a single functional term. For example, as Fig. 5 shows, maIY, maIT, fixB, ybdM, recA and aes are annotated by GO:0005515, aes, recA and yhfW are annotated by GO:0005737, and yhfW and pyrC are annotated by GO:0046872. Obviously, some proteins have more than one function. The above analysis shows that ELPA can effectively detect functional modules implied in E. Coli PPI network as well.

Comparative evaluations
Most of clustering methods of complex network are based on node clustering, among them MCL have been proven that superior to other methods in identifying the functional modules or protein complexes in most cases [21,22]. ELPA is a novel edge clustering method, it considers both node and link attitude, and can reflect the network structure better [16,23]. Next we will compare the clustering results of ELPA and MCL in the same PPI network, respectively. ELPA is a parameter free method, and MCL will take the default parameters.
In order to compare the performance of MCL and ELPA, three metrics: Precision, Recall and F-measure are used to evaluate the predicted quality of protein complexes. Figure 6a shows the comparisons of topological modules of NetH and corresponding protein complexes between the two methods. We observe that the accuracy of ELPA slightly superior to MCL. The value of precision, recall and Fmeasure of ELPA are 72.5%, 61.5% and 66.5%, while those of MCL are 55.1%, 65.5% and 59.9%, respectively. Figure 6b shows the comparisons of topological modules of NetH and corresponding functional modules between the two methods. Effective and average matching rate of significance functional modules is utilized to evaluate the predicted quality of functional modules. The effective matching rate and average matching rate of ELPA are 82.5% and 70.9%, while that of MCL are 85.9% and 70% respectively. Similar results are got in NetR, as Fig. 7a shows, The value of precision, recall and F-measure of ELPA are 35.7%, 28.6% and 31.8%, while those of MCL are 24.3%, 32.1% and 27.7%, respectively. As Fig. 7b shows, the effective matching rate and average matching rate of ELPA are 60.8% and 74.3%, while that of MCL are 68.2% and 74.4%, respectively. The above results show that ELPA is an effective method to predict protein complexes and functional modules of E. Coli.

Comparative analysis of protein complexes and functional modules
PPIs can be divided into Permanent interactions and transient interactions. Permanent interactions are strong and stable, which give rise to protein complexes while the transient interactions vary with cellular processes and form functional modules. Therefore, comparative analysis of protein complexes and corresponding functional modules is of great scientific significance. For example, as Fig. 8a shows

Conclusion
Most of researches on protein complexes or functional modules are based on experimental methods.
For a long time, due to the lack of the support of large-scale PPI dataset, and the existence of too many orphan proteins, making it difficult to predict protein complexes or functional modules based on network model. Some high-throughput PPI datasets of E. Coli are released [7,18,19] in recent years, which make it possible to predict protein complexes and functional modules of E. Coli on large-scale based on computational methods. In this study, we focus on identifying and analyzing protein complexes and functional modules from two large-scale PPI networks of E. Coli. Now, we all know that clustering analysis method of PPI network is a useful supplement to the experimental methods.
Node clustering and edge clustering are two different types of methods to uncover network structure from different perspective. Edge clustering has a natural advantage over node clustering in network community detection. Based on the edge clustering method (ELPA), some interesting protein complexes and functional modules are identified. Furthermore, some comparative analysis are performed to investigate corresponding protein complexes and functional modules, which related with the same topological modules. This helps to understand the dynamic relationship between protein complexes and functional modules. In addition, the results of ELPA are compared with that of MCL method, and it performs better than MCL in most cases. Our conclusion is that ELPA may be used as an effective cluster analysis tool for different types of biological networks.  The 76th predicted module in the NetR matching with EcoCyc benchmark complexes.  The 10th predicted module in the NetR matching with benchmark GO classifications.