A Non-negative Matrix Factorization Based Method for Identifying Essential Proteins

Identification of essential proteins is very important for understanding the basic requirements to sustain a living organism. In recent years, various different computational methods have been proposed to identify essential proteins based on protein-protein interaction (PPI) networks. However, there has been reliable evidence that a huge amount of false negatives and false positives exist in PPI data. Therefore, it is necessary to reduce the influence of false data on accuracy of essential proteins prediction by integrating multi-source biological information with PPI networks. In this paper, we proposed a non-negative matrix factorization and multiple biological information based model (NDM) for identifying essential proteins. The first stage in this progress was to construct a weighted PPI network by combing the information of protein domain, protein complex and the topology characteristic of the original PPI network. Then, the non-negative matrix factorization technique was used to reconstruct an optimized PPI network with whole enough weight of edges. In the final stage, the ranking score of each protein was computed by the PageRank algorithm in which the initial scores were calculated with homologous and subcellular localization information. In order to verify the effectiveness of the NDM method, we compared the NDM with other state-of-the-art essential proteins prediction methods. The comparison of the results obtained from different methods indicated that our NDM model has better performance in predicting essential proteins. Conclusion Employing the non-negative matrix factorization and integrating multi-source biological data can effectively improve quality of the PPI network, which resulted in the led to optimization of the performance essential proteins identification. This will also provide a new perspective for other prediction based on protein-protein interaction networks.


Background
Essential genes and their products (essential proteins) are necessary for the survival of the organism, whose functions are thought to be the basis of life. The identification of essential proteins can help us understand the basic requirements for sustaining life forms . Besides that, it plays an important role in the emerging field of synthetic biology, which aims to create a cell with the smallest genome [1]. It can also provide important reference information for biology, medicine and other disciplines [2]. At present, a variety of methods have been proposed to identify essential proteins through biological experiments, such as single-knock-out [3] and RNA interference (RNAi) [4].
Nonetheless, there are some limitations with these experimental methods, for instance high cost and time consuming. Although large-scale experimental techniques for identifying essential proteins have been greatly improved, there is still a large gap between computational methods for detecting essential proteins and genome sequences.
In recent years, many computational methods have been proposed to identify essential proteins. Based on topological features of PPI network, many centrality models have been proposed to predict essential proteins including Degree Centrality (DC) [5], Information Centrality (IC) [6], Closeness Centrality (CC) [7], Betweenness Centrality (BC) [8], Subgraph Centrality (SC) [9] and sum of Edge Clustering Coefficient Centrality (NC) [10]. Singh et al. [11] proposed the Graph Fourier Transform Centrality (GFT-C) to quantify the importance of nodes in complex networks. Wang et al. [12] designed a new efficiency centrality (EffC) sorting algorithm, which identified influential nodes by considering the change degree of the whole network efficiency after deleting each node respectively. Li et al. [13] found that the frequency of essential proteins appearing in triangular structures was significantly higher than that of non-essential proteins and proposed a new pure centricity measure named Neighborhood close centricity (NCC). However, most of these methods make use of the topological characteristics of essential proteins, whose performance is largely dependent on the reliability of the PPI network.
In order to make up for the limitations of incompleteness of PPI networks, many research groups have combined PPI networks with other biological information to improve the accuracy of essential protein identification. Tew et al. [14] integrated the functional information of proteins and network topology attributes when designing essential proteins identification methods. Zhang  proteins. In spite of significant advances in network-based essential protein prediction methods, it still remains a challenge to effectively improve the PPI network quality with multi-omics data integration and enhance the performance of essential proteins identification.
Here, we developed a novel prediction model called NDM for the task of essential proteins discovery. NDM takes a full account of the protein-protein interactions in PPI network and other multi-source biological information, such as protein complex, protein domain, homologous and subcellular localization information. Specially, the matrix factorization technique is employed in the NDM to construct a more tolerant of false negative protein interaction network than that of other methods. On that basis, we adopt the PageRank algorithm to score and rank all proteins. The comparative performance experiments were conducted for NDM and other state-of-the-art methods by using yeast data sets. The experimental results indicated that NDM obtains better performance compared with other methods and can be effectively applied in the discovery of essential proteins.

Methods
The NDM method consists of three steps. (1) Constructing multiple networks to represent the complex relationships among proteins from topological features of PPI network, protein complex and domain information, and integrating into a reliable weighted network. (2) Reconstructing a comprehensive protein interactome network by using the non-negative matrix factorization technique to discover potential protein interactions from the weighted network. (3) Scoring and ranking proteins through random walk on the above comprehensive network.

Construction of a reliable weighted network based on multi-source biological data
To reduce the negative impact of false positives on predict performance, we transform the original PPI network to a reliable weighted PPI network (rPPI) by combining the topological features of original PPI networks and multi-source biological data, such as protein complexes and protein domains information. Here, three types of associations between nodes have been built and denoted as Neighbor_PPI, Domain_PPI and Complex_PPI respectively. The formation of these will be described in detail below.
Due to the limitation of high throughput technology, a significant proportion of experimental PPI data contains errors. Many prediction algorithms [24] explore common neighbors between pairs of proteins in PPI networks to measure the reliability of the interactions between proteins. Obviously, the more common neighbors between two proteins, the more likely they will interact with each other. In this paper, the proteins pi and pj would be considered to be interconnected in the Neighbor_PPI network, if they have at least one common neighbor. This kind of connections between proteins is the first type of relationships in the NDM, whose reliability can be calculated as follows: Where NSi and NSj denotes the neighborhood sets of pi and pj respectively.
Proteins are usually composed of one or more domains that have independent functions. Researchers [15] studied the relationship between essential proteins and their domain composition and found that the number of protein domain types is closely related to their importance. Especially, they also pointed out that more types of domains would appear in essential proteins, while non-essential proteins contained fewer types of domains. From this, we can think that the proteins pi and pj would be considered to be interconnected in the Domain_PPI network, if they have the same type of domain.
The weight of this kind of connection between proteins can be calculated as follows: Here, PD(pi) is the domain score of the protein pi and independent of each other, which can be calculated as follows: In this formula, CSpi and CSpj is the set of protein complexes which contains pi and pj respectively. The numerator represents the common set of protein complexes which pi and pj belong to.
After constructing the above three kinds of protein interaction networks, we integrate these three networks and get a weighted PPI network (rPPI) by equation 5.
The Figure 1, shown below, summarizes the overall transform process.

Reconstruction of a comprehensive protein interactome network based on NMF
The second stage of our NDM method is to exploit potential associations between proteins from the above reliable weighted network using the Non-negative matrix factorization (NMF) technology. As an effective data representation technique, the NMF has been widely used in lncRNA-disease associations prediction [25], conserved functional modules detection [26], etc. For our purpose, we represent the reliable weighted network constructed in the first stage as an adjacency matrix ∈ × .
In this work, we wish to establish a new matrix ∈ × , in which enough elements of Y are filled with computed values, like: where ∈ × represents a low rank matrix, C denotes the number of features selected and its value far less than N. ∈ × is the coefficient matrix. For a given non-negative data matrix rPPI, the issue can be solved as the following optimization problem: where ‖⋅‖F is the Frobenius norm. Since the objective function in Equation (7) is a joint non-convex problem, we employ the rule of multiplicative iteration to solve the objective function on the basis of using auxiliary functions. Squared Frobenius norm can be written as ||A|| 2 =Tr(A T A), therefore equation (7) equals to: Its partial derivatives equations for factor W, H respectively as follow: The static point can be found by the Karush-Kuhn Tucker (KKT) complementarity conditions. The KKT condition for factor W is as follows: In this respect, the conditions is assumed to be at work if the derivative, , is zero: Similarly, the second update rule for V can be derived as follow: As described above, the multiplication iteration rules are as follows: From the above equation (6-13), we can obtain an optimal matrix Y that is closest to rPPI. To restore the symmetry of the protein-protein interactions, the matrix Y finally need to be transformed to a symmetrical transition probability matrix cPPI as follow: Obviously, after going through the above process, a new network cPPI can be set up accordingly.

Random walk on the comprehensive protein interactome network
The last stage of our NDM method is scoring proteins and generating candidate essential proteins. In this work, we run a random walk with restart (RWR) algorithm on the comprehensive protein interactome network to rank all proteins. As a part of the RWR algorithm, we first define the initial vector IS according to the conservative and functional features of essential proteins, which derived from homologous proteins and subcellular localization information respectively. Tang et al. [27] analysis whether all proteins in the Saccharomyces cerevisiae PPI network have direct homologous proteins in 99 reference species.They manifest that the more homologous proteins a protein has in reference species, the more likely it is to be the essential protein. For a given protein pi in the comprehensive network cPPI, HS(pi) indicates its conservative score and can be calculated as follow [28]: where H(pi) represents the number of times that protein pi has orthologous proteins in reference organisms.
Proteins are localized at their appropriate subcellular locational compartments to perform their biological functions. Researchers [29][30] where C(pi) is the subcellular locations set corresponding to the protein pi.
Finally, base on the equation (15-17), the unique initial score of pi, IS(pi), can be defined as follow: where NSi is the set of neighbors of protein pi and the parameter λ (0≤λ<1) is used to balance the iteration information and initial scores. From the above definition, we can see that the ranking score of a protein can be regarded as a linear combination of its initial scores and neighbor correlation scores of edges in cPPI network. Equation (19) can be rewrite in the matrix-vector format as follows: In order to solve equation (19), the Jacobi iterative procedure is used in our this work as follows: where RS t is the ranking score vector obtained in t-th iteration.
Based on above description, the overall framework of the NMF-based model for identifying essential proteins (NDM) can be descripted as the following Algorithm 1.

Results and Discussion
The study uses qualitative analysis in order to assess the performance of NDM in predicting essential proteins. Here, other eleven state-and-art methods are used for comparison, such as DC [5], IC [6], CC [7], BC [8], SC [9],

Experimental data
The protein-protein interaction data are mainly concentrated in yeast, because this species has good knockout experimental characteristics, the data are the most complete and convincing. In our experiment, data related to essential protein identification mainly include benchmark essential protein dataset, PPI data and multi-source biological data. The benchmark essential proteins collected by experimental methods are mainly derived from four databases: MIPS [31], SGD [32], DEG [33], and SGDP [34], including 1,285 essential proteins. In this work, DIP data [35] is used to assess the effectiveness of our proposed method, which contains 5,093 proteins interacting with 24,743 interacted groups and 1,167 essential proteins. The commonly used data of protein complexes are CYC2008 [36] and MIPS [31], including 408 and 428 complexes detected by biological experiments, respectively. The protein domains information is derived from PFAM 25.0 database [37] and contains 2,671 different types of domains. Subcellular location data is obtained from COMPARTMENTS [38] database. To avoid specificity of data, 11 categories of sub-cellular localization are reserved, including Endoplasmic, Nucleus, Cytoskeleton, Golgi, Cytosol, Vacuole, Plasma, Mitochondrion, Endosome, Peroxisome and Extracellular [39]. For homologous protein information, we get it from the 7th edition of InParanoid database [40] which contains paired comparisons of 100 whole genomes (99 eukaryotes and 1 prokaryote).

Parameter sensitivity analysis
In terms of the computational algorithm for predicting essential proteins, the optimal parameters selection may differ from one experiment scenario to another. In this section, we mainly focus on the parameter λ for NDM. It is set to 0, 0.1, 0.2, ...,0.9 and 1, and is used to balance the iteration information and initial scores in the ultimate ranking score as described in equation (19-21). Table 1 illustrates the impact of the parameter λ on the performance of NDM. As for λ, we vary its value from 0 to 1 and the number of ranked proteins is top 100, 200, 300, 400, 500 and 600 respectively. The accuracy of the prediction is measured based on the percentage of true essential proteins in the candidate proteins. As can be seen from the table, the best prediction results are obtained when the value of λ is set to 0.3 and 0.4. Especially, for the top 100 and top 200, the best predicted accuracy (93% and 88% respectively) is achieved when the value of λ is assigned as 0.3. Therefore, the optimum λ value in this work is 0.3.

Comprehensive comparison with other methods
In this section, the comprehensive comparisons of NDM and other methods are carried out to demonstrate the effectiveness of our proposed prediction method. There are various top numbers of ranked proteins selected as candidate essential proteins after all protein scores are calculated based on each method. As can be seen from the Figure 2, the top100, 200, 300, 400, 500 and 600 of proteins are taken from the ranked results of those twelve methods as candidates, in which each protein is distinguished as essential proteins or not.
From Figure 2 we can see that there is a significant improved accuracy of identifying essential proteins by our proposed method, while contrasted with other eleven methods.
When the top 100 to top 600 proteins selected as candidates, we can see that the NDM resulted in the higher values of percentage improvement respectively to 69.1%, 39.7%, 33.5%, 30.9%,26.5%, 28.4% than the NC which obtains the best results from the classical networks topology-based centrality methods such as DC, IC, BC, CC, SC and NC. Compared with other multi-source based predict methods (CoEWC, PeC, POEM, ION and FDP), NDM still achieves a full-scale improvement. One of the most significant improvement in results is that, when essential candidates are selected with top100 to top 400 respectively, the accuracies of our method improved by 4.5%, 7.3%, 8.9% and 7.5% than FDP which has the best performance among the other multi-source based predict methods. Moreover, the FDP has nearly 70.9% and 66.3% detection accuracy which is 6.3% and 4.3% higher than the ION which has the best results in other competitive methods, as the number of top essential candidates is set to 500 to

Validated by precision-recall curves
Furthermore, we plot the precision-recall (PR) curve at different cutoffs to assess the performance of each method. After scores of all proteins are calculated with each method, we select top k proteins as the essential candidates (positive set), and others as the non-essential candidates (negative set) in descending order. Here, the value of k is set from 1 to 5093 (the total number of proteins). The values of recall and precision can be computed by using each method and are reported at different cutoffs respectively. The PR curves of NDM and other eleven methods are shown in Figure 3.  Figure   3(b) shows the results obtained from the preliminary analysis of NDM and five multi-source based methods, including PeC, CoEWC, POEM, ION and FDP. As can be seen from the figure, the NDM group report significantly better performance than the othereleven groups. Especially, the identification rate of NDM is 100% for the first 39 essential candidates. This is a remarkable result that cannot be accomplished by any other competitive methods.

Validated by jackknife methodology
In this subsection, the simulation is conducted to utilize the jackknife methodology to compare our NDM method with other state-of-the-art methods (DC, BC, CC, SC, curve can be computed to quantify the overall performance. Figure 4 plots the jackknife curves of the competitive methods, in which the horizontal axis denotes the top number of essential candidates ranked in descending order with each method, and the vertical axis represents the number of essential proteins identified. In order to make the result clear, we separate the comparing result to three subgraph and select ten random assortments to compare. Figure 4(a) illustrates the jackknife curves of NDM, DC, IC, SC and ten random assortments. It is evident that NDM almost always has the highest value under the same value of essential candidates. The curves of NDM, BC, CC, NC and ten random assortments are shown in Figure 4(b) which presents that NDM achieves a higher precision with respect to all other competitive methods for any given number of essential candidates. Figure 4(c) shows the comparison result of NDM and other multi-source based methods (PeC, CoEWC, POEM, ION and FDP). As can be seen from the Figure 4(c), NDM has lower performance than ION for the ranked 900 and more essential candidates of predicted list, but still achieves the highest value than other methods. However, a higher precision obtained from NDM comparing to other multi-source based methods for the top portion of the predicted list. This portion of predicted results is important as it identifies the potential essential proteins predicted with high confidence.

Conclusions
As mentioned in the literature review, many computational methods have been developed to predict essential proteins based on PPI network or multi-source biological data and achieve good performance. However, these methods do not take full advantage of the relationship between multiple sources of data. This work set out to develop a model for better performance, named NDM, which integrate PPI network, protein complexes, protein domains, subcellular localizations and homologous proteins information. To get the utmost out of multi-source data, non-negative matrix factorization is introduced into our proposed method. Also, a comprehensive experiment is carried out andthe results of this experiment show that our new method has the better performance than six topology-based centrality methods (DC, BC, CC, SC, IC and NC) and five multi-source based methods (PeC, CoEWC, POEM, ION and NDM). A possible explanation for these results might be that there are deep relationships between multiple sources of data. These results add to the rapidly expanding field of computational methods based on multi-source biological information. It is unfortunate that the study did not take other biological data into account. This is an important issue for future research.

Declarations Ethics approval and consent to participate
Not applicable.

Consent to publish
Not applicable.

Availability of data and materials
Publicly available datasets were analyzed in this study. This data and the NDM program can be found here: https://github.com/husaiccsu/NDM.       , top 100, 200, 300, 400, 500 and 600 of the ranked proteins are selected as candidates for essential proteins. According to the list of benchmark essential proteins, the number of true essential proteins is used to judge the performance of each method. The gure shows the number of true essential proteins identi ed by each method.  Jackknife curves for NDM and other eleven approaches. The x-axis represents proteins ranked by NDM and ten eleven methods, ranked from left to right as strongest to weakest prediction of essentiality. The Y-axis is the cumulative count of essential proteins encountered moving left to right through the ranked.

Figure Legends
The areas under the curve for NDM and the eleven other methods are used to compare their prediction performance. In addition, the 10 random assortments are also plotted for comparison.