Better Network Modeling For Link Prediction In Protein-Protein Interaction Networks

Background: Protein-protein interaction (PPI) data is an important type of data used in functional genomics. However, inaccuracies in high-throughput experiments often result in incomplete PPI data. Computational techniques are thus used to infer missing data and to evaluate conﬁdence scores, with link prediction being one such approach that uses the structure of the network of PPIs known so far to ﬁnd good candidates for missing PPIs. Recently, a new idea called the L3 principle introduced biological motivation into PPI link predictions, yielding predictors that are superior to general-purpose link predictors for complex networks. However, the previously developed L3 principle-based link predictors are only an approximate implementation of the L3 principle. As such, not only is the full potential of the L3 principle not realized, they may even lead to candidate PPIs that otherwise ﬁt the L3 principle being penalized. Result: In this article, we propose a formulation of link predictors without approximation that we call ExactL3 ( L3E ) by addressing missing elements within L3 predictors in the perspective of network modeling. Through statistical and biological metrics, we show that in general, L3E predictors perform better than the previously proposed methods on seven datasets across two organisms (human and yeast) using a reasonable amount of computation time. In addition to L3E being able to rank the PPIs more accurately, we also found that L3-based predictors, including L3E, predicted a diﬀerent pool of real PPIs than the general-purpose link predictors. This suggests that diﬀerent types of PPIs can be predicted based on diﬀerent topological assumptions and that even better PPI link predictors may be obtained in the future by improved network modeling.


Introduction
In the post-genomic era, high-throughput techniques have been developed to retrieve and analyze high-level and dynamic cellular activities. An important example is the development of techniques that enable large-scale characterization of protein interactions [1]. This has led to a new type of interactome for system biology, the Protein-Protein Interaction (PPI) network [2]. A PPI network is a form of complex network where a node represents a protein, and an edge indicates that two proteins can interact with each other. Since PPIs describe signal transduction of protein physical docking [3], large-scale studies can provide insights into the molecular machinery of living systems [4]. On a basic level, researchers can abstract biological components such as signaling pathways as a chain of PPIs [5], or protein complexes as graph clusters [6] for network analysis. In larger-scale studies, PPIs network can even be used as a building block that associates with other biological networks for better prioritization of candidate disease proteins or improved drug repurposing [7,8].
The basis of meaningful and comprehensive discoveries is a complete and reliable PPI network. However, measurement errors or incomplete experimental data may lead to some parts of the constructed PPI network having the wrong structure. For this reason, computational tools have been developed to evaluate the accuracy of the proposed edges in an existing PPI network or to find good candidates for new edges that should be added in order to make the resulting network more biologically sound. The most direct approaches use protein sequences data [9,10], since protein sequences compare proteins' functions genetically. Some of the other approaches include the use of protein structures, RNA co-expression, and protein annotations [11] [12]. Undoubtedly, the success of these methods stems from utilizing features to describe proteins, subsequently characterizing PPIs.
On the other hand, general-purpose link prediction techniques have been developed for complex networks such as computer networks, recommender systems, and social networks [13]. These link predictors can also be applied to PPI data, but they are usually not specific enough to characterize PPIs well, and there are no guarantees on their correctness and reliability. Due to this concern, Kovács et al. [14] introduced a novel link predictor based on a biological motivation that they called the L3 principle. This principle hypothesizes that two proteins linked by many different paths of length three have a higher likelihood of also interacting directly with each other. Using the L3 principle, the L3 link predictor infers new PPIs by scoring the structure of candidate PPIs, and keeping the candidates with the highest scores. The study also argued that for PPI networks, being linked by many paths of length two has the opposite effect, and showed experimentally that the L3 link predictor outperforms a vast number of general link predictors, including the famous Common Neighbor [15] that favors paths of length two. Since then, studies have already successfully improved existing network biology techniques by incorporating the L3 principle, including drugs-disease network analysis [16] and protein fold recognition [17].
Despite the strength of the L3 principle, some researchers claim that our understanding of the L3 link predictor is limited and that it was derived empirically rather than from any theoretical knowledge [18]. In fact, one can regard the L3 link predictor as an approximation in the sense that it penalizes the score of a neighborhood if some of its properties imply that it is a coincidence. This generally happens to any link predictor and each one has a different way of addressing the issue. However, the penalization in the L3 link predictor is applied even to PPIs that should be rewarded for such properties. So, a better approach would be to evaluate its fitness to the L3 principle by characterizing neighborhoods of PPIs more precisely, namely to reward desirable graph structures such as paths of length three, and penalize undesirable graph structures such as paths of length two. In this article, we define the link predictor in a way that more accurately corresponds to the biological motivation of the L3 principle. Our approach is coined ExactL3 (L3E ). We show experimentally that L3E is better at inferring unknown PPIs than the previous methods, which gives further evidence that the network structure of PPIs can be accurately reconstructed from partial data.
We would like to remark here that the preliminary conference version of this article [19] contains an error in the presented formula for P (L3E) xy (Formula (4) in [19]) and that the experimental results were obtained using a slightly different (and correct) version of the formula that was implemented as intended in the program code that was provided. In this article, we have corrected the error and also generalized the formula to further improve the performance of our link predictor; see Formulas (6) and (7) in Section 4.2 below.
The article is organized as follows. Section 2 reviews some known general and PPI-specific link prediction techniques. Then, we provide the problem definition and the formulation of L3E in Section 3 and Section 4, respectively. Using the materials described in Section 5, we evaluate how well these link predictors perform in synthetic datasets of certain structures in Section 6. In Section 7, we then evaluate the predictive power and biological significance of L3E using statistical and biological metrics. Finally, in Section 8, we discuss the underlying differences between L3E and other link predictors and give some ideas for potential future improvements.

Previous Work
Link prediction infers new edges based on the properties of the nodes as well as the overall topology of the existing edges [13]. Many classes of link prediction approaches exist, and this article will focus on similarity-based link predictions, where candidate edges are selected based on the similarity of nodes' immediate or extended neighborhoods. Some link predictors of this type are reviewed next. From here on, for any node a, let N (a) denote the set of neighbor nodes of a, and for any set A of nodes, let N (A) = a∈A N (a).

General Link Prediction
The Common Neighbors (CN) concept originates from social networks [15]. It models a social phenomenon: the more friends two individuals share, the more likely they are to also be friends of each other. Then, the CN score of any two nodes a and b is |N (a) ∩ N (b)|. The assumption here is that the higher the CN score, the more confident we can be that the two nodes should be adjacent. In the context of PPIs, a high CN score of two proteins implies that they have similar functions [20]. That is, if two proteins interact with a similar set of proteins then their functions should be similar.
However, a high-degree node will contribute to the CN scores of many more node pairs than a low-degree node will. Consequently, to reduce the influence that a single node may have, it is a good idea to penalize high-degree nodes in the CN index. To do so, the Resource Allocation (RA) algorithm [21] makes high-degree nodes contribute less by using the following formula instead for every pair of nodes a and b: z∈N (a)∩N (b) 1 |N (z)| . In addition to RA, there exist many other normalization schemes. In the Adam-Adar (AA) Index [22], a logarithmic modifier (motivated in the context of social networks mining) is used to do the normalization: . For a survey of the normalization schemes used in many other general link predictors, see [13].  conditions that would lead CN and the L3 principle, respectively, to predict that an edge between the two non-adjacent nodes x and y is in fact missing. (c) A graphical representation of the occurrence of a physical PPI between protein x and protein y.
(d) Using the abstraction in (c), if the PPIs are arranged as shown on the left, we can infer the existence of a PPI between protein y and protein x as shown on the right.

PPI-Specific Link Prediction
Link predictors can also consider parts of the network beyond the immediate neighborhoods of nodes. For example, in the context of PPI networks, [23] applies random walks to identify and connect pairs of nodes with similar distances to the other nodes in the network. This can be classified as a global approach in similarity-based link predictions.
In another study, Nakajima et al. [24] used protein complex datasets on top of PPI datasets to investigate how many PPIs might be missing from those PPI datasets. Assuming that each protein complex must induce a connected subgraph in the corresponding PPI network, the minimum number of edges that have to be added to ensure that this condition holds in the network thus gave lower bounds on the number of missing PPIs in various databases. This also shows how PPI datasets can be augmented with external feature data, utilizing the biological context.
Finally, in the study of our focus [14], Kovács et al. presented the so-called L3 algorithm, which is biologically motivated by the following observation: Since a physical PPI is the physical docking of two proteins, it can only occur if the interfaces of the two proteins are compatible. Now, if nodes x and y in a PPI network share many neighbors, it can be expected that the interface of x is similar to the interface of y. Two proteins with identical or nearly identical interfaces are usually not compatible (they cannot dock with each other), which means that the PPI network will not have an edge between x and y in this case. See Fig. 1(a) for an illustration. On the other hand, if there are many paths of length 3 between x and y in the network then x and y are likely to be compatible, as shown in Fig. 1(b). Following standard graph theory notation, P 3 will denote an undirected length-2 path consisting of three nodes and two edges, and P 4 will denote an undirected length-3 path consisting of four nodes and three edges. Using this notation, the observation above can be stated as: the more P 4 -subgraphs and the fewer P 3 -subgraphs that connect a pair of nodes x and y, the more certain it is that x and y should be connected by an edge. From here on, we shall refer to this principle as the L3 principle.
After the L3 principle was proposed [14], other researchers have also taken inspiration from it to formulate better link predictors for PPI networks. This includes CH2 L3 (abbreviated as CH2 below) [18], a link predictor that extends the general link predictor CRA [25], as well as the Sim [26] link predictor. Both of these are similarity-based link predictions that use information from immediate neighborhoods, just like our method L3E. For this reason, they are also included in the experimental comparison below. The mechanisms of CH2 and Sim are described in more detail in Section 4.1 and Section 5.2.

Problem Definitions
Given an undirected graph G = (V, E), the task is to determine, for each pair of non-adjacent nodes in V , whether or not an edge between them should be added to E. Every non-adjacent node pair {x, y} will be assigned a score P xy that measures, in a relative sense, the confidence with which one can say that x and y should be connected by an edge. As explained in Section 2.2, one can compute P xy based on the L3 principle simply by counting the number of P 4 -subgraphs between x and y. For this purpose, define U = N (x) ∩ N (N (y)) and V = N (y) ∩ N (N (x)), i.e., let U be the set of neighbors of x at distance 2 from y and analogously for V . Then, every P 4 -subgraph between x and y is an undirected simple path of the form (x, u, v, y), where u ∈ U and v ∈ V . Note that a node may belong to N (x) as well as N (y) and also to both U and V , in which case it will be able to take the role of either u or v in a P 4 -subgraph. With these definitions, one can count the number of P 4 -subgraphs between x and y using Formula (1). This kind of double summation will be abbreviated as in Formula (2) to simplify the notation from now on.
However, similar to what was mentioned in Section 2.1, in this formula, highdegree nodes in the sets U and V will contribute to many more P 4 -subgraphs than low-degree nodes, giving them a disproportionate influence on the value of P xy . Hence, Formula (2) should be adjusted to penalize high-degree nodes. The L3 link predictor [14] does this by using a square root modifier according to Formula (3) below.

Our Contributions
We observe that the normalization modifier in Formula (3) does not completely implement the L3 principle. More precisely, Formula (3) only uses the set U , the set V , and the node degrees to evaluate an xy-node pair. It does not take P 3subgraphs into account and may penalize xy-links that are highly likely despite having high-degree nodes u and v. ExactL3 addresses these problems by employing an alternative approach to normalization. Before presenting the formulation, we first give more intuition behind the L3 principle introduced in Section 2.2.
Recall that in the L3 principle, the interface compatibility of node x and node y can be evaluated using the number of P 4 -subgraphs, where each P 4 can be represented as (x, u, v, y). This can be condensed into one central idea, that the size of N (y) reflects the compatibility of the binding interfaces of x and y (the same idea applies analogously to N (x)). In the original L3 predictor, counting P 4 's enables the evaluation because in a P 4 , node u represents the interface of y, and u ∈ N (x); thus, node u provides evidence that x is compatible with y (and analogously for v). In contrast, L3E performs link predictions more accurately by directly evaluating the ratio of compatible and incompatible nodes in the neighborhoods. For example, one should penalize the score P xy if either x or y has neighbors that cannot form a P 4 , and reward it otherwise. In the next section, we formulate the ExactL3 link predictor.

Detailed Formulation of ExactL3
To describe the properties that characterize the L3 principle, we define an ideal L3 graph G L3 as a graph that can be obtained by taking a complete bipartite graph with two parts U and V , and attaching a new node x as a neighbor of all nodes in U and attaching a new node y as neighbor of all nodes in V . This results in a graph with the four basic L3-elements: node x, node y, set U , and set V , which are the fundamental components of an ideal L3 graph. Fig. 2(a) illustrates an example ideal L3 graph. Its nodes have been colored white and black in such a way that no pair of nodes with the same color are adjacent and every pair of nodes with different colors (except x and y) are adjacent.
To model real PPI networks, we need to consider non-ideal L3 graphs that can deviate from ideal L3 graphs in the following ways: • An edge between x and U is missing, or an edge between y and V is missing.
• An edge between U and V is missing.
• An edge between two nodes in U , or between two nodes in V , exists.
• An edge between x and V exists, or an edge between y and U exists.
Recall that we defined U = N (x) ∩ N (N (y)) and V = N (y) ∩ N (N (x)) in Section 3. These definitions induce, for any specified pair of nodes x and y, the L3-elements of an L3 graph whose fitness to the L3 principle can be evaluated by measuring how well the following conditions are met: I N (x) = U and N (y) = V (see Fig. 2 Fig. 2 Fig. 2 As an example, consider a non-ideal L3 graph obtained by inserting a single edge of the form which violates condition III. However, this graph is still quite close to being an ideal L3 graph. To quantify how well conditions I, II, III are met, we introduce two similarity metrics in the next subsection.

Similarity Metrics
Similarity metrics are formulas that score the similarity of two sets with appropriate penalization so that the size of the two sets has a minimum effect on the score. In the case of PPI networks, the sets would be node subsets such as the neighborhood of a node. Such metrics allow us to formalize the relationships in Figure 2(b) as mentioned above. In the following sections, we review two well-studied similarity metrics that will be included in our improved link predictor. (See the summary in Table 2 in Section 5.2 for their precise formulas.)

Simple Ratio
Given two sets A and B, one of the simplest possible metrics is the Simple Ratio in Formula (4), which measures the size of the intersection relative to the size of one of the sets.
To give an example, the CRA link predictor [25] utilizes this to extend the CN principle for general link prediction (including PPI networks). CRA computes the link prediction score of node x and y by first extracting the common neighbors, A = N (x) ∩ N (y). Then, each node a ∈ A is evaluated according to f 1 (N (a), A). The sum of these scores, which is a∈A f 1 (N (a), A), will then be the link prediction score for nodes x and y. It is defined in this way because CRA is only interested in if N (a) is a subset of A, regardless of the size of set A.

Jaccard coefficient
Formula (5) is the Jaccard coefficient [27] for set A and set B. Note that it uses a different denominator than the one in Section 4.1.1.
This evaluation assumes that both sets are equally important and the maximum possible score can only be obtained when A = B. (In comparison, in Formula (4) in Section 4.1.1, the best score can be obtained even if A B or B A.) This idea is utilized in the Sim link predictor [26]. To be precise, Sim independently scores the similarity of node x and nodes v using f 2 (N (x), N (v)), and node y and nodes u using f 2 (N (y), N (u)). The summation of these scores then become the link prediction score for the corresponding node x and y.

ExactL3 Formulations
Using any similarity metric f , we can quantify how close a non-ideal L3 graph is to being ideal by accounting for conditions I, II, and III described at the beginning of this section as follows: where the notation N ¬b (a) is a shorthand for N (a) \ {b}. Then, we complete the formulation by combining them as in Formula (6). We formulate the link prediction score as a sum taken over all pairs of nodes (u, v) for u ∈ U and v ∈ V since each P 4 that increases the likelihood of the edge between x and y corresponds to one such (u, v). Note that f (N (x), U ) and f (N (y), V ) can be evaluated outside of the inner sum since they do not depend on both u and v at the same time. 3 gives a graphical explanation of Formula (6) using the similarity metric f 1 from Section 4.1.1. From now on, the link predictor obtained by letting f = f 1 in Formula (6) will be denoted by L3E(f 1 ); similarly, plugging in f 2 from Section 4.1.2 into Formula (6) gives a link predictor that we will refer to as L3E(f 2 ). To illustrate the L3E formulation with an example, consider the non-ideal L3 graph mentioned previously in this section that was obtained by inserting a single edge of the form {u i , u j } into an ideal L3 graph. For this graph, u j ∈ N (u i ) although u j / ∈ V , which means that N ¬x (u i ) and V are not completely identical and the third term in Formula (6) will be slightly smaller than its maximum possible value. Moreover, the fact that N ¬x (u j ) = V will also contribute to the third term not being maximized, and N (y) = N ¬x (u i ) and N (y) = N ¬x (u j ) will prevent the sixth term from being maximized.
Formula (6) uses neighborhoods with the node x or y excluded (e.g., N ¬x (u)). For normalization purposes, it may in fact be advantageous to include x or y in the neighborhoods. Similar modifications appear explicitly in CH2 predictors [18], where an offset of one is appended as compensation, and implicitly in L3 [14] and CRA [25]. To see why the normalization might be useful, suppose that we are evaluating (x, y) and that N ( because the larger size of V j provides stronger evidence that u j and V j are compatible. Here, if we use neighborhoods that include x then we would get f (N (u i ), V i ) < f (N (u j ), V j ), which might be preferable. Formula (7) below introduces an alternative L3E formulation based on this observation, which we shall refer to as L3E' in the experiments in later sections.

Time Complexity
Here, the computational complexity of L3 (evaluating Formula (3)) and L3E' (evaluating Formula (7)) will be analyzed. The analysis of the latter also applies to L3E. Let n denote the number of nodes in G. In comparison, the CN link predictor is known to run in O(n 3 ) time [28].
The main operations in both L3 and L3E' are the set operations on node neighborhoods. We first discuss the two set operations included in our formulation, the

Materials
In this section, we will give a brief overview of the PPI datasets that were used in our experiments and the other link predictors that were compared to L3E predictors.

Datasets
Our experiments used seven real PPI datasets from two organisms, the well-known model yeast (Saccharomyces cerevisiae, strain S288C) and human (Homo sapiens).
We included multiple datasets for the same organism because the methodology used to obtain them and their confidence thresholds often differ [29]. Six of the datasets were integrated PPI datasets from three different literature sources (BioGRID [30], STRING [31], and MINT [32]) where both the yeast and the human variations were considered for each one. The seventh dataset that we used was HuRI [33], a more recent human PPI dataset obtained from a single experimental source. We used the datasets' annotations to extract physical PPIs (binary PPIs) only as follows: 'physical' for BioGRID; 'binding' for STRING; 'direct interaction', 'physical association', and 'association' for MINT. (All PPIs in HuRI are physical PPIs.) Next, every directional PPI was converted into a non-directional PPI, and all duplicate PPIs (due to multiple evidence in the literature) as well as all self-interactions were excluded. The number of nodes, PPIs, and candidate PPIs for each of the seven processed datasets is listed in Table 1.

Link Predictors
L3E predictors, using each of the two similarity metrics f 1 and f 2 from Section 4.1, was compared to five other link predictors in the literature, with an extra negativecontrol predictor that selected PPIs uniformly at random. Table 2 summarizes the link predictors used in the experiments. The mechanism of each link predictor is as elaborated in the previous sections: among the CN-based link predictors, CN infers edges according to the principle shown in Fig. 1(a) and CRA infers edges using the f 1 similarity metric as explained in Section 4.1.1, while among the L3-based link predictors, L3 infers edges based on the principle shown in Fig. 1(b), Sim infers edges using the f 2 similarity metric defined in Section 4.1.2, and CH2 rewards edges for which the nodes in U and V are connected to many other nodes in U ∪ V but not connected to many nodes outside of U ∪ V .

Type
Link predictor Score function Pxy =

ExactL3 (L3E) predictors
Plug in either f 1 or f 2 into Formula (6) or (7) control rand Rank the edges uniformly at random

Link Prediction in Synthetic Datasets
In this section, we present the results of our first set of experiments, designed to test how well the L3E link predictors realized the L3 principle compared to the other predictors in Table 2 on some synthetic datasets. The rand link predictor is not considered here since it cannot generate a link prediction score, so we use an alternative control predictor that simply counts the number of P 4 's between the two given nodes x and y instead. To generate the synthetic datasets, we start with an ideal L3 graph G (recall the definitions from Section 4) having 50 nodes in U and 50 nodes in V . Then, in the experiments, we add or remove edges from G that induce changes in the scores computed by the link predictors. By modifying an ideal L3 graph in this way, we can see how sensitive each link predictor is when dealing with changes that make G diverge from its ideal form. Since different link predictors use different scales, we normalize all their scores to values between zero and one. From here on, an edge of the form {u i , v j }, where u i ∈ U , v j ∈ V , and i = j, will be referred to as a compatible edge. Similarly, an edge of the form is called an incompatible edge.

Removing Compatible Edges
Our first experiment started with the ideal L3 graph G and removed one of the compatible edges, chosen uniformly at random, from G in each iteration until all the (50 · 50) − 50 = 2450 compatible edges had been removed. Since the 50 edges of the form {u i , v i } were never removed, the four L3-elements x, y, U , and V remained the same throughout the experiment. In every iteration, P xy for each link predictor were computed. We repeated the above ten times and calculated the median, minimum, and maximum scores to capture the variance. The results are plotted in Fig. 4(a). (The results for L3E are plotted separately in Fig. S1(a) since they overlap with L3E'.) As can be seen by looking at the curve for the control predictor, the number of P 4 's decreases as the number of remaining compatible edges decreases. This implies that the scores for a link predictor that realizes the L3 principle should decrease as well, and that to be more sensitive than the control predictor in score penalization, a link predictor's area under curve (AUC) should be smaller than that of the control predictor. In this regard, the predictor L3E'(f 2 ) outperformed all the other predictors, and CH2 and L3E'(f 1 ) also did quite well. The same applies for L3E(f 1 ) and L3E(f 2 ) in Fig. S1(a). Note that CN and CRA have a constant score throughout the iteration: since x and y have no common neighbors in G, the scores computed by CN and CRA never change when edges are removed. For L3, its underwhelming performance can be attributed to the following: in early iterations, many pairs of nodes from U and V contribute to the score, and since these nodes have a high node degree, each pair has a low L3 score. Their individual contributions are consequently very small, which means that when one edge is deleted, the score computed by L3 remains close to its initial score. In contrast, in later iterations, few pairs of nodes from U and V contribute (and these nodes have a lower degree), so deleting an edge affects the score more.

Adding Incompatible Edges
The second experiment was complementary to the one in Section 6.1. Starting from the ideal L3 graph G, one incompatible edge was inserted into G in every iteration until all the 50 2 + 50 2 + 50 + 50 = 2550 incompatible edges had been inserted. Each edge to be inserted was chosen uniformly at random among the incompatible edges that had not been inserted yet. The experiment was repeated ten times, as in Section 6.1.
The results are plotted in Fig. 4(b). (As above, the results for L3E are plotted separately in Fig. S1(b) since they overlap with L3E'.) In this experiment, one might expect to see strictly decreasing scores as additional edges are inserted into G, disrupting its ideal L3 structure. However, as shown by the control predictor, the addition of incompatible edges increases the number of P 4 's non-linearly because the more edges that already exist in G, the more P 4 's between x and y will be created for each additional edge. Therefore, any L3-based predictor will eventually show an increasing score. Yet, L3-based predictors with proper penalization should still be less sensitive than the control predictor. By this, we mean that for a link predictor, the partial AUC starting from the point on the x-axis where its minimum normalized score occurs should be smaller than that of the control predictor. Here, L3E'(f 1 ) and L3E'(f 2 ) outperformed all the other predictors: all L3 predictors show an initially decreasing score as explained, but L3E is the least sensitive during the increase in scores as demonstrated by it having the smallest partial AUC. The same applies for L3E(f 1 ) and L3E(f 2 ) in Fig. S1(b). For CN-based predictors, the scores are not directly related to the number of P 4 's, but rather the number of common neighbors created by incompatible edges of the form {y, u i } and {x, v i }. (Thus, we do not compare them to the control predictor.) Here, CRA is better than CN since it compensates for interconnectedness within common neighborhoods by also taking edges of the form {u i , u j } and {v i , v j } into account. between the two corresponding samples using student's t-test. In (b), a Savitzky-Golay filter using a polynomial of degree 3 and a window size of 21 was applied to make the curves smoother.

Summary
We conclude that the L3E predictors are able to penalize the absence of compatible edges as well as the presence of incompatible edges better than the other link predictors that were considered in the experiments, which suggests that L3E provides a more accurate implementation of the L3 principle for this particular kind of synthetic data. In the next section, we will evaluate the performance of L3E on some real PPI datasets.

Link Prediction in PPI Datasets
In this section, we experimentally evaluate the predictive power and biological significance of L3E using real PPI datasets. The datasets and link predictors used are described in Section 5. To evaluate the predictive power of L3E, we prepared our datasets by removing 50%, 40%, 30%, 20%, and 10% of the edges chosen uniformly at random (different sample sizes), and repeated this ten times for each sample size, yielding a total of 50 samples for each dataset. Next, for each of the 50 samples and each link predictor, we computed the scores of all non-neighboring pairs of nodes x and y (called candidate edges) and ranked them according to their scores. In each such experiment, we then selected the k top-ranked candidate edges to be the set of predicted edges, where k denotes the number of edges that were removed from that dataset in the sampling preprocessing step. (In other words, the accuracy would be 100% if and only if the predicted edges were exactly those that had been removed earlier.) Finally, the performance of the various link predictors was evaluated by analyzing and comparing the sets of edges that they predicted in the experiments.

ExactL3 Improves PPI Link Predictions
A standard tool for evaluating statistical predictors is the precision-recall (PR) curve [34]. Using the true-positive PPIs (tp), the false-positive PPIs (fp), and the false-negative PPIs (fn) in the outcomes of the experiments, precision is then defined as tp tp+f p and recall is defined as tp tp+f n . As implied by the name, a PR curve consists of a link predictor's precision-and recall-values computed for various datasets, and thus illustrates the trade-off between precision and recall. In general, the larger the area under the precision-recall curve (also referred to as the PR AUC), the better [35].
In our experiments, we first considered the datasets in which 50% of the PPIs had been removed. Fig. 5 shows the PR curves and PR AUC-values of the link predictors. Due to L3E' and L3 having a similar performance, with L3E' being slightly better than L3E (see Fig. S2 for a detailed comparison), only the former is included in Fig. 5. For the same reason, we shall focus on L3E' in the experiments from now on. According to the figure, L3E'(f 1 ) is generally the best predictor both among L3E predictors and among other predictors in the sense of having a PR curve that upper-bounds those of the other link predictors most of the time and having the largest PR AUC; the only exception is the STRING Yeast dataset in Fig. 5(a2).
To ensure the proper design of our methodology, we employed the random link predictor (rand) as a negative control. However, the probability of randomly choosing a real PPI from any of the pools of candidate PPIs in the datasets summarized in Table 1 is roughly at most 1%, and so the PR AUC of rand is almost 0 (see Fig.  S3). Because of its insignificance, the rand predictor was therefore excluded from Fig. 5. We also computed the p-value of the PR AUC for all the predictors against rand in Tables S1 and S2, which confirmed that all the predictors are statistically significant and thus far better than selecting PPIs at random. (The largest p-value was 1e-13, i.e., far from statistically insignificant.) Next, we conducted the same experiments for the other datasets, in which 40%, 30%, 20%, and 10% of the edges had been removed. The computed PR curves and PR AUC-values are plotted in Figs. S4-S7. The outcomes are similar to that of the experiment described above, where 50% of the edges had been removed. To give a summary of Figs. S4-S7, we extracted the PR AUC for each of the predictors in the experiments, and plotted them in Fig. 6 to show the changes in PR AUC as the number of edges removed in the dataset decreases. As in Fig.  5, L3E'(f 1 ) outperforms all the other link predictors in most datasets with high statistical significance. Another observation is that the PR AUC along the x-axis decreases, which may be because of the rapid drop in precision-recall or the drop in maximum recall as the percentage of removed edges in the datasets decreases (see Figs. S3-S7). To investigate the reason for this, we evaluated the PR AUC of the random predictor as a negative control (Table S3). There is a gradual decrease in the PR AUC as the number of removed PPIs decreases, suggesting that if fewer PPIs are removed then it is more difficult for a predictor to pick a real PPI at random.
In addition to the predictive power, another important aspect to consider in the evaluation of link predictors is the computation time. Table 3 summarizes the computation times taken by the experiments in Fig. 5. The experiments were conducted using a setup of 14 cores and 32GB RAM. A larger setup consisting of 24 cores and 128GB RAM was used for the BioGRID Human and STRING Human datasets due to their massive size. For L3E', the computation time increases more rapidly than for the simpler predictors CN, L3, and CRA as the datasets scale up (e.g., BioGRID Overall, the above findings lead us to conclude that L3E'(f 1 ) has the best predictive power (in terms of precision-recall across datasets of different sample sizes) using a reasonable amount of computation time.

ExactL3 Predictions are Biologically Relevant
In addition to the statistical significance of L3E' in PPI link prediction, we are also interested in measuring the strength of the biological evidence that supports the predicted PPIs. The first measure that we consider is the STRING confidence score [31]. The STRING confidence score estimates the confidence of a PPI by evaluating evidence for the two proteins such as whether their genes co-express, whether the proteins co-occur phylogenetically, whether the proteins appear together frequently in the literature, and more. We extracted the STRING confidence scores from the STRING datasets, interpreting every null score as a zero. Fig. 7(a) shows the mean STRING confidence scores of the predicted PPIs across different sample sizes of all datasets for every predictor. The random predictor (rand) has been omitted from the figure due to its insignificance. According to the plots, L3E' is the best predictor for the human dataset and the second-best predictor for the yeast dataset (after CRA). This indicates that PPIs predicted by L3E' are biologically relevant. To investigate whether there is a correlation between the ranking of a predicted PPI and its STRING confidence score, we plotted the moving mean (window size of 100, 10 steps forward in each iteration) of the STRING confidence score along with the ranking of the predicted PPIs in Fig. 7(b) for datasets with 50% of the PPIs removed. The figures for the rest of the sample sizes are included in Figs. S8 and S9. The moving mean shows that for every predictor, the predicted PPIs that are ranked higher indeed have a higher STRING confidence score than those that are ranked lower. The difference between L3E' and the other predictors is that, like in the situation in Fig. 7(a), L3E' is the best predictor for the human dataset where the predicted PPIs in general have higher confidence scores, and the second-best predictor for the yeast dataset.
Next, we computed the Gene Ontology (GO) Semantic Similarity (GOSemSim) scores of the predicted PPIs. The GOSemSim score estimates the similarity of two proteins based on the similarity of their so-called GO annotations that describe proteins in terms of their role as a cellular component, their role in molecular functions, and their role within biological processes. The implementation that we used was a GOSemSim package written in the R programming language [36] based on (a) Mean STRING confidence score of the top predicted PPIs for link predictors of dataset sample size from 50% to 90%, with 10% of increase each interval.  Wang's method [37] with the BMA strategy; null GOSemSim scores are ignored in the computations. Fig. S10 shows the GOSemSim scores of the predicted PPIs of all predictors across the datasets of different sample sizes. The link predictors are separated for comparison according to the principle they are based on (Table 2: CN-based, L3-based, or control). While the differences between predictors are less striking than in the experiments above, we can observe that the CN-based predictors have better GOSemSim scores than the L3-based predictors in general. This is natural because CN characterizes protein pairs with similar functions (Section 2.1). Among the L3-based predictors, we can see that L3E' beats the others with statistical significance (in terms of AUC-values using student's t-test) for four of the seven datasets. Hence, PPIs that are ranked highly by L3E' may possess some functional bias that is encouraged by GOSemSim, e.g., physical PPIs with high L3 scores may reside in neighboring cellular components.  Table 4: Overlap ratios of predicted PPIs between different types of link predictors for datasets with 50% of the PPIs removed (Table S4 and S5 show the complete data). 'CNbased' and 'CRA & L3E'(f1)' denotes the overlap ratio of the predicted PPIs between CN and CRA, and between CRA and L3E'(f1) respectively. For 'L3-based', since there are multiple L3-based predictors (L3, CH2, Sim, L3E'(f1), and L3E'(f2)), we calculated the overlap ratio for each pair of predictors. We then took the mean of these ratios as the final value, and also computed the standard deviation. The same applies to 'CN & L3-based' where a CN predictor is compared to a L3-based predictor. Blue color denotes a relatively higher overlap ratio and red a relatively smaller overlap. Ratios are rounded to nearest integers.

Discussion
We have proposed a way to implement the L3 principle in link predictors that we call ExactL3 (L3E). Using the L3E predictors, we are able to deal with hypothetical PPI subgraphs much better than other link predictors (Section 6). L3E can also predict PPIs with strong statistical significance (Section 7.1) and sufficient biological relevance (Section 7.2). In summary, we have demonstrated that the L3E predictors are effective predictors of missing protein-protein interactions that are better than previous methods. The modeling strength of L3E comes from two main ideas, the realization that the L3 principle can be decomposed into a series of computations that compare graph neighborhoods, and that these comparisons can be computed using similarity metrics. These address what the other L3 predictors are lacking: the original L3 predictor [14] simplifies the L3 principle into counting the number of P 4 's, which does not address compatibility of protein interfaces; the CH2 predictors [18] merely adopt the modeling approach of the CRA predictor in L3 subgraphs, which again does not address protein compatibility; and the Sim predictor [26] models protein compatibility using the Jaccard coefficient but only partially since it lets the sets U and V contribute to the final score independently, ignoring the biological motivation of the L3 principle (see also Fig. 1(d)).
The CRA predictor, a CN-based predictor, is one of the best link predictors within our experiments but with a huge variance in its performance. For example, it appears to outperform L3E in Fig. 6(a2) and Fig. 7(a1), but it also underperforms in some cases such as in Fig. 6(a3) and Fig. 6(b3). We hypothesized that this is due to the different paradigms adopted by L3E and CRA in their respective network modeling, so we further investigated the similarity between the pools of PPIs predicted by CRA and L3E'(f 1 ), i.e., the ratio of the overlap. Surprisingly, as shown in Table 4, these two predictors show a lower overlap ratio compared to the mean overlap ratios of L3-based predictor pairs or CN-based predictor pairs. A lower overlap ratio can also be seen even if we compute overlap ratios of pairs where one predictor is CNbased and another is L3-based predictor. This implies that the PPIs predicted by L3E are similar to those predicted by other L3 predictors, although L3E is better at ranking them (see Section 7). Furthermore, this suggests that since L3E and CRA predict differing sets of PPIs with competing performance based on different assumptions, the two methods could perhaps be used together in a complementary way to obtain even better link predictions.
Apart from the improved link prediction performance of L3E, these predictors can also be used as a heuristic to narrow down candidate proteins for biological problems. A study from Liu et al. [17] improves protein folding recognition by constructing a protein similarity network based on the L3 principle to identify proteins that could fold in similar ways as the query protein. Since adding network data yields better performance than using protein sequence and profile data only, we believe that L3E could also be used in other similar scenarios.
We anticipate that the use of biological network data will become even more prevalent in various biological problems. Therefore, methods such as L3E may turn out to be useful for many other applications beyond link prediction in proteinprotein interaction networks in the future.