Personalized sampling graph collection with local differential privacy for link prediction

Link prediction (LP) is an attractive research problem on social network data. Yet, the link information may be leaked by the untrusted collector. As countermeasures, there are a few methods specially designed for link prediction under local differential privacy (LDP), which allow the third-party data collector collects user data for link prediction while protecting user connection privacy. In this paper, we propose a C ommunity-based graph collection with P ersonalized sampling R andomized R esponse (CPRR) which is a novel graph collection algorithm with LDP and reaches decent trade-off between sensitive link protection and link prediction performance. The proposed mechanism adopts a personalized sampling technique for each connection of a user and then utlizes the randomized response mechanism on the sampled subset. Based on the personalized sampling technique, we can reduce the injected noise in LDP. Meanwhile, considering the different edge distributions of the different regions on the original graph, we propose a community-based sampling strategy. Then, we prove that the proposed CPRR satisfies LDP. Through extensive experiments on several real-life graph datasets, we demonstrate that CPRR can achieve better results in balancing privacy protection and link prediction performance than the state-of-art baselines.


Introduction
Graph data is a prevalent data format in real applications, for example, social network graphs, and protein graphs, and thus, learning knowledge from graph data or analysis of graph data is attractive in various scientific research and industrial fields.The link prediction on the graph is to infer the possible link relationships between the nodes in the future when the current graph structure is known [1].Link prediction (LP) can be used to predict new co-workers or recommend new Facebook friends [2].Then, the problem of how to avoid private data leakage in the various data analysis process of different data types has also attracted more and more attention [3][4][5][6][7].In particular, the Facebook privacy breach has made people realize that those third-party service platforms that provide people with graph data analysis may expose user to the risk of privacy leakage [8].Traditional privacypreserving models on graphs always assume the existence of a trusted centralized platform to serve users, such as k-neighborhood [9], k-degree anonymity [10] and differential privacy (DP) [11,12].However, these third-party service platforms are not always credible in protecting the user's private data.At the same time, for some decentralized scenarios, each user has his own partial graph data view, and there is no centralized view at all.For example, the World Wide Web, the telephone contact network, the mail contact network, and the transmission trajectory of the new coronary pneumonia infection, wherein decentralized privacy protection is required.
To rescue decentralized scenarios, local differential privacy (LDP) has been proposed to protect the sensitive information of distributed users.It assumes the third party is untrusted and thus, can solve the problem of third-party privacy leakage [13].LDP needs users to perturb their data by themselves, then send the perturbed data to the collector.Based on this, user privacy is protected even if the data collector is unreliable.Since LDP is defined in decentralized scenarios, it is becoming more and more commonly used in data collection [14].Previous work proposed LDPGen [15] for synthesizing graphs under LDP.In LDP-Gen, the author uses LDP to collect the degree information of nodes to synthesize graphs that preserve the graph topology.To balance the noise addition of the overall statistical property and edge-grained statistical character on the graph, LDPGen loses some important local features of the graph.But in the link prediction algorithms especially in some heuristic methods, the local features around the target link have sufficient character to compute the edge existence probability [16].Furthermore, [8] proposed LF-GDPR for estimating metrics on graphs under LDP.[17] proposed triangle counting and k-star estimation on graphs under LDP.However, [8,17] do not pay attention to preserving the topology structure of graphs when they collect graph data.The aforementioned graph processing tasks which satisfy LDP are mainly separated into two branches, one is graph synthesis, and the other is the unbiased estimation of statistical indicators on the graph.Existing graph synthesis methods under LDP can retain the structural character of the integrity of the graph, but cannot preserve some fine-grained features at the edge level.The LP task is to predict the probability of the target links will appear in the future.When using heuristic methods or based on similar properties to compute the existence probability of edges, the local enclosing subgraph around the target node pair will have a stronger effect than the character of the farther away nodes [16].The unbiased estimation of the statistics on the graph focuses on the accurate estimation of the target statistics but ignores the structural preservation of the original graph.They do not directly optimize for the LP task and they show poor link prediction performance.
To address the above challenges, we propose a Community-based graph collection with Personalized sampling Randomized Response (CPRR), which can meet the privacy protection needs of LDP as well as ensure the performance of the downstream LP analysis task.Our contribution to this paper is summarized as follows: • We design a method to use a different sampling probability for each piece of data in the user record, thereby reducing the noise addition of the data on the user side.• We propose a two-round graph data collection approach combined with community partition to further preserve the edge distribution character on the original graph.• We prove that these proposed algorithms still satisfy -LDP.
• Through experiments on several real-life graph datasets, we demonstrate that our proposed method outperforms existing graph collection methods under LDP in the LP task.
This paper is organized as follows.We discuss related works in Section 2. Section 3 introduces preliminaries on LDP and link prediction.Then we define the problem of link prediction under LDP.Section 4 presents our method and analyzes privacy guarantee.Finally, extensive experiments are conducted in Section 5, and this paper is concluded in Section 6.

Related work
We introduce related works from three aspects, i.e., graph with centralized DP, graph with Local DP, and link prediction method.

Graph with centralized differential privacy
The application of centralized DP on graphs can be divided into privacy-protected graph publishing and graph analysis with differential privacy.Graph publishing under differential privacy uses the graph generation model to approximate the original graph and then produces the synthetic graph for analysis.Existing graph generation models include dk-series [18], stochastic Kronecker product [19], attribute graph model [20], hierarchical random graph model [5] and exponential random graph model [21].Besides the traditional graph generation model, [22] proposed a graph generation technique combined with deep learning.Graph analysis with differential privacy studies how to estimate metrics and statistics on graphs under differential privacy.[11] proposed how to estimate triangle count and the cost of the minimum spanning tree on a graph under differential privacy.Moreover, it also includes subgraph count queries [23,24], for instance, k-star count, k-triangle count as well as k-cliques and mining frequent graph patterns [12,25].There are also estimated degree distributions of nodes [26,27] and clustering coefficients [28].[15] first proposed to synthesize graphs under LDP in localized scenarios, in which the graph can be collected through a well-designed node degree collection strategy.[29] further considered the intra-group correlation and intergroup correlation between nodes, and proposed a synthetic graph method based on node correlation.[30] considered the range of the noise when collecting the degrees vector while considering the data utility when generating the graph.The aforementioned three works all are related to graph synthetic, hereafter, some works about estimating graph statistics are introduced.[31] proposed a subgraph counting method that protects the user's private connection as well as the privacy of the neighbors connected to the user in the localization scenario.[8] proposed a general framework LF-GDPR to estimate some graph metrics that can be rewritten as adjacency matrix and degree expressed as polynomials.Local clustering coefficient and modularity are given as application examples.[17] estimated triangle counts and k-star counts through multiple rounds of interactions to reduce estimation errors.Besides, there are some works combined with graph neural networks [32].

Link prediction method
Early link prediction works were based on node similarity to measure link existence around the local neighbors of the target node pair [33][34][35].These methods all have a strong assumption about the distribution of the link existence and are lack of generalization ability.Graph embedding methods are used to encode the global structural character of a graph for link prediction [36,37].However, these methods only rely on the connection information of the graph and do not consider the attribute of the nodes at all.Recently, methods based on graph neural networks have been proposed to use graph connection information and node attribute information jointly.[16] proposed that the local enclosing subgraph contains sufficient information for link prediction, and proposed that such information can be exploited using GNNs.[38] further considers the richness of connections in the local enclosing subgraph of target node pairs.[39] use k-hop common neighbors around the target node pair to form an enclosing subgraph.[40] introduced multi-scale learning into the learning process of graph structure to maximize graph structure feature extraction.These above works are all defined in euclidean space, while [41] considered graph metrics using non-Euclidean space, and a general graph convolutional network based on hyperbolic space was proposed for link prediction.[42] considered the relationship between link prediction and community discovery and performed matrix completion by considering both simultaneously.Recently, in order to preserve privacy, [43] proposed a graph embedding learning method that satisfies edge DP. [2] divided user connection relationships into protected connections and public connections, proposing a variant of DP called protected-pair DP.
The above graph analysis methods mainly focus on graph synthetic or statistic estimation in the decentralized scenario and do not optimized for LP task.These early LP methods did not consider privacy protection.The recent methods consider privacy protection, but the above two methods are only suitable for centralized scenarios and cannot be directly used in distributed scenarios.In this paper, we propose a graph collection method which optimize for LP tasks in decentralized scenarios.

Local differential privacy
Differential Privacy [44] (DP) is a provable privacy protection mechanism that can achieve statistical indistinguishability in an algorithm A on sensitive datasets.We say that an algorithm A satisfies -DP, if and only if for two neighboring databases D and D that differ only in one record, and for any possible output S of A, we have P r[A(D)∈S] P r[A(D )∈S] ≤ e .Essentially, DP guarantees that the attacker could not speculate with high confidence that the input database is D or D , even the attacker could observe any output from algorithm A, thus can protect the privacy of a record in the dataset.
Centralized differential privacy needs sensitive data to be kept in a trusted third-party server in which the differentially private algorithm A is executed.However, the assumption that a trusted third-party server exists is not always true.To relieve this assumption, LDP [13] was proposed, which assumes that each user is responsible for his record in the database.In LDP, the data collector is untrustworthy.Before sending their data to the collector, the user first performs a local collection processing with LDP and then sends the noised version to the collector.In this setting, the input database contains one user's record.The neighbor database D of D relies on the specific user data type and the goal of privacy protection.In the social network scenario, according to the need for privacy protection, the privacy algorithm can be designed to meet the edge differential privacy or node differential privacy [15].The former guarantees that a randomized algorithm does not reflect the inclusion or deletion of any edge of an individual, and the latter hides whether a node and its surrounding edges are included or deleted.
For LDP, we have the following notations.U = {u 1 , . . ., u n } represents the set of users in the social network where n is the user number.The neighbors of a user u i could be defined as a n-dimensional user bit vector υ i = (b 1 , . . ., b n ) i.e., b j = 1, j ∈ {1, . . ., n} if and only if there is an edge (u i , u j ) in the social network; otherwise b j = 0.The n-dimensional user bit vectors of all users form the user adjacency matrix M = {υ 1 , . . ., υ n }.
Definition 1 (Node local differential privacy) [8] A randomized algorithm A satisfies -node local differential privacy ( -node LDP), if for any two neighboring user bit vector υ and υ , we have P r(A(υ)=s) P r(A(υ )=s) ≤ e , where s ∈ range(A), and then we say that A satisfies -node LDP.

Definition 2 (Edge local differential privacy) [8] A randomized algorithm A satisfies
-edge local differential privacy ( -edge LDP), if for any two neighboring user bit vector υ and υ , that differ in one bit, we have P r(A(υ)=s) P r(A(υ )=s) ≤ e , where s ∈ range(A), and then we say that A satisfies -edge LDP.

edge) local differential privacy, then combining of these algorithms
Besides the sequential composition, LDP also enjoys immunity to post-processing [45].The neighboring databases of node LDP define in the whole user bit vector while neighboring databases of the edge LDP define in only one bit of the user bit vector.In general, node LDP should be stronger than edge LDP in privacy-preserving strength (in fact node LDP implies edge LDP).According to the application scenarios and nature of social networks, a privacy model with an appropriate protection strength is more meaningful than a stronger privacy guarantee.Although node LDP is stronger than edge LDP in privacy preservation, edge LDP can still be strongly indistinguishable on each edge, which is sufficient for large graph application scenarios, such as community detection and triangle count while maintaining high data utility [8,17].Therefore, this paper adopts the privacy model of edge LDP like existing works on graph LDP.

Link Prediction with local differential privacy
We consider a social network snapshot, and it can be represented as an undirected graph G = (V , E), where V is the set of nodes and E is the set of edges.This graph does not include multiple links and self-links.Let M represents adjacency matrix corresponding the graph, where 2 possible links between nodes, where |V | indicates the amount of nodes in set V .Let U −E to denote all edges that do not currently exist.We assume that some edges in U −E will appear in the future, and the LP task is to discover these edges and a non-private LP algorithm can predict whether the target edge has a high probability of existing or not in the near future.
If we have been given a non-private LP algorithm B. To protect individual privacy under LDP, we use an algorithm A that satisfies edge LDP to perturb user local graph data then the data collector collects all user's perturbed data to obtain privacy-protected graph G = (V , Ẽ) and uses the non-private LP algorithm B on a graph G to calculate target edge existence probability which we make the result similar as been calculated from the original graph G and discover which edges will appear in the near future.According to the postprocess of LDP [45], the output of LP algorithm B satisfies LDP, so the individual privacy could be preserved.

Method
Firstly, we propose the personalized sampling randomized response.Then, the communitybased graph collection with personalized sampling randomized response is provided.Finally, we analyze and demonstrate the privacy guarantee and complexity of our proposed method.

Personalized sampling randomized response
In the previous work, an intuitive way is to perturb each bit of the neighbor list independently by using the classical randomized response (RR) to collect graph data, called the Randomized Neighbor List (RNL) which could maintain edge-level information but require heavy noise injection [15].Given a user bit vector of user u i , denoted by υ i = {b 1 , . . ., b n } and privacy budget , we can obtain the perturbed vector υi = b1 , . . ., bn as follows: As shown in (1), the probability of keeping the state unchanged is p = e 1+e , which is not proportional to the proportion of real edges in the collected edges.If the edge density is η in the original graph, the proportion of real edge in collected edges can be expressed as ηp ηp+(1−η)(1−p) [8].Or we could compute the proportion by counting each user's real and collected edges.For example, assuming that the number of neighboring nodes of . Then the proportion of real edges in collected edges is . This is a posterior probability which also reflects the probability of an attacker inferring the true edge from the collected edges.Fortunately, most of the graphs in real life are sparse [46], meaning that most individuals are connected with only a small number of neighbors, so the attacker has a small probability of success.But at the same time, because of the edge sparsity of the original graph, directly using the randomized response method to collect the graph will result in expanding the edge density of the collected graph, and the collected graph will contain too many fake edges.It is easy to know that the graph edge density after perturbation using the RR mechanism is ηp+(1−η)(1−p).For example, the Facebook dataset has the original edge density η = 0.0108.Assuming that a small privacy budget = 1 is used, the retention probability p of using the RR mechanism is p = e 1 1+e 1 ≈ 0.7310.Then the perturbed edge density η ≈ 0.2739, and the proportion of real edges in the collected edges is 0.0288.The perturbed edge density is 25 times larger than the original edge density.When we use a larger privacy budget such as = 5, then p ≈ 0.9933.The edge density after the perturbation is η ≈ 0.0173, and the density is 1.6 times as the original graph, and the proportion of real edges after the perturbation is 0.6183.Therefore, RR will introduce excessive fake edges when the is small, and it is possible to inject fewer fake edges and obtain a good utility only when the privacy budget is large which can be shown in experiments.
Besides, [8] pointed out that RNL using the classical randomized response mechanism could achieve -edge LDP for the user but only reach 2 -edge LDP for the third-party collector on an undirected graph.Because the third-party collector receives the same edge twice independently perturbed.In other words, each user independently randomizes and sends all bits of his user bit vector to the third-party collector, which results in high computational and communication overhead.Afterward, we will combine the method RABV [8] to use a similar approach to only perturb the upper triangular matrix.
Based on the above observations, we propose the Personalized Sampling Randomized Response mechanism (PRR) which is shown in Algorithm 1.The sampling [47] and personalized sampling [48] have been proposed in differential privacy, but they didn't have been used to combine with LP tasks under LDP.
Our algorithm takes user bit vectors {υ 1 , . . ., υ n }, privacy budget and expectation of real edges ratio r ∈ (0, 1) as input.The user first calculates his transmission range t based on his user number i (Lines 1-6).During the first 1 ≤ i ≤ n 2 users,the range t = n/2 , otherwise the range t = n−1

2
. Only the transmission range between the (i + 1)-th and (i + 1 + t mod n)-th bit in the user bit vector will be used.Then, the user assigns an appropriate sampling probability which is defined below to each bit of the user bit vector (Lines 7-9).Assume that the user bit vector of u i is where m is the neighboring node number within the transmission range.Figure 1 presents an example of the process.
The adjacency matrix includes n rows and each row corresponds to a user bit vector.The sampling probability of each bit b j in the original user vector is π j , as shown in (2).
where is the privacy budget of the RR mechanism.If t = m, π j = 1.According to Theorem 3, π j ∈ [0, 1], when π j > 1, we set π j = 1, at this time all bits of user vector υ i within the transmission range are used for RR, and the personalized sampling RR degenerates into a classical RR.According to the calculated sampling probability, the user sample each bit from his user bit vector.Then the user records the sampled bit value and its corresponding bit index into the bit subset i (Lines 10).After that, the user does a Randomized response on bit value in the sampled bit subset i as in (1) with the privacy budget .Then the user sends the bit indexes where the corresponding bit value is 1 to the collector (Lines 12-15).The third-party collector gets the adjacency matrix by duplicating and aligning the values at the symmetric positions while padding the missing positions with zeros (Lines 17 -19).The algorithm finally outputs the sanitized adjacency matrix M.
Algorithm 1 Personalized sampling randomized response.
In the above process, only one bit is processed at each pair of symmetric positions in the adjacency matrix, and each bit is perturbed only once, so it could satisfy -edge LDP for the collector.
Finally, we talk about the proportion of real edges in the collected edges.When , it is easy to know that the expectation of real edges ratio of user u i within the transmission range is mp mp+(t−m)π j (1−p) = r where p = e 1+e .All of the bits during the transmission range can construct an upper triangular matrix, so the expectation of the ratio of real edges in the upper triangular matrix is r.Then, duplicate the upper triangular matrix to the lower triangular matrix, the expectation of the real edges ratio of the whole graph is r as well.But, because sample probability π j ∈ [0, 1], when some bits own sample probability me (1−r) r(t−m) > 1 or t = m, we will set π j = 1, meaning that the expectation of real edges ratio in the collected edges may exceed r and computing the accurate expectation of real edges ratio in the collected edges need the original neighbor number of each user.Fortunately, the collector does not have it, so it is difficult for the collector to determine the true real edges ratio.

Community-based graph collection with personalized sampling randomized response
A lot of graphs are low-rank in real life [46] shown there are several obvious community partitions in graph.Besides community discovery and link prediction can be complementary problems [42].There are other methods to consider community character to enhance the performance of link prediction [49,50].They show that community property is helpful for the downstream link prediction task.So we should make the perturbed graph keep the similar community property as in the original graph.
Inspired by these works, we propose a community-based graph collection with a personalized sampling randomized response mechanism (CPRR) which detailed in Algorithm 2. Our algorithm takes original user bit vectors υ 1,...,υ n , the total privacy budget , the privacy budget allocation coefficient α, and the expectation of real edges ratio r as input.Our algorithm consists of two round graph collections.In the first round of graph collection, our method obtains a preliminary community partition (Lines 1).The collector computing privacy budget 1 = α and 2 = (1 − α) and sends them to each user and using the PRR proposed in Section 4.1 to get adjacency matrix M1 and then the collector runs a community discovery algorithm on the adjacency matrix M1 , such as the Louvain community discovery algorithm [51], to obtain a preliminary community partition C = {c 1 , c 2 , . . ., c k } where k is the number of communities.In this paper, we assume that each node will only belong to one community.Then the collector sends the community partition C to each user.In the second round of interaction, the user first calculates the transmission range t (Lines 3-7).Then the user u i divides the user vector bits within the transmission range into corresponding communities according to the received community partition C = {c 1 , . . ., c k }, and counts its neighbor number δ i = {δ i 1 , . . ., δ i k } in each community within the transmission range and the element number of every community S i = {s i 1 , . . ., s i k } within the transmission range(Lines 8-11).Assuming that the community is c j corresponding to bit b j within the transmission range and s i c j = δ i c j , the probability π i j of a bit b j being sampled is as follows: where 2 is the privacy budget used in the RR mechanism.If s i c j = δ i c j , π i j = 1.Then, the user calculates a sampling probability for each bit and samples a subset i (Lines 12-16).Our method set the same sample probability for bit 0 which belongs to the same community.According to the Theorem 3, π i j ∈ [0, 1], when π i j > 1, we also set π i j = 1, at this time all bits of user bit vector υ i are used for RR, the CPRR degrade into classical RR.After a corresponding subset i is obtained, the RR algorithm with the (1) is used on the subset and the perturbed result is sent to the collector (Lines 16).The collector forms the adjacency matrix M2 with the received bit indexes (Lines 17 -19).
Because in each round graph collection process only one bit is processed at each pair of symmetric positions in the adjacency matrix and each bit is perturbed only once.It will satisfy -edge LDP for the collector.
Finally, we also analyze the proportion of real edges in the collected edges.When ) , the real edges number and collected edges number of user bit vector υ i within transmission range in community c j is δ i c j p and δ i c j p + (s i c j − δ i c j )π i j (1 − p) respectively, where p = e 2 1+e 2 .Then the expectation of the real edge ratio of one user bit vector u i within the transmission range is m is the number of the neighboring node of user u i within transmission range.All of the bits during the transmission range can construct an upper triangular matrix.Following the same analysis process as in Section 4.1, we will have the same result as in Section 4.1.

Privacy analysis
To prove that the proposed mechanism satisfies the definition of edge LDP in the definition 2, we first give the following notations.We first introduce two inequalities.Because υ and υ are any neighboring user bit vectors, [15] proved that when using the randomized response mechanism with privacy budget on υ, the perturbed result satisfies the -edge LDP.We use o ∈ Range(RR) to represent any possible output of RR.So the following Theorem 2 is established in terms of the definition of edge LDP.
According to the probability to sample on υ to obtain subset .Then applying the RR mechanism with privacy budget on the sampled subset.The aforementioned personalized sampling and RR mechanism satisfy -edge LDP.
Proof For proving personalized sampling and RR mechanism satisfies −edge LDP.Given any output s ∈ Range(P RR).We need to show:

P r[P RR(υ )= s]
Then when assuming the edge τ was sampled on the neighboring database υ and υ , the sampled results and are also neighbors that differ in edge τ .According to the Theorem 2, the following inequality is obtained:

P r[RR( )= s] ≤ e P r[RR( )= s]
And when > 0, the inequality e > 1 holds.So, we have the following equations

r[P RR(υ )= s]
And for symmetry, we have

P r[P RR(υ )= s] ≤ e P r[P RR(υ)= s]
So, the personalized sampling and RR mechanism satisfy -edge LDP.

Theorem 4 Community-based graph collection with Personalized Sampling Randomized Response satisfies -edge LDP.
Proof is separated into two parts 1 , 2 and = 1 + 2 .Since 1 , 2 are assigned to firstround CPRR and second-round CPRR respectively, according to the sequential composition of edge LDP, the community-based graph collection with PRR satisfies -edge LDP.

Complexity analysis
First, we investigate the computational complexity of the graph collection process, excluding the later link prediction process.Both PRR and CPRR have low computational complexity on the client side.The computational complexity of both PRR and CPRR is O(n).The computational complexity of the PRR and CPRR on the collector side is O(n 2 ) and O(nlog(n) + n 2 ), respectively.Then, the previous methods, for example LDPGen [15] is O(n 2 + n(k 0 + k 1 )), and RABV [8] is O(n 2 ).Compared with their methods, our method has a similar computational complexity, but a significant improvement in link prediction accuracy performance.
Second, we perform communication complexity analysis.We take the communication width between a user and collector as the communication complexity.The PRR only requires one interaction with the collector, and the communication overhead is O(n).Although CPRR needs to communicate with the collector twice, the communication overhead is Then the communication overhead of both the LDPGen and RABV methods is O(n).Thus, our method achieves better link prediction performance with similar communication overhead.

Discussion
The expectation of real edges ratio r in the collected graph is a manually set hyperparameter.Although a larger r can reduce the addition of fake edges, it also gives the attacker greater confidence to infer the real edges from the collected edges.We suggest setting r to be less or equal to 0.5.At the same time, because the collector has no real neighbor number about the user, the collector does not know the accurate proportion of the real edge in the collected edges but only knows the r.In addition, when a very large privacy budget is used, our method will fully degenerate into RR in the whole graph, and then the r will lose its effect.
In the two rounds of graph data collection, the allocation coefficient of the privacy budget is also manually set hyperparameter.Because finding out the ideal privacy budget allocation coefficient is non-trivial, we empirically choose to use 10% of the privacy budget for the first round of data collection and the remaining privacy budget for the second round of data collection.Experiments show that this setting can achieve a good effect on most datasets with common link prediction algorithms.A more granular privacy allocation strategy is left for future work.
The Louvain community discovery algorithm [51] is used as the default community discovery algorithm for its low computational complexity while other robust community discovery algorithms may perform better.The impact of different community discovery algorithms on the method also will be regarded as future work.

Experiments
We implement CPRR and perform a series of comprehensive experiments to study its effectiveness.In particular, we set up three baselines for comparison i.e., RABV [8], LDPGen [15], and using the PRR algorithm to collect graph data by only one round interaction.To evaluate the accuracy of the collected graph data on the LP task, we select four common link prediction algorithms A: Common Neighbors (CN), Katz [52], Node2Vec [53], and SEAL [16].The previous two are heuristic methods and the latter two are supervised learning methods.We set the damping factor to 0.001 for the Katz method.The embedding dimension of Node2Vec is 80 and the other hyperparameters settings are as default as the original paper.Generally, we don't know which edges will appear in the future, so it's difficult to do link prediction tests [52].To validate the performance of the LP task, we randomly separate the currently observed edge set E into two parts: the training set E T , which is perturbed by the local user with the LDP algorithm, and then the data collector collects all perturbed local user data to calculate edge existence probability with non-private LP algorithm, and the probing set E P , which don't been perturbed and it is only used to test the accuracy of the LP algorithm.Formally E T ∪ E P = E and E T ∩ E P = .Following the settings of [16], for each experiment, we show the average result after 10 times graph collection.We randomly chose 10% of the existing edges on the original graph as positive probing data and sample the same amount of nonexistent edges as the negative probing data, thereby constructing the final probing set E p .The remaining edges constitute the training set E T and the corresponding training graph G T = (V , E T ) which will be perturbed under LDP and get privacy-protected graph GT .The experiments contain 4 real-life social networks that have commonly been used for link prediction.
• USAir [54]: a network of US Air lines including 332 nodes and 2,126 edges.
• NS [55]: a collaboration network of researchers in network science including 1,589 nodes and 2,742 edges.• PB [56]: a network of US political blogs including 1,222 nodes and 16,714 edges.
• Facebook [57]: a snapshot of a part of Facebook's social network including 4,039 nodes and 88,234 edges.

Evaluation metrics
We adopt the common metric AUC [52] to measure the performance of LP.AUC provides a score rank for all currently unobserved edges, which can be explained as the probability The privacy budget is =0.1 and r = 0.5.The bold numbers correspond to the best performing LDP algorithm of randomly choosing a missing edge (i.e., edge in E p ) that already being allocated a higher score, greater than the probability of randomly choosing a nonexistent edge(i.e., an edge in U − E).
Definition 3 Running a link prediction algorithm on the set of currently unobserved edges, its results can calculate the AUC values as follows.We make the comparison between missing edges and nonexistent edges from E P and will have z independent comparison results.
There are x times that the score of missing edges is greater than nonexistent edges and y times that the score is the same, the AUC is: If all scores are obtained from an independent and identical distribution, the AUC will be about 0.5.Thus, the more the AUC value exceeds 0.5, the better performance we get compared to the random choice.

Results
We first compare the AUC of our CPRR and baselines graph collection algorithms on different datasets and different link prediction algorithms under the given privacy budget = 0.1.We set the same expectation of real edge ratio r = 0.5 for CPRR and PRR.Meanwhile, the privacy allocation coefficient of CPRR is set to α = 0.1.The results in Table 1 show that CPRR outperforms competing LDP protocols, for the given link prediction algorithm CN, Katz, Node2Vec, and SEAL.In each case of link prediction algorithms, CPRR and PRR always significantly outperform RABV and LDPGen when is small.This is because CPRR and PRR efficiently reduce the addition of the fake edge while retaining fine-grained graph structure.CPRR is not significantly improved compared with PRR.This is because the amount of fake edge added by the CPRR and PRR was similar, and CPRR enhances the data utility simply by adjusting the noise allocation in different regions on the collected graph.
Next, we study how the AUC of all algorithms vary as the privacy budget increased.Figure 2 summarizes the results over 4 real datasets, with a privacy budget from 0.1 to 7. Our method outperforms the baseline for almost all privacy budgets.As increases, the AUC of CPRR, PRR, and RABV increases until reaches a reasonable privacy budget, e.g., = 6, Figure 2 Performance (AUC) with different for candidate link prediction algorithms with Katz and SEAL.The first row is Katz and the second row is SEAL.We set α = 0.1 and r = 0.5 for the CPRR.r = 0.5 for PRR Figure 3 Performance (AUC) with different a expectation of real edges ratio r and different privacy budgets.The link prediction algorithm is Katz with privacy allocation coefficient α = 0.1 then the AUC will become very close to real AUC and does not further increase.When is small, the AUC of LDPGen will improve, and when is large the AUC of LDPGen will become stable.This is because the LDPGen method adds the Laplace noise into the degree vector.When the privacy budget is large enough, say > 2, the added Laplace noise has little effect on the degree vector, and the LDPGen does not fully use the privacy budget when giving a large privacy budget.In addition to the above analysis, in Figure 2(b), when is small the CPRR and PRR have a similar performance and when the is big the CPRR will be significantly greater than PRR.This is because if we use the Louvain community discovery algorithm on the NS dataset, we can obtain the community partition with extremely high modularity, indicating that the NS has a very obvious community structure.However, the results of the partition contain very few nodes in each partition and have a large number of communities, which do not meet the low-rank requirements of the CPRR algorithm.Therefore, when is small, the community discovery result after the first round will lead to the merger of small communities, resulting in a large gap between the discovered communities and the real communities.So, the CPRR method can not effectively obtain the community structure and it will degrade into PRR.When the privacy budget is large, a better community partition approximation to the real communities can be obtained through the first round of collection.Based on the appropriate community partition, CPRR could retain a better graph structure than PRR.
Then we discuss the utility of different values of the expectation of real edge ratio r and the privacy budget allocation coefficient α on the performance (AUC) of the link prediction algorithm with CPRR.
Figure 3 presents that the AUC of link prediction gradually improves as the expectation of real edges ratio r increases in different datasets or privacy budgets.To reduce the attacker's confidence in inferring real edges from observed edges, we suggest setting a small expectation of the real edge ratio so that performance does not drop too much.
Figure 4 presents the performance of link prediction with different privacy allocation coefficients.We can see that the link prediction performance fluctuates with the change of the privacy budget allocation coefficient, and the best performance corresponding budget allocation coefficient is different under different datasets and different privacy budgets.In

Conclusion
Graph data is widely used in real life, and the collected graph data can be used for various analyses, among which the LP task is one of the most significant applications.However, the graph data contains the user's sensitive connection information, and direct perturbation of the user's connection information will expose the user's privacy.In this paper, we propose two personalized sampling randomized response techniques, CPRR and PRR, to collect user connection information and then use the collected information for link prediction analysis.In both proposed mechanisms, edge LDP is achieved as a privacy guarantee.The proposed CPRR algorithm can effectively enhance the accuracy of LP tasks by reducing the generation of fake edges and preserving edge distribution character.The effectiveness and superiority of the method are demonstrated by comprehensive experiments on various datasets.
To the best of our knowledge, we are the first to optimize LP performance on the graph and achieve LDP protection in the decentralized scenario.In future work, we intend to dive in and extend our method further.For example, ensuring the collected graph and the original graph have a more similar community partition or reducing the communication overhead.We also try to extend our personalized sampling to handle other special graph types, such as attribute graphs.In addition, we will extend the personalized sampling technology to other local differential privacy scenarios as well, such as mean estimation and multidimensional data collection.We also intend to employ stronger privacy guarantees, for example, node differential privacy, in the future.

Figure 1
Figure 1 Illustration of PRR mechanism.The left side is an original graph and the right side is the corresponding adjacency matrix User bit vector υ = {b 1 , . . ., b n } and the privacy budget .Assuming υ = {b 1 , . . ., b n } and υ = {b 1 , . . ., b n } are neighboring databases which differ in only one edge τ .Without loss of generality, we suppose that b n = b n for υ and υ .PRR represents personalized sampling to sample a subset and then using a randomized response on the sampled subset.Let denote the sampling process and RR represent the randomized response.

Figure 4
Figure 4 Performance (AUC) with different privacy allocation coefficient α and different privacy budgets.The link prediction algorithm Katz with the expectation of real edges ratio r = 0.5

Table 1
Performace (AUC) for non-private LP algorithm and four different LDP protocols i.e., RABV,