Re-identication risk prediction paradigm using incomplete statistical information and recursive hypergeometric distribution

Abstract


18
Today we are living in an era of data explosion 1 . We have easier access to information services than at 19 any time in history, but we also face unprecedented privacy risks because your service providers are 20 extremely likely to know you better than you do 2,3,4 . Although service providers often allege that they 21 have to collect as much personal data as possible to improve user experience, they fail to properly protect user privacy 5,6 . In this regard, the government of many countries promulgated privacy protection laws, 23 such as the General Data Protection Regulation 7 (GDPR) in Europe and Personal Information Security 24 Specification (PISS) in China 8 . PISS emphasizes that all collected personal data should be immediately 25 de-identified and stored separately from their profile data 9, 10 . However, even after the de-identification, 26 anonymized personal data still face re-identification risk and are vulnerable to linkage attacks launched by 27 either honest but curious data collectors or malicious hackers 11,12 . Therefore, the re-identification risk of 28 individual data not only reflects the privacy risk level of individuals but also supports regulators in 29 formulating privacy protection policies. Beyond this, it is difficult for individuals and regulatory agencies 30 to obtain the complete dataset maintained by service providers, and they can only infer the re-31 identification risk from the released incomplete statistical information. 32 The re-identification risk of an individual is closely related to her/his tuple frequency. The tuple 33 frequency is defined as the count of a specific data value combination, where a high tuple frequency 34 signifies a low re-identification risk. If an attacker has sufficient background knowledge for the linkage 35 attack, individuals will be re-identified by her/his unique data records with 100% probability. Therefore, 36 the uniqueness of individual data has attracted extensive research attention 13 . According to the 1990 and 37 2000 U.S. census data releases, it takes only three attributes, namely the date of birth, gender, and zip 38 code, to uniquely identify 87% and 63% of the population 14,15 . Montjoye found that it takes only four 39 spatiotemporal points in trajectory data to uniquely identify 95% of the individuals in the location dataset 40 and 90% in the credit card dataset 16,17 . By exploiting the uniqueness contained in the sampled data records 41 or statistical characteristics of datasets, a latent attacker can measure the uniqueness of individuals given 42 incomplete statistical information 18 and even recover the original personal data 19 . However, using the 43 uniqueness to describe the re-identification risk is sometimes inaccurate, because non-unique data records 44 can still be exploited to re-identify individuals from anonymized datasets with a certain probability 20 . 45

46
The attribute dependence of experimental datasets.

47
Inspired by k -anonymity 21,22,23 , we propose to leverage k -indistinguishability as an indicator to describe the re-identification risk of individuals. If the tuple frequency of an individual in an anonymized dataset is 49 not less than k, then this individual is k -indistinguishable. If the probability of a specific individual being k -indistinguishable can be derived for 2,3, k = , one can have a relatively more comprehensive 51 understanding of her/his re-identification risk. Unfortunately, given incomplete dataset information, the 52 state-of-the-art privacy risk research cannot determine the probability of an individual being k -53 indistinguishable when 2 k  . In light of this, the paper presents how to accurately predict the re-54 identification risk for a given individual with only the incomplete statistical information of the target 55 dataset. Specifically, given some statistical information, the probability mass function (PMF) of the RH 56 distribution can be used to estimate the frequency of the tuples containing strong dependent attribute 57 pairs. In real-world applications, an approximate distribution of the RH distribution is employed to 58 calculate the tuple frequency in an anonymized dataset for computational efficiency, and to further derive 59 the probability of an individual being k -indistinguishable in the target dataset. Our experiments use random 24 , demographic 25 , medical 13 , and educational 26 datasets, and the results show that for all involved 61 datasets, the average AUC of our proposed TFRR is 0.86~0.98, suggesting a high prediction accuracy. 62 For datasets containing strongly dependent attribute pairs, the value dependence knowledge is introduced 63 to rectify the prediction results and the average AUC reaches 0.95~0.98. Our research reveals a general 64 rule determining the distribution of the tuple frequency, which is applicable for all random datasets and 65 most real-world datasets and provides a concise yet effective tool for the re-identification risk prediction 66 of anonymized datasets. With the incomplete statistical information of the target dataset, both individuals 67 and regulators can easily use this tool to predict the re-identification risk. Beyond this function, one can 68 even predict the re-identification risk of submitting data to service providers according to their published 69 data formats, statistical information, and privacy protection plans, and accordingly question whether they 70 obey the existing privacy protection laws, so as to foresee and prevent privacy threats. 71 Considering dataset D is a table with columns representing attributes and rows representing data records.

72
Each cell in the table maintains the value of a particular attribute of a particular data record. A tuple is 73 defined as an ordered list drawing one value per attribute, to enumerate all possible cases of data records 74 in D , some of which may not appear in D . From the perspective of probability theory, the frequency of 75 a specific tuple in a target dataset follows the RH distribution (see Methods). However, the dependence 76 between the values in the tuple will affect the tuple frequency distribution. Therefore, we define value 77 dependence (see Methods) to describe the dependence between the value pairs of a tuple, and use the 78 value dependence knowledge of a particular tuple to rectify the prediction results. To grasp a general 79 understanding of the dependence between an attribute pair, we define the attribute dependence and 80 analyze the dependence between each attribute pair in experimental datasets (including random and real-81 world datasets). 82 The attribute dependence profiles an asymmetric relation between two attributes. The dependence of 83 attribute B on attribute A can be calculated as follows, The approximate RH distribution.

123
Because of the computational complexity of the RH distribution, we expect to find an approximate 124 distribution to reduce the computational burden. According to the analysis in Methods, we find that when 125 can be employed to approximate the RH distribution. To 126 have a clearer understanding of the difference between them, we randomly select many tuples from the 127 random datasets and use the PMFs of the two distributions to calculate the occurrence probability of these 128 tuples. The maximum probability distance (MPD) is used to measure the difference between the binomial 129 distribution and the RH distribution, defined as, 130

132
We use the same 64 parameter sets as in the previous experiment to generate random datasets, and 133 randomly select 1000 tuples from each dataset. The MPD between the binomial distribution and the RH 134 distribution is shown in Fig. 3.

143
The possibility of an individual being k -indistinguishable in random datasets.

144
We randomly select the data records of 1000 individuals from 64 random datasets and use Eq. 11 to 145 estimate the possibility of these individuals being k -indistinguishable (see Methods for details). The 146 result of binary classification is shown in Fig. 4. 147 The possibility of an individual being k -indistinguishable in real-world datasets.  The knowledge of attribute dependence and value dependence can reveal the internal relation between 203 different attributes and values of data records, which are important indicators for the value distribution of data 29 . Existing privacy protection methods, such as differential privacy 30,31,32 , can hide the original data 205 while ensuring their availability by adding random noise regularly. Although we can obtain useful 206 information, such as the tuple frequency distribution and top-k data from the de-identified dataset 33 , the 207 value and utilization of the de-identified dataset are substantially reduced due to the impact of the added 208 noise on the attribute dependence and value dependence 34 . Therefore, we plan to study how to customize 209 the differential privacy budgets and noise generation methods according to the predicted re-identification 210 risk for specific individuals, and how to maximize the preservation of the attribute dependence and value 211 dependence information. In addition, trajectory datasets typically contain temporal and sequential location 212 data with strong attribute dependence 35 , which makes the binomial approximation ineffective in privacy 213 risk prediction. This also points out a new research direction for future work. 214

Considering
(1 )  j jd X satisfies the j -RH distribution characterized by 1 , , , , j N j n n . 224 When 2  j , the PMF of the j -RH distribution can be obtained recursively as follows, Equation 4 can be interpreted as that, given that sub-tuple 229

236
Therefore, we consider that the hypergeometric distribution is only a special case of d -RH distributions The binomial approximation of the d-RH distribution. x is as follows, 247 Then the probability of record r matching tuple i x for k rounds is as follows, called as a strongly dependent value pair, and the threshold is set to 0.5 in this paper. 263 The frequency distribution of tuples with strongly dependent value pairs.

264
The physical significance of d -RH distribution can be summarized as follows. Let The frequency distribution of x in D can be approximated by B( , ) p n  .

289
The probability of a specific individual being k-indistinguishable.

292
Then the probability of p being k-indistinguishable when 2  k can be calculated as follows All simulations were implemented in Matlab. The source code to reproduce the experiments will be is 299 deposited in Code ocean or Github. 300