Measuring The Impact of Spatial Perturbations on The Relationship Between Data Privacy and Validity of Descriptive Statistics
Background: Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy.
Methods: We analyzed the impact of perturbation on spatial patterns in the full set of address- level mortality data from Lawrence, MA during the period from 1911-1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran’s I, Local Moran’s I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley’s K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and the its compliance with HIPAA privacy standards.
Results: Random perturbation at 50 meters, donut masking between 5 and 50 meters, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100x100 and 250x250 meter cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard.
Conclusions: Using the set of published perturbation methods applied in this analysis, HIPAA- compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.
Figure 1
Figure 2
Figure 3
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.
Posted 18 Sep, 2020
On 07 Jan, 2021
On 28 Oct, 2020
Received 18 Oct, 2020
Received 16 Oct, 2020
Received 14 Oct, 2020
Received 08 Oct, 2020
On 21 Sep, 2020
On 20 Sep, 2020
On 19 Sep, 2020
On 16 Sep, 2020
Invitations sent on 16 Sep, 2020
On 16 Sep, 2020
On 15 Sep, 2020
On 14 Sep, 2020
On 14 Sep, 2020
On 13 Sep, 2020
Measuring The Impact of Spatial Perturbations on The Relationship Between Data Privacy and Validity of Descriptive Statistics
Posted 18 Sep, 2020
On 07 Jan, 2021
On 28 Oct, 2020
Received 18 Oct, 2020
Received 16 Oct, 2020
Received 14 Oct, 2020
Received 08 Oct, 2020
On 21 Sep, 2020
On 20 Sep, 2020
On 19 Sep, 2020
On 16 Sep, 2020
Invitations sent on 16 Sep, 2020
On 16 Sep, 2020
On 15 Sep, 2020
On 14 Sep, 2020
On 14 Sep, 2020
On 13 Sep, 2020
Background: Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy.
Methods: We analyzed the impact of perturbation on spatial patterns in the full set of address- level mortality data from Lawrence, MA during the period from 1911-1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran’s I, Local Moran’s I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley’s K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and the its compliance with HIPAA privacy standards.
Results: Random perturbation at 50 meters, donut masking between 5 and 50 meters, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100x100 and 250x250 meter cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard.
Conclusions: Using the set of published perturbation methods applied in this analysis, HIPAA- compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.
Figure 1
Figure 2
Figure 3
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.