Distributed non-disclosive validation of predictive models by a modified ROC-GLM

doi:10.21203/rs.3.rs-2462480/v1

Download PDF

Research Article

Distributed non-disclosive validation of predictive models by a modified ROC-GLM

https://doi.org/10.21203/rs.3.rs-2462480/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. This approach brings the analysis to the data, rather than the data to the analysis. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in model development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance. For binary classification, one technique is analyzing the receiver operating characteristics (ROC). Hence, we are interested to calculate the area under the curve (AUC) and ROC curve for a binary classification task using a distributed and privacy-preserving approach.

Methods: We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the ℓ2 sensitivity Δ2( ˆf)) to the data. The appropriate choice of the ℓ2 sensitivity was studied by simulations.

Results: In our simulation scenario, the true and distributed AUC measures differ by ΔAUC < 0.01 depending on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios when Δ2(ˆf) > 0.07. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.

Conclusions: The applicability of our algorithms depends on the sensitivity of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high sensitivity, the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC, are applicable for validation in distributed setups while preserving an individual’s privacy.

Area under the ROC curve

Distributed computing

Medical tests

ROC-GLM

No competing interests reported.

SupplementDistributedAUC.pdf

Download PDF

Version 1

posted

You are reading this latest preprint version

Distributed non-disclosive validation of predictive models by a modified ROC-GLM

Status:

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Version 1