End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models

doi:10.21203/rs.3.rs-3302707/v1

Download PDF

Research Article

End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models

https://doi.org/10.21203/rs.3.rs-3302707/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. One privacy-preserving technique that aims to mitigate these problems is training data pseudonymization. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks.

This study evaluates the predictive performance effects of end-to-end pseudonymization of clinical BERT models on five clinical NLP tasks compared to pre-training and fine-tuning on unaltered sensitive data. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

Natural language processing

language models

BERT

electronic health records

clinical text

de-identification

pseudonymization

privacy preservation

Swedish

No competing interests reported.

SupplementaryInformationforEndtoendpseudonymizedclinicalNLPBMCHealthinformationprivacyandsecurity.pdf

Download PDF

Editorial decision: Revision requested
20 Mar, 2024
Reviews received at journal
16 Mar, 2024
Reviews received at journal
08 Feb, 2024
Reviews received at journal
05 Feb, 2024
Reviewers agreed at journal
31 Jan, 2024
Reviewers agreed at journal
31 Jan, 2024
Reviewers invited by journal
30 Jan, 2024
Editor assigned by journal
12 Sep, 2023
Submission checks completed at journal
08 Sep, 2023
First submitted to journal
28 Aug, 2023

You are reading this latest preprint version

End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models

Status:

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Version 1