De-Identification of Clinical Notes with Pseudo-labeling using Regular Expression Rules and Pre-trained BERT

doi:10.21203/rs.3.rs-2672115/v1

Download PDF

Research Article

De-Identification of Clinical Notes with Pseudo-labeling using Regular Expression Rules and Pre-trained BERT

https://doi.org/10.21203/rs.3.rs-2672115/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea.

Objective: In this study, we aimed to perform de-identifying of radiology reports in Seoul National University, Bundang Hospital, a tertiary university hospital in South Korea.

Methods: We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes.

Results: Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach,KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score.

Conclusion: By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved.

de-identification

natural language processing

clinical documentation and communications

electronic health records and systems

Clinical notes contain detailed information regarding the medical history and current health status of patients. As Electronic Health Record (EHR) systems have been widely adopted both globally¹ and in Korea^{2, 3}, the number of narrative clinical notes in EHR has also increased. Although these notes are valuable resources for medical research, they cannot be openly shared, as they contain protected personal information (PHI), which must be removed to make medical records accessible.

De-identifying PHI in texts has been extensively studied. A conventional approach is to build hand-crafted rules, such as regular expressions and dictionary look-ups. For example, Medical Information Mart for Intensive Care (MIMIC) database de-identifies clinical notes using extensive dictionary look-ups and pattern-matching with regular expressions⁴. Using publicly available data, Philter-ucsf⁵ software has a vast vocabulary and regular expression set to remove PHI and save whitelists such as medical terms. With the rapid development of machine learning methods, several data-driven approaches have been developed for PHI detection, including support vector machines (SVM)⁶, conditional random fields⁷, and deep learning⁸. After the release of BERT in 2018, BERT-based pre-trained language models, such as BioBERT ⁹ and ClinicalBERT¹⁰ were developed for the clinical domain and used for PHI identification. BERT-based models were confirmed as adequate solutions for the clinical-deidentification problem by Meaney et al¹¹ (2022), who evaluated the de-identification performance of various BERT-based models on clinical text data.

De-identifying English clinical narrative texts^12–14 has been extensively researched. In contrast, work on non-English clinical notes, including Korean ones, is limited. To our knowledge, the only existing study on de-identifying Korean medical notes was conducted by Shin et al¹⁵ (2015), who developed a rule-based approach with database matching and 15 regular expressions. However, the only a few types of PHI are covered by the developed regular expressions, some of which cannot even be applied to notes from other institutes. This lack of reliable methods has hampered the use of medical texts in clinical research.

The primary purpose of this study was to de-identify the radiology reports from Seoul National University Bundang Hospital (SNUBH), a tertiary university hospital located in a metropolitan area in South Korea, which are written in a mixture of Korean and English. We developed two different de-identification strategies and evaluated them to determine their effectiveness. The first was the rule-based approach, in which extensive regular expressions were developed to cover six PHI categories using 1,112 notes annotated by domain experts. In total, 51 regular expressions were developed. In the second approach, the developed regular expressions were used as labelers to pseudo-label a large number of notes and fine-tune a pre-trained Korean BERT model with them. The second approach can be considered a semi-supervised approach, as auto-labelling is used to increase the training data. The existing Korean named-entity-recognition (NER) model, KoBERT-NER, was used as the baseline model for training. By building regular expressions and evaluating strategies to combine them with machine learning, our research facilitates the secondary use of de-identified clinical documents.

Figure 1 illustrates the research process. Of the 11,181,617 radiology reports in the SNUBH database, we annotated 0.1% of the randomly selected documents (11,182). Among the annotated documents, 1,112 contained PHI. Based on the annotated documents, we built 51 regular expression rules, as described in the following sections. The constructed regular expression rules were utilized as pseudo-labelers to generate training data to fine-tune the BERT-based models. For the validation data, we prepared 342 notes and labeled them at the token level. The performance of each method on the notes of other departments was separately measured on 12 functional test reports from various departments (Department of Neurology, Department of Rehabilitation Medicine, and Endoscopy Center). The steps are described in detail in the following subsections.

3.1 Data preparation

All notes for constructing regular expression rules were provided by the Department of Radiology at SNUBH. Manually annotated notes were selected as the ground truth. The PHI categories were DATE, NAME (both medical staff and patients), HOSPITAL, REGION, NUMBER (extension number of SNUBH, patient number), and other miscellaneous categories (nationality, sex, age), referring to the request of SNUBH.

Table 1

The number of PHI words of documents used to build regular expression rules
DATE	NAME (medical staff)	NAME (patients)	HOSPITAL	REGION	NUMB	ETC	non-PHI	Total
1045	56	50	47	7	3	5	10070	11182

As described in Table 1, approximately 10% (1,112) of the total 11,182 notes contain PHI words. The most frequent PHI was Date (94%), and the following were names of medical staff (5%) and names of patients (4.5%).

3.2 Regular expression

Regular expression rules were constructed after identifying repeating patterns from clinical notes. Figure 2 shows examples of regular expressions. Detailed information for each category is described below.

DATE If an expression has a year, month, and date with corresponding numbers, it is classified as a DATE. However, as shown in Fig. 3, a mix of English and Korean results in different types of DATE notation. Owing to the different writing styles of each physician and the use of bilingual expressions, 19 additional rules were added to detect patterns such as Korean-specific date notation(년(year), 월(month), 일 (day)), a period(.) and quotation marks (").

NAME OF MEDICAL STAFF, NAME OF PATIENT eight regular expression rules for NAME of MEDICAL STAFF, and three rules for NAME OF PATIENTS were developed. The NAME OF THE MEDICAL STAFF has specific contexts. (1) A person who takes action such as "NAME 확인함,” "confirmed by NAME,” "from NAME.” (2) A person whose job is marked as "의료진 (medical staff),” "판독의 (reading radiologist),” or "교수 "Pf." (professor),”. If these patterns appeared in a note, they were masked as the name of the medical staff. NAME OF PATIENTS was described as "환자이름 (patient name): NAME". To separate the medical staff and patients, all patients were mentioned once with these rules in all notes. In addition, certain non-PHI words were utilized as identifiers because they often appear with PHI words. For example, in Fig. 2, ‘환자이름 (patient name),’ ‘판독의’ is a non-PHI word but was included in our regular expression rules to capture PHI (name of the patient and medical staff).

ORGANIZATION Thirteen rules were developed for the ORGANIZATION category. Even the same organization is referred to differently. For example, "서울대병원 (Seoul National University Hospital),” "연건" and "본원" refer to the same hospital. Similarly, "고려대학교병원 (Korea University Hospital),” "고려대병원,” "고대병원,” "고대안암" and "안암" refer to the same institution. To process the abbreviations used by medical staff, vocabulary and regular expressions were built. By manually selecting a list of well-known hospitals in Korea, the following rules were established: (1) Word with "{병원}(hospital)" (2) Word with "{대학교}(university)" (3) Vocabulary to catch patterns. To identify non-major hospitals, which were not included in the vocabulary, the following rules such as ‘(병원|의원)(hospital|clinic)’, ‘[가-힣]피부과([a-z][ dermatology]). ’ were established Consequently, HOSPITAL has the second largest number of regular expressions after DATE, despite the small number of documents.

REGION, NUMBER, etc The REGION category has only one rule. In many cases, LOCATION almost overlaps with ORGANIZATION because the physicians describe most of the regions when referring to t HOSPITALS, e.g."대구 (Daegu) local,” "김포우리병원 (Gimpo-Woori-Hospital),” "강동성심병원 (Gangdong-SungsimHospital).” Therefore, some specific city names, combined with well-known hospitals such as Bundang and Gangdong, were identified as ORGANIZATION. Four rules were developed for the NUMBER category, which comprises the extension number of SNUBH and patient number. However, because the hospital uses a combination of four digits as an extension number, distinguish them from the year is challenging. Therefore, the complete set of extension numbers for SNUBH was collected and a vocabulary set was constructed to match them manually. "전화번호(telephone number),” "Tel.", "T." and "환자번호(patient number)" were used as identifiers to detect NUMBER. Only five notes had the ETC category, which includes the patient’s age, sex, and nationality. The patterns corresponding to each note were added using four regular expressions.

3.3 Machine-learning and pseudo-labeling

KoBERT¹⁶ is a BERT-based Korean-language model developed by SKTBrain. KoBERT has 92M parameters pre-trained with 5M sentences from public Wikipedia and News. KoBERT-NER¹⁷ is a named-entity recognition (NER) KoBERT model trained with the Naver NLP Challenge 2018 dataset, which has 81,000 training and 9,000 validation sets¹⁸. The types of labels used to train KoBERT-NER are similar to those of the PHI categories. The ‘Base’ training level refers to a model trained using only the entity name extraction of the Naver NLP challenge 2018. This dataset contains newspaper articles and automatically generated sentences, which are more structured than other narrative texts but do not include clinical notes. Five out of the six PHI categories (DATE, PERSON, ORGANIZATION, LOCATION, and NUMBER) were included in the Naver NLP dataset; ETC (age, sex, and nationality) was not included.

Pseudo-labels were obtained from the regular expressions we developed, as shown in Fig. 4 When labeling, not all the original KoBERT-NER labels were used. Instead, parts of the output were integrated into the predefined label for convenience. The NAME OF MEDICAL STAFF and NAME OF PATIENT were labeled as [PER], DATE as [DAT], HOSPITAL as [ORG], REGION as [LOC], NUMBER as [NUM], and ETC as newly defined [ETC]. In the original KoBERT-NER, the context of each label is slightly different from that in a medical record. For example, [DAT] covers expressions such as ‘today’ or ‘the end of 2019’ in the original version of KoBERT-NER. However, specific dates such as ‘2019-02-04’ and ‘2020년 3월 7일’ were used more frequently in the clinical notes than in the typical narrative notes. Similarly, [NUM] dealt with all numbers, but the model was only trained on four digits of the phone numbers. Although it does not precisely match the relationship between tokens and labels used in the original KoBERT-NER, the work was performed in as similar a context as possible.

Table 2

The number of PHI words for training data
Training level	DAT	PER	ORG	NUM	LOC	ETC
Small (8,000)	12618	721	847	17	8	15
Medium (16,000)	25425	1475	1765	39	15	25
Large1 (32,000)	50907	2909	3522	190	79	143
Large2 (31,884)	51158	2969	1957	215	83	272

Furthermore, 87,015 PHI words were pseudo-labelled from 400,000 clinical notes. A total of 48,684 notes contained at least single PHI category. To compare the performance of models using different data sizes, 8,000, 16,000, and 32,000 notes (only PHI) were selected from 48,684 notes, which were classified as "Small,” "Medium,” and "Large 1,” respectively. "Large 2" is a manually corrected version of "Large 1" that fixed frequent false positives. In "Large 2,” 179 notes from non-radiology reports were added after eliminating false positives with the purpose of improving performance on non-radiology reports. The numbers of PHI words at each level are presented in Table 2.

In addition to KoBERT, we fine-tuned SciBERT¹⁹, which is a BERT-based model trained on large-scale scientific articles. SciBERT has achieved state-of-the-art results for several NLP tasks in the scientific domain. In particular, Tai et al (2020). showed that the SciBERT-based NER model achieved the highest F1-score for biomedical NER performances ²⁰. For comparison with KoBERT, we fine-tuned SciBERT using different levels of training data in the same manner as that for KoBERT.

Table 3

The number of PHI words in the test notes
Source	DAT	PER	ORG	NUM	LOC	ETC	Total
Department of Radiology	537	30	27	7	9	5	615
Non-Radiology	11	27	4	6	1	19	68

To compare the performances of the methods, we manually annotated an additional 2,200 notes that were not included in the training set. Of the 2,200 notes, 342 contained at least one PHI word and were used as the test set. A total of 615 PHI words were identified from 342 notes (Table 3). The performance of non-radiology notes was tested separately on 12 annotated functional test reports from various departments.

4.1 Rule-based approach

Table 4

Result of regular expression
Test data	Department of Radiology			Other Department
Test data	Precision	`Recall	F1 Score	Precision	Recall	F1 Score
DAT	0.99	0.99	0.99	0.85	1.00	0.92
PER	0.93	0.83	0.88	1.00	0.45	0.62
ORG	0.79	0.82	0.80	1.00	1.00	0.91
NUM	1.00	0.71	0.83	1.00	0.83	0.91
ETC	1.00	1.00	1.00	1.00	0.95	0.97

Regular expressions performed well on the test set with a structure similar to that of the training set. However, several unmatched texts were detected when the notes contained text formats outside the scope of regular expressions. For example, (1) "1.2–1.7" could indicate either the period of the DATE or the medical term. However, being primarily composed of numbers, distinguishing between the two is difficult. (2) "상기확인함 by 홍길동" means medical staff "홍길동" was confirmed in one note. However, the regular expression catches both "NAME 확인함" and "by NAME" simultaneously; thus, "상기" was tagged as NAME although it should have been tagged as ’O. ’ (3) "CMC" could be the CATHOLIC MEDICAL CENTER in Korea or the joint area. The former is a PHI in the ORG category, whereas the latter is not. Rules cannot separate them and always tag them as PHI words. Table 4 presents the performance of the rule-based method.

The proposed regular expression achieved a precision, recall, and F1 score of 97.2%, 93.7%, and 96.2%, respectively on the test notes from the Department of Radiology. Of the 537 words classified as DAT, only one non-PHI and six PHI were incorrectly unmasked. Of the 30 identifiers in PER, the regular expression method masked 27 words correctly, but masked two non-PHI words. Since the slash(‘/’) was included to mark the initial of physicians, i.e., ‘/jmk/’, ‘///skd/’, which is a general notation, the false positives also increased. ORG has a similar problem. Among the 27 PHI words, 22 were correctly identified, but 4 were incorrectly masked because certain common Korean words have the same spelling as the hospital name. Although most rules marked PHI words correctly, they have distinct limitations owing to their inability to understand semantic properties. In addition, they are vulnerable to typos and notations lie outside their scope.

4.2 Machine-learning approach

Table 5

Results of KoBERT-NER by training level
Training level	Base KoBERT-NER (0)		Small (8,000)		Medium (16,000)		Large1 (32,000)		Large2 (31,884)
Training level	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
DAT	0.49	0.43	0.87	0.74	0.81	0.79	0.99	0.98	0.98	0.99
PER	0.46	0.40	0.95	0.70	0.83	0.81	0.89	0.93	0.96	0.93
ORG	0.02	0.47	0.31	0.40	0.33	0.80	0.69	0.96	0.84	0.96
NUM	0.02	0.77	0.75	0.38	1.00	0.43	1.00	0.57	1.00	0.57
LOC	0.50	0.11	0.40	0.22	0.67	0.44	0.81	0.78	0.86	0.89
ETC	0.0	0.0	0.25	0.20	1.00	0.60	1.00	1.00	1.00	1.00

Precision and Recall of each training level. Number in parentheses for each level indicates the number of notes in a train dataset for fine-tuning. As the number of trained notes increased, the accuracy of all categories increased.

As shown in Table 5, the performance of the off-the-shelf KoBERT-NER (Base KoBERT-NER) is very low, which indicates the need to use medical text for fine-tuning. As expected, the overall performance increases with the number of notes used for fine-tuning. "Large 1" level trained KoBERT-NER achieves a precision, recall, and F1 score of 96.5%, 97.6%, and 97.1%, respectively, thereby outperforming the rule-based methods. Most importantly, recall, which is more important than precision in PHI de-identification, was particularly improved. Especially in PER, which requires contextual interpretation, the recall of "Large 1" was 0.93, which was substantially higher than that (0.83) of the rule-based approach. The recall of ORG also substantially improved (0.82 vs 0.96). Interestingly, the machine learning approach does not improve the performance in the NUM category. This unexpected result can be attributed to the small number of corresponding PHI words in the training dataset (Table 2). "Large 2" level trained KoBERT-NER shows that removing the falsely detected words in the training data can significantly improve the model’s precision.

Table 6

Results of KoBERT-NER on Non-radiology notes
Training level	Large 1 (32,000)			Large 2 (31,884)
Training level	Precision	Recall	F1 Score	Precision	Recall	F1 Score
DAT	0.52	1.00	0.69	0.52	1.00	0.69
PER	0.93	0.55	0.68	1.00	0.93	0.96
ORG	0.57	1.00	0.73	1.00	1.00	1.00
NUM	0.0	0.0	0.0	1.00	1.00	1.00
LOC	1.00	1.00	1.00	1.00	1.00	1.00
ETC	1.00	0.37	0.63	0.95	0.95	0.98

Notes from other departments not only contain different forms and sentence structures but also include new vocabulary. Being fine-tuned solely on notes from the department of radiology, the "Large 1" level model did not perform well on the notes of other departments because of the forms of the notes and PHI categories are different from those of radiology notes. For example, PHI, such as “환자번호 (the identification number of a patient),” would lower the accuracy because it is not included in notes from the department of Radiology. Meanwhile, the "Large 2" level model, which included only 179 notes from functional test reports from various departments, achieved notably better results, as shown in Table 6.

4.2.1 Different Base Pre-trained Model

Table 7

Result of KoBERT-NER and SciBERT-NER
Training level	Base KoBERT-NER (0)		SciBERT-Base		KoBERT-Large (32,000)		SciBERT-Large (32,000)
Training level	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
DAT	0.49	0.43	0.34	0.05	0.99	0.98	0.98	1.00
PER	0.46	0.40	1.00	0.40	0.89	0.93	1.00	0.79
ORG	0.02	0.47	0.05	0.43	0.69	0.96	0.79	0.87
NUM	0.02	0.77	0.01	0.54	1.00	0.57	1.00	0.57
LOC	0.50	0.11	0.0	0.0	0.81	0.78	1.00	0.67
ETC	0.0	0.0	0.0	0.0	1.00	1.00	1.00	0.54

We fine-tuned SciBERT using "Base" level and "Large 1" level. The results, as shown in Table 7, were measured on test notes from the Department of Radiology. The results from "Base" level show that both KoBERT and SciBERT fail to properly detect PHI words without fine-tuning using domain-specific data. Furthermore, in the "Large 1" level, SciBERT-NER achieved better precision while KoBERT-NER achieved better recall.

In this study, we adopted a rule-based approach and a pseudo-label-trained machine-learning approach to deidentify PHI with a limited amount of annotated data. Although regular expression rules require labor-intensive work to construct, they perform relatively well in the corresponding domain because the clinical notes are relatively structured compared to typical narrative texts. The 51 regular expressions developed in this study achieved a precision, recall, and F1 score of 97.2%, 93.7%, and 96.2%. However, generalizing and applying the rule-based approach to other department texts is difficult as they were solely constructed based on patterns in radiology reports from a specific hospital in Korea.

In the machine learning approach, off-the-shelf pre-trained NER models cannot be directly utilized in PHI de-identification because they were built based on different forms of notes. In our evaluation, Base KoBERT-NER and SciBERT-Base performed poorly. To improve performance, we fine-tuned KoBERT-NER using pseudo-labeled clinical notes generated by the rule-based method. Compared to the approach based solely on the rule, this approach significantly improved the performance, especially the recall of the PER and ORG. In short, the machine learning approach can utilize contextual information, which improves the performance.

Our experiments demonstrated that generalizing these methods and applying them to other departments or hospitals is possible, as evidenced by the fact that the performance of KoBERT-NER on the narrative notes in a specific domain increases as the amount of training data increases. In particular, we showed that adding a small number of notes from other departments to training data remarkably enhances a model’s capability to detect PHI patterns in other departments’ notes.

Although tokenization of medical terminologies may not be a critical issue for the PHI de-identification task, incorrect tokenization can debase the language understanding of a model. In our experiments, neither KoBERT nor SciBERT properly tokenized the majority of domain-specific medical terminologies. Therefore, further research should be conducted using a different base model that is pretrained with domain-specific corpora. Finally, we assume that the performance can be further improved if researchers add more identifiers to cover the outside scope of the current work and train a model with an extended version of pseudo-labeled notes.

In this study, we de-identified PHI from clinical notes using rule-based and machine learning methods. First, we constructed 51 regular expressions to cover 6 PHI categories using radiology reports from SNUBH. Subsequently, we used the developed regular expressions as a labeler to pseudo-label a large number of notes and fine-tuned KoBERT and SciBERT. By evaluating the performance of models that were fine-tuned in different settings, we confirmed that the performance of the models on the notes in a specific domain increases as the number of training data increased. Additionally, we significantly improved the performance of the model on non-radiology notes by adding a small number of non-radiology reports to the training data. This finding shows that generalizing our method and applying them to other departments and hospitals is possible even though the de-identification model is trained on limited training data. However, we acknowledge that our approach has limitations due to the possibility of the model learning inappropriate features from the pseudo-labeled notes generated by the rule-based method because regular expressions cannot distinguish homonyms and are susceptible to typos. In our study, this problem resulted in relatively low precision of the model on some categories. While we were able to address this issue by making manual corrections on pseudo-labeled notes, it is nececessary to conduct futher research to find a solution that does not require human intervention.

Ethics approval and consent to participate

This study was performed in accordance with the relevant guidelines and regulations of SNUBH Institutional Review Board. It was approved by the Institutional Review Board of SNUBH with wavier of informed consent (IRB No.: B-2206-761-002).

Consent for publication
This study does not contain any individual person’s data in any form (including any individual details, images, or videos). No consent is required for publication.

Availability of data and materials
The data used in this study cannot be shared owing to a policy of the institutional review board of SNUBH. The corresponding author ([email protected]) can be contacted regarding a request on data and materials. Python code and regular expressions used in this study can be found at https://github.com/leelabsg/SNUBH_deid

Competing interests
The authors declare that they have no competing interests in the research.
Funding

This research was supported by Brain Pool Plus (BP+, Brain Pool+) Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (2020H1D3A2A03100666). This research was also partly supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI22C0471).

Acknowledgement

Not applicable

Authors’ contributions

J.A. and J.K. wrote the main manuscript, constructed regular expression rules for pseudo-labeling, and fine-tuned the language models. J.A. constructed “Large 2”. H.B. and S.Y. annotated data, L.S. participated in study design and clinical consultation, and S.L. and S.Y. supervised this study. All authors read and approved the final manuscript.

Authors’ information

Jiyong Anand Jiyun Kimthese authors are equally contributed

Authors and Affiliation
Graduate School of Data Science, Seoul National University, Seoul, South Korea:
Jiyong An, Jiyun Kim, Seunggeun Lee

Department of Radiology, Seoul National University Bundang Hospital, Seongnam, South Korea:

Leonard Sunwoo

Healthcare ICT Research Center, Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam, South Korea:

Hyunyoung Baek, Sooyoung Yoo

Corresponding authors
Correspondence to Sooyoung Yoo ([email protected]) and Seunggeun Lee ([email protected]).

Williams C, Mostashari F, Mertz K, Hogin E, Atwal P. From the Office of the National Coordinator: the strategy for advancing the exchange of health information. Health Aff. 2012;31(3):527–36.
Fuad A, Hsu CY. High rate EHR adoption in Korea and health IT rise in Asia. 2012
Yoon D, Chang B-C, Kang SW, Bae H, Park RW. Adoption of electronic health records in Korean tertiary teaching and general hospitals. Int J Med Inf. 2012;81(3):196–203.
Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):1–9.
Norgeot B, Muenzen K, Peterson TA, et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit Med. 2020;3(1):57.
Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. Paper presented at: i2b2 workshop on challenges in natural language processing for clinical data; November 10, 2006:10–11
Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63.
Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv preprint arXiv:181001570. Accessed October 3, 2018
Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.
Alsentzer E, Murphy JR, Boag W et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:190403323. Accessed April 6, 2019
Meaney C, Hakimpour W, Kalia S, Moineddin R. A Comparative Evaluation Of Transformer Models For De-Identification Of Clinical Text Data. arXiv preprint arXiv:220407056. Accessed March 25, 2022
Yang X, Lyu T, Li Q, et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inf Decis Mak. 2019;19(5):1–9.
Hartman T, Howell MD, Dean J, et al. Customization scenarios for de-identification of clinical notes. BMC Med Inf Decis Mak. 2020;20(1):1–9.
Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proceedings of the ACM Conference on Health, Inference, and Learning 2020:214-22115.
Shin S-Y, Park YR, Shin Y, et al. A de-identification method for bilingual clinical texts of various note types. J Korean Med Sci. 2015;30(1):7–15.
SKTBrain S. Korean BERT pre-trained cased (KoBERT). 2019; Available at: https://github.com/SKTBrain/KoBERT
Park J, KoBERT, -NER. 2020; Available at: https://github.com/monologg/KoBERT-NER
Naver. Naver NLP. challenge. 2018; Available at: https://github.com/naver/nlp-challenge
Beltagy I, Cohan A, Lo KS. Pretrained contextualized embeddings for scientific text. 2019:3–7
Tai W, Kung H, Dong XL, Comiter M, Kuo C-F, exBERT. Extending pre-trained models with domain-specific vocabulary under constrained training resources.Findings of the Association for Computational Linguistics: EMNLP2020:1433–1439

No competing interests reported.

Download PDF

Editorial decision: Revision requested
08 Apr, 2024
Reviewers agreed at journal
06 Nov, 2023
Reviews received at journal
09 Oct, 2023
Reviewers agreed at journal
27 Sep, 2023
Reviewers invited by journal
09 Sep, 2023
Editor invited by journal
02 Sep, 2023
Editor assigned by journal
23 Mar, 2023
Submission checks completed at journal
10 Mar, 2023
First submitted to journal
09 Mar, 2023

You are reading this latest preprint version

De-Identification of Clinical Notes with Pseudo-labeling using Regular Expression Rules and Pre-trained BERT

Status:

Version 1

Abstract

Figures

1. Background And Significance

2. Objectives

3. Methods

3.1 Data preparation

3.2 Regular expression

3.3 Machine-learning and pseudo-labeling

4 Results

4.1 Rule-based approach

4.2 Machine-learning approach

4.2.1 Different Base Pre-trained Model

5 Discussion

6 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1