Figure 1 illustrates the research process. Of the 11,181,617 radiology reports in the SNUBH database, we annotated 0.1% of the randomly selected documents (11,182). Among the annotated documents, 1,112 contained PHI. Based on the annotated documents, we built 51 regular expression rules, as described in the following sections. The constructed regular expression rules were utilized as pseudo-labelers to generate training data to fine-tune the BERT-based models. For the validation data, we prepared 342 notes and labeled them at the token level. The performance of each method on the notes of other departments was separately measured on 12 functional test reports from various departments (Department of Neurology, Department of Rehabilitation Medicine, and Endoscopy Center). The steps are described in detail in the following subsections.
3.1 Data preparation
All notes for constructing regular expression rules were provided by the Department of Radiology at SNUBH. Manually annotated notes were selected as the ground truth. The PHI categories were DATE, NAME (both medical staff and patients), HOSPITAL, REGION, NUMBER (extension number of SNUBH, patient number), and other miscellaneous categories (nationality, sex, age), referring to the request of SNUBH.
Table 1
The number of PHI words of documents used to build regular expression rules
DATE
|
NAME
(medical staff)
|
NAME
(patients)
|
HOSPITAL
|
REGION
|
NUMB
|
ETC
|
non-PHI
|
Total
|
1045
|
56
|
50
|
47
|
7
|
3
|
5
|
10070
|
11182
|
As described in Table 1, approximately 10% (1,112) of the total 11,182 notes contain PHI words. The most frequent PHI was Date (94%), and the following were names of medical staff (5%) and names of patients (4.5%).
3.2 Regular expression
Regular expression rules were constructed after identifying repeating patterns from clinical notes. Figure 2 shows examples of regular expressions. Detailed information for each category is described below.
DATE If an expression has a year, month, and date with corresponding numbers, it is classified as a DATE. However, as shown in Fig. 3, a mix of English and Korean results in different types of DATE notation. Owing to the different writing styles of each physician and the use of bilingual expressions, 19 additional rules were added to detect patterns such as Korean-specific date notation(년(year), 월(month), 일 (day)), a period(.) and quotation marks (").
NAME OF MEDICAL STAFF, NAME OF PATIENT eight regular expression rules for NAME of MEDICAL STAFF, and three rules for NAME OF PATIENTS were developed. The NAME OF THE MEDICAL STAFF has specific contexts. (1) A person who takes action such as "NAME 확인함,” "confirmed by NAME,” "from NAME.” (2) A person whose job is marked as "의료진 (medical staff),” "판독의 (reading radiologist),” or "교수 "Pf." (professor),”. If these patterns appeared in a note, they were masked as the name of the medical staff. NAME OF PATIENTS was described as "환자이름 (patient name): NAME". To separate the medical staff and patients, all patients were mentioned once with these rules in all notes. In addition, certain non-PHI words were utilized as identifiers because they often appear with PHI words. For example, in Fig. 2, ‘환자이름 (patient name),’ ‘판독의’ is a non-PHI word but was included in our regular expression rules to capture PHI (name of the patient and medical staff).
ORGANIZATION Thirteen rules were developed for the ORGANIZATION category. Even the same organization is referred to differently. For example, "서울대병원 (Seoul National University Hospital),” "연건" and "본원" refer to the same hospital. Similarly, "고려대학교병원 (Korea University Hospital),” "고려대병원,” "고대병원,” "고대안암" and "안암" refer to the same institution. To process the abbreviations used by medical staff, vocabulary and regular expressions were built. By manually selecting a list of well-known hospitals in Korea, the following rules were established: (1) Word with "{병원}(hospital)" (2) Word with "{대학교}(university)" (3) Vocabulary to catch patterns. To identify non-major hospitals, which were not included in the vocabulary, the following rules such as ‘(병원|의원)(hospital|clinic)’, ‘[가-힣]피부과([a-z][ dermatology]). ’ were established Consequently, HOSPITAL has the second largest number of regular expressions after DATE, despite the small number of documents.
REGION, NUMBER, etc The REGION category has only one rule. In many cases, LOCATION almost overlaps with ORGANIZATION because the physicians describe most of the regions when referring to t HOSPITALS, e.g."대구 (Daegu) local,” "김포우리병원 (Gimpo-Woori-Hospital),” "강동성심병원 (Gangdong-SungsimHospital).” Therefore, some specific city names, combined with well-known hospitals such as Bundang and Gangdong, were identified as ORGANIZATION. Four rules were developed for the NUMBER category, which comprises the extension number of SNUBH and patient number. However, because the hospital uses a combination of four digits as an extension number, distinguish them from the year is challenging. Therefore, the complete set of extension numbers for SNUBH was collected and a vocabulary set was constructed to match them manually. "전화번호(telephone number),” "Tel.", "T." and "환자번호(patient number)" were used as identifiers to detect NUMBER. Only five notes had the ETC category, which includes the patient’s age, sex, and nationality. The patterns corresponding to each note were added using four regular expressions.
3.3 Machine-learning and pseudo-labeling
KoBERT16 is a BERT-based Korean-language model developed by SKTBrain. KoBERT has 92M parameters pre-trained with 5M sentences from public Wikipedia and News. KoBERT-NER17 is a named-entity recognition (NER) KoBERT model trained with the Naver NLP Challenge 2018 dataset, which has 81,000 training and 9,000 validation sets18. The types of labels used to train KoBERT-NER are similar to those of the PHI categories. The ‘Base’ training level refers to a model trained using only the entity name extraction of the Naver NLP challenge 2018. This dataset contains newspaper articles and automatically generated sentences, which are more structured than other narrative texts but do not include clinical notes. Five out of the six PHI categories (DATE, PERSON, ORGANIZATION, LOCATION, and NUMBER) were included in the Naver NLP dataset; ETC (age, sex, and nationality) was not included.
Pseudo-labels were obtained from the regular expressions we developed, as shown in Fig. 4 When labeling, not all the original KoBERT-NER labels were used. Instead, parts of the output were integrated into the predefined label for convenience. The NAME OF MEDICAL STAFF and NAME OF PATIENT were labeled as [PER], DATE as [DAT], HOSPITAL as [ORG], REGION as [LOC], NUMBER as [NUM], and ETC as newly defined [ETC]. In the original KoBERT-NER, the context of each label is slightly different from that in a medical record. For example, [DAT] covers expressions such as ‘today’ or ‘the end of 2019’ in the original version of KoBERT-NER. However, specific dates such as ‘2019-02-04’ and ‘2020년 3월 7일’ were used more frequently in the clinical notes than in the typical narrative notes. Similarly, [NUM] dealt with all numbers, but the model was only trained on four digits of the phone numbers. Although it does not precisely match the relationship between tokens and labels used in the original KoBERT-NER, the work was performed in as similar a context as possible.
Table 2
The number of PHI words for training data
Training level
|
DAT
|
PER
|
ORG
|
NUM
|
LOC
|
ETC
|
Small (8,000)
|
12618
|
721
|
847
|
17
|
8
|
15
|
Medium (16,000)
|
25425
|
1475
|
1765
|
39
|
15
|
25
|
Large1 (32,000)
|
50907
|
2909
|
3522
|
190
|
79
|
143
|
Large2 (31,884)
|
51158
|
2969
|
1957
|
215
|
83
|
272
|
Furthermore, 87,015 PHI words were pseudo-labelled from 400,000 clinical notes. A total of 48,684 notes contained at least single PHI category. To compare the performance of models using different data sizes, 8,000, 16,000, and 32,000 notes (only PHI) were selected from 48,684 notes, which were classified as "Small,” "Medium,” and "Large 1,” respectively. "Large 2" is a manually corrected version of "Large 1" that fixed frequent false positives. In "Large 2,” 179 notes from non-radiology reports were added after eliminating false positives with the purpose of improving performance on non-radiology reports. The numbers of PHI words at each level are presented in Table 2.
In addition to KoBERT, we fine-tuned SciBERT19, which is a BERT-based model trained on large-scale scientific articles. SciBERT has achieved state-of-the-art results for several NLP tasks in the scientific domain. In particular, Tai et al (2020). showed that the SciBERT-based NER model achieved the highest F1-score for biomedical NER performances 20. For comparison with KoBERT, we fine-tuned SciBERT using different levels of training data in the same manner as that for KoBERT.
Table 3
The number of PHI words in the test notes
Source
|
DAT
|
PER
|
ORG
|
NUM
|
LOC
|
ETC
|
Total
|
Department of Radiology
|
537
|
30
|
27
|
7
|
9
|
5
|
615
|
Non-Radiology
|
11
|
27
|
4
|
6
|
1
|
19
|
68
|