Our models essentially have the same structures as that of BERT-Base. We begin with an overview of BERT and describe available models used for medical text-mining tasks. Next, we illustrate our method and refer to our models. Finally, we explain the fine-tuning to evaluate our models.
2.1 BERT: Bidirectional encoder representations from transformers
BERT [2] is a contextualized word-representation model based on masked language modeling (MLM), and it is pre-trained using bidirectional transformers [1]. There are two steps in the BERT framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled large corpora. For fine-tuning, the BERT model is first initialized with pre-trained weights, and all the weights are fine-tuned using labeled data from the downstream tasks. We applied a minimal architectural modification to the task-specific inputs and outputs into BERT and fine-tuned all the parameters in an end-to-end manner.
2.1.1 Pre-training
BERT pre-training is optimized for two unsupervised classification tasks (Fig. 2). The first is MLM. One training instance of MLM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a special token [MASK]. The chosen token is replaced with [MASK] 80% of the time, 10% with another random token, and the remaining 10% with the same token. The MLM objective is a cross-entropy loss on predicting the masked tokens.
The second task is next-sentence prediction (NSP), which is a binary classification loss for predicting whether two segments follow each other in the original text. Positive instances are created by taking consecutive sentences from the text corpus. Negative instances are created by pairing segments from different documents. Positive and negative instances are sampled with equal probability. The NSP objective is designed to improve the performance of downstream tasks, such as natural language inference (NLI) [6], which require reasoning regarding the relationships between pairs of sentences.
While creating the training instances, we set a duplicate factor, which contributes to data augmentation, while pre-training BERT. It refers to the duplicating times of the instances created from an input sentence, where these instances originate from the same sentence but have different [MASK] tokens.
2.1.2 Vocabulary
To manage the problem of out-of vocabulary words, BERT uses a vocabulary from subword units generated by WordPiece [7], which is based on byte-pair encoding (BPE) [8], for the unsupervised tokenization of the input text. The vocabulary is built such that it contains the most frequently used words or subword units. A main benefit of pre-training from scratch is to leverage a domain-specific custom vocabulary. For example, appendicitis, a common disease name, is divided into four pieces ([app, ##end, ##ici, ##tis]) by BERT [2] and three pieces ([append, ##icit, ##is]) by SciBERT [9]. Table 1 compares the vocabularies used by BERT variants. We refer to the original vocabulary released with BERT as BaseVocab, which is based on general domain corpora. In this study, custom vocabularies were used by SciBERT and our models in English. Meanwhile, vocabularies based on BaseVocab were used by the other BERT variants.
Table 1
Comparison of common medical terms in vocabularies used by BERT variants.
Medical Term | Category | BERT | SciBERT | ouBioBERT (Ours) |
stroke | disease | ✓ | ✓ | ✓ |
malalia | disease | ✓ | ✓ | ✓ |
bleeding | symptom | ✓ | ✓ | ✓ |
seizure | symptom | ✓ | ✓ | ✓ |
pulmonary | organ | ✓ | ✓ | ✓ |
stomach | organ | ✓ | ✓ | ✓ |
surgery | procedure | ✓ | ✓ | ✓ |
prescription | procedure | ✓ | ✓ | ✓ |
cocaine | chemical | ✓ | ✓ | ✓ |
glucose | chemical | ✓ | ✓ | ✓ |
osteoporosis | disease | | ✓ | ✓ |
edema | symptom | | ✓ | ✓ |
pancreas | organ | | ✓ | ✓ |
laparotomy | procedure | | ✓ | ✓ |
dexamethasone | chemical | | ✓ | ✓ |
appendicitis | disease | | | ✓ |
jaundice | symptom | | | ✓ |
duodenum | organ | | | ✓ |
polypectomy | procedure | | | ✓ |
codeine | chemical | | | ✓ |
Note: a “✓” symbol indicates that the corresponding vocabulary has the medical term; otherwise, the term will be broken up into smaller subwords. |
2.1.3 Pre-trained BERT variants
The standard BERT model has been reported not to perform well in specialized domains, such as biomedical or scientific texts [3, 9]. To overcome this limitation, there are two possible strategies: either additional pre-training on domain-specific corpora from an existing pre-trained BERT model or pre-training from scratch on domain-specific corpora. A main benefit of the former is that the computational cost of pre-training is lower than the latter. The main advantage of the latter, as mentioned, is the availability of its custom vocabulary; however, the disadvantage is that the pre-trained neural language model may be less adaptable if the number of documents in a specific domain is small.
BERT-Base is pre-trained using English Wikipedia and BooksCorpus [2]. The vocabulary is BaseVocab, and its size is 30K. We evaluated the uncased versions of this model for the general domain.
BioBERT is the first BERT model released for the biomedical domain [3]. BioBERT is initialized from BERT-Base and trained using PubMed abstracts. We used BioBERT v1.1, whose vocabulary is based on BaseVocab, for the evaluation.
ClinicalBERT is a clinically oriented BERT model. [5]. It is initialized from BioBERT v1.0 and trained with additional steps using MIMIC-III clinical notes [10].
SciBERT leverages unsupervised pre-training on a large multi-domain corpus, which consists of 18% of papers from the computer science domain and 82% from the broad biomedical domain [9]. We evaluated the SciBERT-Base-Uncased that utilizes the original vocabulary called SciVocab.
BlueBERT is published with the BLUE benchmark [4]. In this study, we evaluated BlueBERT-Base (P) and BlueBERT-Base (P + M), which were initialized from BERT-Base and pre-trained using only PubMed abstracts and using the combination of PubMed abstracts for 5M steps and MIMIC-III clinical notes for 0.2M steps, respectively.
Tohoku-BERT is a Japanese BERT model used for the general domain released by Tohoku University [11]. It was pre-trained using Japanese Wikipedia, and its vocabulary was obtained by applying BPE to the corpus.
UTH-BERT is a clinical BERT model in Japan published by the University of Tokyo [12]. It was developed with a huge amount Japanese clinical narrative text, and its vocabulary was built with consideration of segment words for diseases or findings in as large a unit as possible.
2.2 Our proposed method: simultaneous pre-training and amplified vocabulary
If we train a BERT model only on a small medical corpus, we must be aware of the possibility that overfitting may degrade its performance. We hypothesized that it can be avoided if we simultaneously train a BERT model using both the general-domain and medical-domain knowledge. This would be achievable by increasing the frequency of pre-training for MLM using documents of the medical domain rather than the general domain and using the negative instances of NSP in which a sentence pair is constructed by pairing two random sentences, each from a different document. To increase the number of combinations of documents and to enhance medical-word representations in the vocabulary, we introduce the following two interventions.
Simultaneous pre-training is a technique used to efficiently create pre-training data from a set of corpora according to size and to pre-train a neural language model, as illustrated in Fig. 3. Given that we pre-train a medical BERT model, core corpora correspond to small medical corpora, and satellite corpora are large general-domain corpora, such as Wikipedia.
In the original implementation, we first divided the entire corpus into smaller text files that can be processed using the memory in practice. Subsequently, the combinations of NSP are determined within each split file, and the duplicate factor is set to define the number of times the sentences are used; however, there are two problems. The first is that the duplicate factor is applied to the entire corpora of both core corpora and satellite corpora, and thus the smaller corpora remain relatively small. Therefore, pre-training using core corpora is less frequent than pre-training using satellite corpora. The second is that the combinations of NSP are limited to the file that was initially split (see Fig. 3.A).
For our method, both core corpora and satellite corpora are first divided into smaller different documents with the same size and then combined to create pre-training instances. When we combined them, we ensured that the documents in the core and satellite corpora would be comparable in terms of their file sizes and that the patterns of the combination would be diverse. Using this technique, more instances from core corpora were used than those from satellite corpora, and they were homogeneously mixed (see Fig. 3.B). Consequently, this intervention achieved a higher increase in the frequency of pre-training for MLM using documents of core corpora through the process of pre-training than the original method. Furthermore, it generates an increased number of different combinations of documents compared to the original method.
As depicted in Fig. 3, core corpora and satellite corpora were combined so that their proportion was equal, and a sufficient number of pre-training instances were created to train a BERT model.
The amplified vocabulary is a custom vocabulary to suit a small corpus. If we build a vocabulary with BPE without adjusting the corpus sizes of core and satellite corpora, most words and subwords would be derived from the satellite corpora, which are larger than the core corpora. To solve this problem, we amplified the core corpora and made the corpus size the same as that of the satellite corpora. Subsequently, we constructed the uncased vocabulary via BPE using tokenizers [13].
2.3 Our pre-trained models
We produced the following BERT-Base models to demonstrate our method. The corpora we used for our models are listed in Table 2.
Table 2
List of the text corpora used for our models.
Abbr. | Corpus | Number of words | Size (GB) | Domain |
jpW | Japanese Wikipedia | 550M | 2.6 | (jp) General |
jpCR | Digital clinical references | 18M | 0.1 | (jp) Medical |
W | English Wikipedia | 2,200M | 13 | (en) General |
BC | BooksCorpus | 850M | 5 | (en) General |
sP | Small PubMed abstracts | 30M | 0.2 | (en) Biomedical |
fP | Focused PubMed abstracts | 280M | 1.8 | (en) Biomedical |
oP | Other PubMed abstracts | 2,800M | 18 | (en) Biomedical |
Notes: Japanese corpora are tokenized using MeCab [14]. jp: Japanese; en: English. |
BERT (prop jpCR + jpW: AmpVocab) is a Japanese medical BERT model pre-trained using our method. We used a reference for clinicians, which is “Today’s diagnosis and treatment: premium,” and it consists of 15 digital resources for clinicians in Japanese published by IGAKU-SHOIN Ltd. as a source of medical knowledge (abbreviated as jpCR) and Japanese Wikipedia (jpW) as that of general-domain knowledge. For comparison, four pre-trained models were prepared. Two are publicly available models: Tohoku-BERT and UTH-BERT. The others are BERT (jpW/ jpCR: jpWVocab), which was initialized with Tohoku-BERT and trained for additional steps using jpCR, and BERT (jpCR: CRVocab), which was pre-trained only using jpCR from scratch.
Next, to assess the feasibility of adapting our method in English, we empirically produced a limited corpus of clinically relevant articles from PubMed abstracts. PubMed comprises a large number of citations for biomedical literature from MEDLINE, and therefore its articles are a mix of those for clinical medicine and those of the life sciences. To simulate a small medical corpus in English, we constructed a medical corpus, denoted as sP, extracted from the PubMed abstracts by using their medical subject headings (MeSH) IDs, which can be converted to the corresponding tree number. The heuristic rules used to decide which articles to extract are shown in Table A1.
BERT (prop sP + W + BC: AmpVocab) is a pre-trained medical BERT model in English to ensure that we can build a well-performing model using a small medical corpus via our method. We used our sP corpus as a small medical source and BooksCorpus (BC) and English Wikipedia (W) as general corpora. BERT (sP: BaseVocab) and BERT (W + BC/ sP: BaseVocab) were trained for comparison. The former was pre-trained solely using sP from scratch, and the latter was initialized from BERT-Base and trained using sP for domain-specific adaptation similar to BioBERT [3].
ouBioBERT (prop fP + oP: AmpVocab) is an enhanced biomedical BERT model pre-trained from scratch using entire PubMed abstracts in which medical articles, especially those related to human diseases, are amplified using our method. Our approach boosts the amount of training on the target domain within the entire corpus. We investigated whether the BERT model trained via our method using PubMed abstracts that were closely related to human diseases (focused PubMed abstracts: fP) as a core corpus and using other PubMed abstracts (oP) as a satellite corpus would achieve better performance for biomedical text-mining tasks than those of other BERT models. We created fP and oP from entire PubMed abstracts using their MeSH IDs (see Table A1).
To clarify the difference between our pre-trained models and the published models, we refer to the published models as shown in Table 3.
Table 3
List of the names for the published models discussed in this study.
Model | Name |
English | |
BERT-Base | BERT (W + BC: BaseVocab) |
BioBERT | BioBERT (W + BC/ P: BaseVocab) |
clinicalBERT | clinicalBERT (W + BC/ P/ M: BaseVocab) |
SciBERT | SciBERT (Sci: SciVocab) |
BlueBERT (P) | BlueBERT (W + BC/ P: BaseVocab) |
BlueBERT (P + M) | BlueBERT (W + BC/ P/ M: BaseVocab) |
Japanese | |
Tohoku-BERT | BERT (jpW: jpWVocab) |
UTH-BERT | UTH-BERT (EMR: EMRVocab) |
Notes: W: English Wikipedia; BC: BooksCorpus; P: PubMed abstracts; Sci: scientific texts; M: MIMIC-III clinical notes; jpW: Japanese Wikipedia; EMR: electronic medical record of the University of Tokyo Hospital. |
2.4 Task-specific fine-tuning BERT
Given an input token sequence, a pre-trained language model generates an array of vectors in the contextual representations. A task-specific prediction layer is then placed on top to produce the final output for the task-specific application. Given the task-specific training data, the task-specific model parameters can be trained and the BERT model parameters fine-tuned by gradient descent using backpropagation. Figure 4 shows a general architecture of fine-tuning BERT models for downstream tasks. Input instance is first subjected to task-specific pre-processing and to the addition of special instance markers ([CLS], [SEP], etc.). The transformed input is then tokenized using the vocabulary of the neural language model and input into the neural language model. The sequence of vectors in contextual representations taken from the language model is then processed by a Featurizer module and input into a Predict module to produce its final output of the given task.
Three evaluations were made. First, we studied the performance of the Japanese medical BERT variants and some baseline models other than neural language models on a medical-document-classification task to confirm that our method could be used in Japanese. Second, we showed the scores of the BLUE benchmark of BERT (prop sP + W + BC: AmpVocab) and publicly available pre-trained BERT models with a single random seed to demonstrate the effectiveness of our method in English. Finally, we executed the BLUE benchmark with five different random seeds and compared the average score of ouBioBERT (prop fP + oP: AmpVocab) with those of BioBERT (W + BC/ P: BaseVocab), BlueBERT (W + BC/ P: BaseVocab), and BlueBERT (W + BC/ P/ M: BaseVocab), respectively, to show the potential of our method.