In this section, we describe the dataset, error generation, typo correction, MLM based candidate word selection, and deep learning based NER extraction. We examine whether low-quality data with many typos affect the performance of information extraction. To this end, two processes were performed to extract key information from low-quality data. First, we correct typos in two datasets, NCBI-disease data set and surgical pathology records. Second, we identify NER after data was refined through typo correction. Figure 2 illustrates the structure of the overall model. Detailed descriptions of each method are provided in the following sub sections.
Dataset
To evaluate performance of typo correction and NER after typo correction, we used the NCBI-disease dataset, which is mainly used as a benchmarking dataset for NER in the medical domain. In addition, EHRs related lung cancer was provided and used by Asan Medical Center.
A total of 40,443 diagnosis results were composed of five columns, including the date and time of the prescription, the test code, and the text of the test result. We extract the type of test, test institution, test location, result, and size of cancer from the text of the test result. Types of tests include PCNA, needle biopsy, bronchial washing, and pleural fluid etc. even if the name of the test written in the text of the test result is the same, there are some differences depending on the author. Figure 3 shows the example of the text of the test results, which is the input data of the model. The first part is the name of organ, location, operation name, histology diagnosis, tumor size, and invasion of lymph node in order. We excluded invasion of lymph node from the range of information extraction for our work. Table 2 shows the size of training and test data in two datasets each. In surgical pathology records, one line of the text content of the test result in excel file was defined as one sentence and used as the input of the model.
Because the content of the test result is written differently according to the authors without standardized rules, there are many exceptions that cannot be extracted by the rule-based extraction method. In addition, there are parts that cannot be extracted in a rule-based manner due to typos that occur while the author is typing. Therefore, in this study we adapted a deep learning-based algorithm to deal with the part that could not be extracted due to the exception cases and typos that cannot be handled by the rule-based information extraction.
Table 2
Datasets | Train/Test data | Sentences | Tokens |
NCBI-disease | Train | 6347 | 159670 |
Test | 940 | 24497 |
Surgical pathology record | Train | 39443 | 2050125 |
Test | 1000 | 49668 |
Error Generation
Since there is no benchmarking dataset for evaluating typo correction related studies in the medical field, most studies use a method of randomly generating and evaluating typos for evaluation. Therefore, in this study, as in the previous study, the types of typos were defined as four types as shown in Table 3, and for types were randomly generated. Special characters, numbers, and acronyms are made to avoid typos.
Table 3
Type of Errors and Examples
Error type | Example |
Insertion | differenece/difference |
Delete | randm/random |
Replace | appand/append |
Transpose | money/moeny |
Typo Correction
In this study, the candidate group of words to be corrected is selected using the Edit distance algorithm and typo is corrected by scoring each word candidate group in consideration of the frequency of words in the dictionary and the context within the sentence. The SymSpell algorithm was used to generate a candidate group of words to be modified [31]. In general, in order to create a word candidate group based on the Edit distance, four types of text correction processes are performed: delete, transpose, replace, and insert. However, the SymSpell algorithm reduced the amount of computation by using only the delete approach. In order to optimize the SymSpell algorithm for the medical domain, we trained PubMed abstract (about 25.4GB of literature updated in December 2019). In addition, word dictionary was created that summarized words and their frequency using the PubMed abstract. If the frequency of words appeared in the entire collections was less than 20, the words were excluded from the word dictionary. As a result, a total of 2,370,526 words were included in the dictionary. Based on the generated word dictionary, a group of word candidates was generated using the SymSpell algorithm.
MLM based Candidate Word Selection
In order to find an appropriate word to correct a typo among the generated word candidates, we used a score that combines a frequency-based score and context-based score. The formula is as follows.
$$\left(FinalScore\right) = \lambda \left(FrequencyScore\right) + (1- \lambda )\left(ContextSensitiveScore\right)$$
Using the MLM method, scores were obtained in consideration of the context within the sentence. In particular, a BERT-base pre-trained MLM model was used. By adding a dense layer to the pre-trained model, we calculated the probability of a specific word in the masked part of the input sentence and used it as a score for the context. Table 4 shows the structure of the model added to the pre-trained BERT model.
Table 4
structure of the candidate word selection model
Layer | Output Shape |
Input | (None, None, 768) |
Dense | (768, 768) |
Layer Normalization | (768,) |
Output | (None, None, 30522) |
Figure 4 shows the process of finding an appropriate cored word from the generated candidate word group. ‘Eting’ in Fig. 3 (a) should be corrected to ‘eating’. (b) shows the generation of candidate word to correct typos through the optimized SymSpell algorithm. In (c) and (d), the part where the typo appears is masked and replaced with a candidate word, and the probability of the word entering a specific position in the sentence is calculated through the MLM model.
Deep Learning based NER extraction
In order to extract key information from medical data, we fine-tuned by adding a dense layer to the pre-trained BERT. The hyper parameters used for fine tuning are shown in Table 5. The output value \({y}_{i}\) of the model means the probability that the input value \({x}_{i}\) belongs to n tags of each dataset, and it has a vector of the form (1, n). The softmax function is used to learn the probability that the model belongs to \(\text{j}\) tags. When \({a}_{k}\) is the probability value of the \(\text{k}\) th tag among n tags, the probability that the \(\text{k}\) th tag is correct is \({y}_{k}\). The formula of softmax function is as follows:
$${y}_{k}=\text{e}\text{x}\text{p}({a}_{k}/\sum _{i=1}^{n}(\text{e}\text{x}\text{p}({a}_{i})$$
As a result of the model, each word is tagged as one of [B-Disease, I-Disease, O] for the NCBI-disease dataset, and [B-ORGAN, I-ORGAN, B-LOCATION, I-LOCATION, B-OPNAME, I-OPNAME, B-HISTOLOGIC DIAGNOSIS, I-HISTOLOGIC DIAGNOSIS, B-TUMOR_SIZE] for the surgical pathology record. Through this, it is possible to extract the key information from the surgical pathology record, such as organ, location, operation name, histology diagnosis, and tumor size.
Table 5
Hyperparameter | Value |
learning rate | 3e-5 |
epochs | 3 |
max sequence length | 178 |
batch size | 16 |
optimizer | Adam |
activation function | Softmax |