A New Modeling Method Base on Candidate Window for Clinical Concept Extraction

Background: Recently, how to structuralize electronic medical records (EMRs) has attracted considerable attention from researchers. Extracting clinical concepts from EMRs is a critical part of EMR structuralization. The performance of clinical concept extraction will directly affect the performance of the downstream tasks related to EMR structuralization. We propose a new modeling method based on candidate window classification, which is different from mainstream sequence labeling models, to improves the performance of clinical concept extraction tasks under strict standards by considering the overall semantics of the token sequence instead of the semantics of each token. We call this model as slide window model. Method: In this paper, we comprehensively study the performance of the slide window model in clinical concept extraction tasks. We model the clinical concept extraction task as the task of classifying each candidate window, which was extracted by the slide window. The proposed model mainly consists of four parts. First, the pre-trained language model is used to generate the context-sensitive token representation. Second, a convolutional neural network (CNN) is used to generate all representation vector of the candidate windows in the sentence. Third, every candidate window is classified by a Softmax classifier to obtain concept type probability distribution. Finally, the knapsack algorithm is used as a post-process to maximize the sum of disjoint clinical concepts scores and filter the clinical concepts. Results: Experiments show that the slide window model achieves the best micro-average F1 score( 81 . 22% ) on the corpora of the 2012 i2b2 NLP challenges and achieves 89 . 25% F1 score on the 2010 i2b2 NLP challenges under the strict standard. Furthermore, the performance of our approach is always better than the BiLSTM-CRF model and softmax classifier with the same pre-trained language model. Conclusions: The slide window model shows a new modeling method for solving clinical concept extraction tasks. It models clinical concept extraction as a problem for classifying candidate windows and extracts clinical concepts by considering the semantics of the entire candidate window. Experiments show that this method of considering the overall semantics of the candidate window can improve the performance of clinical concept extraction tasks under strict standards.


Background
Electronic medical records (EMRs) contains substantial clinical data of patients during hospitalization, such as symptoms, test, and treatment, which is the largest source of empirical data in medical research. Most of this information exists in the form of natural language, and this unstructured data is hard to be directly used. With the rapid development of EMR systems, how to effectively extract structuralized information from EMRs wrote by natural language has become a research hot-spot.
Clinical concept extraction from free-text is a crucial part of EMRs structuralization. The clinical concept includes both name entity such as problems, tests, treatment, and clinical departments, and events relevant to the patient's clinical timeline, such as admission, transfer between the department, etc. [1] Clinical concept extraction is a precursor to downstream tasks such as relation extraction [2], and co-reference [3]. Multiple clinical corpora for different tasks such as 2010 i2b2, 2012 i2b2 NLP challenges [1,4], contain the clinical concept extraction task as subtask.
The mainstream solution for clinical concept extraction is modeling the clinical concept extraction task as a sequence labeling problem [5]. A common practice is employing a specific set of labels (such as 'BIO' and 'BEMO' , etc.) to label each token in the sentence separately, and then extract the corresponding clinical concepts from the sentence according to the results of the labeling. A 'BIO' labeling case study is shown below: The patient coughs with fever . Where the 'B-PROBLEM' tag represents the starting token of a 'PROBLEM' entity, the 'I-PROBLEM' tag represents a successor token of the 'PROBLEM' entity, and the 'O' represents a non-entity token. One problem which exists with this modeling method is that the wrong prediction of any token label will lead to the wrong extraction of the entire clinical concept. For example, in the above case, 'coughs with fever' should be extracted as a complete clinical concept. However, if the label of 'with' is incorrectly predicted as 'O', the clinical concepts extracted from the labels will become two clinical concepts of 'coughs' and 'fever'. Please note that the label of 'fever' has been changed to 'B-PROBLEM' in consideration of the legality of the label.
We call this type of extraction error a boundary segmentation error. Boundary segmentation error is considered correct under relax standards (only requiring the extracted clinical concepts to intersect with the correct clinical concepts), but does not meet strict standards (require extracted clinical concepts and correct clinical concepts split boundary is completely consistent). Therefore, we can estimate the proportion of boundary segmentation errors in clinical concept extraction by observing the performance difference of the model under the relax standard and the strict standard. Table 1 shows the performance of the two sequence labeling models on the 2010 i2b2 and 2012 i2b2 tasks.
Where 'BERT base +Softmax' and 'BERT base +BiLSTM-CRF' respectively represent the softmax classifier [6] and the BiLSTM-CRF model [7] which use Bidirectional Encoder Representation from Transformers (BERT) [6] of base size as embedding layer. As can be seen from Table 1, boundary segmentation errors caused Unfortunately, predicting each token accurately is hard. The main reason is that some tokens have an uneven distribution of labels in the training set. For example, in the training set of the 2012 i2b2 task, 'with' appeared 1392 times in total, of which only 17 times appeared as part of the 'problem' type entity (labeled I-PROBLEM or B-PROBLEM), which accounted for about the total number of samples 1.22%. The uneven distribution of labels leads a sequence labeling model hardly to correctly predict the label of 'with' as 'I-PROBLEM' or 'B-PROBLEM'. Similar ones include 'and', 'of', '/' and so on. Figure 1 shows the distribution of the label of these tokens in the 2012 i2b2 task training set. In order to improve the performance of clinical concept extraction tasks under strict standards, we propose a modeling method different from the sequence labeling problem. The clinical concept extraction task is modeled as a problem that classifies all possible token sequences. The token sequence to be classified is called a candidate window. For example, 'The patient', 'coughs', and 'coughs with fever' are candidate windows. We call this model the slide window model (SW). The slide window model classifies the candidate windows instead of predicting the labels of each token that composes the clinical concept, which avoids the boundary segmentation errors caused by the wrong prediction of individual token labels and improves the performance of clinical concept extraction tasks under the strict standard. Our main contributions include: • We modeled the clinical concept extraction task as a candidate window classification problem for the first time. • We proposed a slide window model that suggest possible segmentations first and then measure how possible a segmentation to be extracted with features extracted by CNN. • We achieved state-of-the-art performance on the 2012 i2b2 task.

Clinical concept extraction
Clinical concept extraction aims to extract clinical concepts (e.g., problem, test, and treatment) from EMRs. The solution to the clinical concept extraction task can be roughly divided into two categories. The first category consists of featurebased approaches, which extract clinical concept by machine learning-based model (e.g., HMM, CRF) using manually designed clinical domain knowledge as features [8,9]. The other category mainly contains neural network-based approaches, which generally model clinical concept extraction problems as sence labeling problems and clinical concepts are extracted by predicting the label for each token. Up to now, the most mainstream model for clinical concept extraction tasks is a bidirectional Long Short-Term Memory with Conditional Random Field (BiLSTM-CRF) model [10][11][12]. The BiLSTM-CRF model captures forward and backward context-sensitive information in the sentence by the BiLSTM layer and considers the correlations of the label sequence by the CRF layer using the Viterbi algorithm. As pre-trained language models make significant advances in the NLP field [6,13], more researchers applied pre-trained context-sensitive word embeddings to medical NLP. For instance, Elmo has shown excellent performance in clinical concept extraction [14]. Some recent studies have attempted to pre-train medical-specific language models using unlabeled medical field texts [15][16][17][18] to improve the performance of the BiLSTM-CRF model in the medical NLP task. These studies do demonstrate the enormous potential of domain-based pre-training in medical NLP tasks.

Method
We solve the clinical concept extraction problem in two-stages. The first stage is called candidate concept generation, in which we classify candidate windows into different concept types and calculate the classification confidence using features of candidate windows extracted by CNN. If the confidence of a candidate window exceeds the threshold, it is added to the list of candidate concepts. Here, a candidate window means a sequence of tokens in a sentence within the maximum length. The second stage is called the concept score maximization. Considering that in the candidate concept set generated in the first stage, there may be overlap between the two candidate concepts, we need to post-process the candidate concept set to obtain the final extraction result of the clinical concepts without overlap. In this paper, we use the knapsack algorithm as a post-processing method, which considers classification confidence of each candidate concept as its clinical concept score and optimize the final clinical concept set in a sentence by maximizing the sum of the clinical concept score of the sentence. The details of each stage are described as follows.

Candidate concept Generation
The overview architecture of candidate concept generation is shown in Figure 2, which primarily involves the following three components: i) Context-sensitive token representation. It generates a semantic representation of each token of a sentence using bidirectional encoder representation from Transformers (BERT). ii) CNNbased representation of candidate window. It convolves all token representations in a candidate window to obtain a feature representation of each candidate window. iii) Candidate window classification. It predicts the concept type of each candidate window and measures the classification confidences by their feature representations. Details are shown in the following sections.

Context-sensitive token representation
In natural language, the semantic of a token varies with its context. Recent studies show that pre-trained language models that consider contextual information can more accurately represent the semantic information of each token. In this paper, we used bidirectional encoder representation from Transformers (BERT) to generate context-sensitive embedding of each token. Inspired by the Elmo model, we spliced the context-sensitive embedding (blue rectangles in Figure 2) from the last layer of the BERT model and the token embedding (green rectangles in Figure 2) from the embedding layer of BERT model as input to the downstream task. This splicing method retains the semantic information of tokens at different levels, which can provide more information for downstream tasks.
In order to compare with the SOTA results on the 2010 i2b2 and 2012 i2b2 tasks, we followed Si 2019 et al. [18] and used the MIMIC III database to pre-train the BERT model on the domain. The parameters used in training are set according to the authors' detailed instructions. MIMIC III [19] is a public database of intensive care unit (ICU) patients.

CNN-based representation of candidate window
In this component, we apply a convolutional neural network(CNN) to extract the features of each candidate window. Considering that If the length of the candidate window is not limited, the number of candidate windows is too large, which is where N is the number of tokens in the sentence. We set a maximum window length and select all token sequences whose length is less than the maximum window length as candidate windows. This not only reduces the number of candidate windows but also ensures that the model does not miss any clinical concepts within the maximum window length. The setting of the maximum window length depends on the statistical results of the concept length in the data set.
Using CNN to extract the features of each candidate window, one straightforward method is to define K CNNs to perform feature extraction on different candidate windows, where K is the total number of candidate windows in the sentence. Generally, the number of K is large, about O(N M ). Where M is the maximum window length, this will cause a problem, because the number of candidate windows is extensive, defining a CNN for each candidate window separately will cause unacceptable space overhead. Another direct idea is to use a two-stage model. After obtaining a context-based token representation, use CNN to extract features for each candidate window once. There is no doubt that this method is very time-consuming because this method requires K convolution operations for each sentence. In this paper, we propose a new method that extracts the feature vectors of all candidate windows in the sentence using CNN only once. coughs with fever Figure 3 Perform1D-convolution operation of length 2 on the candidate window Figure 3 shows the process of convolving the candidate window "coughs with fever" using a 1-dimensional convolution operator of length 2. Generally, we only use the final result (represented by the orange region) generated by CNN as the feature vector of the candidate window, without paying attention to the output of the lower convolution layer (represented by the yellow region). However, by observing Figure  3, it can be found that the feature vector generated by the lower convolution layer contains the feature information of the shorter candidate window. For example, the feature vector represented by the yellow region in Figure 3 can be considered as the feature representation of the candidate window "coughs with".
Based on these observations, we use feature vectors obtained from different convolutional layers to represent the feature information of candidate windows of different lengths and achieve the purpose of extracting all candidate window feature vectors, using CNN only once.
As shown in Figure 4, the context-sensitive token representation act as the feature representation of all candidate windows with length 1. In the first convolutional layer, we convolve the input with a 1D convolution operates of length 2. The outputs of the first convolutional layer act as the representations of all candidate windows with length 2, and so on. Considering that during the convolution process, vector representations of different layers may have different dimensions, we arrange these representations in order and project them into a vector space of the same dimension to serve as the final feature representation of each candidate window.

Candidate windows classification
In this component, a candidate concept list can be generated. First, we classify each candidate window as a type of clinical concept or "not a clinical concept" using a Softmax classifier. Then, we add all candidate windows, which are predicted as a clinical concept, and the confidence exceeds the threshold, into the candidate concept list. We set the threshold manually based on experience. In this paper, we set the threshold to 0.5 simply.
We use a one-hot vector to represent the type of candidate window and crossentropy loss to measure the correctness of the classification. The window is classified as a clinical concept is called a positive case, and the window that is not a clinical concept is called a negative case. We noticed that the positive case is very sparse in all candidate windows. For example, in the i2b2 data set, each sentence contains no more than ten positive cases. However, the number of candidate windows it generates is much larger than the number of positive cases. For example, in a slide window model with a maximum sequence length of 128 and a maximum candidate window of 10, the number of candidate windows will be as high as 1, 650. To solve the problem of data imbalance, we introduce a weighting factor in the loss function to increase the weight of the positive case. As shown in Equation 1.
Where l p is the mean of the loss of all positive examples, while l n corresponds to the mean of loss of all negative examples, and α is the weighting factor to increase the weight of the positive case in the loss function, L represents the weighted loss.
concept score maximization Since the generated candidate concepts may collide, such as boundaries coinciding with two candidate concepts, we need to select a set of candidate concepts without conflicts from the candidate concept list. To measure the quality of different candidate concepts, we define a score for each candidate concept, and our goal is to find a set of candidate concepts with the most significant concept score. We call this process concept score maximization.

Algorithm 1 Concept Score Maximization Based Knapsack Algorithm
Input: The concept score F (i, j), Sentence length N Output: A set of the selected clinical concept ConceptSet 1: for i = 0 → N do 2: P re(i) = j 8: t = N 9: while t! = 0 do 10: if CandidateW indow(P re(t), t) is a concept then 11: add CandidateW indow(P re(t), t) into ConceptSet 12: t = P re(t) 13: return ConceptSet In this paper, we take the confidence of the classification of each candidate concept as the concept score and use the knapsack algorithm to find the sequence of concepts with the highest score. Equation 2 shows the state transition equation of the knapsack algorithm.
Where S(n) is the most significant sum of concept score on the first n token sequences, and F (i, j) is the concept score of the candidate concept starting with the ith token and ending with the jth token. If the token sequence is not a candidate concept, the concept score is 0. Each time S(n) is updated, we record the value of i as P re(n), which is used to get the last concept E(i + 1, n) and the previous point. The specific algorithm is shown in ithm 1.

Dataset
Our experiments are performed on two widely studied public available datasets, 2010 i2b2, and 2012 i2b2. The 2010 i2b2 challenge data contains a total of 170 training and 256 testings EMRs with three clinical concept types: PROBLEM, TEST, and TREATMENT. The 2012 i2b2 challenge data contains 190 training and 120 testing discharge summaries, with six clinical concept types: PROBLEM, TEST, TREAT-MENT, EVIDENTIAL, CLINICAL DEPARTMENT, and OCCURRENCE. The pre-training database is MIMIC III [19], which is a public database and consists of almost 2million clinical notes.

Baseline
In this paper, we conducted two experiments. In order to illustrate the performance improvement of the proposed method compared with the sequence labeling-based method under strict standards in the clinical concept extraction task, we have selected two currently mainstream sequence labeling methods for comparison in the first experiment. One is the BiLSTM-CRF model, which is a widely used sequence labeling model. The structure of the model follows Lample 2016 et al. [7] The other is a method proposed by Google in 2018 that only uses the softmax classifier to classify the token representation vectors generated by the pre-trained language model. In Table 2, we use 'BiLSTM-CRF', 'Softmax', and 'SW' to represent the BiLSTM-CRF model, softmax classifier, and slide window model, respectively.
On the other hand, we compare the performance between the proposed method and the current SOTA results in the second experiment. The SOTA results in 2012 i2b2, and 2010 i2b2 were obtained by si 2019 et al. Their work used the MIMIC III database to train pre-trained language models further. For comparison, we have followed their work and used the MIMIC III database to pre-train domain models. In Table 3, we use 'SOTA' and 'SW' to represent the SOTA result and our results, respectively.

Pre-training
The medical-specific model is initialized with base-sized BERT (BERT base ) and pre-trained using the MIMIC III database [19]. The model is represented by BERT base−mimic . Unless specified, we follow the original detailed instruction, which Google proposed, to set up the pre-training parameters. The vocabulary list consisting of 30522 word-pieced tokens applied in BERT base is adopted, and all words are set to lowercase, as is standard practice. We performed 700000 pre-training steps and took the latest preserved model as the BERT base−mimic . These settings are the same as Si 2019 et al. [18]. The maximum sequence length of the BERT model is set to 128. For all sentences that exceed this length, we use the default tokenize method of BERT to truncate it.

Experiment
In the first experiment, a base-sized BERT model was used in these three models, the BiLSTM-CRF model, softmax classifier, and slide window model, to generate a context-based token vector representation in order to reduce the performance difference caused by the different token embedding vectors. On the other hand, in order to limit the difference in vector expression ability caused by the different representation vector length, the length of the vector representation output by our three models is set to 768, which is consistent with the vector representation length output by BERT base . The implementation of the BiLSTM-CRF model and softmax classifier follows Lample 2016 et al. [7] and Google 2018 [6], respectively. As for the hyperparameter settings of the slide window model, we set the maximum window length to 10 based on the statistical results on the 2010 i2b2 and 2012 i2b2 task training sets. This window length can cover more than 98.8% of clinical concepts. Figure 5 shows the number and proportion of clinical concepts that can be covered by different maximum window lengths in the 2012 i2b2 task. Meanwhile, we set the weighting factor in the loss function to 0.6 and the threshold for generating a candidate concept to 0.5 based on our experience. The convolutional neural network uses 1-dimensional convolution operators of length 2 in the slide window model. For the activation function of the convolution operator, we tried Sigmoid and Relu, and achieve similar performance.
In the second experiment, we used the results reported by Si 2019 et al. [18] as the baseline, which is the state-of-the-art on the 2010 i2b2 and 2012 i2b2 corpora, so far. The baseline system used the external database MIMIC III to further pretrain the BERT model. After that, the BiLSTM-CRF model was used to label the  Table 3 and represented by BiLSTM-CRF.

Evaluation
We evaluated the overall performance of the model under strict standards. The specific performance metrics are precision, recall, and F1 score, and the F1 score is used as the final evaluation criteria.
Training data is divided into ten times cross-validation. We train the model using the training set and evaluate the effect of the model training using the development dataset. Finally, the performance of the model on the test dataset is reported.
The checkpoint of the model is saved every 1000 steps during training. The one with the best performance from the last five saved checkpoint is selected as the final result. Table 2 shows the performance of different models under two clinical concept extraction tasks. The performance is evaluated in the F1 score. In general, the slide window model performs better than the BiLSTM-CRF model and the Softmax model under the strict standard when the same pre-training language model is used. Table 3 shows the performance comparison of our proposed slide window model with the SOTA results in 2012 i2b2 and 2012 i2b2. For the 2012 i2b2 corpus, the best performance under the strict standard is achieved by the slide window model with an F1 score of 81.22%. It improves the performance by 0.88% over the previous SOTA result achieved by Si 2019 et al. with an F1 score of 80.34%. On the other hand, the performance of our model in 2010 i2b2 is slightly worse than the current SOTA results, with F1 scores differing by 0.3%. However, compared to the results of our reproduction of SOTA, we still improved the F1 score by 1.07%. In order to further analyze the impact of different categories of clinical concepts on the performance of the slide window model, we separately calculated the performance of the slide window model on different types of clinical concepts in the 2012 i2b2 task. The experimental setup was the same as the first experiment, and the F1 score was used as the final evaluation criterion. The experimental results are shown in Table 4.

Discussion
In this study, we investigate the performance of the slide window model on the clinical concept extraction task. Compared with the BiLSTM-CRF model, the slide window model has improved F1 scores in both the 2010 i2b2 and 2012 i2b2 corpora under the strict standard. However, we noticed that the slide window model compared with the BiLSTM-CRF model on the 2010 i2b2 and 2012 i2b2 tasks did not have the same improvement. This is mainly because the types of clinical concepts included in the 2010 i2b2 and 2012 i2b2 tasks are different. As can be seen from Table 4, compared with the two methods based on sequence labeling, the slide window model has improved performance in all categories of clinical concepts, but the magnitude of the improvement is different. Among them, the clinical concept of the 'OCCURRENCE' category has improved the most.
Through further research, we found that the slide window model has a relatively small improvement in relatively standardized clinical concepts compared to the sequence labeling model, such as clinical departments. This type of clinical concept contains a more uniform structure of named entities and higher similarity between named entities. Models based on sequence annotation are more comfortable to learn such clinical concepts, and it is easier to predict token labels in clinical concepts correctly. Conversely, for clinical concepts that have a relatively vague structure and a relatively sparse distribution of named entities, such as emergencies, the slide window model can achieve much better performance than the sequence labeling model.

Conclusion
In this paper, we comprehensively investigate the performance of the slide window model on clinical concept extraction. Experiments on the 2010 i2b2, 2012 i2b2 corpora prove that 1) Proposed modeling method based on candidate window classification is better than the model based on sequence labeling; 2) Based on the same pre-trained language model, the slide window model is better than the current mainstream clinical concept extraction methods (such as BiLSTM-CRF model and softmax classifier); 3) By considering the overall semantics of clinical concepts instead of the semantics of each token, the performance of clinical concept extraction tasks can be effectively improved.

List of abbreviations
SW: slide window model CNN: convolutional neural network ICU: intensive care unit NLP: natural language processing BERT: bidirectional encoder representation from transformers BiLSTM-CRF: bidirectional long short-term memory with conditional random field