Drug Knowledge Extraction Framework with Entity Pair Calibration for Chinese Drug Instructions

and generate relevant knowledge graphs based on drug instructions. Moreover, the relations extracted from drug instructions by current approaches are typically coarse-grained, e.g., most existing work provide solutions for identifying only the existence or polarity of drug-drug interactions, which is insufficient in clinical practice. For example, the drug knowledge provided by Abstract Existing pharmaceutical information extraction research often focus on standalone entity or relationship identification tasks over drug instructions. There is a lack of a holistic solution for drug knowledge extraction. Moreover, current methods perform poorly in extracting fine-grained interaction relations from drug instructions. To solve these problems, this paper proposes an information extraction framework for drug instructions. The framework proposes deep learning models with fine-tuned pre-training models for entity recognition and relation extraction, in addition, it incorporates an novel entity pair calibration process to promote the performance for fine-grained relation extraction. The framework experiments on more than 60k Chinese drug description sentences from 4000 drug instructions. Empirical results show that the framework can successfully identify drug related entities (F1 >= 0.95) and their relations (F1 >= 0.83) from the realistic dataset, and the entity pair calibration plays an important role (~5% F1 score improvement) in extracting fine-grained relations.


Introduction
Drug instructions contain a wealth of drug knowledge, which can provide decision support for clinical diagnosis, prescription and healthcare management. However, textual descriptions provided in drug instructions usually take the form of long, complicated sentences, which are difficult to be processed by man or machine. The existing information extraction methods in the field of medicine include drug and disease entity recognition [1][2][3], drug-disease relationship extraction [4], etc. These methods mainly focus on specific tasks and do not present a complete framework for drug knowledge extraction. We argue that an complete drug knowledge extraction framework is crucial for drug knowledge construction and knowledge-based medical and pharmaceutical applications. The framework should take as input original drug instruction documents and generate relevant knowledge graphs based on drug instructions. Moreover, the relations extracted from drug instructions by current approaches are typically coarse-grained, e.g., most existing work provide solutions for identifying only the existence or polarity of drug-drug interactions, which is insufficient in clinical practice. For example, the drug knowledge provided by Abstract Existing pharmaceutical information extraction research often focus on standalone entity or relationship identification tasks over drug instructions. There is a lack of a holistic solution for drug knowledge extraction. Moreover, current methods perform poorly in extracting fine-grained interaction relations from drug instructions. To solve these problems, this paper proposes an information extraction framework for drug instructions. The framework proposes deep learning models with fine-tuned pre-training models for entity recognition and relation extraction, in addition, it incorporates an novel entity pair calibration process to promote the performance for fine-grained relation extraction. The framework experiments on more than 60k Chinese drug description sentences from 4000 drug instructions. Empirical results show that the framework can successfully identify drug related entities (F1 >= 0.95) and their relations (F1 >= 0.83) from the realistic dataset, and the entity pair calibration plays an important role (~5% F1 score improvement) in extracting finegrained relations.

Keywords: Information Extraction; Entity Pair Calibration; Pre-Training Model; Chinese Drug Instruction
Drugbank only provides the relevant text description of drug interactions as data properties (strings). It does not further provide word/phrase level semantics for detailed drug interactions knowledge, such as interaction mechanism, interaction result or clinical suggestion, thus prohibits fine-grained reasoning over the drug knowledge.
To overcome limitations in current approaches, this paper proposes a drug information extraction framework based on natural language processing techniques and deep learning models. It starts from automatically gathering the textual description of drug instructions from public sources. After data gathering and cleaning, it applies named entity recognition technology with natural language processing tools as well as deep learning models to identify 4 categories of entities: drugs, diseases, body parts and symptoms. After the entity name recognition, it derives a set of sentences with a list of relevant entities, then, an entity pair calibration (EPC) step is proposed. The goal of EPC is to distinguish between the main entity (the primary drug which the instruction is intended for) and the secondary entity (the drug mentioned in the sentence of the instruction, potentially associated with the primary drug entity), this step can reduce the noise for further drug relation extraction and facilitate accurate extraction of fine-grained relations. Finally, we use a combination of BERT (wwm) and BiGRU-ATT to extract the relationship between entity pairs.
The main contribution of this paper is two-fold: 1) a pretraining-based drug information extraction framework with a novel entity pair calibration process is proposed to extract drug-related entities and fine-grained relations from drug instructions; 2) to evaluate 1), a empirical study is carried out over a realistic dataset that contains over 60k sentences from 4000 drug instructions.
The remainder of the paper is organized as follows: Section 2 discusses the related work and the state-of-the-art with regard to drug information extraction; Section 3 presents the drug information extraction framework proposed in this paper, including the architectural design of the framework as well as specific neural models used in each module in the framework; Section 4 elaborates on the details of the entity pair calibration mechanism; Section 5 demonstrates and analyzes the empirical results before Section 6 concludes.

Related work
Rau et al. [5] first proposed the task of named entity recognition in 1991, the task of named entity recognition and relation extraction has developed rapidly. After the early methods based on traditional machine learning, such as Fang Xiaoshan et al. [6] proposed a method of named entity recognition based on rules, but this method relies heavily on the predefined high-frequency rules, The whole process of information extraction is not automatic enough. There are also neural network-based methods, such as the method of named entity recognition of electronic medical records based on BiLSTM-CRF model and attention mechanism, which is used by Chen Chen et al. [7] Has improved the performance of BiLSTM-CRF model to a certain extent, but it is only one of the tasks of the whole information extraction. Gong lejun [8] and others used a drug entity relationship extraction method based on GRU and CNN, which mainly extracted adverse drug reactions (ADRs).
In DDIE extraction 2013, the comprehensive evaluation rate was 75%. However, there was a lack of a complete set of information extraction processes, and the classification of the main entity and sub-entity was not well explained. Wang Yongchao [9] proposed an entityrelationship extraction method based on pointer network, but this method of pointer annotation does not pay enough attention to the continuity of entities, which will increase the probability of generating illegal sequence annotation in the entity recognition stage, resulting in low accuracy. There are a large number of entity names nested between diseases and drugs in the text of the medical field, Therefore, this method used in the field of medicine will produce a lot of wrong entity names. Only one of the tasks above presents a complete information extraction process, and the granularity level for the entityrelationship is not enough, thus prevents it to be used in practice.
In recent years, attention mechanism in natural language processing has gained interests.
Ashish Vaswani [10] proposed the transformer model structure in 2017, which can decouple from the traditional neural network structures such as CNN or RNN, and it focused only on the attention mechanism. The bidirectional encoder representation from transformers (BERT) [11] pre-training model is developed based on the transformers, which pre-trains sentences by mask language model (MLM) and next sense prediction (NSP). But the original BERT model is used on English corpus. For Chinese corpus, the word-based "Mask" will lose some semantics. Coping to the special features of Chinese text expression, the whole word mask method proposed in BERT-wwm [12] is more effective in Chinese. Therefore, the BERT-wwm Chinese pre-training model is proposed.
Instead of using the conventional character-based masking model, the whole word (consists of multiple characters) mask is used in BERT-wwm. To identify the whole word, a word segmentation tool is used to segment the text, then the whole word segment is marked and masked, with the help of several other optimizations, the experimental results on the public data set CMRC 2018 showed that BERT-wwm outperforms the original BERT. In this paper, we use BERT-wwm as a basic model for the drug information extraction framework.

Drug information extraction framework with pre-training
The drug information extraction framework proposed in this paper uses pre-training models as a key method to label sentences as well as entities and relations in sentences. In the following, we first present the architectural design of the framework and introduce the functions of its modules, then, we show the pretraining based models used in the relevant modules in the framework.

Architectural design
The framework has four main modules: data collection and cleaning, named entity recognition, entity pair calibration and entity relation extraction, as depicted in Figure 1. The data collection is a automatic process that gathers drug instructions and organize them into semi-structured text inputs. The entity recognition and entity pair calibration are semi-supervised learning modules that identifies valid entity pairs for the relation extraction module, which is a supervised learning module.

Data collection and cleaning
The data collection module automatically collects drug instructions, disease diagnosis, treatment guidelines and other medical text data from public sources on the Internet. We carry out this operation regularly with a careful selection of authoritative data sources, in order to provide rich and reliable data sources for drug information extraction. The structure of the collected and cleaned data is stored as spreadsheets, all textual description of a drug instruction is retained and categorized into columns using the section titles appears in the instructions, e.g., product name, primary chemical component, producer, drug reaction, drug interaction, etc.

Named entity recognition
The Named Entity Recognition (NER) module is mainly responsible for identifying four kinds of named entities from the text, which are "drugs", "diseases", "symptoms" and "body parts", thus preparing datasets for further relationship extraction. This module uses an entity name dictionary as a starting point and applies a BERT(wwm)-BiLSTM-CRF deep learning model (explained in Section 3.2) to learn new names. The entire procedure consists of 3 steps: Firstly, the original sentences in the instruction are converted to the format required by the training dataset for the named entity recognition algorithm model.
Secondly, the sentences are processed with the longest string matching method with the limited dictionary names collected manually in the early stage. The longest string matching method can effectively avoid the wrong annotation of nested entity names in the dictionary; Thirdly, the deep learning model is trained by the dataset to generalize entity names, generalized entity names are added to the dictionary, so that the training data set of the model can be updated regularly, thus achieving a self-improvement solution.

Entity pair calibration
When drug and disease entities have been extracted from sentences, we can select sentences with n (n>=2) entities to further investigate entity relations. Since we are trying to learn binary relations, those n entities may produce C(n,2) entity pairs. However, not all those entity pairs are legitimate for relation extraction, e.g., the same drug/disease may have different names, occurring at different locations in the same sentence, or there may be a type-of/sub-class-of relation between them, in these cases, the entities pairs cannot enter drug/disease interaction extraction module without causing noise. To reduce noisy data caused by wrong entity pairs, the Entity Pair Calibration (EPC) module is used to verify the primary entity (the main drug described by the drug instruction, i.e., the intended object for the instruction) and secondary entity (the other drugs/diseases described in the sentence). In this paper, we use a modified word embedding method to splice the entity names with the original sentences and "Mask" entity names in the sentences. Finally, we feed the masked text into the BERT(wwm)-BiGRU-ATT model (explained in Section 3.2.3) to train the entity pair calibration module, so that it can identify the categories of entities in the text and remove entity pairs with unintended type-of/sub-class-of relations.
We provide the details of the EPC process in Section 4.

Relation extraction
The Relation Extraction (RE) module extracts the relationship between the entity pair and its text information from the entity pair calibration module, which is mainly realized by the BERT(wwm)-BiGRU-ATT model (explained in Section 3.2.3). We distinguish between 2 main relations: drug-disease or drug-drug relation, based on the types of the secondary entity in entity pairs. The relationship between "drugs" and "diseases" can be further divided into positive correlation and negative correlation; "Positive correlation" means that the drug can treat or alleviate the disease, "negative correlation" means that the drug can cause or aggravate the disease. The relationships between drugs are firstly divided into exist-interaction and noninteraction (sometimes a sentence states that there is no known interaction between two drugs); then for the sentences describing drug pairs that do have interactions, we further extract 3 types of relationships including interaction mechanism, clinical guidance and interaction results. Each drug-drug interaction relationship type can be categorized into more fine-grained labels. For example, "clinical guidance" can be further divided into use  From the text semantics after "Mask", we can see that the text semantics after "WWM"

BERT(wwm) -BiLSTM-CRF for Named Entity Recognition
The BERT(wwm)-BiLSTM-CRF model is the deep learning model of the NER module. BiLSTM is a bidirectional recurrent neural network, which can better capture the dependency of context semantics in text. The prediction score of each tag is labeled by the training output sequence of the BiLSTM network, which will be used as the input of the CRF layer. CRF layer can obtain the constraint rules of sequence labeling data from the training data, thus greatly reducing the probability of illegal sequence labels predicted.
The overall model architecture is shown in Figure 4. In this figure, the input text is first transformed into a dynamic semantic vector by BERT(wwm), The vector is trained by the BiLSTM network and the constraint rules of the CRF layer, and the label corresponding to each word is finally output. "B-DRUG" represents the beginning of an entity, "I-DRUG" represents the middle and end of an entity, "O" represents the part that is not an entity.

BERT(wwm)-BiGRU-ATT based Entity Pair Calibration and Relation Extraction
The BERT(wwm)-BiGRU-ATT model is used in the EPC and RE modules. Different word embedding can be used for different tasks. In this model, BERT (wwm) is still used as the upstream encoder to encode the text as the intermediate word vector, and the downstream uses a bidirectional GRU network and attention mechanism to train the upstream word vector. GRU network is a variant of the LSTM network, which can solve the problems of long-term dependence and gradient dispersion. Compared with the LSTM network, the GRU network has higher computational efficiency and occupies less memory.
Therefore, considering the heavy workload of EPC and RE task of computation, the bidirectional GRU network is used. The entire model is accompanied by a final attention layer as shown in Figure 5.  For a human brain, the two words "升高" (increase) and "不可"(cannot) in such a text determines the relation between the two entities. Similarly, through attention calculations, these two words are assigned with a greater attention score, thus indicate that the appearance of these two words w.r.t. their position is important to identify a relation between entities.

Entity pair calibration
In drug instructions, there are often multiple expressions of the same entity name, and there are inclusion relationships (for example, one entity is a component of another entity), sub-class and type-of relationships among different entities. There should be no interaction between such entity pairs. When this kind of entity pair is directly used as the input of the subsequent relation extraction task, it will cause a lot of noise.
When combined with voriconazole, the blood concentration of sirolimus may increase significantly, so the two drugs can not be used at the same time The combination of quinolones and theophylline may increase the serum theophylline level,therefore, the dosage of theophylline should be reduced when balofloxacin tablets are combined with theophylline Antacids can affect the absorption of vitamin A in children's vitamin chewable tablets, so they should not be taken together For example, in the two texts in Figure 7, the first text involves three entity names:4quinolones, theophylline and Balofloxacin Tablets. Among the three entity names, 4quinolones is a drug type for balofloxacin tablets. Similarly in the second text, vitamin-A is a component of Multivitamin chewable tablets. In both cases, the entity pairs are noise.
A naive approach will simply extract all entities from text and apply a Cartesian product to produce all entity pairs, which will result in very noisy input for the RE module. A natural improvement will be using a filter based on apriori knowledge after the Cartesian product to eliminate illegal entity pairs, such as Algorithm 1. However, this approach imposes a strong assumption that an accurate and complete knowledge base that describes the type-of, In order to avoid using such wrong entity pairs, we propose to check the entity pair after the named entity recognition task and before the entity relationship extraction, so as to determine whether the entity pair is valid. In essence, we employ a deep learning model to label the primary and secondary entities in a entity pair to verify its validity. The primary entity set contains only the drug entity that the instruction is intended for, or its type, super-class or drug-composition, the secondary entity contains only the entity associated with the primary entity in the instruction. This way, we can produce correct entity pairs for relation extraction.
In order to successfully recognize the entity category from the text, we use the "WWM" mechanism similar to BERT (wwm) model. We mask the primary and secondary entities (with m'#'symbols, where m is the length of the masked entity) when training the algorithm to identify them. Meanwhile, to pertain the information in the entity name, we use the same splicing method in BERT and concatenate the entity name (with a'$ symbol) in front of the masked sentence. In this way, we can not only ensure a lossless text embedding but also obtain flexibility in the embedding method. Table2 depicts an example of EPC training data. Label Text (masked) Primary Entity quinolones$The combination of ########### and theophylline may increase the level of theophylline in serum, so the dose of theophylline should be reduced when balofloxacin tablets are combined with theophylline Primary Entity balofloxacin tablets$The combination of quinolones and theophylline may increase the level of theophylline in serum, so the dose of theophylline should be reduced when ############ are combined with theophylline Secondary Entity theophylline$The combination of quinolones and ######## may increase the level of ######## in serum, so the dose of theophylline should be reduced when balofloxacin tablets are combined with theophylline

Empirical results and analysis
In order to evaluate our approach, we experiment the modules in this framework over realistic datasets. In this section, we first present the experiment dataset and configuration, then, we demonstrate our experiment results and analysis.

Experiment dataset and configuration
The experimental data of this paper mainly comes from authoritative data sources on the Internet. Starting from a medical entity dictionary, the medical domain sequence annotation dataset with 3 million Chinese characters for named entity recognition is constructed, which is used to train and verify the NER module. There are four types of entity names: drug name, disease name, body part name and symptom name. As shown in Table 3. For the EPC and RE modules, we prepared 60k Chinese sentences from 4000 drug instructions. EPC labels entities with 2 categories: primary or secondary, RE labels relations among entities with 12 fine-grained drug-drug and drug-disease relations, as

Experiment result and analysis
In this section, we discuss the experiments results for the NER, EPC and RE modules.

Named entity recognition
The comparative experiment results of the NER module are shown in Figure 8. We   In Figure 9, the first column is the text data labeled by sequence, the second column is the label labeled by the dictionary, and the third column is the label predicted by the model.
It can be seen that the BERT(wwm)-BiLSTM-CRF model successfully identifies the drug name "双氯芬酸钠" (Diclofenac Sodium) which does not appear in the dictionary. This feature can help expand the original dictionary and facilitate a semi-supervised approach.

Entity Pair Calibration
To evaluate the performance of the model used in EPC, we compare the precision, recall and F1 score for identifying the primary (EN1) and secondary (EN2) entities using 4 different models: DPCNN, BiLSTM-ATT, BERT-BiGRU-ATT and BERTwwm-BiGRU-ATT. The experimental results of the EPC are shown in Table 4  The results in Table 4 suggests that the performance of DPCNN [14] without pre-training neural network is quite good. It was proposed by Tencent AI Lab in 2017 to extract longdistance text dependence by continuously deepening the network and using a network based on word level. It outperforms the BiLSTM-ATT model, but not better than the pretraining models. The results show that our approach performs the best among the 4 models, and it can achieve 98% and 99% F1 values for primary and secondary entity identifications, respectively.

Relation extraction
The experimental results of RE module w.r.t. 4 coarse-grained relations (polarity of drugdisease relations, existence of drug-drug interactions, clinical guidance and interaction results) are shown in Figure 10   As can be seen from Figure 10-13, pre-training models outperform conventional deep learning models. Adding the EPC module performs the best as we expected. However, using apriori knowledge to eliminate wrong entity pairs performs similarly to not using filters. This demonstrates that the performance of the rule-based entity calibration heavily relies on the quality of the apriori knowledge. Figure 14 and 15 show the experiment results for 10 more fine-grained RE tasks as discussed in Section 3.1.4, using the same 4 models. Figure 14. F1 score for fine-grained "clinical guidance" relation extraction Figure 15. F1 score for fine-grained "interaction result" relation extraction Results from Figure 14 and 15 suggests that the performance of fine-grained relation extraction is improved significantly by EPC. More specifically, for the relation of "use with caution (UC)", the F1 value is increased by 8.04%, and for the relation of "adverse reaction increase (RI)", the F1 value is increased by 9.44%. Compared to the coarse grain relation extraction result (~2% F1 improvement on average) in Figure 10-13, we can see that the EPC module has a much bigger contribution (5% F1 improvement on average) while extracting more fine-grained and complex relations.

Conclusion
Extracting fine-grained drug knowledge from professional drug instructions is a complicated task for both human and machines. In order to address this problem, we propose a holistic information extraction framework based on an integration of pre-training and deep learning models. The abstract structure of the framework follows a basic design of pipeline information extraction methodology that includes data gathering, Named Entity Recognition and Relation Extraction, on top of which, we design a novel procedure called Entity Pair Calibration and place it between the NER and RE tasks to reduce the input noise for drug relation extraction.
The framework is empirically evaluated over more than 60000 sentences from Chinese drug instructions, covering 4000 different drugs. Experiment results demonstrate that the proposed framework can achieve >=0.95 F1 score in NER and >=0.83 in RE tasks.
Moreover, the proposed EPC module can significantly improve the performance for finegrained relation extraction (F1 score improvement up to 9.44%, 5% on average). Using this framework, we have successfully built a drug interaction knowledge base with over 1,300,000 RDF triples, describing more than 180,000 drug-drug and drug-disease relations.
It is worth mentioning that although we design and evaluate the framework for Chinese drug instructions, it can be easily adapted for other languages by changing the pre-training model into BERT or other language-specific models. Also, the fundamental idea of EPC can be reused for other languages.
Our future work may carry out in 3 aspects. Firstly, the pre-training model BERT(wwm) used in this paper can be optimized with grammatical structure and semantic information [15,16]. Secondly, the architecture of the framework is configurable but not dynamically adaptable, we could improve on this aspect by monitoring the result and dynamically switch to better models. Also, adding confrontation training [17] in the word embedding stage of the model can improve the robustness of the model. Finally, there is an inherent problem in this domain, which is the imbalance of samples in drug-drug interactions. This problem is not easily solvable using pure learning-based approaches, we could explore using a hybrid approach of knowledge and learning-based approach to address this issue [18].