Improving Biomedical Named Entity Recognition with Label Re-correction and Knowledge Distillation
Background: Biomedical named entities recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotation datasets, especially the limited knowledge contained in them.
Methods: To remedy the above issue, we propose a novel Chemical and Disease Named Entity Recognition (CDNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance entity recognition model. Our framework is inspired by two points: 1) named entity recognition should be considered from the perspective of both coverage and accuracy; 2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two CDNER models, respectively. Finally, we compress the knowledge in the two models into a single model with knowledge distillation.
Results: Experiments on the BioCreative V chemical-disease relation corpus show that knowledge from large datasets significantly improves CDNER performance, leading to new state-of-the-art results.
Conclusions: We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for biomedical named entity recognition.
Figure 1
Figure 2
Figure 3
Figure 4
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.
Posted 15 Dec, 2020
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
Invitations sent on 14 Dec, 2020
On 14 Dec, 2020
On 14 Dec, 2020
On 14 Dec, 2020
On 10 Dec, 2020
Improving Biomedical Named Entity Recognition with Label Re-correction and Knowledge Distillation
Posted 15 Dec, 2020
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
Received 04 Jan, 2021
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
On 17 Dec, 2020
Invitations sent on 14 Dec, 2020
On 14 Dec, 2020
On 14 Dec, 2020
On 14 Dec, 2020
On 10 Dec, 2020
Background: Biomedical named entities recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotation datasets, especially the limited knowledge contained in them.
Methods: To remedy the above issue, we propose a novel Chemical and Disease Named Entity Recognition (CDNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance entity recognition model. Our framework is inspired by two points: 1) named entity recognition should be considered from the perspective of both coverage and accuracy; 2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two CDNER models, respectively. Finally, we compress the knowledge in the two models into a single model with knowledge distillation.
Results: Experiments on the BioCreative V chemical-disease relation corpus show that knowledge from large datasets significantly improves CDNER performance, leading to new state-of-the-art results.
Conclusions: We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for biomedical named entity recognition.
Figure 1
Figure 2
Figure 3
Figure 4
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.