Background: Biomedical named entities recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotation datasets, especially the limited knowledge contained in them.
Methods: To remedy the above issue, we propose a novel Chemical and Disease Named Entity Recognition (CDNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance entity recognition model. Our framework is inspired by two points: 1) named entity recognition should be considered from the perspective of both coverage and accuracy; 2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two CDNER models, respectively. Finally, we compress the knowledge in the two models into a single model with knowledge distillation.
Results: Experiments on the BioCreative V chemical-disease relation corpus show that knowledge from large datasets significantly improves CDNER performance, leading to new state-of-the-art results.
Conclusions: We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for biomedical named entity recognition.
This preprint is available for download as a PDF.