Disease Named Entity Recognition (D-NER) Evaluation

Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP). In medical domain, NER is very important phase in all end-to-end systems. In this paper, we investigate the performance of NER for disease (D-NER). TaggerOne was evaluated on 52 cardiovascular-related clinical case reports against hand annotation for diseases. Different training sets have been used to evaluate the performance of TaggerOne as a famous tool for NER in biomedical domain.


Introduction
Natural Language Processing (NLP) has been a research area that gained significant recent interest [1]. Expansions in the volume of unstructured free text has created a strong need for automated methods to identify, classify, normalize, and annotate these unstructured data into semantically meaningful structured data for knowledge generation. Traditional NLP methodologies have been rule-based or heuristic-based, encoding in the linguistic structure of English along with domain-specific semantic relations into algorithms to identify named entities. More recent machine-learningbased methods have attempted to broaden the generality regarding breadth of topics, with techniques that can apply to a wide variety of topics. These methods attempt to be agnostic to specific text types, with only the training set specific to a knowledge domain to evaluate performance across domains. Recent work [2] have demonstrated that domain-specific structural information can show significant improvement by combining these two approaches. By encoding semantic information specific to a domain via a well-chosen training set, significant performance was observed. Thus, the role of domain-specific NLP models is a valuable but poorly-characterised area, particularly as it applies to biomedical texts and clinical case reports for cardiovascular diseases.
The structuring of biomedical texts has been a growing area of interest growing in parallel with the more general expansion of unstructured free text. PubMed, the central repository of biomedical texts, has been growing exponentially, and an increasingly major challenge is organizing the vast corpus of knowledge for easier access and knowledge generation. NLP research in biomedical data is unusually challenging in comparison to NLP research in other text areas due to a paucity of well-annotated gold standards -understanding biomedical or clinical texts require specific education and precludes crowdsourcing or large extant gold standard corpora. Thus, the investigation of NLP approaches for biomedical texts is a research area of specific interest.
Recent approaches to try to organize PubMed using named entity recognition and normalization, called PubTator [3,4,5,6], have been applied to biomedical texts, but no in-depth analysis of performance exists in the literature. DNorm [7,8,9,10], the technology behind PubTator, uses conditional random fields for normalization of disease names. While these methodologies have demonstrated high statistical performance metrics, the reasons and characteristics for errors have been less welldescribed. Even less known is the application of these methods in cardiovascular clinical texts such as clinical case reports. Investigating the performance of NER algorithms in the cardiovascular clinical texts will inform future research approaches on areas of improvement. This paper focuses on the specific types of errors that PubTator, disease name normalization, and conditional random fields generate specifically in the context of cardiovascular clinical case reports.

Methodology
The frame work of our overview is shown in Figure 1. PubMed Central was queried using the term "heart failure". The results were limited to full-text clinical case reports in English, with no restrictions on publication date or journal. 52 reports were randomly selected. On the 52 reports, automatic annotation was performed first using PubTator Central and hand-annotation of disease names (the "gold standard") by a single individual was also performed. PubTator annotations are stored in XML, Figure 1: Framework to Evaluate the Performance of Disease NER while the gold standard annotations are stored in BRAT format. Disease annotations were extracted from PubTator XML using regular expressions and aligned to the gold standard annotations. Missing annotations and false positive annotations were tabulated, and an F1 score was computed for each report. To provide context for the PubTator Central F1 score, two additional comparisons were run by training Tag-gerOne on the NCBI Disease Corpus and BioCreative V Chemical-Disease Relation (BC5CDR) corpus. F1 scores based on models trained on these datasets were also calculated. A deep dive was also performed on the types of errors observed and to look for any patterns. Tabulated charts of missed and false positive disease annotations of each of the 52 reports were hand-scrutinized and categorized by type. Proposals for potential improvements were conceived based on these results.

Results
Below are the F1 score distributions of the three models -PubTator, NCBI Disease Corpus, CB5CDR -on the 52 reports.
The mean F1 score was 0.42 for PubTator Central, 0.28 for the NCBI Disease Corpustrained model, and 0.36 for the BC5CDR-trained model. Seeing that the PubTator Central outperforms both trained models, it seems likely that the PubTator Central  Here is a distribution of the types of errors found: The errors were divided into misses and false positives. They were also categorized by type. The major types of errors were: 1. Named entities that the algorithm believe are diseases but do not qualify medically as a disease, merely as a symptom. For example: ""pain", "dyspnea", "jaundice", "death".
3. Span errors, where the NER algorithm does not capture the entire length of the term correctly. For example, incomplete terms such as "kidney injury", "Barre syndrome", "coagulation", "ST-elevation". Anatomical terms that the algorithm incorrectly identify as diseases, such as "anastomosis", "patent ductus arteriosus".

Discussion
The aim of this paper is to examine the performance of NER algorithms on cardiovascular clinical case reports. The performance evaluating TaggerOne trained on various models against a hand-annotated gold standard shows that there is still room for performance improvement for NER on clinical texts. The performance also demonstrates areas of improvement for acronym resolution and incorrect term spanning. In particular, it is important to note that the training sets used were not based on texts from clinical case reports nor are they focused on cardiovascular disease. For the NCBI Disease Corpus, disease names are collected from biomedical research articles and their abstracts. For the BC5CDR corpus, it is a selection of disease and chemical names pulled from basic science research articles [4,11]. Neither corpus uses clinical case reports, and neither specializes in cardiovascular diseases either.
Thus, it is unsurprising that ambiguities in biomedical naming, such as acronym resolution, fail noticeably in the test set. The superior performance of PubTator Central, however, seems to indicate that by merely improving and expanding the training corpus, significant performance improvement can be achieved without overhauling the underlying algorithm. It is thus likely that a model trained specifically on cardiovascular clinical case reports will show significant improvement over baseline. Clinical case reports possess a particular structure and writing style, such as a heavy use of acronyms, and training the model on these more representative names should show improvements in span error and acronym resolution, two of the biggest contributors to error [12]. The impact of using a well-constructed training corpus is demonstrated in this project, as is also the need for hand annotation. Some of the errors -vagueness and misattribution of symptom to disease -rely on meta-clinical knowledge that is challenging to identify from text in an unsupervised way. From a clinical perspective, symptoms and disease form a continuum, with stereotyped syndromes with physiologically coherent etiologies such as infective endocarditis on the disease end, and generalized and ambiguous physiological responses to a disturbance such as tachycardia on the symptom end. Two approaches towards identifying where the line between symptom and disease are hypothesized [13,14,15]. The first approach is a supervised way. By constructing a consistent corpus of identified disease names by hand, TaggerOne might be able to recognize disease from symptoms. Expanding the NCBI Disease Corpus with cardiovascular clinical case reports should yield improvements. However, hand annotation is challenging and unable to scale, so it might be possible to develop a weakly supervised method that pre-processes the training set bootstrapped from a smaller hand-annotated corpus or another corpus such as the NCBI Disease Corpus onto a larger unannotated corpus. These methods have not been tested and are potential future directions. With the room for improvement notwithstanding, the potential for creating structure in otherwise free text is demonstrated in these models. As disease terms form a notable minority of words in a text creating a highly unbalanced training set, an F1 score of up to 0.5 shows a clear ability to distinguish named entities from nonentities. Named entity recognition is an important step in interpreting clinical case reports and other biomedical texts for generating a semantic structure for the texts [1,16,17,18]. This proof of concept demonstrates the state of the art for NER technology as applied to what is currently available as the most representative of clinical texts. Using NER for semantically structuring text can also be applied on electronic medical records, which currently are large bodies of unstructured text. Identifying structure from unstructured medical record texts can open many new opportunities in clinical informatics areas such as clinical decision support, patient cohort identification, and disease nosology [19,20].

Conclusion
While deep learning models performs better in many cases, although the use of classical ML approaches still interesting for many tasks specially which needs more resources such as pre-trained word representation, huge memory and time consuming.
In this paper, we evaluated the implementation of deep learning models to disease named entity recognition.