Annotation Data and Procedure
The 5151 Online Health Care Network is an online service for users to consult with doctors from 41 medical specialties. We collected 82,508 records of user consultation texts from the 5151 online health care web application. The content of consultation is the main focus, hence only health related advice is selected.
We randomly selected 100 health descriptions for an interdisciplinary team of physicians and two information scientists to develop an annotation guide. At the end of reviewing the 100 narratives, we obtained guiding principles agreed to by all members of the team.
Based on the annotation guideline, 5 annotators, each of whom are students with medical backgrounds, each marked 100 online health consultation articles. We selected online health consultation posts with word counts of no more than 150 words as the label text. The collection of 100 online health consultation posts consists of approximately 4,706 words, and the average number of words per post is 47 (SD 21.1).
Labeling is done using our SAAS. One hundred labeled online health consultations were used as training and test data for medical entity identification, as described below. The label data is marked by annotators respectively, and the differences are resolved by the SB (doctor). We also report Cohen’s kappa, a well-known statistic used to assess the reliability of a fixed number of evaluators when categorizing or classifying multiple projects [19]. We use this set of data and the previously constructed medical entity recognition model to capture all named entities.
Preparation
In this section, we give a brief introduction of Medical Subject Heading Terms which is used as our symptom and disease dictionary resource and the tool to segment the sentences.
Medical Subject Headings
We collect Mesh terms from disease categories and split tree number that starts with C23 to symptom. The reason we chose tree number starts with C23 as a symptom is because the doctor determined the term which tree number starts with C23 was a symptom. We perform data cleanup in the symptom dictionary by removing extra punctuation and duplicate symptoms. The above processing steps results in 375 symptoms in the symptom dictionary. In the dictionary, some examples are “頭昏眼花” (Dizziness), “頭痛” (Headache), “流膿” (Suppuration) and other health related symptoms.
We collect mesh terms from disease categories and split tree number before C23 to disease. The reason we chose tree number before C23 as disease was because the doctor determined the terms before tree number C23 was a disease. We performed data cleanup in the disease dictionary by removing extra punctuation and duplicate diseases. After the above processing steps, there were 3723 diseases in the dictionary. In the disease dictionary, some examples are “手足口病” (Foot and Mouth Disease Hand), “細菌感染” (Bacterial Infections) and “十二指腸潰瘍” (Duodenal Ulcer).
Tools: Ckip Chinese Word Segmentation
We employed a Chinese word segmentation system with selective function of new word recognition ability and additional part of speech(POS) tag developed by CKIP(Chinese Knowledge and Information Processing Group) [20]. For example in Fig. 1, “醫師您好, 我最近經常頭痛, 晚上睡不好一直上廁所, 不知道是怎麼了? ” It’s translated to “Hello doctor, I often have headaches recently, I can’t sleep well at night and go to the toilet all the time, I don’t know what’s wrong? “ After segmentation, there will be a sequence of words and POS tags like 醫生(Na) 您(Nh)好(VH), (COMMACATEGORY) 我(Nh) 最近(Nd) 經常(D) 頭痛(VH), (COMMACATEGORY) 晚上(Nd) 睡(VA) 不(D) 好(VH) 一直(D) 上廁所(VA), (COMMACATEGORY) 不(D) 知道(VK) 怎麼(VH) 了(T) ?(QUESTIONCATEGORY) (Fig. 1). We will refer to the POS tag as the basis for the extraction. If the POS tag is “VH,” it may be a symptom entity such as headache or a stomachache.
Medical Entity Recognition
We introduce how to identify ten entities in the medical texts. The ten medical entity types are Symptom, Disease, Health Information, Department, Treatment, Examination, Medication, Organ, Time and Abbreviation. Medical entity will be divided into three processes: candidate entity generation, medical entity decision and semi-automatic medical entity extraction. Table 3 shows the definition for each medical entity.
Table 3
Medical Entity Definition
Type
|
Definition
|
Total count
|
Health Information
|
User personal information
|
823
|
Symptom
|
A physical or mental feature which is regarded as indicating a condition of disease
|
854
|
Disease
|
A disorder of structure or function in a human
|
3771
|
Organ
|
Brain, Teeth, Wrist
|
439
|
Examination
|
Inspection or investigation, especially as a means of diagnosing disease
|
1222
|
Treatment
|
The combating of a disease or disorder
|
2124
|
Medication
|
Medicine
|
30474
|
Time
|
Time information
|
103
|
Department
|
Digestive system
|
59
|
Abbreviation
|
A shortened form of a word or phrase
|
306
|
According to corpus analysis, we found:
(1) Long-word medical entities are usually segmented into several fragments by the natural language processing tool.
(2) There are some entity patterns on the medical texts.
Medical Entity Generation Model
In this section, we introduce the POS patterns of the candidates based on dictionary analysis. Table 4 shows the patient chief complaint text is segmented by the CKIP POS tagger. The output is set of word and part of speech pairs. Medical entity candidates will be generated according to the POS patterns of different entities. After that we filter the word set which length must large than 1 from the candidates. The dictionary mentioned above is used to determine whether entities belong to the medical entity.
Table 4. Algorithm – Medical Entity Generation
Semi-automatic Medical Entity Extraction System
In this section, we will introduce a Semi-Automatic Annotating System (SAAS) that identifies medical entities based on text entered by the user. For entities that are not successfully identified, Correction is done by semi-automatic annotation immediately. Therefore, the success rate of medical entity recognition could gain more improvement. For medical researchers, SAAS can reduced labeling time and assisting entity labeling by marking results through visualized interface.
Our SAAS is composed of web interface, CKIP-based medical entity recognition unit, and the database. First, we let CKIP-based medical entity recognition unit load the raw data and generate automatic annotated data, then it will output the result to the web interface. Five operators with medical knowledge background then reviewed the generated result and corrected it on web interface. The automatic annotated data and corrected data from five operators will be stored into database for further analysis. Also, we collected the time consumed during semi-automatic annotation process, and compared it with the one during pure manual annotation.
Figure 2 shows our semi-automatic annotation system web interface, which is divide into two parts. Upper section part is an input box in which the medical text is submitted to be marked. Lower section part is the medical entity recognition result table generated by the system. Ten different entities will be labeled by different colors. For example, the entity health information will be marked in red, the symptom will be marked in sky blue. The generated result table is editable on web interface. Users could manually correct and add the unrecognized entity.
Table 5 shows the Chinese and English translations from part A of Fig. 3. Table 6 shows the medical entity extracted results. We could see 9 medical entities were extracted. Only Abbreviation didn’t extract.
Table 5
Chinese-English comparison/translation of medical texts
|
Chinese
|
English
|
Medical Texts
|
我是一位20歲的女性, 在當工程師, 您好, 大約距今大概一個月前, 我發現自己開始頭暈, 大多發生在起床、躺下、抬頭、低頭的時候, 刷牙時腦部晃動也會暈, 極少數發生在轉頭或走路時。
平常坐著時, 頭暈的情況雖然不明顯, 但就是覺得頭不舒服, 胃管插入, 影響到工作, 用了喜達諾注射液。
我一開始先到小診所看診, 由於我去年曾因缺鐵性貧血而頭暈, 就先做了抽血檢查, 但檢查結果顯示血紅素沒問題。診所的醫師便建議我到大醫院檢查, 暫且幫我轉診到血液腫瘤科。
血液腫瘤科也幫我做了抽血檢查, 但這次主要是看血液中鐵的含量, 結果檢查結果出來也是正常, 醫師認為這不太像是血液的問題, 建議我改看耳鼻喉科。
|
I am a 20-year-old woman. As an engineer, hello. About a month ago, I found myself dizzy. Most of them happened when I got up, lay down, looked up, and lowered my head. When I brush my teeth, my brain shakes Will dizzy, very few occur when turning head or walking.
Although the dizziness was not obvious when I was sitting normally, I felt that my head was uncomfortable. The stomach tube was inserted, which affected my work. I used Starno injection.
I first went to a small clinic to see a doctor. Since I was dizzy due to iron deficiency anemia last year, I had a blood test first, but the test results showed that heme was fine. The doctor at the clinic suggested that I go to a big hospital for examination and temporarily refer me to the hematology and oncology department.
The Department of Hematology and Oncology also helped me with a blood test, but this time I mainly looked at the amount of iron in the blood. As a result, the test results were also normal. The doctor thought that this was not like a blood problem and suggest me to switch to otolaryngology.
|
Table 6. Medical entity term labeled result