Extracting Disease-Speciﬁc Clinical Experiences from Ancient Literature of Traditional Chinese Medicine with Deep Learning

Background: Ancient literature of Traditional Chinese Medicine (TCM) contains massive clinical experiences which are important ingredient of TCM knowledge and valuable for TCM clinical practice of nowadays. However, it is diﬃcult for TCM professionals to acquire such valuable experiences due to their massive volume and broad occurrence in the literature. Furthermore, diﬀerent characteristics of ancient Chinese language from the modern one lead to additional challenges for analyzing the literature, regardless of how to perform the analyzing, manually or automatically with a software toolkit. Methods: In order to overcome the aforementioned challenges, we formalize a novel information extraction task for ancient literature of TCM, and the entities to be extracted are Disease-Speciﬁc Clinical Experiences (DSCEs) occurring in the literature. For the purpose, we have collected two corpora from ancient literature of TCM and annotated them manually with DSCEs occurrence information for the diseases pregnant abdominalgia and colporrhagia ( 妊 娠 腹 痛 及 下 血 ) and jaundice ( 黄 疸 ) respectively. We further propose a deep learning and CRF-based algorithmic framework with character encoding of ancient Chinese, thus avoiding the special diﬃculty in word segmentation for ancient Chinese texts. We investigate the framework with diﬀerent methods for contextual encoding of characters in a sentence, including CNN, Bi-LSTM and BERT, and diverse approaches to aggregate contextual information of characters into a sentence encoding, such as max-pooling and attention mechanism. After that all the encoded sentences in a section of the literature are passed through a Bi-LSTM-based sequence labelling model with CRF inference on its top to obtain an optimal label sequence for the sentences in the section. Results: We conduct a series of experiments on the two corpora to verify the eﬀectiveness of our framework for the task, and evaluate its eﬀectiveness with diﬀerent metrics in two granularities of labelling, namely accuracy/ F 1 -value in sentence-level labelling and precision/recall/ F 1 -value in correct recognition of the whole DSCEs. Conclusion: The experimental results demonstrate that the deep learning and CRF-based framework with character encoding of ancient Chinese could achieve an accuracy of 80.40% ± 1.64% and an F 1 -value of 76.73% ± 1.59% for the sentence labelling, while for recognition of the whole DSCEs, it is able to obtain the recall of 44.97% ± 2.16% and the precision of 51.13% ± 2.64%, meaning that the framework is a promising baseline for further development of the novel information extraction task for TCM. n-gram encodings. The experiments performed verify the use-fulness of our framework, but also demonstrating the diﬃculties speciﬁc to the task, especially in determining the exact span of a DSCE and solving the issue of the occurrence sparsity, which are our subsequent studying direction for this new information extraction task.


Background
Traditional Chinese Medicine (TCM) has been providing protection of Chinese people from different diseases for several thousands of years. In such a long period of the practice massive experiences for diagnosis and treatment are distilled, which are recorded in ancient literature of TCM and have become an important ingredient of TCM knowledge. The experiences are also one of the most critical referenced resources for modern TCM clinical practice and research. Even the achievement of Youyou Tu for the 2015 Nobel Prize of medicine, the therapy against Malaria with artemisinin, was inspired by an ancient clinical experience of that mentioned in A Handbook of Prescriptions for Emergencies (肘后备急方), namely "A handful of Qinghao immersed in two liters of water, wring out the juice and drink it all." (青蒿一握,以水二升渍,绞取 汁,尽服之) [1].
However, extracting manually the experiences from ancient literature of TCM is time-consuming and labor-intensive due to the massive volume of the literature. According to [2], there are more than 10,000 distinct TCM ancient books in China with more than 37,000 different versions, and the experiences are scattered across almost all of them. Furthermore, the literature is written in ancient Chinese language, a language that is different from the modern one, which leads to additional difficulties for analysis and understanding of the literature, regardless of the way of the analysis and understanding, manually or automatically with the off-the-shelf software toolkits such as information retrieval systems. An information retrieval system always returns a list of documents but the experiences needed for modern TCM research and clinical practice are hidden in several ancient books, where an experience constitutes just a text span in a section of the books. Moreover, the existing information retrieval systems for Chinese textual documents heavily rely on the word segmentation as a foundational operation, but in ancient Chinese language, there is a very vague bordering between word and character, which obviously hinders their direct applications in extraction of the experiences from ancient literature of TCM.
The importance of ancient literature for modern TCM research and clinical practice is realized early enough, but only recently some researchers have become to resort to the text mining and information extraction techniques in order to overcome the challenges mentioned above. These existing studies have focalized their attention mostly on semantic annotation [3], syndrome differentiation [4], clinical record classification [5,6], knowledge graph construction [7], to name a few. However, to our best knowledge, there does not exist any study focusing specially on extracting clinical experiences from the literature of TCM, which are critical for TCM research and clinical practice, as mentioned above.
In order to bridge the gap in extraction of TCM clinical experiences, in this paper we propose a deep learning and CRF-based algorithmic framework to extract Disease-Specific Clinical Experiences (DSCEs) from TCM ancient literature. The framework solves the task with a sentence sequence labelling approach [8] and incorporates character encoding of ancient Chinese into itself, thus avoiding the special difficulty in word segmentation for ancient Chinese texts. We investigate the framework with different methods for contextual encoding of characters in a sentence, as well as diverse approaches to aggregate contextual information of characters into a sentence encoding. All the sentences encoded with such a way in a section of the literature are passed through a Bi-LSTM-based sequence labelling model with Conditional Random Fields (CRF) [9] inference on its top to obtain an optimal label sequence for the sentences in the section.
In order to evaluate the performance of the framework for this novel information extraction task (i.e., extracting clinical experiences from ancient literature of TCM), we have collected two corpora from ancient literature of TCM and annotated them manually with DSCEs occurrence information for the diseases pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) and jaundice (黄疸) respectively. We perform a series of experiments on these corpora to verify the effectiveness of our framework, evaluating its effectiveness with different metrics in two granularities of labelling, namely accuracy/F 1 -value in sentence-level labelling and precision/recall/F 1 -value in correct recognition of the whole DSCEs. The experimental results show that the framework with character encoding of ancient Chinese could achieve an accuracy of 80.40%±1.64% and an F 1 -value of 76.73%±1.59% for the sentence labelling, while for recognition of the whole DSCEs, it is able to obtain the recall of 44.97%±2.16% and the precision of 51.13%±2.64%, meaning that the framework is a promising baseline for further development of the novel information extraction task for TCM.
Summarily, the major contributions of this work are four-fold: (1) a novel information extraction task is proposed for extracting DSCEs from ancient literature of TCM; (2) a deep learning and CRF-based algorithmic framework with character encoding of ancient Chinese is devised to solve the task; (3) a series of experiments is conducted to investigate the performance of the framework with alternative techniques in its different modules; (4) the two benchmark corpora and the source code of this paper are released for further development in solution of the novel information extraction task [1] .

Related work
In recent years, many disciplines witness fast growth in exploiting machine learning and text mining technologies to discover knowledge hidden in a massive volume of data. Similarly, some efforts have raised in TCM which utilize machine learning and text mining technologies for discovering knowledge from prescriptions and clinical records, such as treatment rules mining [10,11], medical term extraction [12][13][14], syndrome differentiation [15], knowledge graph construction [16] and fine-grained entity corpus construction [17]. However, majority efforts in these studies are devoted to structured data or unstructured textual data written in modern Chinese language, in spite of the importance of ancient literature for modern TCM research and clinical practice, as mentioned in Section Background.
Fortunately, a couple of pioneering works on knowledge discovery from ancient literature of TCM appear now. Weng et al. [3] proposed to employ a HMM (Hidden Markov Model) for recognizing medical phrases related to spleen and stomach from ancient TCM literature, and on the basis develop a knowledge-mining system to support the spleen-related research in TCM. There are various academic schools in TCM formed at different regions of China. Nie et al. [4] presented a data mining approach to investigate the characteristics of the Bashu (巴蜀) school in exposing the relationships among diseases, symptoms and herbs. The source data used for the mining are the ancient literature published by the school of TCM. Yao et al. [5] investigated the issue of automatic classification of clinical records from Classified Medical Records of Distinguished Physicians Continued Two (二续名医类案), a famous book with many clinical records provided by distinguished TCM physicians from different periods of the history. In order to solve the issue, they examined a variety of shallow classification algorithms (such as SVM, MaxEnt and Random Forest) with various kinds of textual features from the literature, including document embedding powered with TCM MeSH. In their subsequent work Yao et al. [6] further proposed a BERT-based method for the issue, and with the pretrained model a much better performance is gained. Zhou et al. [7] outlined a knowledge graph construction based on the key concepts related to diseases, symptoms, prescriptions and herbs extracted from ancient books of TCM, demonstrating its possible applications for TCM research and clinical practice.
Knowledge discovery from ancient TCM literature is still in its infancy and achieves only limited results due to the unique characteristics of ancient Chinese language relative to the modern one, as mentioned in Section Background, which always lead to a failure when applying the existing Natural Language Processing (NLP) tools on these ancient Chinese texts. Therefore, some authors turn to the fundamental tasks for ancient Chinese literature processing of TCM such as word segmentation and Part-of-Speech (POS) tagging in [18].
Put simply, the studies for automatic processing of TCM ancient literature are limited, and to the best of our knowledge, there is no research on extraction of DSCEs from ancient literature of TCM, in spite of their importance in TCM academic research and clinical practice. In order to bridge the gap in mining of TCM ancient literature, in this paper we formalize a novel information extraction task for TCM ancient literature, where the entities to be extracted are DSCEs mentioned in the literature, and further propose a deep-learning-based framework to solve the task. We detail the task and the framework in the following section.

Task definition and method
In this section we elaborate on the task definition of extracting DSCEs from ancient literature of TCM and introduce the framework for solving the task, detailing different methods in it. The solution in general consist in a sequence labelling process with the sequence-tosequence deep learning architecture as its core component coupled with a CRF inference level on its top to obtain an optimal label sequence for all the sentences in a section of an ancient book of TCM. The technical details of the framework are described in Sequence labelling framework subsection .

Task definition
The new information extraction task we formalize here is to identify text spans in ancient TCM literature which describe experiences for a particular disease summarized by ancient TCM doctors and described in the literature. We call such text spans DSCEs for exposition brevity. Figure 1 exemplifies some segments from an ancient TCM book titled with Qixiao Liangfang (奇 效良方), where the text spans underlined are describing some experiences for pregnant abdominalgia and colporrhagia (妊娠腹痛及下血). It is obvious from the underlined contents that an experience description for a particular disease almost always contains its pathogenesis, syndrome differentiation, clinical symptoms, etc., which are critical references in modern TCM research and clinical practice. Considering that in most situations an experience description consists of several successive sentences in a section of an ancient TCM book, and hardly traverses borders between sections, we thus propose to regard the task as a sequence labelling issue for the sentence sequence in a section of an ancient TCM book, and employ the traditional BIO tagging scheme on a sentence level. Performing sequence labelling directly on a section rather than on a book could greatly shorten the sequence length to be labelled and thus raise the efficiency of the labelling. After the sequence labelling with the BIO scheme for a section, any text span in this section, starting at a sentence with tag 'B' and continuing until a first encounter of an another tag 'B' or 'O' after the starting sentence, is formed into a DSCE. Figure 2 shows the top flowchart of the DSCEs extraction process.

Sequence labelling framework
For sequence labelling on a section of an ancient TCM book, we propose a deep learning and CRF-based algorithmic framework with character encoding of ancient Chinese in order to avoid the special difficulty in word segmentation for ancient Chinese texts. The framework is depicted in Figure 3. Figure 3, our framework consists of four layers, namely an embedding layer, a sentence encoding layer, a sentence contextual encoding layer and a CRF layer. The embedding layer is responsible for initializing every character in a sentence with its corresponding dense real-valued vector for further processing in the next layers. The sentence encoding layer, taking the dense real-valued vectors of all the characters in a sentence, aims to capture the contextual embeddings of the Basic Feature Units (BFUs) in the sentence, and on the basic aggregate them into an encoding of the sentence. A BFU for a sentence could be its character, character n-gram or even the [CLS] returned by the BERT model, depending on the model used for capturing the contextual embeddings. All the sentences in a section encoded in the sentence encoding layer are passed through the sentence contextual encoding layer to further incorporate contextual information in the section into all its sentence encodings. After that, these sentence encodings are fed into a multilayer perceptron respectively, which returns a BIOtag probability distribution for every sentence. Finally, the CRF layer takes the BIO-tag probability distributions for all the sentences in the section and generates an optimal BIO-tag sequence for the sentence sequence of the section. We examine diverse techniques in the layers of the framework, and detail them in the subsequent subsections.

Embedding Layer
In recent years, the deep learning approaches with word embedding are so popular in processing textual data that almost all the state-of-the-art algorithms for such data processing are exclusively deep-learningbased. However, such approaches always suffer from the issue of out-of-vocabulary (OOV) words due to the skewed distribution of words in textual data. In order to overcome the OOV challenge, some approaches to embed sub-words (characters for Chinese) rather than whole words are proposed, which demonstrate an improved performance in many text processing applications [6,19,20]. Contrary to processing texts of modern Chinese language, processing ancient Chinese texts has to be confronted with another special difficulty, namely the difficulty in word segmentation Figure 2 The flowchart of the DSCEs extraction process Figure 3 The architecture of the framework due to the vague difference between a word and a character. Therefore, embedding individual characters, instead of whole words, is more reasonable for ancient Chinese texts. Therefore, in our framework a sentence is regard as a sequence of characters, and characters are our minimal processing unit rather than words. In the embedding layer the characters are embedded with the dense real-valued vectors pre-trained specially for ancient Chinese in TCM. Concretely, we have collected 693 ancient TCM books as the corpus, regarding the characters as the words to be embedded and train-ing the character embedding vectors with Google's word2vec tool [2] on the corpus. We call the character embedding obtained in such a way TCM-char in the subsequent part of the paper for exposition brevity. Besides, we also investigate the character embeddings trained in [21] with the BERT model (we call such character embeddings Bert-base from now on), and additionally the character embeddings formed with con- [2] https://code.google.com/archive/p/word2vec/ catenation of TCM-char and Bert-base, and call such embeddings TCM-char-Base.

Sentence Encoding Layer
The task of the sentence encoding layer is to capture the contextual representations of the features in a sentence and further aggregate the representations of the features into the presentation of the sentence for the subsequent processing. The features could be single characters or n-grams of characters depending on the methods employed to capture the contextual representations. We call all these diverse features Basic Feature Units (BFUs) for brevity. Formally, suppose S=(c 1 , c 2 , . . . , c N ) is the sentence to be encoded, and it consists of N Chinese characters c 1 , c 2 , . . . , and c N , where c i is the embedding of its i -th character initialized with TCM-char, Bert-base or TCM-char-Base. The sentence encoding layer learns the representation Z (a real-valued vector) of S through two steps, namely BFU contextual encoding step and aggregation step. We detail them below.

BFU Contextual Encoding
We examine three alternative approaches for BFU contextual encoding, namely CNN, Bi-LSTM and BERT, and accordingly resulting in different particular BFUs, i.e., character n-grams for CNN, characters for Bi-LSTM and BERT, [CLS] for BERT. We elaborate on learning contextual encodings for these BFUs with their corresponding approaches in more details below.
• CNN-based BFU contextual encoding The capability of CNN in capturing local linguistical dependencies motivates us to choose it for our character n-gram BFU contextual encoding. Particularly, let g be a convolution kernel with a width a, and c i:i+j be the successive concatenation of the character embeddings c i , c i+1 , . . . , and c i+j in S, the BFU contextual embedding generated at the position i in S, denote with BF U g i , is calculated with where b is the bias. With the padding method named SAME, for a convolution kernel g and a sentence of length N, we can obtain a sequence of N distinct BFUs encoded contextually in realvalued vectors, i.e.
They are aggregated into the sentence representation in the subsequent, aggregation step.
• Bi-LSTM-based BFU contextual encoding Bi-LSTM is characterized with its long-short memory, which could benefit learning long-distance dependencies in textual data. We thereby examine it in our framework for BFU contextual encoding. For a position i in S, the Bi-LSTM generates a BFU contextual encoding being the concatenation of the forward and the backward hidden outputs, i.e.
• BERT-based BFU contextual encoding In recent years we can always find BERT [22] in state-of-the-art algorithms for NLP tasks. Indeed, due to its multi-layer bidirectional transformer and the self-attention mechanism [23] employed, BERT has been applied successfully in solutions of almost all the NLP tasks. We therefore incorporate BERT into our framework and investigate its performance for BFU contextual embedding in the task of DSCEs extraction from ancient TCM literature. The BERT for BFU contextual encoding in our framework is depicted in Figure  4, where [CLS] is added as a special character to denote the entire sentence, and c [CLS] is its initial embedding. Through its multi-layer bidirectional transformers and the self-attention, the BERT incorporates contextual information inside the sentence and generates a contextually embedded vector for [CLS] as well as for every character in the sentence. We thereby obtain two groups of BFUs from the BERT for a sentence. The first group consists of a single contextual embedding for [CLS], denoted with BF U (2) in Figure 4, while the second group contains all the character contextual embeddings in the sentence, i.e., BF U 1 , BF U Aggregation After capturing the BFU contextual embeddings for the sentence S with any alternative approach for BFU contextual embedding, the aggregation step composes them into a representation of the entire sentence S necessary for the subsequent processing and the final sentence labelling. We examine two diverse ways of the aggregation, namely maxpooling [24] and attention [25].
For the sentence S with its BFU contextual embeddings BF U 1 , BF U 2 , . . . , and BF U N , their maxpooling aggregation results in the representation Z of S that Z = max {BF U 1 , BF U 2 , . . . , BF U N }. If the of the BERT (i.e., the first group of the BFU contextual embeddings), Z = BF U (2) . For the CNNbased BFU contextual encoding, we employ several convolution kernels in order to capture different semantical aspects of a n-gram, thus obtaining multiple max-pooling aggregations for a sentence. If we employ k convolution kernels, and for the j -th convolution kernel g j the BFU contextual embeddings for the sentence S are BF U With the attention mechanism, BFU contextual embeddings for a sentence are aggregated according to their respective contributions for the sentence, and the contribution strengths are learned automatically. The process of the attention-based aggregation is shown in Figure 5, which demonstrates that, for the sentence S with its BFU contextual embeddings BF U 1 , BF U 2 , . . . , and BF U N , their attention weights are learned with the query vector u c at first, after that BF U 1 , BF U 2 , . . . , and BF U N are weighted-summed to generate the representation Z of S. The attention weights are calculated through the formulas (4) and (5), and the weighted-summed operation in the formula (6), where W c is the weight matrix, and b c is the bias. If the BFU context embeddings are obtained with the single [CLS] of the BERT, the attention aggregation leads to Z = BF U (2) . For the CNN-based BFU contextual embedding, the attention aggregation is performed for every convolution kernel to obtain the sentence representation towards the convolution kernel, and all the representations resulted towards k different convolution kernels employed are concatenated to form the final sentence representation Z.
Except the max-pooling and attention, we also examine a simple aggregation method for the Bi-LSTMbased BFU contextual encoding, where the last BFU encoding is employed to represent the sentence, i.e., Z = BF U N .

Sentence Contextual Encoding Layer
In order to further capture contextual information in a section, the sentence contextual encoding layer takes the representations of its sentences generated in the sentence encoding layer, and updates their representations with a Bi-LSTM operating on the sentence sequence of the section. Suppose the section D to be sequentially labelled contain successive sentences S 1 , S 2 , . . . , and S L , their sentence representations are Z 1 ,Z 2 , . . . , and Z L respectively. The contextual encoding h i for the sentence S i in D generated with the Bi-LSTM is The resulted h i is then fed into a MLP to transform h i into a probability distribution r i over the elements of the BIO label set K = { B , I , O }, and such label distributions for all the sentences in a section are passed through the CRF layer in order to determine an optimal label sequence for the section. The CRF layer is presented in the next subsection.

CRF Layer
For a section D containing successive sentences S 1 , S 2 , . . . , and S L , with their corresponding label distributions r 1 , r 2 , . . . , and r L , the CRF layer aims to determine an optimal label sequence y * = (y * 1 , y * 2 , . . . , y * L ) for D through scoring all its possible label sequences. For a label sequence y = (y 1 , y 2 , . . . , y L ), the CRF scores it with where r i [y i ] denotes the probability of S i labelled with y i , and A yiyi+1 is the probability of y i+1 following y i learned in the training phase. The scores for all possible label sequences for D are then transformed into a probability distribution over these label sequences as in the formula (9), where Y are the set of all possible label sequences for D. We use the traditional Viterbi algorithm [26] to choose an optimal label sequence for D, i.e., the label sequence with the maximal probability of label sequence, as expressed in the formula (10).
All the model parameters in the layers of embedding, sentence encoding, sentence contextual encoding and CRF are trained jointly using the negative log-likelihood loss with the L 2 norm regularization, as shown in the formula (11).
where ω denotes all the model parameters, and λ is a hyperparameter trading off the norm regularization term against the loss.

Experiments
In order to evaluate the performances of our framework with the alternative approaches applied in its different layers and modules for the DSCEs extraction task, we have collected two corpora from ancient literature of TCM and annotated them manually with DSCEs occurrence information for the diseases pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) and jaundice (黄疸) respectively. In this section, we introduce these two corpora and their annotations, report the experimental results of our framework on the corpora, and present the implications which could be drawn from the empirical results.

Datasets and evaluation metrics
The first corpus [3] , named PAC in the following exposition, is oriented towards extracting DSCEs of pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) from ancient TCM literature. When constructing PAC, the two co-authors of this paper from TCM domain had chosen a list of query keywords according their own experiences, including "pregnant (妊)", "pregnant (娠)", "pregnant (孕)" and "fetus (胎)", and retrieved such ancient TCM books from Chinese Medical Classics (中 华医典) [4] that match at least one character from the list. With the retrieval process, they have determined 114 books mentioning pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) in Chinese Medical Classics (中华医典). After that, all the sections in these books that contain one or more keywords from the list are given to two TCM graduate students which are asked to annotate DSCEs of pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) occurring in these sections. All the sections containing at least once an DSCE occurrence of pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) are composed into the corpus PAC, ignoring the other sections. After that the corpus PAC annotated with such manual way in [3] The two terms, namely corpus and dataset, are used interchangeably in this paper for ease of its exposition. [4] Chinese Medical Classics (中华医典) is a large-scale electronic collection of ancient TCM book with its special content organization and search functionality. transformed into the BIO tagging form, such as the example in Figure 6. Another corpus, named JAUNDICE in the following exposition, is constructed and annotated with the similar way, except that it is oriented towards extracting DSCEs of jaundice (黄疸) rather than pregnant abdominalgia and colporrhagia (妊娠腹 痛及下血), and accordingly the query keywords for this disease are "jaundice (黄疸)", "jaundice (疸)" and "jaundice (瘅)". The two co-authors finally have determined 70 books mentioning jaundice (黄疸). Table 1 summarizes statistics of the two annotated datasets. We evaluate our framework on the annotated datasets towards two granularities of labelling, namely sentencelevel labelling and DSCE-level labelling. For the sentence-level labelling we employ the traditional metrics for the evaluation, including precision, recall, F 1 -value and accuracy. Specifically, for a tag i from { B , I , O }, the precision P i , recall R i , and F 1 -value F 1i are defined respectively in the formulas (12), (13) and (14).
We further aggregate the precision, recall and F 1value for a single tag into their corresponding metrics for all the tags from { B , I , O } with the macroaverage method. Hereafter all the mentions of precision, recall and F 1 -value mean their corresponding macro-averaged aggregation over all the tags in { B , I , O }.
Except precision, recall and F 1 -value, we also evaluate the framework towards the sentence-level labelling with its accuracy Acc, which calculated with the formula (15).
For the DSCE-level labelling we employ precision, recall and F 1 -value to assess our framework. They are defined respectively in the formulas (16), (17) and (18).

Experimental Setup
Considering different characteristics of the two annotated datasets, in particular their difference in size, we treat them separately in our experiments. Concretely, the dataset PAC is divided randomly into training, validation and testing sets according in the ratio of 8:1:1, while the dataset JAUNDICE is split only into two sets, namely training set and testing set in the ratio of 9:1 due to its relatively small size. All the hyperparameters of our framework are tuned on the validation set from PAC and the resulted hyperparameter settings are applied to the experiments on JAUNDICE. The parameter settings are given in Table 2.
We implement our framework in PyTorch, taking advantage of the Adam [27] optimizer in the training phase to learn its parameters and incorporating the dropout operation into its sentence encoding and sentence contextual encoding layers to avoid its overfitting. We evaluate our framework based on the 10-fold cross-validation schema, and report not only the mean of a metric over all the 10 folds but also its corresponding 95%-confidence interval. Table 2 Hyper-parameter settings Hyper-parameter Name Value dimensionality of TCM-Char 64 dimensionality of Bert-base 768 widths of convolution kernels 3,4 and 5 #convolution kernels for a particular width 300 dimensionality of the Bi-LSTM-based BFU encoding 300 dimensionality of the sentence contextual encoding 300 dropout probabilityr 0.5 coefficient of the L2 regularization term λ 0.0001 learning rate for the CNN-based 0.0001 and Bi-LSTM-based BFU encodings learning rate for the BERT-based BFU encoding 5e-5 (1) N at first, concatenates the two aggregation results as the aggregated sentence representation. We call the hybrid approach for the BFU contextual encoding as well as for the aggregation Hybrid. The experimental results are given in Table 3.
We can see in Table 3 that, among all the combinations of alternatives in the character embedding, BFU contextual encoding and aggregation, the combination of Bert-base character embedding, CNN-based character n-gram BFUs and max-pooling aggregation are best in all the performance measure metrics, e.g., the accuracy of 80.40%±1.64% and the F 1 -value of 76.73%±1.59%. Furthermore, the results also demonstrate that, for a fixed BFU contextual encoding approach and a fixed aggregation approach, the Bertbase consistently outperforms the other two character embedding alternatives, e.g., the average accuracy of the Bert-base for the Bi-LSTM-based character BFUs and the attention aggregation is 74.80%, about 4.5% higher than the TCM-char and 3.5% higher than the TCM-char-Base for the same BFU contextual encoding approach and the same aggregation approach.
With regard to the BFU contextual encoding, the CNN-based approach performs almost equally well as the BERT-based approach. Both of them outperform the Bi-LSTM-based BFU contextual encoding, regardless of the approach employed for the subsequent aggregation. The superiority of CNN over Bi-LSTM in the BFU contextual encoding explains a possibility that, although there is a very vague bordering between word and character in ancient Chinese language and thus it is difficult to perform word segmentation in its text, the character n-grams captured by CNN could operate as proxies of words. BERT also has similar ability to capture character n-grams for the BFU contextual encoding with its transformer mechanism and character position embedding. For the alternative aggregation techniques, the attention is useful only for the Bi-LSTM-based BFU contextual encoding, regardless of the character embedding method employed.
Due to its better performance, we employ the Bertbase as the only character embedding approach in the subsequent experiments. We first report experimental results of our framework on the dataset PAC for recognition of the whole DSCEs, which are given in Table  4. The experimental results in Table 4 demonstrate that, as for the sentence-level labelling, for recognizing the whole DSCEs both of the CNN-based and BERT-based BFU contextual encodings perform almost equally well, outperforming the Bi-LSTM-based BFU contextual encoding. Comparing Table 4 with Table 3, we can see much degraded performances in recognizing the whole DSCEs, which exemplify special difficulties in our novel information extraction task. Different from entities to be extracted in other information extractions tasks, the DSCEs to be extracted for a particular disease often have a larger span in an ancient TCM book, Furthermore, in such a book DSCEs always occur sparsely. All these phenomena make our framework have to face three challenges unsolved in machine learning, namely skew data distribution, data sparsity and long-distance dependency in sequence labelling. Therefore, our information extraction task deserves further and continuous studies.
Of course, the results in Table 4 are discouraging. However, if we weaken our goal from exact recognition of the DSCEs to their approximate appearing positions in an ancient TCM book, the results become encouraging and such positions are also useful for TCM professionals to acquire clinical experiences from ancient TCM literature. Table 5 shows the results of recognizing the whole DSCEs in the weakened situation, where a recognized DSCE is regarded as being precise if its span is overlapped with a manually annotated DSCE span and a manually annotated DSCE is regarded as being recalled if its span is overlapped with a recognized DSCE span, and the P DSCE , R DSCE and F 1 DSCE are calculated according to the formulas (16), (17) and (18) respectively under the weakened definitions of a precise test instance and a recalled test instance. We call such assessment for recognizing the whole DSCEs loose, while the original exact assessment strict.
The experimental data in Table 5 indicates that our framework performs very well in determining mention positions of the DSCEs in ancient TCM literature, although it is faced with the difficulty in identifying their exact spans (i.e., their beginning and ending positions). The data also supports the conclusion above that both of the CNN-based and BERT-based BFU contextual encodings perform almost equally well, outperforming the Bi-LSTM-based BFU contextual encoding.
Experimental results on the dataset JAUNDICE We further perform the experiments of our framework on the dataset JAUNDICE in order to investigate its generalization ability for DSCEs of a particular disease with fewer annotated training data and adaptability of its hyper-parameters from one disease to another disease. As introduced in subsection Datasets and evaluation metrics, the dataset JAUNDICE is much smaller than PAC, and composes of sections with annotated DSCEs for jaundice (黄疸), other than pregnant abdominalgia and colporrhagia (妊娠腹痛及下血), the annotated disease of PAC. We split the dataset JAUN-DICE into training set and testing set in the ratio of 9:1, and train our framework with the parameter settings in Table 2, which are tuned on the dataset PAC. Similarly, we evaluate our framework on JAUNDICE based on the 10-fold cross-validation schema, and report the mean of a metric over all the 10 folds and its corresponding 95%-confidence interval. The experimental results of the sentence-level labelling are given in Table 6.
It is obvious from Table 6 that, due to the small size of the dataset JAUNDICE and the hyper-parameters tuned outside of the dataset, the performance of our framework in sentence-level labelling on JAUNDICE is much poorer than on PAC, but the same tendency is hold, i.e., both of the CNN-based and BERT-based BFU contextual encodings perform almost equally well, outperforming the Bi-LSTM-based BFU contextual encoding. The corresponding experimental results for recognizing the whole DSCEs on JAUNDICE are shown in Table 7, where we can see the similar phenomena that our framework performs well in determining mention positions of the DSCEs in ancient TCM literature, but is difficult to identify their exact beginning and ending positions.

Discussion
In this paper we define a novel information extraction task for recognizing DSCEs in ancient TCM literature, and propose a framework with alternative techniques in its modules to solve the task. The experimental results shown in Section Experiments demonstrate that the framework is effective in helping TCM professionals finding disease-specific clinical experiences in ancient TCM literature. Furthermore, we examine different combinations of the alternative techniques in our framework, and find that the Bert-base character embedding is appropriate than TCM-char for our task, and the CNN-based and BERT-based approaches are almost equally powerful in capturing the features than the Bi-LSTM-based approach.
Although the results are encouraging, the specificities of our task with regard to the long span of the DSCEs and their sparse occurrences in ancient TCM literature have made the task much difficult than other information extraction task, which deserves further deep research.

Conclusion
We present a novel extraction task where the entities to be extracted are DSCEs in TCM ancient literature in order to assist TCM professionals in obtaining valuable disease-specific clinical experiences in massive volume of ancient TCM literature. For the purpose we have collected two corpora from ancient literature of TCM and annotated them manually with DSCEs occurrence information for the diseases pregnant abdominalgia and colporrhagia (妊娠腹痛及下血) and jaundice (黄疸) respectively. Furthermore, we propose a framework for solve the task, examining different alternative techniques in the framework to overcome the challenges specific to ancient TCM text processing, such as the character embeddings without segmenting the words and the attention mechanism on character n-gram encodings. The experiments performed verify the usefulness of our framework, but also demonstrating the difficulties specific to the task, especially in determining the exact span of a DSCE and solving the issue of the occurrence sparsity, which are our subsequent studying direction for this new information extraction task. Availability of data and materials The datasets used and analyzed during the current study are available from https://github.com/153lym/TCM DSCEs.