Research on the diagnosis and treatment of the dominant diseases of traditional Chinese medicine based on machine learning

1 Background : Insomnia as one of the dominant diseases of traditional Chinese medicine (TCM) has 2 been extensively studied in recent years. To explore the novel approaches of research on TCM 3 diagnosis and treatment, this paper presents a strategy for the research of insomnia based on machine 4 learning. 5 Methods : First of all, 654 insomnia cases have been collected from an experienced doctor of TCM 6 as sample data. Secondly, in the light of the characteristics of TCM diagnosis and treatment, the 7 contents of research samples have been divided into four parts: the basic information, the four 8 diagnostic methods, the treatment based on syndrome differentiation and the main prescription. And 9 then, these four parts have been analyzed by three analysis methods, including frequency analysis, 10 association rules and hierarchical cluster analysis. Finally, a comprehensive study of the whole four 11 parts has been conducted by random forest. 12 Results : Researches of the above four parts revealed some essential connections. Simultaneously, 13 based on the algorithm model established by the random forest, the accuracy of predicting the main 14 prescription by the combination of the four diagnostic methods and the treatment based on syndrome 15 differentiation was 0.85. Furthermore, having been extracted features through applying the random 16 forest, the syndrome differentiation of five zang-organs was proven to be the most significant 17 parameter of the TCM diagnosis and treatment. 18 Conclusions : The results indicate that the machine learning methods are worthy of being adopted to 19 study the dominant diseases of TCM for exploring the crucial rules of the diagnosis and treatment.

information, the four diagnoses of TCM, the treatment based on syndrome differentiation and the 1 main prescription. The diagnosis and treatment of TCM is a whole from the information collection 2 (including basic information and four diagnoses) to the treatment based on syndrome differentiation, 3 and then to the establishment of the main prescription. The whole diagnosis and treatment process is 4 not only logical, but also indivisible. In the past decades, many efforts have been done to study this 5 process, whereas most researches have only focused on one part of this process. Zhang S et al. and treatment, resulting that their conclusions can hardly be applied in clinical practice [11]. 13 Therefore, for the sake of reliability and comprehensiveness of the research method adopted in the 14 present paper, the research is carried out logically according to the sequence of TCM diagnosis and 15 treatment, and the whole will be discussed at last. 16 In recent years, the rapid development of data analysis and artificial intelligence has provided a 17 novel research direction for the improvement of the clinical diagnosis and treatment technology. At 18 present, the methods of data mining and machine learning have been widely used in the field of 19 TCM [12]. 20 In the present paper, the medical record data of insomnia are selected as the research samples. 21 treatment ideas, the contents of the sample data are divide into four parts: the basic information, the 10 four diagnostic treatment, the treatment based on syndrome differentiation and the main prescription.

11
Each part contains several data, and the specific data processing steps will be described later. In the 12 light of the characteristics of the data, the machine learning methods, including frequency analysis, 13 association rules and hierarchical clustering analysis, are adopted to process and mine the data. 14 Finally, the data of the TCM diagnosis and treatment ideas from the four diagnoses, the treatment 15 based on syndrome differentiation and the main prescription are integrally discussed by employing 16 the random forest algorithm. The specific research strategy designed in this paper is illustrated in 17 Figure 1.Flowchart of the research strategy designed in this paper. 18 In the process of coding and classification, the Guidelines for the diagnosis and treatment of  these values are filled with their mean value. The substitute values are specified in Table 2. 15 The processed data set are import into Python. The data samples are quantified by programming, 16 and then analyzed by applying the following machine learning methods.
Frequency is also known as "time". The total data are divided into groups according to the 1 preset standards, and then the number of individuals in each group is counted. The relative frequency 2 is the ratio of the frequency of each group to the total number of data.

Association rules 4
A frequently-used method to study the relationship rules among data is to apply the association 5 rules of Apriori algorithm [15]. Generally, three indicators, including confidence, support and lift, can 6 be used to evaluate an association rule. Support is defined as the proportion of the data in the item set 7 to the data in the data set, thus measuring the frequency of a set appearing in the original data. For 8 instance, if two sets in the data set are X and Y respectively, then： where X|Y represents the union of X and Y.

11
Confidence is defined for an association rule. The confidence of X→Y can be expressed as 12 follows: Lift can reflect the correlation between X and Y in association rules. As expressed in the 16 following function, the lift is defined as the proportion of the probability of the data set containing 17 both X and Y to the probability of the data set only containing Y.
The higher the lift is (lift＞1), the higher the positive correlation is，and vice versa. The lift 21 equal to 1 indicates that there is no correlation. 22

Cluster analysis 23
At present, the cluster analysis is extensively used in the medical field [16]. In general, the 1 cluster analysis can be classified into two categories, one is hierarchical clustering algorithm and the 2 other is agglomerative clustering algorithm. In the Euclidean space, using hierarchical clustering 3 algorithm to analyze small-scale data sets can achieve optimal results. Its basic principle is to 4 establish a hierarchical clustering tree by calculating the similarity among different categories of data 5 points and adopting the bottom-up aggregation strategy. Each sample set in the data sets is regarded 6 as a cluster, and then the clusters with close distance are merged step by step to achieve the expected 7 number of clusters.

8
Assuming that there are clusters i C and j C , the function can be described as follows:

Random forest 12
The random forest algorithm derived from ensemble learning method is composed of multiple 13 decision trees. The random forest is an extension of the classification tree and the regression tree. 14 These trees can be used to model the response variables through recursive partition and predict the 15 final results jointly [17]. The random forest algorithm is commonly employed in data classification 16 and regression [18]. At present, there are three mainstream decision tree algorithms, including ID3, 17 C4.5 and CART. In the present paper, the most widely used algorithm, CART, is selected to build 18 random forest algorithm model. The main function of this algorithm is described below.

19
Suppose that there is a training data set D with k classes in total. The Gini index of set D can be 20 expressed as follows: where C k represents the sample subset of class k. The |C k | and |D| represent the size of C k and D 3 respectively.

4
In CART algorithm，assuming that feature A is used to segment the data. If feature A is a 5 discrete feature, set D can be divided into subset D1 and subset D2 according to one possible value a 6 of A, as shown below.
Consequently, the Gini(D,A) of set D under the condition of known feature A can be obtained 11 by combining the above functions. The Gini index is theoretically similar to entropy, as described 12 below.  The basic information mainly consists of the ID, name, clinic time, age and gender of patients.
Since the clinic time is not taken as a factor in the screening criteria during the data screening stage, 23 the statistical results may deviate from the actual situation. The ID and name of patients have no 24 impact on the diagnosis and treatment process. As a consequence, the focus of this section is age and 25 gender of patients. Considering that the categories of age and gender data are relatively few, we 1 choose frequency analysis for the data processing. According to the box-plot of age distribution of 2 patients (shown in Figure 2.Box-plot of the age distribution of patients), it can be seen that the 3 average age of patients is 47, the mean square deviation of age is 11. Moreover, the maximum age 4 and minimum age are 79 and 14 respectively.  it is a process of collecting medical history information for doctors of TCM [19]. "Inspection" refers 13 to the observation of patients' external performance, such as tongue picture, expression, reaction and 14 complexion. Moreover, "auscultation and olfaction" is the way that doctors diagnose diseases by 15 hearing and smelling. Additionally, "interrogation" is a sort of diagnostic method for doctors to find 16 out the occurrence, development, treatment process and past health history of diseases by talking 17 with patients. Furthermore, "palpation" particularly refers to the method that doctors use index

22
Based on the smallest unit of classification, the method of association rules is applied to study 1 in this section. Considering that the basic information is also a part of TCM interrogation and may 2 have an effect on the diagnosis and treatment process of the diseases, the basic information is 3 included in the four diagnostic parts for discussion as well. Taking into account that there are too 4 many null values in some of the smallest classification units, we attempt to use two methods to 5 analyze the association rules for the combination of the smallest units (the combination items are in 6 the brackets below), so as to minimize the impact of the null values on the research results. The 7 results are listed in Table 3 and Table 4.  Table 3.  Table 4. 14 It can be concluded from the above tables that most of the results are dominantly related to 15 gender and age, while there is no significant association among the four diagnoses. According to the 16 analysis of the actual clinical experience, the above results have no remarkable guiding significance 17 for clinical practice. Nevertheless, two innovative research directions can be found based on these 18 results. On the one hand, this research can be explored deeply through expanding the sample size and data can be used for the epidemiological study of TCM on condition that the sample size is large 1 enough.

Treatment based on syndrome differentiation
3 Originating from the philosophical culture, the treatment based on syndrome differentiation is 4 the core of the TCM theories and gradually develops into a complex theoretical framework, 5 including the yin and yang theory, five elements, eight principles, the Qi and blood theory， the organs 6 theory and the meridian system [20]. the above four significant syndrome differentiation points. It is worth mentioning that the organs 22 syndrome differentiation includes heart, liver, spleen, lung, kidney, gall bladder, stomach, small 1 intestine, large intestine, bladder and the triple burner; the syndrome differentiation of asthenia and 2 sthenia consists of asthenia syndrome and sthenia syndrome; the syndrome differentiation of cold 3 and heat is composed of cold syndrome and heat syndrome; the pathogenic factors include phlegm, 4 fire, blood stasis and asthenia. The above-mentioned 19 syndrome differentiation factors constitute 5 the section of treatment based on syndrome differentiation of the insomnia sample data research in 6 this paper. To ensure the objectivity of each syndrome differentiation factor, the three TCM doctors 7 are supposed to collect at least two or more kinds of medical record information in the classification 8 and coding stage of medical record data for determining one syndrome differentiation factor. For 9 instance, the medical information "wiry pulse" and "irritability" can infer that the syndrome 10 differentiation factor of organs is liver; the medical information "thin pulse" and "tiredness" can 11 imply that the factor of asthenia and sthenia syndrome differentiation is asthenia syndrome; the 12 medical information "red tongue" combined with "tidal fever" and "rapid pulse " indicates that the 13 factor of cold and heat syndrome differentiation is heat syndrome; the medical information "slippery 14 pulse" combined with "yellow tongue" and "greasy tongue coating" means that the pathogenic 15 factors is phlegm.

16
Despite that each syndrome differentiation factor in each medical record is relatively 17 independent, there is a strong correlation among the factors. Therefore, it is reasonable to select 18 association rules for the analysis. Through a process of trial and error, the confidence is finally 19 adjusted to 0.7, and the results are summarized in Table 5.

20
As can be seen from the above table, besides the associations that can be obtained from the 21 basic theories, such as the associations between fire and sthenia syndrome, fire and heat syndrome, there are more new-found associations. For example, the complex syndrome of heart, liver, spleen, 1 asthenia and sthenia → the heat syndrome, the fire stasis syndrome → the heart, liver. The following 2 conclusions can be drawn by analyzing the treatment based on syndrome differentiation with 3 association rules. On the one hand, the results can reveal the syndrome differentiation thoughts of 4 TCM doctors. On the other hand, after applying the above methods to classify the contents of 5 treatment based on syndrome differentiation, the results can reflect the priority direction of syndrome 6 differentiation of insomnia to a certain extent, thus having guiding significance for clinical practice.

7
In the further study, more research methods can be adopted to verify the dominant diseases of TCM 8 and explore new syndrome differentiation rules.  Table 6 presents the 5 correspondence between the processed data codes and herbs.

6
For the sake of reducing calculation amount and the increasing the code execution efficiency, 7 all the herbs are replaced with codes, and then the codes are entered into the database.

8
The hierarchical clustering algorithm is employed to analyze the small sample data set in the 9 European space, thus obtaining a satisfactory result. According to the characteristics that the main 10 prescription is composed of a wide variety of herbs, the hierarchical clustering algorithm is applied 11 to explore the potential classification rules in the data samples of TCM. The results of the analysis of 12 the main prescription using hierarchical clustering analysis are shown in Figure 5.Results of the 13 analysis of the main prescription using hierarchical clustering analysis. 14 The main prescriptions of the corresponding serial number are presented in Table 7, and the 15 repeated herb combinations in all main prescriptions are shown in Table 8. 16 The above conclusions indicate that the desired results can be achieved by adopting the 17 hierarchical clustering algorithm to analyze the main prescriptions. The rapid acquisition of the main 18 prescription of TCM is beneficial for the study of the combination rules of TCM, but also lays a solid 19 foundation for the overall study of the diagnosis and treatment of the dominant diseases of TCM. In 20 order to facilitate the further discussion on the overall diagnosis and treatment idea, we code the

Diagnosis and treatment idea 1
In the discussion of the aforementioned four parts, the four parts of TCM diagnosis and 2 treatment ideas are studied successively, so as to reveal the internal relationship and related research 3 methods of each part. This section discusses the four parts as a whole. In accordance with the 4 research process designed in the previous section(in Figure 1), the random forest algorithm is 5 adopted to establish the model. Simultaneously, the data sets collected from four diagnoses, 6 treatment based on syndrome differentiation, and the main prescriptions of TCM are put into the 7 model for cross-validation. Consequently, the corresponding accuracy can be obtained. In the 8 meantime, for the purpose that the internal relationship of TCM diagnosis and treatment ideas can be 9 explored deeper, this section is divided into two processes for further discussion. These two process 10 are illustrated in Figure 6.Flowchart of the diagnosis and treatment ideas of TCM.

11
It is worth noting that five zang-organs, six fu-organs and pathogenic factors each contains 12 several syndrome differentiation factors, which are randomly combined in the medical record sample 13 data. In addition, in the actual outpatient service, the prescriptions made by doctors for patients 14 commonly includes at least one main prescription. Therefore, in order to facilitate data processing, 15 the five zang-organs combination, six fu-organs combination, pathogenic factors combination and 16 main prescription combination are coded and loaded into the database. For the sake of presenting 17 the accuracy more intuitively, the method of confusion matrix is carried out in this paper. The 18 confusion matrix results are shown in Figure 7.Confusion matrix. 19 As summarized in Table 9, the accuracy of applying the random forest algorithm model to 20 predict the information of treatment based on syndrome differentiation through the four diagnostic 21 information is dramatically high. Simultaneously, the high accuracy is achieved by predicting the 22 main prescription through the information of the combination of the four diagnoses and the treatment 1 based on syndrome differentiation.

2
In process 2 of this section, the random forest algorithm model is applied to extract the 3 eigenvalues of all data in the data sets. Since the eigenvalues obtained by using the random forest 4 model are too small to be studied conveniently, the eigenvalues are expanded in the form of As illustrated in Figure 8, the most significant parameter affecting the judgment results is the 8 syndrome differentiation of five zang-organs , followed by sleep status, pulse conditions, the 9 syndrome differentiation of asthenia and sthenia and the syndrome differentiation of six fu-organs. It can be concluded from the above results that the random forest algorithm model can be 3 applied to quickly and accurately verify the correctness of TCM diagnosis and treatment ideas. It is 4 worth mentioning that only one algorithm model is used in this paper, resulting in the lack of the 5 diversity of methods. In the further research, a wide variety of algorithm models can be introduced 6 for comparisons, so as to further investigate the feasibility of machine learning methods in the 7 research of TCM diagnosis and treatment. The results indicate that the machine learning methods can be effectively applied to deeply mine 10 and analyze the medical record data of the dominant diseases of TCM. The focus of this study is to 11 analyze the diagnosis and treatment process of the TCM dominant diseases which includes the 12 acquisition of the patients' condition information through using four diagnostic methods, and the 13 flexible application of the syndrome differentiation methods to develop the treatment plan and select 14 the main prescription. And the research strategy established in this paper can efficiently filter the 15 unessential diagnosis and treatment information, thus helping TCM doctors to quickly and efficiently 16 obtain valuable information and crucial rules from a substantial number of medical record data.

17
Furthermore, since the research process, the data collection and the data analysis methods designed 18 in this paper are highly standardized, the research strategy established in this paper can be applied to

Declarations Ethics approval and consent to participate
Informed consent of the study and a statement on ethics approval was waived because of the retrospective nature and the analysis used anonymous clinical data.

Consent for publication
Not applicable

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. The codes in Figure 5 correspond to the codes in Table 6. The line represents the the main prescription, and the small circle represents the corresponding herb. Frequency of Category 1 is 573, frequency of Category 2 is 312, frequency of Category 3 is 577.