The Classicatied Prediction of Coronary Heart Disease Based on Patient Similarity Analysis

Objective: It is important for physicians' clinical decision support to classify the coronary heart disease (CHD). Customizing personalized predictive models for patients requires selecting a patient group from an existing medical database that most closely resembles the indexed patients. In this study, we introduce a new concept that using the patient similarity for the classification of patient with CHD. Materials and methods: We performed a structured representation of CHD patients. Obtain the multidimensional attribute distance matrix between patient pairs by calculating the multidimensional attribute distance of the patients. Predict similarity between patient pairs using machine learning (ML) models to predict clinical outcomes for indexed patients based on matched similar patients. Results: The new measure shows marked improvements over the traditional classification measures. LightGBM is the top-performing ML model. The best model achieved 88.52% accuracy. Conclusion: The medical applications of ML supported by similarity analytics represent a promising solution through which to reduce the physican workload to achieve the goal of “precision medicine”.


Introduction
Coronary heart disease(CHD) is the primary chronic diseases that has emerged as an increasing prevalent worldwide health concern associated with and physiological [1]. There is a great need to improve understanding of CHD and its complex factors to facilitate prevention, early detection, and improvements in clinical management. A more precise characterization of CHD patient populations can enhance our understanding of CHD pathophysiology. Current clinical definitions classify CHD into 4 subtypes: Stable angina pectoris, Unstable angina pectoris, Ischemic cardiomyopathy, Myocardial infarction. In recent years, many scholars have integrated artificial intelligence techniques for the CHD with exciting results [2]. Therefore, new patients with CHD can be evaluated and predicted in advance.
In order to predict individualized classification prediction of patients with CHD, a detailed analysis of the characteristics of patients with CHD is required. Early studies on CHD prediction mainly applied machine learning or deep learning methods to patients' own characteristics by probabilistic statistics. Through clinical manifestations such as disease symptoms, a statistical model between the probability of disease occurrence and the frequency of disease appearance is established, and the model is learned to complete the diagnostic work. However, with the development of precision medicine, this prediction method is actually inconsistent with the actual physician's diagnostic process. The physician uses analogical reasoning in the process of disease diagnosis, where the physician makes a diagnosis for a new patient based on information such as the clinical presentation, laboratory test data and other information, analogous of the patient in whom the diagnosis has already occurred. So in order to build an effective clinical decision support system, we need to quantitatively measure the distance between two patients, which is the core operation behind patient similarity [3].
Patient similarity learning is a fundamental and important task in healthcare domain, which can improve clinical decision making without additional effort to physicians [4][5][6]. The goal of patient similarity is to learn a meaningful distances, which refers to the selection of clinical concepts (e.g., diagnosis, symptoms, examination tests, family history, past history, drugs, surgery, genes, etc.) as patient characteristics in a specific medical setting. Quantitatively analyzing the distance between concepts will dynamically measuring the distance between patients and filtering out patient similarity groups that are similar to the index patients [7]. In addition patient similarity analysis can leverage the wealth of information in the electronic medical record (EMR) for patient-specific prediction tasks [8]. Appropriate similarity distances enable a variety of downstream applications in medicine, such as personalized medicine, medical diagnostics, trajectory analysis, and cohort studies [9]. The use of patient similarity helps to identify the exact cohort of indexed CHD patients to train a personalized model. This approach has the potential to provide customized predictions compared to traditional models trained for all CHD patients.
Previous prediction method ignored the analogical reasoning work of physicians and focused too much on data without considering empirical knowledge in the prediction process, it needs further improvement.
To address the above issues, we combines the performance of EMR of CHD patients to form a method for representing CHD patients; proposes a method for distancing patient similarity based on patient attribute information; and examines the performance of the algorithm through systematic experiments.
The following categories of papers can be submitted to the journal: • Consisting of discharge abstracts with comprehensive descriptions of the patients, 4910 EMRs with the first diagnosis of "coronary atherosclerotic heart disease" from a tertiary hospital in Xinjiang has been collected and labeled. Based on relevant literature and clinician's guidance, we have identified 37 attributes related to patients and made further discrete labels of them; • We have obtained a structured patient representation based on the vectorization of the labeled data through the CatBoost-Encoder algorithm; • The distance between patient pairs is represented by the distance metric of multidimensional attributes, the similarity of patient pairs is analyzed quickly and effectively by the application of LightGBM pattern and we have verified effectiveness of the algorithm through experiments.

Related Work
In medical field, many advances have been made in predictive models for outcome prediction, but these innovations are aimed at the general patient population and are not sufficiently tunable for individual patients. One developmental idea in is individualized predictive analysis based on patient similarity [10]. "Precision medicine" states that the clinical outcome of individual patients is determined by their genetic, genomic, physiological and clinical characteristics [11]. Therefore, it's important for "right treatment" to use all available and relevant data to determine all possible diagnoses and their disease trajectories.
For the calculation of patient similarity, the similarity is mainly weighted by traditional methods to optimize the similarity calculation such as Manhattan distance and Euclidean distance. Jia [12] et al. proposed a diagnosis prediction framework based on patient similarity, where they defined patient similarity as the similarity between two sets of diagnoses. Q Suo [13] et al. used a convolutional neural network (CNN) to capture locally important information in the EHR, and then fed the learned representations into a modified triple-loss neural network for training. L Chan [14] et al. proposed a simSVM patient similarity algorithm with 14 similarity measures as input, the model outputs the predicted similarity or dissimilarity, and the trained model has high accuracy on the test data.
With the advent of medical informatization, EMR have become the main vehicle for recording patients' medical procedures. Therefore, it become the mainstream trend in CHD prediction which used machine learning and data mining algorithms by a large number of EMRs. It's quickly and efficiently for predict in CHD with data mining algorithm. Du [15] et al. constructed a chronic disease prediction model using cardiovascular and cerebrovascular diseases, which was modeled by the random forest algorithm in machine learning, and developed a chronic disease prediction system. Shao [16] et al. proposed a new hybrid modeling scheme for prediction of heart disease. They used logistic regression, multivariate adaptive splines, and rough set for feature dimensionality reduction. The prediction model used is neural network. In their classification tasks, the neural network model has obtained effective results. Compared with traditional prediction methods, data mining algorithm takes advantage of large amount of data, which can be more suitable for real-world disease occurrence and reflect the clinical manifestations of patients.
In the research of precision medicine, patient representation has become a research hotspot, structuring free text electronic medical records has always been the focus of many researchers [17,18]. Good results have been achieved in the research of automatic extraction of electronic medical records. Sun [19] et al. summarized and described the current state of research on EMR information extraction in free text, but also provided a review of automated structured methods that raise new questions: Existing methods tend to target only certain types of closed data sets. Even though in the specific area of EMR information extraction, the medical record data collected from different hospitals have a large variability because there is no unified writing norm for medical record text in different regions and hospitals. Consequently, models that have finished training and have achieved higher accuracy on a certain dataset often fail to achieve satisfactory results when tested with other datasets. Therefore, we choose to manually annotate the dataset to ensure the accuracy of subsequent experiments.

Material and methods
The techniques used in EMR-based similarity platform for patients with CHD, which including can be divided into several parts: data acquisition, patient representation, patient-to-attribute similarity matrix generation, LightGBM-based similarity calculation, and CHD subtype prediction. The schematic view of the method is shown in Figure 1.

EMR Access
In healthcare systems, EMRs provide the basic data support. Since EMR include admission records, medical procedure records, surgery records, discharge summaries, etc. Which are complex and hard to handle, we select the discharge summaries in EMRs as the experimental data in this paper. The reason for choosing the discharge summary in the EMR is that the discharge summary for patients with coronary artery disease contains more detailed patient information. The discharge summary consists of many parts: admission status, treatment history, important examination results, discharge status, discharge medical advice, etc. Follow-up recommendations. As shown in Figure 2.

Patient representation
The literature [21] points out that the most significant challenge in patient similarity analysis is currently addressing the problem of patient representation due to data heterogeneity. We represent patients as consisting of 37 concept vectors and discretize them by degree, but the discrete data representation is subject to disagreement after distance calculation. For example, the chest pain range feature pair (0.2,0.4) and (0.4,0.6) distances are calculated with the same result, which cannot reflect the difference and are not fully reflected in the process of model learning. Therefore, we introduced CatBoost Encode to encode the features of this study.
The biggest advantage of this coding method is that it is friendly to class variables. Moreover, CatBoost encode can support the existence of missing values with high accuracy that can be developed in the medical field.

Patient Attribute Similarity Calculation
After obtaining the patient representations, we performed a distance metric on the patient pairs. We choose a ratio of 1:100 for the construction of patient group for index patients. That is, a index patient is randomly paired with 100 matched patients for pairwise distance metrics to obtain a 100*37 patient pairwise attribute similarity matrix D, where the first 36 are feature distances, and the last one is labeled distance. We designed different calculations for the feature distance metric and the distance metric for the labels.

Feature distance calculation
From patient representation the patient's characteristics are expressed as ,

Tag distance calculation
We use label consistency as the principle for label distance calculation, where the diagnosis of patient a and patient b are consistent is 1, and label inconsistency is 0, forming the label distance representation of patient a and patient b as in equation (2) , Tag T , T a b denote the label of patient a, patient b, respectively.

LightGBM-based patient similarity prediction
After patient representation and patient pair similarity calculation, the attribute similarity matrix of different patient pairs is obtained. The next step is input to the prediction model for prediction. To compare the model prediction detection effect, seven supervised machine learning methods are used in this bit to predict patient similarity.
LightGBM (Light Gradient Boosting Machine) is a supervised learning algorithm implemented in the Gradient Boosting Decision Tree (GBDT). The biggest advantage for LightGBM is solve the problems encountered by GBDT in massive data, so that can be used in industrial practice better and faster. Therefore, LightGBM has the advantages of faster training speed, lower memory consumption, better accuracy and the possibility of parallelized learning. It is very suitable for large-scale medical data mining. Among them, LightGBM algorithm principles mainly include histogram algorithm, leaf-wise leaf growth strategy with depth restriction and histogram optimization [22].
Based on the above advantages, using the LightGBM algorithm for patient similarity prediction tasks can achieve better results in the shortest time using fewer computational resources. But that's not all. We also conducted experiments on the results of different machine learning in patient similarity predicting. Detailed data can be found in the experimental results section.

Experiment Data
For the problem of predicting similarity in patients with CHD, the best data should be those that provide a complete description of the patient's condition. We collected discharge summaries from the cardiology department of a hospital in Xinjiang, all of which were patients who were already the first diagnosis of coronary atherosclerotic heart disease. The second diagnosis of which is used as a label for the data, and the data distribution is shown in Table 1. Experimental procedure I. Data pre-processing We constructed an EMR-based attribute representation of CHD patients, including categories, attribute names and their corresponding attribute values. The attributes with more than 30% missing values in all features were removed, and the obtained categories and attributes of patients with CHD are shown in Table 2.
The processing steps for EMR are as follows: 1) Use the annotation tool to annotate the attributes in the discharge summary 2) Extracting the values of each attribute of the medical record. There are three main forms of values in the EMR: binary, numeric and textual features. For binary data, we extract binary features from qualitative text descriptions such as "suffering from XXX disease" and "denying XXX disease" in the medical record. For numeric type data, the process is to extract the text that quantitatively describes the features in the medical record. For example, if the blood pressure is xx mmHg, the value is extracted. For text features, the attribute value descriptions corresponding to the criteria given by the doctor are extracted.
3) The different attributes of each patient were organized into a table according to Table 2.

II.
Model construction and parameter setting We require a pairwise characteristic distance metric for patient pairs. We selected 400 index patients from each diagnostic group for patient pair construction, using a 1:100 ratio for patient pair selection, matching patients were randomly selected from the total patient pool. Generating a total patient pair similarity matrix of 160,000*37. All random number seeds are set to 2021.
According to tag consistency we could classifies the patients similarity into similar 1 and dissimilar 0. Therefore, we converts the multiclassification problem into a second classification problem using the patient similarity approach, then uses LightGBM for prediction. III.
Evaluation Metrics Accuracy, P (Precision), R (Recall) and F1 (F1-Measure) are used as evaluation metrics for similarity calculation models. P denotes the proportion of the calculated similar patient pairs that are actually similar, R denotes the ratio of the predicted correct similar patients to the actual similar patient pairs, and the F1 value is a weighted summed average of precision and recall, and they are calculated as follows: Where TP (Ture Positive) denotes a correctly calculated similar patient pair, TN (Ture Negative) denotes a correctly calculated dissimilar patient pair, FP (False Positive) denotes a wrongly calculated similar patient pair, and FN (False Negative) denotes a wrongly calculated dissimilar patient pair.

Patients indicated learning impact analysis
From Section 3.3, we propose the use of CatBoost Encode for vectorized encoding of CHD patients. We analyze and compare the advantages of CatBoost Encode by comparing the encoding method proposed by other research scholars and the non-encoding method for comparison experiments.
We use the category features commonly used as a control group, which are one-hot coding, label coding, using Skip-Gram coding, and no coding treatment. All tests use LightGBM as a prediction model to compare the accuracy. The experimental results are shown in Table 3.
As the table shows, CatBoost Encode far outperforms the other four encoding methods in terms of Accuracy values, and it can be seen that CatBoost Encode is well suited for encoding our data.

Distance calculation method impact analysis
Based on the above experimental parameters, we tests the effect of using the attribute similarity calculation method in the LightGBM model. Referring to the literature [6] using the Jaccard distance with onehot encoding and the Euclidean distance as comparison experiments, respectively. Considering that using the above two distances require classify the features, we classify the features according to Table 2 as the base vector for the distance metric.
For example, under the Euclidean distance, the similarity of the laboratory tests of patient a and patient b is For example, under the Jaccard distance, the similarity of risk factors for patient a and patient b is Where A denotes the set consisting of risk factor class characteristics of patient a coded by one-hot and B denotes the set consisting of risk factor class characteristics of patient b coded by one-hot.
The experimental results are shown in Table 4: By comparing the experiments, it can be seen that the distance calculation method proposed is better than the distance measure which proposed by previous scholars. Our analysis suggests that traditional distance metrics based on Euclidean distance and other traditional distance metrics operate to a certain extent on feature fusion, it actually reduces the dimensionality of a small number of features, so the model loses a lot of information in the learning process, leading to a decrease in accuracy.

Predictive Model Analysis
Based on the evaluation metrics, in order to verify the effectiveness of the LightGBM model in the task of calculating patient similarity prediction, we used the calculated pairwise patient similarity matrices as data with 10-fold cross-validation using K-nearest neighbors (KNN),logistic regression (LR), support vector machine (SVM), random forest (RF), CatBoost, and XGBoost 6 different models of similarity computation performance, the experimental results of model comparison are shown in Table 5 and Figure 3. Considering the large scale of medical data, we compared the running time of the above three models, and the running time comparison of different models is shown in Table 6.

15.9s
According to the running time comparison results, it can be seen that LightGBM model has the shortest running time of 15.9s, KNN is the second and RF is the slowest. In terms of running time, LightGBM saves twice as much time as the other models, which shows that LightGBM can significantly improve the accuracy of predicting patient similarity while getting the prediction results more quickly.

Discussion
In this paper, we discussed the use of patient similarity in clinical decision support for CHD. The experiment proved that the similarity analysis of patients can achieve a good function in the classification prediction of CHD. The process of patient similarity is set according to the diagnostic process of clinicians, and it is very consistent with the needs of modern society to make this process intelligent. Patient similarity analysis, to a large extent, realizes personalized treatment for a single patient, solves the problem of complicated workload in precision medicine, and makes use of valuable data in medical database. However, the method is still limited to the direction of classification of CHD, we can expend the dataset to improve the generalizability of the algorithm and expand the application in other medical fields as well.

Conclusion
Patient similarity prediction is a fundamental problem in the field of disease health and it has a wide range of applications, such as disease prediction, disease evolution and treatment selection. In our study, we apply patient similarity to perform CHD classification task, develop annotation criteria for CHD patients based on EMR discharge summaries, using CatBoost Encode for structured presentation. After that, we obtain similarity matrices of patients by distances multidimensional features, predictive learning of patient similarity is performed by LightGBM model. Experimental results show that our proposed method achieves better results than the baseline method in the classification and prediction task of CHD.

Declarations Ethics approval and consent to participate
This study has been approved by the Ethics Committee of the First Affiliated Hospital of Xinjiang Medical University (in these studies). All the data are anonymized before being used. All methods are in accordance with relevant guidelines and regulations (such as Helsinki Guidelines).

Consent for publication
No applicable.

Availability of data and materials
The datasets analyzed during the current study are not publicly available as they contain information that are sensitive to the institution, but are available from the corresponding author on reasonable request