Construction of Patient-level Prediction Model for In-hospital Mortality in Congenital Heart Disease Surgery: Regression and Machine Learning analysis

Background: Prediction of in-hospital death is important for patient management as well as risk-adjusted evaluation of Congenital heart disease (CHD) surgery performance. Using a large database containing CHD surgery records of 12 years, we aim to establish patient-level in-hospital mortality prediction models. Methods: Patients with congenital heart disease who underwent surgery at Shanghai Children’s Medical Center from January 1, 2006, to December 31, 2017 were included in the study. Each procedure was assigned a complexity score based on Aristotle Score with modication. In-hospital mortalities for various surgery procedures were estimated. In-hospital death prediction models including a procedure complexity score and patient-level risk factors were constructed using logistic regression analysis and machine learning methods. The predictive values of the models were tested. Results: Among 24,684 patients underwent CHD operations, there were 595 (2.4%) in-hospital deaths. The results showed that AUC of the prediction model based on logistic regression is 0.864 (95% CI: 0.833-0.895, P <0.001), the sensitivity is 0.831 and the specicity is 0.786. The AUC of the Gradient boosting model is 0.884 (95 %% CI: 0.858-0.909, P <0.001), the sensitivity and specicity were 0.838 and 0.785 respectively. The feature importance analysis found that the variable (average score) that had the greatest impact on the model's prediction performance was operation score (95.6), and other variables (average scores) were Age (days) (95.5), Ultrasound MV (cid:0) 54.6 (cid:0) , Ultrasound atrial level (cid:0) 54.5 (cid:0) , Palliative operation (45.8), Operation history (38.8), Ultrasound TV2 (32.1), Urgent operation (30.8), Ultrasound ventricular level (30.5), and Spo2 ≤ 90% (30.3). Conclusions: Model constructed using machine learning method and logistic regression containing procedure complexity score and pre-operative patient-level factors had high accuracy in in-hospital mortality prediction. Operation score and age

quality of life into the adult years [1]. However, there remains signi cant variability in outcomes according to procedures, patient characteristics, and other associated factors such as different hospitals.
In-hospital death is an important outcome of CHD. Precise in-hospital death prediction is of great importance in evaluating the quality of hospital surgical treatment, facilitating clinical decision on the performance of certain procedure, and improving hospital-patient relationship [6]. Since the complexity of procedure, as well as the risk factors at patient levels, are closely related with in-hospital death outcome of CHD surgery, case-mix adjustment is fundamental to any systematic attempt to measure outcomes, compare performance, and sustain a program of continual quality improvement. Two major methods of risk adjustment for CHD surgery, Risk Adjustment in Congenital Heart Surgery-1 (RACHS-1) [7], and the Aristotle Complexity Score [8] have been developed based on projections of risk or complexity that were predominantly subjectively derived. An objective, empirically based index to identify the statistically estimated risk of in-hospital mortality by procedure, the Society of Thoracic Surgery-European Association for Cardio-Thoracic Surgery (STAT) Congenital Heart Surgery Mortality score, have also been developed [9]. The feasibilities of these risk adjustment score have been tested in a few studies [10][11][12]. However, there still lacks study in Chinese population about the mortality risk prediction. In addition, these previous developed systems mainly aimed to evaluate the surgical outcomes among various hospitals.
Also, due to the lack of information, these previous score systems concentrated more on surgical procedure categories but did not include su cient risk factors at individual patient level. Therefore, the prediction may lack accuracy for individual patients.
Shanghai Children's Medical Center is a major tertiary hospital with advanced facility and teams in CHD treatment. A procedure score system has been adopted to stratify the in-hospital death risk of surgical procedures. The aim of current study is to establish and test a prediction model using the procedure score system together with risk factors at individual patient level.

Study design and population
Patients diagnosed with congenital heart disease and underwent surgery at Shanghai Children's Medical Center between January 1, 2006, and December 31, 2017 were included in the current study. For patients with multiple surgery records within 30 days, only the latest ones were used, the previous surgery records were considered as operation history. The exclusion criteria include general thoracic surgery not involving operation in heart, subjects with uncertain in-hospital survival records, and surgery procedure performed in < 3 patients. Demographic characteristics and pre-operation test data were extracted from a clinical database. In-hospital mortalities for various surgery procedures were estimated. The in-hospital death predictive value of a procedure risk strati cation and patient-level risk factors were assessed with models constructed using logistic regression analysis and machine learning methods. The study was approved by the Ethical Committee of Shanghai Children's Medical Center. The inform consent was waived since the study only involved retrospective review of previous clinical data.

End Point
The study end point was in-hospital mortality, which was de ned as death during the same hospitalization as surgery regardless of cause.

Data Source And Extraction
A "big data" database was constructed through the merging of multiple individual-level databases, included the hospital information system (HIS), laboratory information management system (LIS), clinical data repository (CDR), intensive care unit (ICU ) database, and surgery record database of cardiac surgical department in Shanghai Children's Medical Center. These datasets were linked using unique encoded personal identi ers. To merge and standardize the data from multiple sources, semantic transformation process was handled with ETL approach. Data extracted from the original sources were provided as input to the semantic transformation rules which are created by domain experts to express semantics of the data transformation. Data were transformed according to a standardized common data module (CDM).
Baseline characteristics including demographic information, diagnosis, pre-operation results of echocardiography and laboratory tests were extracted from merged database. Variables with missing valued in more than 30% cases were excluded.

Statistical Analysis And Model Construction
Continuous variables were described using Median (Range), since all of them are not normally distributed. Categorical variables were described using frequency (%). To evaluate the distributive balance between training set and validation set, Mann-Whitney U test, chi-squared test and Fisher's exact test were applied for between groups comparison, as appropriate.
Mortality of each procedure was calculated. In addition, estimations of procedure-speci c mortality rates were calculated by using a Bayesian random effects model that adjusted each procedure's mortality rate based on the size of the denominator. Unlike conventional methods, random effects models use data from all the procedures in the database when estimating the probability of mortality for any single procedure. This ''borrowing of information'' across procedures produces estimates with good statistical properties, including smaller standard errors than conventional estimates. The model-based estimate is a weighted average of a procedure's actual observed mortality rate and the overall average mortality rate for all procedures in the database. The model weights an individual procedure's own data more heavily when the denominator is large enough to be reliable and weights the overall average mortality rate more heavily when the denominator is too small to support a reliable mortality estimate.
A mortality risk strati cation was performed by categorizing procedures into groups according to mortality. Performance of different categorizations consisting of 2 to 20 categories were evaluated based on the internal homogeneity of the categories and the discrimination of the categories as predictors of mortality.
To construct in-hospital death prediction model at patient level, we rstly conducted univariate analysis on training set with aim to selection of potential risk factors. Then still in training set, we built both multiple logistic regression and gradient boosting decision tree (GBDT) model based on the potential risk factors found in univariate analysis. Speci cally, GBDT is an ensemble algorithm for classi cation and regression with multiple tree models under gradient boosting, and we developed the machine learning model with the XGBoost toolkit, an implementation of the GBDT algorithm with strong exibility. A grid search was used for the setting of the XGBoost model hyper-parameters [13]. Finally, we applied the constructed models (i.e. logistic regression and XGBoost model) in validation set, we used ROC curve analysis to evaluate the predictive power of the models. Speci cally, AUC together with 95% con dence and P value, sensitivity and speci city were calculated. In addition, importance of top 10 risk factors in prediction of XGBoost model were also produced.
All statistical tests were two-sided tests, P < 0.05 was considered statistically signi cant. All statistical analysis and the logistic regression were conducted with SAS (version xx?). The XGBoost model was implemented with Python (version 3.6) library XGBoost (version 0.9)

Characteristics of study population
As shown in Table 1, a total of 24,684 patients underwent CHD operations were included, among those were 595 (2.4%) in-hospital deaths. Demographic and pre-operation characteristics of patients with death or survival outcomes were showed in Supplemental Table 1. The most commonly performed procedures included VSD repair, tetralogy repair, ASD secundum repair etc. The in-hospital mortality for each procedure varies between 0 to 77.78%, with 24 procedures with 0 death record. When constructing a death risk prediction model, 75% of patients were included in the training set and 25% of patients were included in the validation set. There were no signi cant differences between most of the demographic and pre-operation characteristics of the training set and validation set, except for white blood cell count (P = 0.02), basophil count (P = 0.03), and MPV (P = 0.004) between training set and validation set (supplemental table 2).

Strati cation Of Surgical Procedure
Previous analysis has clearly shown that the complexity of procedure is among the most important factors in predicting in-hospital death. To further categorize the mortality risk of various procedures, performance of different categorizations consisting of 2 to 20 categories were evaluated based on the internal homogeneity of the categories and the discrimination of the categories as predictors of mortality.
Procedures were sorted by increasing estimated risk and partitioned into 6 relatively homogeneous categories (Table 2). Among all the 95 surgical methods included in the analysis, there were 3 types of surgery with a risk strati cation of level 6, 10 with a strati cation of level 5, 11 with level 4, 19 with level 3, and 2 There are 38 types and 14 types for level 1. In our study, surgery risk strati ed by level 6 includes Decannulation, ECMO, and Fontan Takedown, with corresponding mortality rates of 71.55%, 64.99%, and 52.15%, which are higher than other surgical risk levels (Table2). protein, serum albumin, A/G, ALT, AST, ALP, total bilirubin, direct bilirubin, creatinine, uric acid, eosinophils/100 white blood cells, monocytes/100 white blood cells, and basophils/100 white blood cells.

Construction of in-hospital death prediction model based on patient level
Patient level risk factors including baseline characteristics, pre-operation test results of echocardiography and laboratory tests, as well as the procedure score were compared between populations with death and survival outcomes. The results identi ed signi cant differences in numerous factors between the two groups (supplemental Table 2).
In concerning that the numerous pre-operation data incorporated into regression analysis may interfere with the effectiveness of the prediction model, a different approach with machine learning model construction, which is considered to be more effective in "big data" analysis, was performed for inhospital death prediction. The preoperative indicators included basic information, medical history, diagnostic categories, and pre-operation test results.
The data extracted for the study samples contained 124 potential risk factor features. In this experiment, we conducted a one-by-one examination of each feature and removed 51 features with more than 30% data missing. A normal range (at least 95% of the samples were included) and a proper transform function was assigned to each continuous feature and values beyond the reasonable range was set as boundary values to increase the effectiveness of the experiment. Binary data was converting to value 0 and 1. Ranking data were assigned into ascending numbers. Categorical data were encoded into individual features containing binary values. Missing data were lled with the median (mode) value for each continuous (discrete) feature.
The data processing resulted in 158 risk factor features in the nal model analysis.
Receiver operating characteristic curves for the risk prediction value of model based on logistic regression and gradient boosting model were displayed in Table 3 and Fig. 1. The AUC of Logistic regression prediction model was 0.864 (95% CI: 0.833, 0.895), P < 0.001. The sensitivity is 0.831 and the speci city is 0.786. The AUC of the Gradient boosting model is 0.884 (95%% CI: 0.858-0.909, P < 0.001), the sensitivity and speci city were 0.838 and 0.785 respectively (Table 3).  A higher score of a feature stands for a more signi cant role in the machine learning model for outcome classi cation. The risk factor features that have the greatest impact on the model are the procedure score, age, Ultrasound MV, and Ultrasound atrial level.

Discussion
Assessing the surgical risk of cardiac surgery is critical to understanding the difference in surgical outcomes, including in-hospital mortality and improving patient outcomes. In-hospital mortality risk prediction has important clinical signi cance for assessing the quality of cardiac surgery and patient postoperative management in different institutions.
In our study, in-hospital mortality risk prediction models were constructed based on combination of surgical risk strati cation and patient pre-operation variables, and the predictive power of the models for postoperative mortality risk was evaluated. The results showed that in-hospital mortality of CHD surgery in our hospital was 2.4%, which was consistent with previously reported results in Western countries [14-16] and much lower than studies in developing country [10]. This mortality was also similar as the previous overall mortality of CHD patients in China [17]. According to the strati cation of RACHS-1, the mortality of CHD procedures with different risk varied from 0.26-62% [12,18], which was also consistent with the mortality distribution observed in the current study.
Our research found that operation score has the greatest impact on the predictive performance of the death risk prediction model. In our study, Decannulation, ECMO, and Fontan Takedown with the highest surgical risk scores (6 points) also had the highest mortality rates, 77.78%, 72.22%, and 55.56%, respectively. Several surgery risk strati cation systems have been developed previously, such as Aristotle score and The Risk Adjustment for Congenital Heart Surgery (RACHS-1) system [19]. [20]. The predictive power of Aristotle comprehensive complexity was better than Aristotle basic complexity and risk adjustment in congenital heart surgery-1 (RACHS-1) prediction models for in hospital mortality after surgery for congenital heart disease [21]. Hörer J analyzed the predictive power of the ACHS score for mortality risk in adults after congenital heart surgery, and they found that the ACHS score had similar, and good predictive power for surgical outcomes in 2 European centers [22]. It has been reported that the mortality of CHD procedures with different risk strati cation of RACHS-1 varied, with higher mortality in higher risk level groups [12,18] Due to enormous variability in CHD procedures as well as patient-level factors, it is di cult to establish a predictive model for mortality. All these models mentioned above were constructed as a risk-adjustment tool for of performance evaluation of CHD surgeries. In China, there still lack uniform standards for administration and evaluation of CHD surgery.
The Aristotle Score system was adopted with modi cation according to our experiences. The individual procedure scoring system instead of risk strati cation level was chosen in consideration of the request of an individual instead of group level predictive model. As tools to compare surgery performance across groups, the complexity score/strati cation had certain prediction value however not very accurate. It has been reported that when used for prediction of in-hospital death, the ABC score, RACHS-1 and STS-EACTs were usually with an overall C-index from 0. The current study was with certain limitations: As all studies in CHD surgery risk evaluation, due to the great variations in CHD surgery procedures and low mortality, the sample size of individual procedure was small for most procedures. The statistical methods used to adjust may not be su cient to overcome the in uence. In addition, the data used in the current study was from a single center, which might not be representative for the general practice in China. The experiences with speci c procedure may brought bias into the analysis. However, this was the rst systematic analysis of CHD procedure risk in China with a large sample size. The results should, at least in part, re ect the current status in CHD surgery practice in China. Also, due to the limited data source, especially the low mortality, we did not include a validation cohort in the current study, therefore the real value of the prediction models needs to be further validated in future studies. Finally, as previous studies, the in-hospital mortality recorded may not be representative for all operation-related death, which should incorporate data in a 30 days period after discharge. This needs to be addressed in a more complete data source.

Conclusion
In conclusion, the current study has demonstrated that through combination of procedure complexity score with pre-operative patient-level factors, model constructed using machine learning method and logistic regression had high but similar accuracy in in-hospital mortality prediction. Operation score and age have the greatest impact on model prediction performance. The predictive model may be applied for surgery performance evaluation as well as pre-operative risk prediction in clinical practice.