In-hospital Mortality Prediction among Patients with Fractures of Pelvis and Acetabulum in Intensive Care Unit: Machine Learning versus Conventional System

we machine scoring


Abstract Background
Fractures of pelvis and/or Acetabulum are leading risks of death worldwide. However, the capability of inhospital mortality prediction by conventional system is so far limited. Here, we hypothesis that the use of machine learning (ML) algorithms could provide better performance of prediction than the traditional scoring system Simple Acute Physiologic Score (SAPS) II for patients with pelvic and acetabular trauma in intensive care unit (ICU).

Methods
We developed customized mortality prediction models with ML techniques based on MIMIC-III, an open access de-de ned database consisting of data from more than 25,000 patients who were admitted to the Beth Israel Deaconess Medical Center (BIDMC). 307 patients were enrolled with an ICD-9 diagnosis of pelvic, acetabular or combined pelvic and acetabular fractures and who had an ICU stay more than 72 hours. ML models including decision tree, logistic regression and random forest were established by using the SAPS II features from the rst 72 hours after ICU admission and the traditional rst-24-hours features were used to build respective control models. We evaluated and made a comparison of each model's performance through the area under the receiver-operating characteristic curve (AUROC). Feature importance method was used to visualize top risk factors for disease mortality.

Results
All the ML models outperformed the traditional scoring system SAPS II (AUROC=0.73), among which the best tted random forest model had the supreme performance (AUROC of 0.90). With the use of evolution of physiological features over time rather than 24-hours snapshots, all the ML models performed better than respective controls. Age remained the top of feature importance for all classi ers. Age, BUN (minimum value on day 2), and BUN (maximum value on day 3) were the top 3 predictor variables in the optimal random forest experiment model. In the best decision tree model, the top 3 risk factors, in decreasing order of contribution, were age, the lowest systolic blood pressure on day 1 and the same value on day 3.

Conclusion
The results suggested that mortality modeling with ML techniques could aid in better performance of prediction for models in the context of pelvic and acetabular trauma and potentially support decisionmaking for orthopedics and ICU practitioners.

Background
The role of big data and arti cial intelligence (AI) in medicine has been receiving increasing attention from researchers in both industrial and academic circles. Healthcare giants and interdisciplinary groups at universities have combined big data and AI algorithms to build better health pro les and predictive models around individual patients for improved diagnosis and disease treatment. For example, Roche and IBM joined hands in predicting the early risk of diabetes-related chronic kidney disease (CKD) based on real world data; results showed that the Roche/IBM algorithm had the best predictive performance as compared with published ones [1]. FDNA company created a facial image analysis framework called Deep Gestalt by incorporating computer vision and deep learning algorithms to predict genetic syndromes, making enhanced accuracy than reports from clinicians in most of the scenarios [2]. Yet, there remain signi cant challenges in the convergence of AI and medicine. Enhanced dialogue and teamwork between two elds need developing towards precision medicine.
Although previous comparison studies suggested that machine learning (ML) methods are superior to traditional regression in terms of real-time risk prediction [3][4], it is the uncertainty that the same method from them can be reproduced to obtain better accuracy for a different patient cohort, not to mention in a distinct clinical context. Thus, efforts to study and implement AI in various biomedical settings are far from su cient.
In terms of the early mortality rate, fractures of pelvis and acetabulum top the list of orthopedic traumas, for most of which are severe high-energy injuries [5][6]. Many factors make the treatment of pelvic and acetabular fractures challenging and its corresponding high risk of death. First, the fracture site is usually surrounded by extensive soft tissues, which means more bleeding and higher possibility of accompany with large-area trauma when damaged. Second, patients with pelvic and acetabular fractures often suffer from secondary or simultaneous multiple organ injuries, potentially ranging from limbs to thoracic and abdominal organs, urinary system, brain, or and spinal injury [7][8]. Even for a sophisticated doctor with high quali cations, it is full of uncertainty to judge a patient's mortality and the in uencing factors through subjective experience.
The development of predicting hospital mortality for ICU patients has roughly gone through three eras since the past 3 decades. The rst version of the APACHE (Acute Physiology and Chronic Health Evaluation) scoring model that proposed in 1981 represents the rst stage of subjective experience from medical experts. The second stage dates back to the era of statistical analysis with logistic regression, and the third one is the era of machine learning and big data that is in full swing today. Both SAPS and APACHE are death prediction systems that remain the most widely used in clinical practice, which have formed inherent variables and model formulas. However, several external validation studies have suggested that neither the most recent versions of SAPS nor of APACHE succeeded to accurately predict the actual probability of death [9]. Clinicians, therefore, are calling for a data-based auxiliary tool to predict the in-hospital mortality for a better clinical decision-making.
Here, we linked the popular ML methods and the Multiparameter Intelligent Monitoring in Intensive Care III (MIMIC-III) database, with the representative traumatic disease in orthopedics for the rst time. To predict mortality among patients with pelvic, acetabular or combined pelvic and acetabular fractures, we established customized ML models including decision tree, logistic regression and random forest by capturing two sets of physiological features from Simple Acute Physiologic Score (SAPS) II -----the rst set of variables is obtained within 72 hours after ICU admission; the second as control is from rst 24 hours. And we compared the ML experiment models with controls and SAPS II to (1) determine whether the model based on the new technique and MIMIC-III can improve mortality prediction under this speci c clinical context; (2) to explore whether the use of evolution of physiological features over time, rather than traditional feature snapshots within the rst 24 hours, is supportive to produce a better prediction model.

Methods
Overall, the approval to access to MIMIC III database has been obtained after completing the required online courses and requirements according to the instructional manual on the o cial homepage (https://mimic.physionet.org/). On the website, two methods were introduced to have access to the dataset: one is to access the metadata on BigQuery, which is a serverless online platform that supports to directly query data using SQL language; the other is to download the database that is composed of comma-separated (CSV) les and import data into the database system on the local server. The latter has more complete functions with all information, supports exible operation and queries without overloading. Therefore, we rst downloaded a virtual machine with the operating system Linux Centos 7, and installed the large-scale relational database PostgreSQL11 (The PostgreSQL Global Development Group, California, USA) on the remote server. Next, we imported the target data with total uncompressed size (6.2G) into Postgres to build the clinical database, utilized SQL scripts to build MIMIC-III database with a collection of materialized views. We nally conducted data processing, feature selection, model building, training and prediction evaluation in Python 3.6.3 (Python Software Foundation, Vienna, Austria).

Dataset and Patients
The MIMIC III dataset is an openly available database developed by The Laboratory of Computational Physiology at Massachusetts Institute of Technology (MIT), which consists of data from more than 25,000 patients who were admitted to the Beth Israel Deaconess Medical Center (BIDMC) since 2003 and who have been de-identi ed for information safety [10].
Here, we identi ed patients who were diagnosed as pelvic, acetabular, or combined pelvic and acetabular fractures according to ICD-9 code and who survived at least 72 hours after the ICU admission. All the data within the rst 72 hours following ICU admission were collected and extracted from the MIMIC-III clinical database (version 1.4).

SAPS II and Feature Expansion
To date, the SAPS II scoring system is known to discriminate potential survivors and non-survivors well for ICU patients and remains the most widely used in clinical practice. Here, the SAPS scores of individual patients were converted to prediction mortality according to the following formula [11]: Logit = -7.7631+0.0737*SAPS II+0.9971*In (SAPS II+1) (2) Variables in the SAPS II system only contain physiological parameters (except for variables including age, type of admission and three underlying diseases) within rst 24 hours after the ICU admission. However, we hypothesized that models' performance could be further improved by expanding the observation period (from the rst 24 hours to 72 hours in ICU) of the physiological predictors. Hence, we here kept the same variables in the original SAPS II scoring system yet captured their values within 72 hours after ICU admission for ML models. Another set of values from the rst 24 hours were extracted to establish the control models. The missing values were replaced with the median of each variable.

Establishing Models and Evaluation
Based on this customized variable selection, logistic regression, decision tree and random forest models were built to predict in-hospital mortality. Here, the candidate samples were randomly divided into two separate subsets, with one taken as the training set and the other as the test set (the assignment proportion is 7:3). All models were built on a training dataset by using 5-fold cross-validation -----80% of the data were used for training and the remaining 20% were used for validation -----and the bestperforming models were evaluated on the test set. 16 predictor variables from the rst 24 hours (same as SAPS II system parameters) were used as inputs for ML control models, while extended variables from 72 hours were used for the ML experiment models. The outputs were the same as SAPS II model, which were the estimated in-hospital survival probability for enrolled patients. The evaluation procedure was implemented and presented through the cross-validated area under the receiver-operating characteristic curve (AUROC), with graphically illustrated receiver-operating (ROC) curves. The AUROC value could re ect each model's ability to distinguish the prediction target capacity and its overall prediction effect.
To visualize the contribution of the predictor variables in ML models, we applied feature importance visualization, which refers to a class of techniques for assigning scores to input features, indicating the relative importance of each feature when making a prediction. Here, a variable importance measure that utilized coe cients for each input variable were used for logistic regression models, while the change in the Gini index was used for decision tree and random forest classi ers [12]. Figure 1 shows the work ow of this study.

Distribution of Patients
In total, 313 patients with an ICD-9 diagnosis of fracture of pelvis and/or acetabulum were admitted.
After excluding patients with outliers (age>300), we nally enrolled 307 patients (210 males and 97 females), among which 84 expired (marked as 1, 27.4% mortality) and 223 survived (marked as 0) as showed in Figure 2a. Figure 2b shows the admission types of the 307 patients: 28 patients were admitted for unscheduled surgery (9.1%), with 278 patients for medical reasons (90.6%) and the left 1 for scheduled surgery (0.3%). Figure 3 shows the distributions of age and SAPS II score among survivors and non-survivors in the entire patient cohort. Here, we also described the 16 features from rst-24 hours as baseline characteristics of the cohort in Table 1.   Chronic disease, n (%) 1 (0.04%) 0 (0) Note: Scheduled surgical = surgery scheduled ≥24 hours prior; medical = no surgery within 1 week of admission; unscheduled surgical = surgery scheduled ≤24 hours prior; Abbreviation: BP = blood pressure; GCS = Glasgow coma scale; BUN = blood urea nitrogen; WBC = white blood cells

Comparison among ML Models and SAPS II
The prediction performance of the optimal logistic regression, random forest and decision tree, and SAPS II scoring is shown in Table 2. Results showed all ML models had better performance of prediction than SAPS II (AUROC = 0.73, Hosmer-Lemeshow p < 0.001). Among all ML models, the random forest model combined with rst-72-hours-SAPS-features achieved the best performance with an AUROC up to 0.90, followed by the decision tree experiment model (AUROC=0.89) and the logistic regression experiment model (AUROC=0.78).  Table 2, all the experiment groups with rst-72-hours variables performed better than their controls, with comparison of respective ROC presented in Figure 4. For logistic regression models, the Hosmer-Lemeshow (HL) p values of both experiment and control models remained non-signi cant (>0.05), suggesting goodness of t.

Feature importance
Top 10 features or risk predictors of each model were showed in Figure 5. Age, BUN (minimum value on day 2), and BUN (maximum value on day 3) were the top 3 predictor variables in the best tted random forest experiment model, while age, BUN (maximum value) and the lowest temperature were top 3 risks of mortality in its control models. In the optimal decision tree model, the top 3 risk factors were age, the lowest systolic blood pressure on day 1 and the same value on day3. Although ranking of feature variables varied in different models, age remained the top of feature importance for all classi ers.

Discussion
The purpose of this study is to establish customized modeling that can provide better performance of mortality prediction for patients with representative orthopedic trauma, as compared to the current standard severity scoring system SAPS II. All models were based on the training data of patients with pelvis and/or acetabulum fractures in the MIMIC-III database. The MIMIC-III database is derived from a large-scale critical care database in the United States. It has a large amount of data, rich variables, and high data quality. Meanwhile, we linked the mainstream ML algorithms with the use of evolution of physiological features over time to further enhance the performance of each model. A good prediction model should have both satisfying sensitivity and speci city. AUROC, as an index of comprehensive judgment for two were reported here. Based on the results, all the ML models outperformed SPAS II (AUROC of 0.73), among which the random forest topped (AUROC=0.90) and was followed by the decision tree model (AUROC=0.89) and then the logistic regression (AUROC=0.78). The decision tree method is a process of classifying data through a series of rules, which conforms to cosmic human decision thinking. Similarly, the disease diagnosis of a clinician can be regarded as a classi cation process, that is, the doctor classi es patients into a speci c disease group through his knowledge and experience. The results of the decision tree are concise and clear, easy to understand, and helpful to extract the corresponding diagnosis rules. Its application to the classi cation and diagnosis of diseases can often improve the diagnosis accuracy, making it widely used in clinical practice. In this study, the AUROC of the optimal decision tree model is higher than the traditional SAPS II system (0.89 vs 0.73), re ecting an improved prediction performance.
Although the decision tree achieves the best balance between accuracy and interpretability, the single decision tree model always fails to achieve the most ideal prediction accuracy. Therefore, scholars have developed a new algorithm based on the decision tree called random forest. As its name suggests, a random forest is composed of multiple decision trees, each of which has undergone relatively independent training and has its own independent prediction and classi cation capability. On the other hand, if we compare each tree in the model to an individual expert, the random forest model effectively avoids an individual's misjudgment and bias as a result of the combination of multiple experts' voting according to certain rules. In this view, we can obtain a better under-standing of the reason why a random forest performs better than a decision tree. In this study, we obtained the consistence on this point that the best tted random forest model performs best among all ML models, with an AUROC of 0.90 versus 0.89 in the decision tree. This is also consistent with the results of previous studies. For example, Fernandez-Delgado et al. compared more than 100 model algorithms based on 121 datasets and found that the random forest algorithm has the highest prediction accuracy [13]. Pirracchio et al. established a death prediction model based on the clinical data of 24,508 ICU patients [14]. The results showed that the "Super Computing" model customized by combining multiple machine learning algorithms outperformed the disease scoring system, with the nding that the random forest model and the "Super Compu-ting" is comparable in prediction performance (each contained an AUROC as high as 0.880). However, the performance of the model is unavoidable to vary with the application to differing scenarios. Through a single-center clinical study, Mao et al. compared the predictive value of a logistic regression model and two other ML algorithms (decision tree and support vector machine) in early disease warning [15]. The results showed that the predictive accuracy of logistic regression was higher than the other two. Badriyah et al. applied a decision tree model to predict the probability of inpatients in general wards to enter the ICU, and the results showed that its prediction accuracy is comparable to the traditional National Early Warning System (MEWS) [16]. In other elds, logistic regression is more accurate than other machine learning methods. Yet in this study, the AUROC of the logistic regression model with extended variables is 0.78, which is less than that of the random forest model.
The results revealed an increasing AUROC when comparing each set of ML models before and after feature expansion: the physiological parameters included in SAPS II as key factors were kept, yet we expanded the features by capturing the measurements obtained within the rst 72 hours rather than 24hours snapshot after ICU admission. This suggests that the evolution of physiologic variables over time is more predictive to the clinical outcome than the physiologic snapshot of the ICU patients on admission, which is the basis of current severity scoring systems. The result from this improvement of method is consistent with the experience from clinical observation: it is not the initial presentation after ICU admission yet how a patient responds to treatments re ects the trend of deterioration or improvement, thus determining the outcome of each patient. The similar method was used by Celi et al. [17] who reported that the use of the evolution of physiologic variables over time facilitated the ML models' performance in predicting mortality among ICU patients with acute kidney injury (AKI). This method of feature selection may set a good example for the integration of clinician's experience and AI's strength.
Comprehensively, the top 3 important factors affecting the morality of pelvic and/or acetabular fractures were age, systolic BP, and BUN. In our ndings, the mean value of age of the non-survivors is 66.29, which is signi cantly higher than that of the survivors. Similarly, some studies reported that patients aged greater than 65 with pelvic fractures had higher case fatality despite equivalent measures of injury severity, with a mortality rate of approximately 20 percent [18]. Moreover, both the systolic BP and BUN are indicators of extensive bleeding. According to the results of a retrospective cohort study, a substantial number of patients with pelvic injuries died within the rst 24 hours of hospital arrival primarily due to massive hemorrhage [19]. Our nding is consistent with literature [20][21] reporting that bleeding-related risks are reliable risk factors predictive of early mortality, suggesting the signi cance of initial bleeding control in the management of hemorrhagic shock.
Our study has some limitations. First, the patient cohort was from a single center in the United States, and the number of enrolled patients didn't meet the standard of a set of big data. In addition, this study only conducted internal veri cation of the model and limited ML methods were compared with each other. In all, further research is needed to verify the external data to generalize the model for other settings. What's more, MIMIC-III in this study is not a specialized database for orthopedic trauma, which means there is a lack of relevant detailed information (such as clinical classi cation of fractures of pelvis and acetabulum, patient trauma scores, etc.) which could be potential features of importance. As ML algorithms and big data are complementary, enhanced dialogue and teamwork between computer scientists, data scientists and clinicians could play a signi cant role in the progression of AI in medicine.

Conclusion
To our knowledge, this study is the rst to introduce ML models applied to decision support for patients with the representative traumatic disease in orthopedics. We found that customized modeling with ML methods could produce better performance of mortality prediction than the traditional severity scoring system SAPS II in the context of pelvic or/and acetabular fractures, potentially facilitating clinical decision-making for orthopedist or ICU practitioners in practice. While evidence-based medicine has overshadowed empirical therapies, we consider that integrated experience with both elds of AI and medicine represents the future of personalized clinical medicine.