There were 147054 patients included in this study. Table 1 showed their baseline characteristics. The hospital mortality of the whole cohort was 9.3%.
Table 1
Baseline characteristics of the study population
| Overall population (n = 147054) | Alive at hospital discharge (n = 133328) | Dead at hospital discharge (n = 13726) |
Age | 62.8 ± 17.1 | 62.2 ± 17.2 | 69.2 ± 15.0 |
Source of admissiona | | | |
Floor | 17670 (12.0) | 14874 (11.2) | 2796 (20.4) |
OT/ Recovery | 23718 (16.1) | 22641 (17.0) | 1077 (7.8) |
AED | 65316 (44.4) | 59250 (44.4) | 6066 (44.2) |
Step-down units b | 2030 (13.8) | 1625 (1.2) | 405 (3.0) |
Other hospitals | 2454 (1.6) | 2095 (1.6) | 359 (2.6) |
Other sources | 284 (0.2) | 275 (0.2) | 9 (0.7) |
Pre-ICU LOS (days) | 1.04 ± 4.64 | 0.93 ± 3.73 | 2.10 ± 9.70 |
Temperature (°C) | 36.5 ± 1.0 | 36.5 ± 0.9 | 36.0 ± 1.7 |
Mean BP (mmHg) | 86.4 ± 41.4 | 87.1 ± 40.8 | 79.3 ± 47.0 |
Heart rate (bpm) | 100.7 ± 30.9 | 99.6 ± 30.1 | 111.2 ± 36.3 |
Respiratory rate (per minute) | 25.0 ± 15.0 | 24.6 ± 14.9 | 28.8 ± 15.0 |
On ventilator | 29884 (25.5) | 23778 (22.4) | 6062 (55.6) |
FiO2 (%) | 54.4 ± 26.6 | 51.4 ± 25.5 | 68.2 ± 27.5 |
PaO2 (mmHg) | 122.1 ± 77.7 | 121.8 ± 75.2 | 123.3 ± 88.4 |
PaCO2 (mmHg) | 42.7 ± 13.6 | 42.6 ± 13.0 | 43.0 ± 15.9 |
pH | 7.36 ± 0.11 | 7.37 ± 0.1 | 7.30 ± 0.2 |
Serum sodium (mmol/L) | 138.0 ± 5.7 | 137.9 ± 5.5 | 138.5 ± 7 .4 |
Urine output (mL / 24 hours) | 1787.8 ± 1539.2 | 1838.7 ± 1523.6 | 1267.8 ± 1600.7 |
Serum creatinine (mg/dL) | 1.59 ± 1.86 | 1.52 ± 1.84 | 2.19 ± 1.93 |
Urea nitrogen (mg/dL) | 26.7 ± 22.7 | 25.4 ± 21.8 | 38.9 ± 27.4 |
Serum glucose (mg/dL) | 165.6 ± 102.8 | 163.9 ± 101.6 | 181.6 ± 112.8 |
Albumin (g/dL) | 2.89 ± 0.72 | 2.95 ± 0.70 | 2.46 ± 0.74 |
Bilirubin (mg/dL) | 1.21 ± 2.28 | 1.09 ± 1.91 | 2.02 ± 3.89 |
Hemocrit (%) | 32.7 ± 7.0 | 32.9 ± 6.9 | 31.0 ± 7.9 |
White blood cell (x109 cells/ L) | 12.5 ± 8.6 | 12.1 ± 7.8 | 15.9 ± 13.6 |
Glasgow coma scale | | | |
Eye | 3.5 ± 1.0 | 3.5 ± 0.9 | 2.6 ± 1.4 |
Verbal | 4.0 ± 1.6 | 4.2 ± 1.5 | 2.8 ± 1.8 |
Motor | 5.5 ± 1.3 | 5.6 ± 1.1 | 4.2 ± 2.2 |
Could not assess GCS before medications | 1535 (1.3) | 1138 (1.1) | 397 (3.6) |
Comorbidities | | | |
On dialysis | 4213 (3.6) | 3650 (3.4) | 563 (5.2) |
Cirrhosis | 2035 (1.7) | 1701 (1.6) | 334 (3.1) |
Hepatic failure | 1735 (1.5) | 1464 (1.4) | 271 (2.5) |
Metastatic cancer | 2568 (2.2) | 2113 (2.0) | 455 (4.2) |
Lymphoma | 515 (0.4) | 435 (0.4) | 80 (0.7) |
Leukemia | 884 (0.8) | 696 (0.7) | 188 (1.7) |
Immunosuppression | 3454 (2.9) | 2885 (2.7) | 569 (5.2) |
AIDS | 137 (0.1) | 113 (0.1) | 24 (0.2) |
Received thrombolysis | 1834 (1.6) | 1673 (1.6) | 161 (1.5) |
Emergency operation | 4273 (3.6) | 3733 (3.5) | 540 (5.0) |
OT, operating theatre; AED, accident and emergency department; ICU, intensive care unit; LOS, length of stay; GCS, Glasgow coma scale; AIDS, acquired immunodeficiency syndrome |
Data are represented as mean ± standard deviation, and n (%) |
The number of patients with data available for each variable, p-value and confidence interval were shown in Table 7 in Appendix 6.4 |
aThe percentages of admission source did not add up to 100% because of missing data |
bHigh-dependency units were counted as step down units in the Hong Kong dataset. |
Discrimination
The AUPRC was 0.57 for the XGBoost algorithm, and 0.49 for the APACHE IV in the whole cohort. (Fig. 1) Looking individually at the eICU and Hong Kong datasets, the XGBoost algorithm had higher AUPRC than the APACHE IV score (eICU: 0.55 vs. 0.45, Hong Kong dataset: 0.71 vs. 0.66). (Additional file 1: Table S2). The XGBoost algorithm reached an AUROC of 0.90, and APACHE IV had an AUROC of 0.87 in the combined data test set. (Additional file 1: Table S3)
Calibration
Figure 2 showed the calibration plot of the whole cohort. The closer curve to the diagonal reference line suggested a better calibration of the XGBoost algorithm (closer to the diagonal reference line) than the APACHE IV score. The calibration plots of the individual datasets were shown in Additional file 1: Table S4. While the Hosmer-Lemeshow chi-square was 22.31 (p = 0.004), caution should be made in interpretation of the p-value. A statistically significant Hosmer Lemeshow test does not mean that model fits poorly because the Hosmer-Lemeshow test, initially developed using a smaller dataset, would become overpowered when it is applied to a large sample. (19, 20) Rather, looking at the individual decile (Table 2), the absolute difference ranged from − 0.33 to 1.17% only. Table 3 tabulated the SMR comparing the predicted mortality over the actual mortality. The XGBoost algorithm demonstrated a SMR ranging from 0.70 to 1.28. The APACHE IV tended to overestimate the mortality, resulting in SMR varying from 0.44 to 0.82.
Table 2
Observed and predicted hospital mortality rates of the XGBoost model across risk deciles within the test set (n = 29957)
Risk decilea | Observed Deaths No. (%) | Predicted Deaths No. (%) | Difference % |
1 | 2 (0.07) | 8 (0.27) | -0.20 |
2 | 7 (0.23) | 17 (0.57) | -0.33 |
3 | 21 (0.70) | 28 (0.93) | -0.23 |
4 | 37 (1.23) | 43 (1.44) | -0.20 |
5 | 68 (2.27) | 65 (2.17) | 0.10 |
6 | 78 (2.60) | 100 (3.34) | -0.73 |
7 | 178 (5.94) | 160 (5.64) | 0.30 |
8 | 309 (10.31) | 274 (9.15) | 1.17 |
9 | 554 (18.49) | 534 (17.82) | 0.67 |
10 | 1571 (52.44) | 1572 (52.47) | -0.03 |
aRisk decile: population sorted by increasing predicted risk and then split into deciles. Hosmer-Lemeshow Chi-square = 22.31, df = 8, p-value = 0.004 |
Table 3
Standardized mortality ratio for selected disease groups in the test set
| n | | XGBoost | APACHE IV |
Disease Group | eICU | Hong Kong dataset | Observed mortality, % | Predicted mortality, % | SMR | Predicted mortality, % | SMR |
Sepsis (non-urinary tract) | 2717 | 373 | 19.9 | 19.5 | 1.02 | 25.0 | 0.79 |
Cardiac arrest | 678 | 77 | 51.8 | 51.7 | 1.00 | 64.6 | 0.80 |
Emphysema/ bronchitis | 629 | 16 | 5.4 | 7.7 | 0.70 | 12.3 | 0.44 |
Thoractomy of lung, neoplasm | 138 | 17 | 2.6 | 2.5 | 1.05 | 5.3 | 0.49 |
Aortic aneurysm, elective repair | 158 | 76 | 3.9 | 4.0 | 0.99 | 7.9 | 0.50 |
Stroke | 1070 | 35 | 10.6 | 9.2 | 1.15 | 19.0 | 0.56 |
Hepatic failure | 123 | 6 | 17.8 | 19.9 | 0.89 | 27.9 | 0.64 |
Respiratory arrest | 364 | 2 | 21.9 | 17.1 | 1.28 | 26.6 | 0.82 |
Table 4 summarized the discrimination and calibration between the XGBoost model and the APACHE IV, showing the superior performance in the former.
Table 4
Comparison of discrimination and calibration of the XGBoost model and APACHE IV model when applied to the test set
| XGBoost | APACHE IV |
Observed mortality rate (%) | 9.44 | 9.44 |
Expected mortality rate (%) | 9.37 | 13.83 |
SMR | 1.01 | 0.68 |
AUPRC | 0.57 | 0.49 |
AUROC | 0.90 | 0.87 |
Hosmer-Lemeshow Chi-square | 22.31 (p = 0.004) | 711.73 (p < 0.001) |
SMR, standardized mortality ratio; AUPRC, area under the precision-recall curve; AUROC, area under the receiver-operating characteristic curve |
Variable Importance
Figure 3 showed the SHAP variable importance plot. It was made up of individual dots and each represented one training data. The feature importance was shown by the descending order of the variables. The x position of the dot reflected the impact of the prediction. A positive SHAP value was positively associated with higher mortality, and vice versa. The color of the dots represented the value of that variable for the prediction. For example, older age (as shown in red) was positively associated with mortality (as it was on the right side of the axis), whereas patients who were not on ventilators (as shown in blue as it was encoded as 0 in the data, compared with 1 representing patients who were on ventilator) were associated with a lower risk of mortality (left side of the axis). Therefore, age contributed most in predicting hospital mortality in the XGBoost model, followed by other factors like heart rate, whether the patient was on ventilator, bilirubin level, and whether the patient suffered from sepsis (non-urinary tract) etc.
Furthermore, the SHAP variable importance plot visualized the data in an intuitive way. For example, when looking at the right side of the plot (SHAP value > 0.0), both a high heart rate and a low heart rate were positively associated with mortality, but the effect of low heart rate was stronger than that of high heart rate. To investigate the effect of an individual variable like the heart rate, the dependence plot generated by the SHAP model had elegantly illustrated the U-shape relationship. (Fig. 4) The SHAP value was lowest with the heart rate ranging from about 50 to 100 beats per minute. The effect of severe bradycardia on mortality prediction was greater compared with that of tachycardia. The effect of interactions with other variables was also shown. For example, for patients who had heart rate of 150 bpm, younger patients had much lower SHAP than that of older patients. Different dependence plots between SHAP and independent variables, and interactions among these variables could also be plotted and studied from the system if needed.