This single-center study aimed to validate APACHE II, APACHE III and SAPS II in mortality prediction in a 10-bed ICU in Poland. We discovered that although all the scores were acceptable in predicting mortality from statistical point of view, their ability regarding 12-month prognostication proved to be limited from clinical point of view.
We found that the in-hospital ICU mortality rate was 35.6%, which was relatively high compared with international data, but lower than the value observed in the Silesia region (43.7%) (24). The higher mortality in Polish ICUs compared with other European countries (25), which has been under debate in recent years, is rather due to differences in patient populations, indications for ICU admission, the availability of ICU beds and the organization of end-of-life care in Poland. This is also due to the skeptical attitude of some practitioners regarding guidelines on futile therapy (26,27) and official ICU admission criteria (28). Although patients admitted to Polish ICUs are more often at higher risk of death compared with other countries, ICU mortality observed in the Silesian Registry of Intensive Care Units was lower than that predicted by the APACHE II score (29).
In our study, APACHE II, APACHE III and SAPS II scores, and the predicted ICU mortality were as follows: 19 (IQR 12-24) points (i.e. mortality rate of 25.8%; IQR 12.1-46); 67 points (IQR 36.5-88) (mortality rate of 18.5%; IQR 3.8-41.8); and 44 points (IQR 27-56) (mortality rate of 34.8%; IQR 7.9-59.8), respectively. APACHE II and SAPS II had comparable observed-to-expected mortality ratios, close to 1.0. For APACHE III, the ratio was surprisingly high and reached 1.38. Usually, the scores overestimate mortality (30). The cause of this phenomenon appears to be complex, and may result from substantial differences between the patient population in our unit (mixed admissions, including post-operative cases as the first priority) and the target populations these prognostic models were developed for. Medical patients were confirmed to have higher mortality than surgical patients, which is in line with previous research on this issue (31).
The reliability of the data collected is important because poor source data quality, as well as the number and type of missing physiological variables, can influence mortality predictions. In the original APACHE II study, variables were missing in 13% of cases (32). In our data series, a total of 14% of variables were missing in all three studies’ scores which should be taken into account in data interpretation. The process of data collection is burdened with a high risk of bias. In the case of APACHE II scores, it was observed that the main causes of data errors are inconsistent choices between the highest and lowest values and problems with GCS score determination in sedated patients (32). We used the pre-sedation GCS in sedated patients if available, data was always verified by two members of the study team independently.
Two main objective criteria are used for prognostic scales performance evaluation: namely, calibration and discrimination. Discrimination refers to the ability of a prognostic score to classify patients as survivors or non-survivors and is measured by ROC curves (i.e. AUC and 95%CI). Calibration refers to how closely the estimated probabilities of mortality correlate with the observed mortality, is of great importance for clinical trials or comparison of care between ICUs, and is depicted graphically or assessed by using goodness-to-fit models. Discrimination in our study was acceptable: all three investigated scores predicted in-hospital mortality with an AUC of almost 0.8, with no statistically significant differences between them. In terms of post-discharge mortality prediction, the diagnostic accuracy of the scores was also acceptable in terms of AUCs (i.e. >0.7) but was rather of borderline clinical relevance (the AUC was closer to 0.5 than to 1.0, which indicates a perfectly accurate test). However, it is vital to note that the AUC itself lacks clinical interpretability as it does not reflect this. Because an AUC measures performance over all thresholds (cut-offs) for the scores, it includes both those clinically relevant and clinically illogical. Therefore, clinical interpretation of AUCs remains difficult (33).
Our observations are consistent with previous studies proving the high accuracy of the scores in short-term prognostication (31,34–36). Although all the scores had comparable AUCs, APACHE II and SAPS II seemed to perform better from a clinical point of view as their observed-to-expected mortality rates were 1.12 and 0.96 compared with 1.38 for APACHE III. In a study by Beck et al., who validated the same prognostic models in 16,646 adult ICU patients in the southern UK, although similarly good discrimination was reported for all three scales, calibration was imperfect (31). The APACHE II score was more reliable than SAPS II and APACHE III in ICU patients in a study by Gilani et al. (35). Similar findings come from a study by Khwannimit et al. who compared SAPS II and APACHE II. Although the latter model performed better in Thai ICU patients, in this case also the calibration of both scores was poor. In contrast, Sungurtekin et al. reported better prognostic accuracy for SAPS II than APACHE II in organophosphate-poisoned ICU patients (37). Another study by Godinjak et al. demonstrated the comparable high diagnostic accuracy of APACHE II and SAPS II (36).
Calibration of our scores was good in terms of chi-squared and ‘p’ values. However as the application of Hosmer-Lemeshow test has been recently criticized (38), we drew the calibration curves to visualize the effect of goodness-of-fit. While the small sample size but high rate of events (i.e. deaths) is a strength of our study for the whole cohort, the calculations performed in subgroups of patients for predicted mortality were rather underpowered. On the one hand, this drawback encourages us to extend this prospective analysis to a larger group of patients. On the other hand, it must be remembered that the population of critically ill subjects changes over time and, therefore, diagnostic accuracy parameters can change dynamically (39). Differences in the performance of scores may result from variation in the case mix, standards, the structure and organization of medical care, as well as lifestyles and genetic differences between populations (7). Therefore, despite numerous studies performed so far on this subject, there is still a need to validate these prognostic models using data from independent samples from different ICUs in different countries, or even regions, at repeated time intervals.
Although we found some differences in the values of AUCs between surgical and medical patients, it has been confirmed by previous investigations that surgical patients generally have a better survival prognosis than medical ICU patients (6,34). The explanation of this fact is quite simple: in these patients the reason for ICU admission is mostly their unstable condition resulting from the performed long-lasting extensive surgical procedure, and not as much from their poor general condition prior the surgery or their comorbidities.
While all three investigated scores predicted a 12-month post-discharge mortality in a statistically significant way, their diagnostic accuracy was much lower (AUC of ~0.7). In a study by Angus et al. (19), the APACHE II score was also predictive of 1-year mortality (AUC of 0.671) in patients undergoing liver transplants. In contrast, a study by Lee et al. reported no relation between the scores calculated on admission and post-discharge mortality (40). Lower diagnostic accuracy in predicting long-term mortality could be due to various reasons. The scores are calculated during the first 24 h following admission, using the worst results. The treatment implemented during ICU stay, eventual complications and the quality of the follow-up care and rehabilitation, influence the patient’s outcome and can change the results provided by the scoring systems. Lee et al. found that the discharge APACHE II score was a good predictor of post-ICU mortality and readmission (40). Therefore, it would be more reasonable to focus on the scores calculated to estimate the long-term prediction of the patients on their discharge from the ICU. Because currently available tools have not been initially designed for such an application, further studies should be conducted to create scores estimating the long-term prediction. In this context, one ought to bear in mind that proper screening and accurate identification of patients who will stay at risk after their successful discharge from the ICU may be of great importance in order to avoid ICU readmissions, further deterioration of quality of life and higher post-discharge mortality.
The present study has some limitations. Those related to validation have been described above. However, one ought to remember also that as a single-center study, there may be bias with regard to the heterogeneous population and relatively small sample size. The final results in the scores may be affected by the confounding effect of the data selection process and the calculation of Glasgow Coma Scale results. The follow-up period in our study was limited to 12 months after the date of ICU admission. Finally, we did not include the SOFA score into our analysis. However, as this particular scoring system was primarily created for prognostication among septic patients, it seems less comprehensive in the mixed ICU setting than APACHE or SAPS (41).