In this paper we describe the development of a machine learning model for supporting ICU discharge with good performance using routinely collected ICU data. By using feature importance techniques and displaying risk prediction for readmission and mortality throughout the admission, application of our model as a bedside decision support tool seams feasible.
Several attempts to develop prediction models to prevent untimely discharge from the ICU for general adult intensive care patients have been made previously [10, 16–19, 21, 23–26, 37]. Earlier models use logistic regression for prediction focusing on very few parameters, whereas the newer models use more advanced machine learning algorithms (Additional file 1: Table S8). Our Gradient Boosting machine learning model represents an improvement over the models reported in the literature in terms of ROC AUC and outperforms the purposely built SWIFT score when validated on our own data set. The improvement in performance is modest and by choosing a time window of 7 days post-discharge more patients have been included that died after discharge, which is a more predictable task [25]. However, given the large and increasing number of ICU admissions worldwide, this modest reduction may have significant impact for patients and society.
Our paper has several strengths. Firstly, compared to the current literature, we performed more extensive feature engineering. Unsurprisingly, this allowed the logistic regression model to perform very similar to the Gradient Boosting model. Extending the feature engineering with for example log or inverse transformations and interactions between individual features would potentially further improve the performance of the logistic regression model towards the performance of the Gradient Boosting model.
Secondly, and perhaps more importantly, for most predictive models formal evaluation or bed side implementation is currently lacking [38]. Based on our previous experience, developing a bed side decision support tool requires designing a model, pipeline and software with clinical implementation in mind [39, 40]. For predictive models, this involves close collaboration between intensivists and data scientists for extensive feature engineering with a focus on features that are available in real-time, innovative approaches with respect to interpretability, actionable insights and feature importance, as well as extensive performance evaluations and impact analyses.
Our paper also has some limitations. Firstly, performance was measured using internal cross-validation only and not on a separate test group. We specifically chose this approach because of the low incidence of adverse outcomes in our dataset and using a separate test set would further reduce the power during model development. Since many published models show moderate to good performance on data they were trained on, our next step to implement our model is to validate it on our current electronic health records and on data from other hospitals.
Secondly, in our Dutch setting, where ICU capacity is strained, we specifically chose to target readmissions and mortality until 7 days after ICU discharge to include patients that might suffer from complications that typically occur later, for example respiratory failure or sepsis and not the quality indicator of ICU readmission after 2 days. Since the performance of the retrained model outcome after 2 days dropped significantly, this does show that predicting early readmissions is a more difficult task.
Thirdly, predicting and possibly preventing readmissions may not influence outcome in some health care systems. In fact, even unexpected ICU readmissions may not unequivocally lead to an increase in hospital mortality as some authors have shown in a prospective study [41]. In addition, by only using routinely collected ICU data, some data that might be predictive of our endpoint, such as pre-hospital status or detailed reasons for (re)admission, could not be used to further improve performance.
For imbalanced datasets, the area under the ROC curve, which is often the only metric reported, may not be a good indicator of model performance. The reason for this is that the ROC curve is influenced by the large number of true negatives, which are clinically less relevant since those are patients that do well. False positive rate (1 – specificity), which forms the x-axis of the ROC curve, is calculated as false positives / (false positives + true negatives). In our case with a high number of true negatives, this will push the curve upwards and to the left, thereby increasing the area under the curve, which will give an overly optimistic view of model performance [30, 31]. The PR curve does not suffer from this limitation since precision, also known as the positive predictive value, is equivalent to true positives / (true positives + false positives) and therefore not influenced by the large number of negatives. In addition, for a perfect model, the area under the PR curve equals 1, similar to the ROC curve, but whereas the baseline AUC is fixed at 0.5 for the ROC curve, for the PR curve this baseline AUC equals the proportion of the ‘positives’ or in our case 0.053 (5.3% combined outcome of readmission and mortality). The PR curve (Fig. 1b) shows that the area under the curve (0.198) is much better than the baseline AUC of our combined outcome of 5.3% (0.053), but also that there is still room for improvement, even for the complex models we describe. Unfortunately, reporting of PR curves is rare in medicine [30] and they are also unavailable for previously developed models targeted at prevention of untimely ICU discharge. This makes rigorous comparisons with our model cumbersome. Given the rapid progression of the field, it is likely that many more classification algorithms will be published for intensive care medicine. As imbalanced datasets are the norm in this setting, it is essential that both ROC and PR curves will be reported for future machine learning studies in the context of intensive care medicine.
Good model performance alone, is insufficient for a useful bedside tool on the ICU. Whether a model will be adopted in clinical practice will also depend heavily on ease of use as well as on the trade-off between the cost associated with a readmission (mortality, length of stay) and the cost of an unnecessary prolonged stay (length of stay, cancelled elective surgery or denied admissions). For the distinct groups of patients that we identified earlier using readmission probability-time curves, we showcase in our explorative impact analysis that using these curves may improve discharge management by preventing readmissions and deaths from premature ICU discharge with only a small increase in total length of stay. Furthermore, when a fraction of the patients that do not seem to improve during the last days of the admission (group 2b in Fig. 5) might even be discharged earlier as a consequence of using our model, the impact could potentially include both a reduction in total length of stay as well as a reduction in readmission rate. The promising result of our explorative analysis of clinical impact and the vast potential benefit for our future critically ill patients have prompted us to proceed with validation and implementation of our model at the bedside in Amsterdam UMC.