In this retrospective study, we developed and validated an algorithm that identifies patients at risk of ARF requiring advanced respiratory support. The algorithm continually produced risk scores at 3-hour intervals based on updated patient EHR data to predict the risk of ARF within the next 24 hours. An alert was generated only once for patients when the risk score first exceeded a predefined threshold prior to ARF onset. The MLA demonstrated superior performance at higher sensitivity and specificity in comparison to SpO2 and MEWS comparators when using time-sensitive ROC evaluation methods. Importantly, we demonstrate a greater number of early alerts before ARF onset with a lower false positive rate, reducing false alarms.
An easily implementable, automatic clinical decision support tool that does not require physician input or interrupt clinical workflow may supplement clinical assessment in hospitals lacking adequate staffing or resources. The continually running algorithm design and analysis of new patient data every three hours to predict ARF risk within a 24-hour prediction window would provide clinically relevant alerts to health care providers. MLA defined ARF advanced respiratory support treatments include NIV procedures and high flow nasal cannula, allowing identification of patients early in their trajectory of decompensation, and enabling prophylactic measures. This also would permit earlier consideration of appropriate ventilation escalation, including techniques that are protective against lung injury (45–48).
The MLA performed with higher sensitivity and specificity on the temporal testing and external validation sets (tAUROC of 0.858/0.883, respectively) than SpO2 (0.771/0.810) and the MEWS score (0.676/0.774). Other studies have created algorithms predicting ARF onset (30, 49); however, direct comparisons are difficult due to different methodologies and evaluation metrics. Wong et al. (30) developed an XGBoost ML model with a similar definition of ARF that provided a single alert with a 3 hour prediction window from the majority vote of 8 predictions over the span of 8 hours. In comparison, we generated risk scores with a 24-hour prediction window at 3-hour intervals from updated EHR data until a risk score exceeded a specific threshold and an alert was generated. Although the algorithm from Wong et al. outperformed MEWS with an AUROC of 0.85 versus 0.57, it is a complex algorithm with 70 input variables, many of which may not be immediately available in many healthcare systems, limiting its applicability. Kim et al. (49) designed an MLA where acute respiratory failure was defined as endotracheal intubation to make risk predictions within a window of 6 hours to 1 hour prior to onset using nine features. While they also achieved high AUROC values, their definition of ARF included only mechanical intubation, and a shorter prediction window.
The most important features that contributed to the model’s performance were age, SpO2, BUN, pH, and respiratory rate for both the temporal testing and external validation sets. Age is known to contribute to ARF due to age-related structural changes of the respiratory system (13, 50). BUN is a marker of renal function, (51) and its importance may be correlated with the increasing body of evidence that suggests lung-kidney interactions are involved in renal consequences of ARF (52). SpO2 has been previously identified in literature (39, 40) and in discussions with clinicians as critical for monitoring patient respiratory status. PaO2 was of lower importance in MLA prediction outcomes in both data sets. One reason may be that SpO2 is a non-invasive tool and typically monitored continually with frequent EHR updates. In contrast, PaO2 is a laboratory test with a high degree of data missingness, that may be determined only after the clinician already suspects a decline in patient respiratory status. While pH had high data missingness similar to PaO2, the importance of pH appears to be more informative for the model than PaO2. Any measurement of pH resulted in a positive SHAP value for predictions, while a missing value resulted in a slightly negative value. Covid-19 positive status was also of high importance in the MLA prediction of ARF requiring respiratory support, as may be expected (53–55).
An important aspect of designing clinical decision support systems is the clinical relevancy and the timing of alerts. From literature and discussions with clinician experts, early alerts prior to ARF onset (30), and especially within 24 hour before onset, are considered to be most useful (Additional File 2). With early alerts, health care providers may implement prophylactic or preventive therapies to mitigate ARF onset. Therefore, we evaluated the algorithm performance with respect to the time of ARF onset. The MLA achieved a similar or lower false positive rate than the comparators. While SpO2 achieved a similar false positive rate (13.4%) to the MLA (13.6%), the number of true positive cases identified before ARF onset by the MLA was much higher, and the MEWS comparator produced over 2.5-times the number of false positives than the algorithm.
There are several strengths to this analysis. The patient data used in this study was obtained from multiple sites from throughout the United States and represent a heterogeneous patient population. The analysis evaluates the model on both a temporally distinct set and an external validation set, demonstrating generalizability. This study is limited by its retrospective nature, and the lack of prospective validation to ascertain clinician response to alerts in practice. We cannot predict how the MLA may perform in different populations or settings. The MEWS score was calculated without AVPU assessment, as it was not available in the dataset. Therefore, the performance of the MEWS score may be underestimated in this study.