Predictors of Mortality Using Machine Learning Decision Tree Algorithm in Critically Ill Adult Patients with COVID-19 Admitted to the ICU


 Background:The Coronavirus Disease-19 (COVID-19) caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is a major cause of intensive care unit (ICU) admissions globally. Robust data of epidemiology, characteristics, and disease outcomes from different regions and populations showed considerable variations. However, limited number of reports addressed predictors of mortality utilizing machine learning methods. Herein, we aimed to describe the association and relationship of a predefined set of variables found to be predictive of 28–day ICU outcome among adults COVID-19 patients admitted to the ICU using a machine learning decision tree (DT) algorithm.Methods:This was a prospective/retrospective, multicenter cohort study from 14 hospitals in Saudi Arabia. We included critically ill COVID-19 patients admitted to the ICU between March 1, 2020, and October 31, 2020. The primary outcome was 28-day ICU mortality. Secondary outcomes were 90-day mortality and ICU length of stay. The predictors of mortality were identified using two predictive models, the conventional logistic regression and DT analysis.Results:A total of 1468 critically ill COVID-19 patients were included. The mean age was 55.9 (SD±15.1) years, with 74% of the patients were males. The 28-day ICU mortality was 540 (36.8%), while 90-day mortality was 600 (40.9%). The multivariable logistic regression model demonstrated that the PaO2/FiO2 ratio on ICU admission and the need for intubation or vasopressors could strongly predict 28-day ICU mortality. The DT algorithm identified five variables [need for intubation, need for vasopressors, age, gender, and PaO2/FiO2 ratio] provided in an algorithmic fashion to predict 28-day ICU outcome. Conclusion:Five clinical predictors of 28-day ICU outcome were identified using DT algorithmic analysis of COVID-19 patients admitted to ICU. The findings of this DT analysis may be used in ICU for early identification of critically ill COVID-19 patients who are at high risk of 28-day mortality.


Background
The Coronavirus disease 2019 (COVID-19) caused by Severe Adult Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) was rst discovered in Wuhan city in late 2019 [1]. The World Health Organization (WHO) announced the disease to be pandemic on March 11, 2020 [2]. Since then, many investigators have addressed robust data of epidemiology, characteristics, and outcome of the disease from different regions and populations that showed variable results [3][4][5][6][7][8]. However, a limited number of reports addressed predictors of mortality utilizing machine learning methods [9]. Herein, we conducted this study to evaluate predictors of 28-day ICU mortality based on a prede ned set of variables among COVID-19 adults admitted to the intensive care unit (ICU) using a machine learning decision tree (DT) algorithm.

Study design
This was a prospective/retrospective, multicenter national study. We included COVID-19 patients admitted to ICU between March 1, 2020, and October 31, 2020. Institutional review boards (IRB) approvals were obtained from the Central Institutional Review Board at the Saudi Ministry of Health and the Ethical Boards of each participating center.

Setting
The study was conducted in 14 participating centers across the Kingdom of Saudi Arabia. The participating ICUs were in accredited governmental and non-governmental tertiary hospitals. The multidisciplinary treating team included critical care physicians (consultants, specialists, and residents), registered ICU nurses, respiratory therapists, clinical pharmacists, and other ICU care providers who practiced according to national and international published protocols and guidelines. In addition, during the time of the COVID-19 surge, non-ICU physicians from different specialties joined the critical care team under the supervision of intensivists after receiving basic ICU management training.

Patients
Adult patients admitted to the ICU of participating hospitals with con rmed COVID-19 by detecting SARS-CoV-2 using real-time polymerase chain reaction (RT-PCR) in nasopharyngeal swabs or tracheal aspirate specimens. Immunocompromised status was de ned as solid organ malignancy, leukemia, current use of steroids (prednisone >7mg daily for >2 weeks), post organ transplantation at any time, or rheumatological disease on immunomodulators (as azathioprine, methotrexate, in iximab, mycophenolate mofetil or others). Infection was de ned by the positive culture in blood or tracheal aspirate.

Data collection
The data was collected using the Research Electronic Data Capture (REDCap, Vanderbilt University, Nashville, TN) [10]. The collected data included patients' demographics, comorbidities, signs, and symptoms of COVID-19 illness, laboratory abnormalities, mechanical ventilation (MV) utilization, adjunctive interventions, medications, complications, and outcomes.

Outcomes measures
The primary outcome was 28-day ICU mortality. Secondary outcomes were 90-day ICU mortality and hospital length of stay.

General analysis
Patients' characteristics were summarized using frequencies and percentages for categorical variables and as medians and interquartile ranges (IQR) or mean and standard deviation (SD) for continuous variables. Chi-square or Fisher's exact tests compared categorical variables. The Wilcoxon rank-sum test was used for continuous variables. We constructed Kaplan-Meier curves to assess cumulative mortality during the initial 60 days from ICU admission.
In the rst 28 days of ICU stay, risk factors of mortality were evaluated in the whole cohort using univariate and multivariable logistic regression analyses. Variables included in the multivariable logistic regression model were identi ed based on literature review and used to generate the predictive models included demographics, co-morbid conditions, laboratory data on ICU admission, the respiratory component of SOFA score [11] and the need for intubation or vasopressors. In the regression analysis, variables of the need for intubation and vasopressors included patients during the rst ve days of ICU admission, and the respiratory component of SOFA score on ICU admission was classi ed as (category 4 was PaO 2 /FiO 2 <100 with respiratory support, while category 0-3 was PaO 2 /FiO 2 >100). Continuous variables were categorized using cut-off points based on either previous literature review or optimal cutoff points statistically identi ed using the cut pointer library in R, maximizing the Youdin index that determines the split point between survivors and non-survivors. The logistic regression model results were reported as odds ratio (OR) with a 95% con dence interval (95%CI). All statistical tests were two-tailed, and p values < 0.05 were considered signi cant. All statistical analyses were performed using R software,

Decision Tree (DT) analysis
Machine learning DT analysis was used to identify characteristics of patients with COVID-19 that were predictive of 28-day ICU outcome. The model was generated using the standard setting in the opensource software library Waikato Environment for Knowledge Analysis (WEKA, University of Waikato) [12] using the C 4.5 classi cation algorithm (J48) with 20 cases as the minimum number of cases at the leaf of each branch (end of the tree). C4.5 classi er used the information gain ratio split criterion to reduce bias towards multivalued [13]. In addition, we used "Algorithm Accuracy" as a general measure to assess the performance of the classi er. Accuracy is one of the most used performance measures representing the overall correctness of the algorithm. The Area Under the Receiver Operating Characteristic Curve (AUROC) is also used to evaluate the performance of the decision tree model, which represents the predictive performance of the model across different thresholds of sensitivity (true positive rate (TPR)) plotted over different ranges of 1-speci city (false positive rate) [14]. Ten-fold cross validation was applied to generate the accuracy, AUROC, and its con dence intervals. TPR is the true positive cases by the algorithm divided by the total positive cases (true positive + false positive). To assess the model's generalizability and avoid over-tting, 10-folds cross validation was applied.

Patient characteristics and ICU admission data
During the study period, 1468 patients were admitted to the ICUs in the 14 participating hospitals. Table 1 described the patients' demographics and data in the rst 24 hours of ICU admission among the 28-day ICU survivors vs. non-survivors. The mean age was 55.9 (SD±15.1) years, (74%) of the patients were males, and 69 patients (4.8%) were healthcare workers. Hypertension, ischemic heart disease, and smoking were signi cantly associated with higher 28-day ICU mortality (P value 0.0187, 0.0016, and 0.0333, respectively). The SOFA score, 7 (IQR 4-10), was signi cantly higher in patients who died within the rst 28 days of ICU admission. Survivors had a higher PaO 2 /FiO 2 ratio compared to non-survivors at  Maier curve for COVID-19 cumulative incidence of mortality showed 40% mortality at day 60 of ICU admission. (Figure 1)  (Table 4) The stepwise logistic regression analysis retained eight variables: age groups, gender, diabetes mellitus (DM), ischemic heart disease (IHD), the respiratory component of SOFA score (category 4), need for intubation, or vasopressors, and neutrophils-lymphocytes (NL) ratio. (Figure 3) Results of the decision tree analysis: Five variables were identi ed and allocated the patients into the nal binary outcome (Survival versus Mortality). These variables were: the need for intubation or vasopressors, Gender, PaO 2 /FiO 2 on ICU admission, and age categories. The constructed DT assigned the root node (start of the tree or rst splitting criteria) to the need for intubation. Then, the tree continued to grow and assign patients into their respective groups utilizing the other four variables in a sequential manner. The DT model's ability to correctly assign patients to their respective groups (model discrimination), assessed using the ROC-AUC, was 75.42 % (95% CI= 74.84-78.95). The DT model accuracy was 73.1% (number of retained patients on the model 1043 out of 1468). (Figure 2).

Discussion
We utilized the DT analysis and identi ed ve features that are predictive of 28-day ICU outcomes. These features are the need for intubation, vasopressors, age, gender, and PaO 2 /FiO 2 ratio.
The COVID-19 pandemic overwhelmed the health care system leading to constraints of medical resources, mainly in critical care unit capacity, and a shortage of mechanical ventilators. [15][16][17]. Many hospitals utilized machine learning-based analyses combining clinical, radiological and laboratory data for the prognostication and rapid risk strati cation of PCR con rmed COVID-19 patients [18][19][20]. The evaluation of the severity of illness for patients admitted to the ICU has been applied by different general scoring methods such as the acute physiology and chronic health evaluation (APACHE) II and IV [21,22], Simpli ed Acute Physiology Score (SAPS) [23], and SOFA scores [11] or COVID-19 speci c scores as 4C mortality score [24]. APACHE-II score, the most commonly utilized score on ICU admission, is subjective, time-consuming, and depends mainly on laboratory indicators which are not comprehensive enough to predict the outcome of the newly emerged COVID-19 [25]. Standard logistic regression analysis is useful in predicting outcomes of interest; however, it does not model a nonlinear relationship of multiple dimensional data [26].
The application of machine-learning models in the medical eld has been increasingly utilized, especially in outcome prediction of cancer outcomes [27][28][29]. Random Forest classi ers, decision trees, and arti cial neural networks (ANNs) speci cally were among the earliest used techniques in medical research [30,31]. DT analysis is an effective classi er that has been applied in many domains [32,33] and is considered an intuitive nonlinear approach that can automatically detect independent variables that predict outcomes and the interactions between these variables. Moreover, it allows an easy-to-understand visual representation of the relationships between the variables and the primary outcome [34]. The multivariable, stepwise logistic regression, and DT analyses in our study were built using the same prede ned set of variables. Nevertheless, our study demonstrates the advantage of DT analysis in providing prediction in an algorithmic fashion rather than merely identifying associations and relations between the variables and the outcome achieved by most regression models [35,36]. In this context, DT algorithms are very intuitive, easy to understand and explain while producing simple rules that simulate a human logic-like approach.
Predictors of mortality in COVID-19 have been widely reported in many studies with different settings and designs, including mainly laboratory and radiological variables [37,38]. Limited reports, however, studied clinical variables on ICU admission as predictors of mortality that may facilitate the early identi cation of critically ill COVID-19 patients at higher risk of 28-day mortality [39]. In metanalysis reported by (Pengqiang Du. et al. 2021) addressed the predictors of mortality utilizing the classic logistic regression analysis showed that advanced age, male gender, and comorbidities of chronic respiratory disease, DM, hypertension, and chronic kidney or cardiovascular diseases were associated with severe illness or death for COVID-19 patients [40].
On the other hand, studies that reported predictors of mortality utilizing DT analysis in critically ill COVID-19 patients were very limited [41][42][43]. One of these analyses by (Qiao Yang et al. 2021) showed a rapid, simple, and easy to interpret decision tree model built of three biochemical markers on ICU admission (LDH, NLR, and CRP) with a high true sensitivity rate to predict death in severe COVID-19 disease [41].
Strengths of the study include the multicenter nature, which improves generalizability. In addition, unlike the earlier reported experience from the Middle East [44], the 28-day ICU mortality of 36.8% in this cohort was comparable to reported experiences during the pandemic [45][46][47]. Our study has limitations, including a low retained number of patients in the multivariable logistic regression in the provided model (409 out of 1468 (27.82%)). However, this low number depends on the clinically dependent variables utilized in the model that is limited to a small number of critically ill patients on ICU admission as the need for intubation and vasopressors.

Conclusion
Five clinical predictors of 28-day ICU outcome identi ed using DT algorithmic analysis of COVID-19 patients admitted to ICU. DT analysis may be used in ICU for early identi cation of critically ill COVID-19 patients who are at high risk of 28-day mortality. Further studies are required to validate these results. Figure 1 Kaplan Maier curve for COVID-19 cumulative incidence of mortality. -The need for intubation during the rst ve days of ICU admission was the rst splitting variable (Root node); in those who required intubation, gender was the next splitting step. If the patient was a Male, a high probability of mortality with 74% accuracy is anticipated. While if female gender, the PaO2/FiO2 ratio <100 had a high probability of mortality with an accuracy of 72%. On the other hand, for those who did not require intubation nor vasopressors during the rst ve days, age affected the outcome widely (e.g., high probability of survival with an 88% accuracy in age group less than 40).