A Time-Incorporated SOFA-Based Explainable Machine Learning Model for Mortality Prediction in Critically Ill Patients

Background: Organ dysfunction (OD) assessment is essential in intensive care units (ICUs). However, no OD scoring system has so far considered the duration of OD, which is clinically relevant. This study aimed to develop and validate an ICU mortality prediction model based on the Sequential Organ Failure Assessment (SOFA) score, incorporating the time dimension with machine learning methods. Methods: Data from the eICU Collaborative Research Database and Medical Information Mart for Intensive Care (MIMIC) -III were mixed for model development, and the MIMIC-IV dataset and Nanjing Jinling Hospital Surgery ICU (SICU-JL) dataset were used for external testing. Adult patients in the ICUs for more than 72 hours were deemed eligible. The total SOFA score and individual scores were calculated every 12 hours for the rst three days of ICU admission. Time-dimensional variables were derived from the consecutively recorded SOFA scores and individual scores for each organ. A modied SOFA model incorporating the time dimension (T-SOFA) was stepwise constructed to predict ICU mortality using multiple machine learning algorithms. The predictive performance was assessed with the area under the receiver operating characteristic curves (AUROC). Also, we utilized the SHapley Additive exPlanations (SHAP) algorithm for data visualization and model explainability. Results: We extracted a total of 66,709 ICU patients from the mixed datasets for model development and 15,423 patients for validation. The T-SOFA M3 that incorporated the time dimension features and age, using the XGBoost algorithm, signicantly outperformed the original SOFA scores (AUROC 0.800 95% CI [0.787-0.813] vs. 0.693 95% CI [0.678-0.709], p<0.01) in the validation set. Good prediction performance was maintained for the T-SOFA M3 in test Set A and test Set B, with AUROC of 0.803, 95% CI[0.791-0.815], and 0.830, 95%CI [0.789-0.870], respectively. Signicant contributors demonstrated by the SHAP analysis included total SOFA score, Respiration-score, CNS-score, age, Cardiovascular-score,


Background
Organ injury or dysfunction caused by a primary illness is a common reason for intensive care unit (ICU) admission [1]. Although the original diseases of critically ill patients vary, organ dysfunction(OD) is associated with substantial morbidity and mortality [2]. Therefore, OD assessment is a routine in ICUs to determine the impact of the underlying disease and the effects of therapies.
OD assessment system has evolved from the Multiple Organ Dysfunction Score (MODS) to the sequential organ failure assessment (SOFA) score, which was extended from sepsis to all critically ill patients after the establishment of the Sepsis-3 criteria [3][4][5]. Theoretically, OD evaluation should be conducted along three dimensions: the severity of each OD, the number of organ systems involved, and the duration of each OD. OD is a dynamic process rather than a static event [4], so a temporal assessment can assist clinicians in understanding the progression of the disease and potentially facilitating decision-making.
Unfortunately, to date, none of the existing OD scoring systems have taken the time dimension into account.
The assessment of time dimension is complicated by the fact that the duration of OD could be different from one system to another, and changes in OD during ICU stay are multidirectional. Previous studies proposed concepts such as ΔSOFA to re ect the direction and magnitude of variation, but information on the duration of each OD remains unaccountable [6]. Arti cial intelligence algorithms have recently had a wide range of applications in healthcare practice [7], where machine learning or deep learning algorithms could process time-series data to improve prediction accuracy by extracting potentially valuable time features [8]. However, the cost is that the prediction process became a "black box" (unexplainable),sacri cing medical interpretability.
This study aimed to develop and validate a novel ICU mortality prediction model based on the SOFA score by incorporating time dimension through a machine learning algorithm. Furthermore, the impact of time dimension on ICU mortality was further clari ed by interpretable algorithms.

Data sources
The study data was extracted from three real and medical care and mortality (including in-hospital and after-discharge) [9]. MIMIC-IV is an updated version of the MIMIC database, containing data collected between 2008 and 2019 [10]. We selected patients admitted between 2013 and 2019 from the MIMIC-IV database to distinguish it from the MIMIC-III database, thereby making it an external dataset. The eICU-CRD is another public database consisting of more than 200,000 admissions to ICUs from 208 hospitals across the US between 2014 and 2015, which provided a degree of generalizability to the training model due to its comprehensive geographic coverage and a more broad range of critical illnesses [11]. The SICU-JL database contained 2,155 electronic health records from the Surgical ICU of Jinling Hospital, Nanjing University, between 2018 to 2020. The following inclusion criteria were used for the present study: 1. adult critically ill patients aged ≥ 18 years; 2. ICU stay at least 72 hours. The criterion-compliant data from the MIMIC-III and the eICU-CRD were mixed and randomly divided in a ratio of 8:2, where 80% of the selected admissions were used as a training cohort. The remaining 20% were served as an internal validation cohort to verify the predictive performance and optimize hyperparameter settings to prevent over-tting. The data from the MIMIC-IV database was used as an external test set A, and data from the SICU-JL dataset served as external test set B to ensure the generalizability of the model. We had obtained proof of completion of the course on human research and signed a data use agreement on PhysioNet, and approved by MIT's Institutional Review Board. Data collection from the SICU-JL dataset was approved by the institutional review board of the Jinling hospital. The study was conducted in accordance with the tenets of the Declaration of Helsinki. Informed consent was waived by the IRB-JL in light of the retrospective nature.

Data extraction and preprocessing
Demographic information (e.g., age, gender, ethnicity, etc.), clinical data related to SOFA score (e.g., Oxygen partial pressure, mean arterial pressure, Glasgow Coma Scale (GCS) score, etc.), and the outcomes of patients were extracted from the dataset. The raw clinical data required to calculate the SOFA score were collected every 12 hours after ICU admission, with T 1 for the rst 12 hours at admission and T 6 at 72 hours after ICU entry. The total SOFA score and the sub-scores were calculated simultaneously, thereby resulting in six consecutive time points.

De nition of time dimension variables
We de ned several novel variables to re ect the temporary change of OD as follows( Fig. 1): 1)The time point when the total SOFA score or a sub-score ≥ 2 was counted as an Organ dysFunction unit Time (OFuT) out of the six time points. We de ned the sum of OFuT multiplied by the mean value of the total SOFA score or the corresponding sub-scores as the OFT Index (OFTI). 2) The difference in the total SOFA score or a sub-score between two adjacent time points (△SOFA=SOFAT x+1 -SOFAT x ,x [1, 2, 3, 4, 5]) ≥ 0 was recorded as an Organ dysfunction Unalleviated unit Time (OUuT), and the summed OUuT multiplied by the mean value of the total SOFA score or the corresponding sub-scores was called the OUT Index (OUTI). 3) The time point when △SOFA>0 was de ned as an Organ dysfunction Aggravated unit Time (OAuT), and the summed OAuT multiplied by the mean value of the total SOFA score or the corresponding sub-scores was de ned as an OAT Index (OATI). 4) With the sub-score value (0-4 points) or the total score value (0-24 points) as the y-axis and time (0-3 days) as the x-axis, the study time points were connected, constituting the SOFA curves. The area under the SOFA curves was described as the SOFA-Time Area Under the Curve (STAUC), which could re ect the cumulative changes of OD over the 72 hours.
Conventional time series features of total SOFA score and sub-scores (e.g., Mean, maximum, minimum, variance) were also included for analysis [12].

Prediction model development, testing, and explanation
We used a supervised classi cation learning algorithm to construct a prediction model for ICU mortality based on SOFA score. The maximum SOFA score during the rst 72 hours was included as the original SOFA as a baseline reference [13]. Then we progressively incorporate related variables into the prediction model. First, the total SOFA score and sub-scores at each time point were included. The model was trained using four machine learning algorithms, eXtreme Gradient Boosting (XGBoost) [14], Support Vector Machines (SVM) [15], Random Forest (RF)[16], and Logistic regression to obtain the T-SOFA M1. In ∈ the second step, the time dimension was taken into account by incorporating the abovementioned temporary variables to train a prediction model called T-SOFA M2. In the third step, given that age is a well-proven risk factor for all-cause mortality in ICUs [17], we added age to M2, thereby creating T-SOFA M3. All training models would be performed in the internal validation set to prevent over-tting. The performance of T-SOFA were further compared with the original SOFA and tested using datasets with public database and local database.
We used a uni ed framework, SHAP (SHapley Additive exPlanations), to obtain the impact of each variable on the designated outcome (ICU mortality), which could interpret the contribution of different organ systems and time features to the prediction model. Variables with an impact value≥0.1, as indicated by the SHAP algorithm, were considered signi cant for ICU mortality.

Missing data preprocessing
Most of the variables involved in this study were time-series data. Missing data were imputed by the last observation carried forward (LOCF) [18]. If an original variable required for SOFA score calculation is missing and cannot be imputed by proximity, it is lled using the mean value of the variables in the data set at the corresponding time point, when the missing data are normally distributed; otherwise, the median is used.

Statistical analysis
Differences in clinical characteristics between the training set, internal validation set, and external test sets were compared using the ANOVA or Kruskal-Wallis test, and categorical variables were compared using Pearson's chi-squared test. Comparisons of differences in temporary variables between survivors and non-survivors were made using a two independent sample t-test. The primary outcome was all-cause mortality in the ICUs, and the predictive performance was evaluated by the area under the receiver operating characteristic curves (AUROC). Differences in predictive performance between models were compared with a non-parametric DeLong test with a two-sided p<0.05 criterion for statistical signi cance [19]. Statistical analyses were performed using Stata 15.0 software, and the machine learning algorithm modeling was implemented in Python 3.7 and R 3.6.3

Results
A total of 66,709 patients from the MIMIC-III database and the eICU-CRD were used to develop the mortality prediction model. Then 80% of the patients were randomly allocated for training and 20% for internal validation. 13,929 patients from the MIMIC-IV database and 1,494 patients from the SICU-JL database served as the external test set A and test set B, respectively (Additional le 1. Figure S1.). The modeling cohort consisting of both the training and internal validation sets showed signi cant differences from the external test cohort in terms of baseline clinical characteristics of admissions and outcomes ( Table 1). The patients from both test sets had ICU mortality around 9.7% (test set A:1358/13929, test set B:145/1494), which were higher than those in the training and internal validation sets (9% [4785/53367] and 9% [1206/13342], respectively). Almost all the SOFA-related temporary variables and age differed statistically between survivors and non-survivors (Additional le 1. Table S1).

Prediction performance assessment
In the validation set, the original SOFA score showed an AUROC of 0.693 (95%CI 0.678-0.709) in predicting ICU mortality. The XGBoost algorithm outperformed the others (Table 2. and Additional le 1. Table S2)

Clinical interpretability of prediction models
Among all the variables, the signi cant contributors to ICU mortality appeared to be the total SOFA score, the respiration score, the CNS score, the age, the cardiovascular score, and the SOFA OUTI, with variable important for the projection(VIP) value of 0.2861, 0.2836, 0.2594, 0.2548, 0.1437 and 0.125, respectively (Fig. 3A). The SHAP values for each feature were visualized using a summary plot (Fig. 3B). Additionally, the prediction model provided individualized mortality risk factor analysis of instances through the SHAP algorithm (Additional le 1. Figure S2).

Discussion
In this study, a SOFA-based, time-incorporated mortality risk prediction model(T-SOFA M3) for critically ill patients was developed and validated. We de ned several novel temporary variables to describe the dynamic change of OD quantitively from different perspectives. The results showed that a prediction model incorporating time dimension and age using machine learning algorithms had a signi cantly better predictive performance for ICU mortality than the original SOFA.
The SHAP algorithm was applied to evaluate the contribution of the time dimension. The results showed that various OD types have different impacts with the respiration system, the CNS system, and the cardiovascular system ranking the top three, consistent with what are used in the popular qSOFA score [5,20]. Among all the temporary variables, the impacts of SOFA OUTI, which re ects the duration of persistent(unalleviated) OD, ranks highest, even higher than that of renal dysfunction, indicating the importance of the time dimension. The better interpretability provided by the SHAP model can potentially further guide treatment and facilitate clinical decision-making. Furthermore, our model derived from the impacts of each variable could be used for individual prediction, which would be of clinical use for risk strati cation in ICU settings, which was impossible with previously proposed modi ed SOFA models[6, 13,[21][22][23].
Although efforts to develop new predictive models based on the SOFA score were repeatedly presented in the literature [6,13,20,[24][25][26][27], the T-SOFA M3 has several unique advantages. Firstly, the training set was obtained from the mixed datasets (MIMIC-III and eICU-CRD) with a large sample size, which may compensate for the remarkable heterogeneity among ICUs in the US. Secondly, the model was externally tested using an open-source multi-center dataset and a local single-center dataset, respectively, demonstrating excellent predictive performance and geographical generalizability. More, we collected total SOFA scores and individual scores every 12 hours within 72 hours of ICU admission, which was relatively infrequent and therefore doable in real practice, implicating its potential for future clinical use.
In recent years, the application of the SOFA score has in a broader range of diseases. Although the SOFA score was not initially developed to predict mortality but to evaluate comorbidities [4], it has become a well-established tool for prognosis prediction [28]. As a consequence, a lot of modi ed SOFA models have emerged to overcome the limitations of the SOFA score [29]. The modi ed SOFA systems incorporating the SOFA score derivatives like ΔSOFA, MAX SOFA, Mean SOFA, etc., could signi cantly outperform SOFA at admission, suggesting that characteristics representing changes in OD over time could help improve the performance.
Lilian Minne et al. systematically reviewed the performance of SOFA-based models for predicting mortality in critically ill patients and concluded that MAX SOFA had the best predictive performance (AUROC range = 0.792 to 0.922), while ΔSOFA showed variable performance across studies (AUROC range = 0.51 to 0.828) according to the de nitions [13] [29]. Another alternative option was to optimize the SOFA by replacing individual components with more clinically reliable ones. Previous studies have attempted to use the Richmond Agitation-Sedation Scale (RASS) score as a substitution as the Glasgow Coma Scale (GCS) score tends to be inaccurately recorded. However, they failed to nd the superiority of RASS score modi ed SOFA over the original version [30]. Adding other systems or variables to the original SOFA, such as the gastrointestinal system, was also considered and tested. Yehudit Aperstein et al. proposed the Novel Gastrointestinal Dysfunction Assessment Tool (Resting energy expenditure daily balance, Gastric residual volume, vomiting, and Bowel movements) as the seventh organ dysfunction assessment criterion of the modi ed SOFA and demonstrated improved prediction accuracy [21]. However, their modi ed SOFA score did not gain much popularity due to the subjective and inaccurate nature of assessing gastrointestinal dysfunction [31]. In the T-SOFA M3, age was the only additional variable apart from the original SOFA-required data since it was readily available and there was a strong correlation between age and mortality [32,33].
There are some limitations to our study. First of all, both the training and validation sets were extracted from retrospective data, which inevitably contained missing data and recording errors. Although multiple efforts were made to minimize the potential bias, there was still some data distortion. Therefore, prospective studies are necessary to further validate the discriminative ability of the model. Moreover, our model requires consecutive 72-hour SOFA data, which denies its use in the "very early" ( 24 hours) stage of ICU admission.

Conclusions
We developed and validated a SOFA-based time-incorporated mortality prediction model by machine learning algorithms, with satisfactory discrimination and medical interpretability.    Receiver operating characteristic (ROC) curves for the machine learning models and the original SOFA score in predicting ICU mortality in different datasets: A. Training set; B. Validation set; C. Test set A; D. Test set B