Data Source
Data for this study were obtained from MIMIC-IV 1.0 database, the eICU-CRD 2.0, and the AmsterdamUMCdb 1.0.2. The MIMIC-IV 2.0 database, an updated version of MIMIC-III, comprises data from over 40,000 patients admitted to ICU at the Beth Israel Deaconess Medical Center (BIDMC)20. The eICU-CRD contains data from multiple ICUs having over 200,000 patients admitted in 2014 and 201521. The AmsterdamUMCdb contains approximately 1 billion clinical data points from 23,106 admissions of 20,109 patients22.
Data collection and release were approved by the ethical standards of the institutional review board of the Massachusetts Institute of Technology (no.0403000206) and complied with the Health Insurance Portability and Accountability Act (HIPAA).
Participants
This study included participants aged 18 or older from the MIMIC-IV1.0, eICU databases 2.0, and AmsterdamUMCdb 1.0.2. eligibility for inclusion was based on the following criteria: 1) Documented or suspected infection and a Sequential Organ Failure Assessment (SOFA) score of ≥ 2 according to the Sepsis-3.0 standards 3 in the first 24 hours of ICU admission. 2) Documentation of peripheral complete blood count within the first 24 hours of ICU admission
Exclusion criteria included: 1) ICU stay of fewer than 24 hours; 2) HIV infection, cancer, metastatic tumors, rheumatic diseases 3) For patients with multiple hospitalizations, only the first ICU admission was considered for the study 4) Total cholesterol, triglyceride, HDL, low-density lipoprotein (LDL) was not documented in the first 24 hours.
Data extraction, handling missing and outliers’ data
The following clinical information was extracted using Structured Query Language (SQL) statements:1) Laboratory blood and biochemical examination within the first 24 hours: WBC, platelets, neutrophil count, lymphocyte count, monocyte count, total cholesterol, HDL, LDL, blood glucose. 2) Demographics and vital signs within the first 24 hours: age, sex, heart rate, systolic blood pressure, diastolic blood pressure, temperature (℃), and respiratory rate. 3) Blood gas analysis within the first 24 hours: arterial partial pressure of oxygen (PaO2), arterial partial pressure of carbon dioxide (PaCO2). 4) ICU details: the length of ICU stays and the inpatient survival status. 5) Comorbidity and treatment modalities: myocardial infarction, congestive heart failure, chronic pulmonary, liver disease, renal disease, mechanical ventilation, and dialysis. In cases where a variable was recorded multiple times within the first 24 h of ICU admission, the value associated with the greatest severity of illness was used. The NLR was computed as the ratio of neutrophils to lymphocytes, and LMR was calculated as the ratio of lymphocytes to monocytes. The PLR was calculated from the ratio of platelets to lymphocytes. The MHR was calculated from the ratio of monocytes to HDL. The NHR was calculated from the ratio of neutrophils to HDL.
Variables missing for over 30%, including PaO2fio2ratio, Fio2, Lactate,Spo2, Paco2, Pao2, Ph, LDL, were excluded from analysis (FigureS1). The remaining 45 predictor candidates measured at the ICU admission were selected for further analysis. Multiple imputations utilizing predictive mean matching (pmm) with the "mice" package imputed missing values for selected variables23. Random forest outlier detection was implemented (Figure S2), with outliers replaced by pmm using outForest R package24,25.
Statistical analysis
To begin, we completed the Data or Specimen Study course in the Collaborating Institution's Training Program (CITI) (Record ID: 9303810). Subsequently, we applied for access to both databases by creating an account on physio.com (https://physionet.org) and signing the Physio.com Clinical Database Restricted Data Use Agreement. We then utilized SQL statements to extract the required clinical information.
All analyses were carried out using R4.0.5. Continuous variables were represented as the mean ± SD or median (interquartile) and compared using Student's t-test for normally distributed variables or Mann-Whitney U test for non-normally distributed variables. Categorical variables were expressed as proportions and analyzed using the Chi-square or Fisher's exact tests.
Lasso regularization was employed for variable selection, identifying pertinent variables while disregarding others to reduce model complexity and mitigate overfitting risks 26,27. A vital advantage of this approach is facilitating model interpretability by enhancing the understanding of underlying relationships. Ten-fold cross-validation with the "glmnet" package estimated optimal penalty parameters (lambda) and beta coefficients for selected variables in the training cohort28. This rigorous cross-validation process ensured robustness in model selection and parameter estimation.
A comprehensive ensemble of seven machine learning models, including eXtreme Gradient Boosting (XGBoost), logistic regression (LR), random forest (RF), support vector machine (SVM), K Nearest Neighbor (KNN), Naive Bayes, and Decision Tree (DT), estimated the predictive models in our study. Model discriminative accuracy was evaluated using the area under the receiver operating characteristic curve (AUC-ROC), a widely accepted metric. To further assess the practical utility and potential clinical impact, decision curve analysis (DCA) quantified net benefit across varying threshold probabilities, providing crucial insights into model clinical relevance and optimal decision strategies based on predictive outcomes 29. Spearman correlation analysis examined the associations among the continuous predictor variables. Restricted cubic splines (RCS) with strategic knot positioning ( the 5th, 35th, 65th, and 95th percentiles) explored potential non-linear relationships between continuous risk factors using the Regression Modeling Strategies (rms) package in R 30. Multivariate adjustment in RCS analyses helps control for these variables' effects and get a more accurate estimate of the relationship between the independent variable and the in-hospital mortality. Collectively, these rigorous statistical techniques ensured robust and reliable results.