Study selection
A total of 422 articles were identified through various databases including Cochrane (n = 12), Embase (n = 150), Pubmed (n = 74), and Web of Science (n = 186). After eliminating 15 duplicate articles and excluding ineligible records using automation tools, we browsed through 387 articles. Ultimately, 23 articles met the inclusion criteria and were included in our study[2, 5–26]. Figure 1 displays the PRISMA flow diagram illustrating our study selection process. The selection was conducted independently by two reviewers (Zhenyu Yang and Xiaoju Cui). Any discrepancies were resolved by a third reviewer.
Characteristics of included studies
A total of 1,287,160 individuals were included in this study, with 167,338 in the validation set. All articles analyzed were published within the past 5 years, indicating a growing interest in using machine learning for sepsis prediction. Our research identified 81 prognostic models, including 5 based on Deep Learning, 4 on InSight, 10 on Logistic Regression, 6 on Multilayer Perceptron, 8 on Neural Network, 8 on Support Vector Machine, 14 on XGBoost, 15 on Random Forest, 11 on SOFA, and others. Detailed characteristics of the included studies can be found in Table 1 and Table 2.
Quality assessment
The quality assessment was conducted independently by two reviewers (Zhenyu Yang and Xiaoju Cui), and any discrepancies were resolved by a third reviewer. The results of the quality assessment are presented in the risk of bias picture (Fig. 2). Two studies (8.6%) were deemed to have a high risk of bias in the domain of participants, 13 studies (58.3%) were deemed to have a high risk of bias in the domain of analysis, and two studies (8.6%) were deemed to have a high risk of bias in the domain of outcome. No studies were deemed to have a high risk of bias in the domain of predictors. The high risk of bias in the analysis domain may be attributed to inadequate sample size, insufficient Events per Variable (EPV), improper handling of missing data, or failure to report how missing data were handled. The PRISMA checklist can be found in Supplementary File 1.
Predictors
Age, Creatinine, and Sodium were the most frequently used predictors (n = 12), followed by blood pressure and platelets (n = 11), and heart rate (n = 9). The remaining predictors were ranked in descending order of frequency as lactate and temperature (n = 9), WBC count (n = 8), respiratory rate and SOFA score (n = 7), glucose hemoglobin, MCHC, and PaO2 (n = 6), GCS, ICU LOS, Lymphocyte, and PaCO2 (n = 5), BUN, cancer, and gender (n = 4). These results are presented in Fig. 3.
Training set and test set accuracy
In the training set, the Random Forest model was the most frequently applied machine learning model (n = 9), with an accuracy of 0.911 (0.485, 0.991). XGBoost showed the best predicting performance (n = 6), with an accuracy of 0.970 (0.487, 0.997). In the test set, the Random Forest model was also the most frequently applied machine learning model (n = 7), with an accuracy of 0.795 (0.638, 0.895). The Deep Learning model showed the best predicting performance (n = 3), with an accuracy of 0.830 (0.814, 0.845). These results are presented in Figs. 4–8.
Training set and test set c-index
Regarding the c-index results, in the training set, XGBoost was the most frequently utilized machine learning model with a c-index of 0.83 (0.83,0.84) in 7 studies. InSight exhibited the best performance with a c-index of 0.91 (0.90,0.93) in 2 studies. On the other hand, in the test set, the random forest model was the most frequently employed machine learning model with a c-index of 0.83 (0.82,0.83) in 5 studies. In terms of performance, both the random forest model (n = 5, c-index = 0.83 (0.82,0.83)) and XGBoost (n = 3, c-index = 0.83 (0.82,0.84)) exhibited similar performance. Detailed datasets can be found in Figs. 9–13, and the overall results are presented in Supplementary File 3.