Machine Learning Algorithms to Predict the Mortality of Carbapenem Resistant Klebsiella Pneumoniae bacteremia


 Purpose To establish mortality prediction models in 14 days of Carbapenem-Resistant Klebsiella Pneumoniae bacteremia using Machine learning.Materials and Methods It is a single-center retrospective study. We collect the relevant clinical information of all patients with Carbapenem-Resistant Klebsiella Pneumoniae (CRKP) bacteremia in the past 5 years using the local database. Data analysis and verification are carried out by multiple logical regression, decision tree, random forest, support vector machine (SVM), and XGBoost.Result This study includes 187 patients with 40 related variables. In multiple logical regression, acute renal injury (P=0.003), Apache II score (P=0.036), immunodeficiency (P=0.025), severe thrombocytopenia (P=0.025) and septic shock (P=0.044) are the high-risk factors for 14 days mortality of CRKP bloodstream infections. According to the importance of those parameters, risk scoring is established to predict the survival rate of CRKP bacteremia. The analysis of the five models, with 70% training set and 30% test set, show the comprehensive performance of random forest (AUROC=0.953, precision=91.85%) is slightly better than that of XGBoost (AUROC=0.912, precision=86.41%) and SVM (AUROC=0.936, precision=79.89%) in predicting 14-day mortality of CRKP bacteremia. The multiple logical regression model (AUROC=0.825, precision=81.52%) is the second, and the decision tree model (AUROC=0.712, precision=79.89%) is not very ideal.Conclusion Machine learning has good performances in predicting 14-day mortality of CRKP bacteremia than multiple logical regression. Acute renal injury, severe thrombocytopenia, and septic shock are the high-risk factors of CRKP bacteremia mortality.

largest study in China shows that the mortality rate of CRKP bacteremia is 32.9% 3 with the reported worldwide ranging from 17% to 50% 4-6 .
Risk factors for death of CRKP bloodstream infections include long-term hospitalization, medication history of carbapenem and glucocorticoids, invasive operations, septic shock, and inappropriate empirical treatment, etc 4,7,8 . High-risk factors can predict the prognosis, understand the disease severity and direct to take medication earlier. For patients with high-risk factors for death, we may prefer to use higher levels antibiotics, especially in the context of the increasing number of new antibiotics 9 . However, the results of risk factor analysis have great heterogeneity because of its regional epidemiological characteristics in each of retrospective studies. There are differences in human species, medication methods and treatment concepts of CRKP bloodstream infection, especially in the west countries and Asia countries 5,6,10,11 . It is difficult for us to apply foreign high-risk factors to Chinese patients and to popularize CRKP-related scores dominated by western countries. The establishment of own scores of CRKP bacteremia mortality may be of better significance, which is one of the purposes of this study. The risk predictive model contains the weight of different risk factors and is more intuitive and more practical. There are few kinds of researches on the high-risk prediction model of CRKP infection with previous researches using the multiple logistic regression methods 6,[12][13][14] . However, some studies have proved the effect of the machine learning is better than the multiple logical regression in the prediction of gram-negative bacteria and CRE colonization [13][14][15]

4.
Statistical methods: all statistics are done in R and Rstudio. Data packets include "VIM", "psych", "regplot", "mass", "vcd", "rpart", "rpart.plot", "randomforest", "e1071", "xgboost" and "rattle".   Table1 Then we incorporate these variables into the multiple logistic regression model and establishe a preliminary score to predict the survival rate of CRKP bloodstream infection according to the contribution of the parameters. As see in Table 2, patients with scores higher than 20 are more likely to survive in our scoring system. best in terms of accuracy. SVM has high specificity but low sensitivity, see Table 3; Overall, random forest, XGBoost, and SVM are better in predicting 14-day mortality of CRKP bloodstream infection.

Discussion
In this study, some machine learning methods are used to construct the prediction mortality model of 14-day of CRKP bloodstream infection. The effect of random forest is better than that of XGBoost and SVM with limited sample sizes, and is generally better than multiple logical regression models. The purpose of establishing a 14-day mortality prediction model is that CRKP bloodstream infection deteriorates rapidly into severe multiple organ dysfunction syndromes, resulting in death. The establishment of a relatively short-term risk prediction model can help clinicians identify high-risk groups when bloodstream infection occurs, and guide the use of more active monitoring measures and anti-infection programs 16 .
Recently, artificial intelligence has been used to predict the incidence of hospitalization and colonization of CRO 17 . Although there is no research on machine learning to predict the mortality rate of CRKP bloodstream infection, this study partially proves the superiority of the machine learning algorithm 18,19  There are few reports of severe thrombocytopenia but it is consistent with the clinical manifestations of CRKP bloodstream infection, including earlier digestive tract bleeding, and faster thrombocytopenia rather than a decrease of myelosuppression. We found that septic shock, acute renal insufficiency, severe thrombocytopenia, these three independent variables are present in different models, and occupy an important position in the model.    Table 3: Prediction effect analysis of five prediction models, including AUROC, accuracy, sensitivity, specificity, positive likelihood ratio and negative likelihood ratio. Figure 1 The flow chart of the study.

Figure 2
The decision tree model with two branches of severe thrombocytopenia and acute kidney inju  AUROC curves of five models. Different colors represent different models.