In this study, some machine learning methods are used to construct the prediction mortality model of 14-day of CRKP bloodstream infection. The effect of random forest is better than that of XGBoost and SVM with limited sample sizes, and is generally better than multiple logical regression models. The purpose of establishing a 14-day mortality prediction model is that CRKP bloodstream infection deteriorates rapidly into severe multiple organ dysfunction syndromes, resulting in death. The establishment of a relatively short-term risk prediction model can help clinicians identify high-risk groups when bloodstream infection occurs, and guide the use of more active monitoring measures and anti-infection programs16.
Recently, artificial intelligence has been used to predict the incidence of hospitalization and colonization of CRO17. Although there is no research on machine learning to predict the mortality rate of CRKP bloodstream infection, this study partially proves the superiority of the machine learning algorithm18,19. The machine learning algorithms we chose are currently recognized, mature and common, including random forest, SVM and XGBoost. With small samples, the random forest can spread the branches of the decision tree through its algorithm, determine the important variables according to the Gini score, and present an effective model. SVM and XGBoost, as introduced algorithms recently, can get more accurate prediction through more complex operation mechanism, and can fully demonstrate their superior performance with a large amount of data. There are similar conclusions in the comparative study of other clinical medical models20-22. The modeling results are almost the same accuracy. XGBoost is similar to the random forest in performance and SVM has higher specificity. The prediction ability of the multivariate logical regression is not as good as three machine learning algorithms, but it is more intuitive and easy to understand. In this study, the risk score based on multivariate logical regression can predict the 14-day survival rate of 82.5% of patients using five highly independent and important factors. The unexpected result is the decision tree model, because the inclusion of too few branch points, the prediction result is not ideal.
In this study, the 14-day mortality rate of CRKP bloodstream infection was 21.4%, which was lower than the previously published epidemiological results of 31.1% in China3. This may be related to the regulation of antibiotics use in carbapenems and tigecycline recently, and to the approves of polymyxin B and ceftazidime-avibactam in the domestic market. More clinical experience of clinicians in CRKP bloodstream infection and earlier use of appropriate antibiotics may be associated with lower mortality23. The multiple logistic regression model included five major risk factors including age, septic shock, acute kidney injury, severe thrombocytopenia, and immunodeficiency. Septic shock and acute kidney injury are often mentioned in previous retrospective studies3,6,7,11,24. At present, the most famous research is the INCREMENT study, which divides the numerical variables into groups through the classification decision tree and predicts the model through the hierarchical logical regression model, and finds that five indicators can predict CRKP mortality, namely, severe infection/septic shock, Pitt score more than 6, Charles scores more than 2, nobiliary tract orginal bloodstream infection, inappropriate treatment, etc6. There are few reports of severe thrombocytopenia but it is consistent with the clinical manifestations of CRKP bloodstream infection, including earlier digestive tract bleeding, and faster thrombocytopenia rather than a decrease of myelosuppression. We found that septic shock, acute renal insufficiency, severe thrombocytopenia, these three independent variables are present in different models, and occupy an important position in the model.
In this study, there are advantages and disadvantages to the use of the local database. Different databases include different amounts of dimensions. In the critical area, a large number of articles have been published on the public database MIMIC, but the data may not apply to the Chinese population25,26. CRKP bloodstream infection is gradually increasing in recent years, but the epidemic trend is quite different from that of foreign countries. The CRKP samples in the database of MIMIC are very small, so it is impossible to carry out. Our department database has been established for two years, integrates the data stored in various data centers of the hospital in the past 5 years, and carries on the data mining of sepsis and acute kidney injury. The use of the local database to obtain samples of high accuracy, a wide selection of independent variables and flexible and complete extraction of independent variables. However, the local database may have some shortcomings, such as data bias, limited extrapolation, small sample size, unstable model. The models developed in recent years have no clear visualization tools, which makes it difficult to understand and explain the logic of the model. We have considered this issue in the data modeling phase and are actively working with other centers for external validation of the model, which will be the next phase of our team's research.