Early Prediction of the Carbapenem Resistance Gram-negative Bacteria Carriage in Intensive Care Unit using Machine-Learning

Background The prevention and control of carbapenem-resistance gram-negative bacteria (CR-GNB) is the diculty and focus for clinicians in the intensive care unit (ICU). This study construct a CR-GNB carriage prediction model in order to predict the CR-GNB incidence in one week. Methods The database is comprised of nearly 10,000 patients. the model is constructed by the multivariate logistic regression model and three machine learning algorithms. Then we choose the optimal model and verify the accuracy by daily predicted and recorded the occurrence of CR-GNB of all patients admitted for 4 months. There are 1385 patients with positive CR-GNB cultures and 1535 negative patients in this study. Forty-ve variables have statistical signicant differences. We include the 17 variables in the multivariate logistic regression model and build three machine learning models for all variables. In terms of accuracy and the area under the receiver operating characteristic (AUROC) curve, the random forest is better than XGBoost and multivariate logistic regression model, and better than decision tree model (accuracy: 84% >82%>81%>72%), (AUROC: 0.9089 > 0.8947 ≈ 0.8987 > 0.7845). In the 4-month prospective study, 81 cases were predicted to be positive in CR-GNB culture within 7 days, 146 cases were predicted to be negative, 86 cases were positive, and 120 cases were negative, with an overall accuracy of 84% and AUROC of 91.98%.

with a full range of diseases. The general ICU database uses MIMIC database structure to collect basic patients' information, medical advice, image examination, laboratory testing, nursing and doctor documents. The database is comprised of nearly 10,000 ICU patients, and the data have been updated daily.
Model Establishment 1) Participants: all the patients in the general ICU database. Patients with positive culture of CR-GNB is de ned as CR-GNB carriage. Exclude standard: i) patients with hospital stay less than a week; ii) patients with a positive culture of CR-GNB within 48 hours of admission; iii) patients with only one positive culture result and considering as contamination during hospitalization. See Fig. 1 for the ow chart.
2) Variable selection: demographic data, vital signs, basic and primary diseases, important test indicators, operations histories, and antibiotic use records in the prior month. The time range of variable collection was determined according to the positive CR-GNB culture time. i) patients with positive CR-GNB culture within a week of admission, the variable collection period is limited in 48 hours of admission. ii) patients with positive culture more than a week, the collection period is limited between admission to a week before the positive time. iii) according to the previous investigation, 87% of patients would carry CR-GNB within 2 weeks at admission, so the variable collection period of negative patients is within 1 week of admission. We select 65 variables into our study, as detailed in Table 1. The maximum and minimum values of the subscript of some parameters are the maximum or minimum values in the speci ed collection node, which are determined according to the clinical signi cance.
3) Statistical methods: we use MySQL and Navicat to complete data collection with structured query language. Data analysis are accomlished through R and Rstudio with related packets. Multivariate logistic regression and three machine learning algorithms, as decision tree, random forest, and XGBoost, are selected to establish the models. The main R packets include "glm", "rpart", "randomForest", "xgboost" and "rattle". We delete independent variables with the missing more than 50% and the other missing values are processed by multiple interpolations. In univariate analysis, numerical variables are tested by independent sample T-test, dichotomous variables are completed by the chi-square test, and a P value less than 0.05 is considered as statistical signi cance. All samples are grouped into 70% training set, 15% validation set, and 15% test set. Multivariate logistic regression uses the step-by-step decreasing method to adjust the parameters. Five hundred trees are constructed and the exhaustive method is used to adjust the parameters in the Random forest, and the importance of measurable variables is presented by the corresponding visual package. The speci c and task parameters of the XGBoost linear rise are adjusted according to the performance of the model. The evaluation parameters include sensitivity, speci city, positive predictive value, negative predictive value, and area under the receiver operating characteristic (AUROC) curves.
2) Protocol: i) determination of the optimal prediction model through model evaluation; ii) application of the model for all patients admitted to ICU to make daily predictions. iii) termination of prediction with the following conditiions: a) when prediction results suggest high risk (CR-GNB carriage will positive within 7 days), stop prediction and enter the 7-day observation period; b) when patients leave the ICU ward, including transfer, discharge or death. iv) analysis of the CR-GNB carriage within 7 days compare with the predicted results and calculation of the relevant evaluation indexes and evaluation the model performance. v) all the predicted results are kept secret to clinicians, and this study does not interfere with clinical decision-making.

Results:
Prediction models A total of 3303 patients are included in the study, of which 1768 patients are culture positive. According to the exclusion criteria, we include 1385 CR-GNB carriage patients and 1535 negative patients. Among the positive CR-GNB carriages, carbapenem-resistant Acinetobacter baumannii accounts for 59.16%, carbapenem-resistant Klebsiella pneumoniae accounts for 21.21%, and carbapenem-resistant Pseudomonas aeruginosa accounts for 18.25%. In terms of the distribution of cultural sites, sputum culture accounts for 80.95%. As far as the types of diseases, the center is dominated by multiple injuries, especially severe craniocerebral trauma, followed by various infectious diseases, and cardio-cerebrovascular accidents. See Table 5 for details.
In univariate analysis, there are 45 variables with statistical differences as Table 1. In the multivariate logistic regression model, a stable model is obtained by incorporating 17 variables, see as Table 2. These risk factors includes gender, invasive catheterization, mechanical ventilation and hospital residence history over the past month, vital signs including systolic blood pressure, respiratory rate and glasgow coma scale, laboratory indicators including white blood cell count, hematocrit, C-reaction protein, direct bilirubin, total protein, and brinogen. The use of cephalosporins and carbapenems is also a high-risk factor. In the models established by three common machine learning methods, random forest presents the important characteristics of model parameters, as Fig. 2, in which hospital residence history over the past month, total protein and respiratory rate are the main risk factors which have much overlap with the multivariate logistic regression model.  Table 3, Fig. 3 and Fig. 4. According to the effect of the model, we nally choose the random forest as the optimal prediction model for the prospective study.

Prospective Study
In the 4-month prospective study, a total of 673 patients are treated in our center, and the total number of prediction period by the model is 4553. Among them, 431 patients with hospitalization days less than 7 days and 36 patients with positive in 48 hours as admission are excluded. It is nally predicted that 81 cases are positive and 146 cases are negative. There are 86 positive cases and 120 negative cases after 7 days of observation. The sensitivity is 72.28%, lower than the performance of the test set, but the speci city is as high as 93.65%, better than that of the previous test set. Therefore our model is computationally e cient and achieves high accuracy of 84% and AUROC of 91.98%. it is equivalent to the accuracy of the validation set and the test set. In the veri cation stage, the proportion of sputum samples was still the largest, and the proportion of Pseudomonas aeruginosa was the highest, more detail can be seen in Tables 4 and 5.  Discussion: In this study, we build the prediction model of CR-GNB carriages within a week at admission by machine learning. We analyze to further verify the accuracy of the model through a 4-month prospective and consecutive predictions.
With the worldwide spread of CR-GNB in ICU, clinicians invest a lot of times and resources in nosocomial prevention and control measures, including colonization supervision, contact isolation, hand hygiene, antibiotic control and so on 2, 6 . However, in the face of heavy clinical work, the implementation rate of these measures has been criticized. The normalization of nosocomial prevention and control measures is not only the compliance of medical staff but also the problem of cost and bene ts. Control the source of infection, cut off the route of transmission, and protect the vulnerable population are three classical pathways. Current research has found more and more dormant sources of infection, including beroptic bronchoscope, ICU ume 17,18 .
The route of transmission also has more possibility of analysis with the assistance of next-generation sequencing 19 . Many pieces of researches focus on Protecting susceptible people because of its simplicity and effectiveness 5,20,21 . Some studies pay attention to the identi cation of high-risk factors, which are determined by building models [21][22][23][24] . Thomas and his colleagues predicted the infection of MDR by multivariate logical regression model by the public database and found that the main risk factors of MDR infection were the high use of antibiotics previously, the site and degree of infection in the previous three months 25 . The researchers included a total of 120,000 samples, but there were a few variables and the models could only vaguely indicate which patients were likely to develop into the infections. Katherine and his teams used a multiple logical regression model and a decision tree model to assess the risk of extended Spectyum-β lactamase (ESBLs) bacteremia through the data of 1288 cases of gram-negative bacteremia. A total of 14 variables were included. The prediction effect of the decision tree and multivariate logistic regression model for predicting ESBL infection was similar 21 . However, the research sample size and variables are limited, and the incoordination between the number of positive cases (15%) and that of control cases (85%) restricted the performance of machine learning. Wang et al used 512 patients to predict MDR carriages and found that males, high CRP levels, and high Pitt scores were high-risk predictors, and a line chart was used to predict the occurrence of MDR 24 . In our study, invasive procedures include endotracheal intubation, intravenous intubation, drainage tube, and the hospital residence history over the past month are high-risk factors for CR-GNB carriages, which have been identi ed in other studies 9,22 . According to the history of residence in other hospitals, our center adopts preemptive isolation and active surveillance which can reduce the incidence of carbapenem-resistant Enterobacteriaceae (CRE) 26 . However, these kinds of literatures only provide information on which patients may develop MDR colonization or infection, but the exact time is unknown. As a result, the prevention and control of nosocomial infection measures are faced with the problems of when to implement and when to remove, as well as the cost-effectiveness, clinical burnout. Therefore, our center developed a CR-GNB prediction model in a week for ICU patients, which aims to carry out more targeted prevention including single isolation in advance, special management, and other measures. At present, this study has completed the veri cation of clinical applicability, and the subsequent clinical randomized controlled trial will be conducted to verify its clinical practicability.
We set one week as the forecast period in several aspects. First of all, too short or too long prediction periods will affect the performance of the prediction model. Secondly, ICU hospitalization as a high-risk factor for CR-GNB, the longer the ICU hospital stay, the higher the incidence of CR-GNB. According to the average hospitalization days, about 6 days in our center, most of the patients with hospitalization in a week are postoperative patients who are not the bene ciaries of this study. Lastly, according to the pre-experimental results, the peak time of CR-GNB carriages in our center was 5 days, then decreased slowly. One week as the prediction node can balance and the comparability in the trial group and the control group. Besides, sputum samples were still dominant in the colonization and infection of CR-GNB in this study, which was related to the types of diseases, including severe craniocerebral trauma and community-acquired pneumonia, which required long-term mechanical ventilatio. and the proportion of. These patients sent sputum samples for examination are higher than that of other parts and stay longer than the general postoperative patient trial. Therefore they are more likely to be included and caused bias in the study.
This study uses the central database, which has both advantages and disadvantages. On the one hand, the local database has with the diversi ed and integrated variables, which is di cult to achieve in the public database. The central database has been updated in real-time and provides the feasibility for continuous prediction. Also, the real-time updated database can continuously carry out iterative learning for the model to incorporate data and keep pace with the times. On the other hand, the public database has more data and complete diseases than the single-center database, based by stable models and multi-center researches. However, the problem is also very prominent. The variables are xed and may simplify the number of variables to ensure the integrity, followed by a slow update and unable to achieve timely prediction. The number adopted in this study is limited, and as single-center research, follow-up promotion will take a long time. Conclusion: The prediction model by machine learning can predict the occurrence of CR-GNB carriage within one week, with a success rate of 84%. This model can predict the high-risk groups of CR-GNB carriers in real-time and help guide medical staff to take more targeted prevention and control of nosocomial measures.
Abbreviations CR-GNB carbapenem-resistance gram-negative bacteria; AUROC area under the receiver operating characteristic; MDRB multidrug-resistant bacteria; ICU intensive care unit Declarations Ethics approval and consent to participate: Our research has been approved by the ethics committee of the second a liated hospital of Zhejiang University School of Medicine.
Consent for publication: Figure 1 Flow chart of study. CR-GNB: carbapenem-resistance gram-negative bacteria; Figure 2 Important characteristics of model parameters by random forest included all variables.

Figure 3
Area under the receiver operating characteristic curve of validation set for three machine learning and multivariate logistic regression.

Figure 4
Area under the receiver operating characteristic curve of test set for three machine learning and multivariate logistic regression.