Settings and study design
The study was conducted in Hôpital Européen Marseille, a French general hospital located in the popular city-center of a nearly 900,000 inhabitant city in southeastern France. The hospital has been fully paperless since its opening in 2013, and all inpatient and outpatient information is electronically stored in the same comprehensive electronic health record (EHR) called QCare ICU Manager (Health Information Management GmbH, Bad Homburg, Germany).
The study was designed as a monocentric retrospective analysis of positive bacterial culture and susceptibility data (antibiograms) performed at Hôpital Européen Marseille between January 2014 and December 2020.
Using covariates available in the laboratory database (specimen origin, type of ward, previous multidrug resistance [MDR] bacteria carriage and sample date), we first designed different frequentist, Bayesian and state of the art machine learning models predicting antibiotic susceptibility (Supplementary Table S2), for each stage preceding antibiogram results: (1) “sampling” stage (specific ecology for a body site); (2) “direct” stage, after Gram stain examination of the sample when available at day 0 or 1; (3) “culture” stage, after macroscopic and Gram stain examination of positive cultures, usually from day 1 to 3 ; and (4) “species” stage, after bacterial identification of positive cultures, usually at day 2 or 3. We trained models on the 2014–2019 dataset (training + test), and used them to predict the susceptibility probability to each of the 22 single antibiotics and 25 antibiotic combinations for isolates of the 2020 dataset (validation).
Data
We extracted antibiograms from microbiology electronic records of the accredited bacteriology laboratory of the hospital between January 2014 and December 2020. Specimen culturing followed the latest guidelines of French microbiology earned societies 17. Bacterial identification was performed using matrix assisted laser desorption ionisation/time of flight mass spectrometry (MALDI/ToF MS) with VITEK® MS (bioMérieux France, Craponne, France). Antibiotic susceptibility testing was performed using VITEK® 2 automated system (bioMérieux France) or diffusion techniques on Agar plates when relevant, and SIRxpert Master® software (i2a, Montpellier, France) based on the current French and European guidelines (CA-SFM EUCAST) 18.
Covariate definition and data preparation
Based on metadata and results of each antibiogram, we characterized: (1) the type of ward (i.e. emergency room, intensive care, medicine, surgery, day care unit…); (2) the body site origin of the specimen (i.e. blood or intravenous catheter, urine, lower respiratory, joint/bone, digestive, genital, cerebrospinal…); (3) the history of a previous multidrug resistant (MDR) bacteria carriage for the same patient within the past three months (i.e. usual threshold for risk factors of multiresistant bacterial infections) 19; (4) the clinical relevance of bacterial isolates considering the body site (i.e. relevant or likely contaminant such as a unique blood culture positive for common bacteria of the skin flora like Staphylococcus epidermidis) 20; (5) duplicates (i.e. similar isolates and antibiogram within the past two days). To provide antibiotic susceptibility predictions based on information available at the “direct” and “culture” stage preceding species identification usually at D0-D2, we also characterized: (6) the typical Gram stain features on direct examination of the sample (i.e. Gram positive cocci, Gram-negative rods…); and (7) the typical macroscopic and Gram stain features of positive cultures (i.e. Staphylococcus-like Gram-positive cocci, enterobacteriaceae-like Gram-negative bacteria, non-fermentative Gram-negative rods…) (Supplementary Table S3).
Finally, we interpreted (8) the susceptibility to 22 single antibiotics (traditional antibiogram) and 25 antibiotic combinations (combination antibiogram) of clinical interest (Table 3). Isolates with intermediate susceptibility were classified as resistant to the antibiotic. For strains whose susceptibility results were not available on a specific antibiotic, we interpreted susceptibility based on recommended expert rules on intrinsic and cross resistances 18,21.
Table 3
– List of analyzed single antibiotics and antibiotic combinations of clinical interest
Single antibiotics | Antibiotic combinations |
1. Amoxicillin | 1. Amox-Clav + Amikacin |
2. Amox-Clav | 2. Pip-Tazo + Amikacin |
3. Oxacillin / Cefazolin | 3. CTX / CRO + Amikacin |
4. Pip-Tazo | 4. Ceftazidime + Amikacin |
5. Cefotaxime / Ceftriaxone | 5. Cefepime + Amikacin |
6. Ceftazidime | 6. Aztreonam + Amikacin |
7. Cefepime | 7. Imipenem + Amikacin |
8. Aztreonam | 8. Meropenem + Amikacin |
9. Imipenem | 9. Amox-Clav + Gentamicin |
10. Meropenem | 10. Pip-Tazo + Gentamicin |
11. Ertapenem | 11. Imipenem + Gentamicin |
12. Amikacin | 12. Vancomycin + Gentamicin |
13. Gentamicin | 13. Amox-Clav + Ciprofloxacin |
14. Ciprofloxacin | 14. Pip-Tazo + Ciprofloxacin |
15. Levofloxacin | 15. CTX / CRO + Ciprofloxacin |
16. TMP-SMX | 16. Ceftazidime + Ciprofloxacin |
17. Vancomycin | 17. Cefepime + Ciprofloxacin |
18. Rifampicin | 18. Aztreonam + Ciprofloxacin |
19. Clindamycin | 19. Imipenem + Ciprofloxacin |
20. Macrolides | 20. Meropenem + Ciprofloxacin |
21. Linezolid | 21. Pip-Tazo + Vancomycin |
22. Metronidazole | 22. Cefepime + Vancomycin |
| 23. Meropenem + Vancomycin |
| 24. CTX / CRO + Metronidazole |
| 25. Cefepime + Metronidazole |
Amox-Clav, Amoxicillin-Clavulanate (Augmentin*) Pip-Tazo, Piperacillin-Tazobactam (Tazocilline*) TMP-SMX, Trimethoprim-Sulfamethoxazole (Bactrim*) CTX / CRO, Cefotaxime / Ceftriaxone |
After removing likely contaminants, duplicates and MDR bacteria carriage screening specimens (rectal and nose swabs), we estimated mean antibiotic susceptibility rates and constructed mean antibiograms for single antibiotics (traditional antibiograms) or antibiotic combinations (combined antibiograms) 22, for each type of each category of isolates: bacterial species; bacterial type on direct examination; bacterial type on culture; specimen type; ward type; and past history of MDR carriage. To illustrate the susceptibility differences of each category of isolates, these mean antibiograms were plotted. Root-mean-square deviations (RMSD) were also estimated for each category, using the mean susceptibility rates of overall isolates as a reference.
We converted all the categorical variables into numerical ones, and then split the dataset in three parts: training, test and validation. The training and the tests dataset constitutes 80% and 20% respectively of the data registered between 2014 and 2019, whereas the validation data set was all the data gathered in 2020. For machine learning models, the date of the sample was considered as a distinct variable in order to take into consideration the variation of the antibiotic resistance over time.
In the following sections, we describe the details of each model.
Frequentist inference models
For each antibiotic and each stage preceding antibiogram results, we first trained very simple frequentist inference models (FRQ) using the last year of the training + test dataset (2019) and validated them using the validation dataset (2020). These frequentist models directly estimated the posterior susceptibility probability, as the proportion of antibiotic susceptibility within isolates with similar features within the last year of the training + test dataset (2019) (Eq. 1, example for the “species” stage).
Eq. 1:\({P(S\mid Species, Specimen, Ward, MDR)}_{2019}=\frac{{N(S\mid Species,Specimen,Ward,MDR)}_{2019}}{{N(Species,Specimen, Ward,MDR)}_{2019}}\)
where \({P\left(S\mid Species,Specimen,Ward,MDR\right)}_{2019}\) is the probability of susceptibility \(P\left(S\right)\) to a given antibiotic group or combination, within the last year of the training + test dataset (2019), given the \(Species\) of bacteria grown from positive cultures, the \(Specimen\) origin, the type of hospital \(Ward\), and the previous history of \(MDR\) carriage (Supplementary Table S2). In case no similar situation was available in the training + test dataset, we chose to attribute a probability of susceptibility of 0.5, so that these isolates were not excluded from AUC estimations. With this kind of “intention to treat” analysis, AUCs were thus impaired by prediction failures, whereas a kind of “per protocol” analysis excluding prediction failures would have artificially overestimated AUCs.
Bayesian inference models
Using the whole training + test dataset (2014–2019), we trained one Bayesian inference model (BAY) per antibiotic and stage and validated them using the validation dataset (2020). We first estimated likelihood ratios, \(LRT\), from prior and posterior susceptibility probabilities, after converting probabilities \(P\left(S\right)\)into odds \(O\left(S\right)\) (Eq. 2 and Eq. 3) 23. For all combinations of isolates features, posterior probabilities were then approximated, using prior probabilities and corresponding likelihood ratios (Eq. 4), and conversions from odds \(O\left(S\right)\) to probabilities \(P\left(S\right)\)(Eq. 5).
Eq. 2:\(O\left(S\right)=\frac{P\left(S\right)}{(1-P(S\left)\right)}\)
Eq. 3:\({LRT}_{Species}=\frac{O(S/Species)}{O\left(S\right)}\)
Eq. 4:\({{O\left(S\right|Species,Specimen,Ward,MDR)}_{2019}={O\left(S\right)}_{2014-19}\times {LRT}_{2019} \times {LRT}_{Species}\times {LRT}_{Specimen}\times {LRT}_{Ward}\times LRT}_{MDR}\)
Eq. 5:\(P\left(S\right)=\frac{O\left(S\right)}{(1+O\left(S\right))}\)
where \({O\left(S\right)}_{2014-19}\) is the prior probability of susceptibility \(S\) to a given antibiotic group or combination, within the whole training + test dataset (2014-2019), \({LRT}_{2019}\) is the likelihood ratio for the last year of the training + test dataset (2019), and \({LRT}_{Species}\), \({LRT}_{Specimen}\), \({LRT}_{Ward}\) and \({LRT}_{MDR}\) are the likelihood ratios for the \(Species\), the \(Specimen\), the hospital \(Ward\), and the previous \(MDR\) carriage, respectively (Supplementary Table S2). Models were implemented using Python v3.6.
Machine learning models
We have trained and tested state of the art supervised machine learning models using the training + test dataset (2014–2019), and validated them using the validation dataset (2020): Logistic Regression (LR), as abaseline; ensemble models (AdaBoost (ADA), Gradient Boosting (GBS), Extreme Gradient Boosting (XGB), Bagging (BAG) 24, Random Forest (RF) 25, and neural networks (NN). Support-vector machine (SVM) models were discarded in the exploratory analysis since they did not present a good performance, which was not surprising given that our dataset included only few categorical features.
Logistic regression is a generalized linear model that computes a weighted sum of the input features (plus a bias term) in order to estimate the probability that an instance belongs to a particular class.
Ensemble models are a group of learning methods that can combine several weak learners into a strong learner, and that provide a decision based on a majority voting of all the ensemble. We chose models using decision tree learners given their suitability to deal with categorical variables. We first considered three boosting methods, which train predictors sequentially so each predictor aims to correct its predecessor: AdaBoost (ADA) 26 predictors focus specifically on cases that previous learners have underfit, and thus progressively become better in classifying difficult cases; Gradient Boosting (GBS) 27 tries to fit the new predictor to the residual errors made by the previous one; and Extreme Gradient Boosting (XGB) 28 uses a more regularized model formalization than GBS to control over-fitting, which generally improves performance. Bagging (BAG) 24 is another ensemble learning approach, which applies the same training model for every predictor, and trains each predictor on different random subsets of the training set. An example of this type of model is Random Forest (RF) 25 is considered as an improvement of bagging models that changes the way the sub-trees are learned, so that predictions resulting from all of the subtrees have less correlation.
Finally, artificial neural networks (NN) 29 are based on a collection of connected unit functions or nodes called artificial neurons. In this contribution we have implemented one dense neural network for each stage. The implemented NN based models are composed of one input layer, three hidden dense layers, one batch normalization layer before each hidden dense layer, one dropout layer after each dense hidden layer to reduce overfitting and one output layer with one node per antibiotic or antibiotic combination. The number of nodes per hidden layer ranges between 50 and 150 depending on the position of the hidden layer and the stage. The number of nodes in each layer, the number of layers and the dropout rate were chosen following a procedure of grid search with cross validation. The activation functions are set to ‘ReLU’ (Rectified Linear Unit) for the intermediates layer and to ‘sigmoid’ for the final layer.
To train the different ML models we split the 2014–2019 dataset into 80% for training and 20% for the test. For each of the considered models, we implemented one classifier per stage with one output per antibiotic. Cofactors or features included in each model are summarized in (Supplementary Table S2). For stage 4 (Species), we also included the culture type as this cofactor markedly improved predictions for rare situations, probably because the antibiotic susceptibility of rare species is generally close to the susceptibility of species from the same culture type (for instance Non-fermentative Gram-negative rods). However, we chose not to include the direct type as a cofactor within stages 3 and 4 models as such types (for instance Gram-positive cocci) include numerous species with contrasting susceptibility patterns
Models were implemented using Python v3.6, and the Scikit-Learn Python library 30. For the neural networks, we used the Keras library and KerasClassifier, a scikit-learn API wrapper 31. To ensure model generalization, all models were trained using a 10-fold cross-validation procedure over the train dataset, then tested using the test dataset and validated using the validation dataset, the displayed results are the one obtained with validation dataset. To select the hyperparameters of each model, we implemented a grid search procedure with cross validation.
Model comparisons
We then compared the prediction performance of frequentist, Bayesian and machine learning models by conducting a receiver operator characteristic (ROC) analysis, with antibiotic susceptibility probabilities as binary classifiers and antibiograms as outcomes using the validation dataset. We estimated areas under the ROC curve (AUC) for each antibiotic and antibiotic combination and at each stage, and estimated the mean AUC for each model at each stage. We plotted mean ROC curves as well as AUC distributions. We then compared models by their mean AUC and by their prediction failure rate (for FRQ models): (1) globally; and (2) for the least frequent situations only (5th percentile of the number of occurrences in the 2014–2019 dataset: e.g. a rare species in a rare specimen in a rare ward etc…). Finally, we chose the best model family.
Model interpretability
Understanding predictions of the models is key in any application for health purposes. This was not an issue for our FRQ and BAY models, which reflect the mean antibiograms shown in Fig. 1, through the direct combination of features (FRQ) or feature-specific likelihood ratios (BAY). In LR influence of covariates is included in odds ratios. RF models also allow feature interpretation by providing the degree of contribution of each variable to the final decision. But this gets more challenging with other methods that are not directly interpretable and are often considered as “black boxes”, especially for the case of Neural Networks. For such cases, we thus used the SHapley Additive exPlanations (SHAP) as done in other machine learning implementations for health purpose predictions 33,34. SHAP is based on the Shapley value, a game theoretic approach that calculates the average marginal contribution of a feature value across all possible coalitions. We have applied this SHAP analysis, using the SHAP Python package 32, to the machine learning models chosen after comparing prediction performances, and obtained the relative contribution of each cofactor, which was plotted for each stage of the identification process. Interpretability of each cofactor can then be obtained and visualized. As an illustration of how cofactors can influence the prediction of antibiotic susceptibility, we then plotted SHAP values at the individual level using a didactic scenario. Using the same scenario, we finally compared NN predictions and covariate-specific SHAP values with the corresponding BAY predictions and likelihood ratios.
Ethics
This study exclusively used electronic health records routinely collected by the bacteriology lab of Hôpital Européen. All data were pseudonymized. Authors received authorization #DR-2020-047 of the French national commission for data protection (CNIL, Commission nationale de l'informatique et des libertés).