Data processing
For all models, data were extracted from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-III version 1.3 dataset collected from the ICU at Beth Israel Deaconess Medical Center in Boston, Massachusetts [29]. MIMIC-III contains EHR data (including lab results) and clinical notes on over 40,000 individual patient encounters. All MIMIC-III data were passively extracted from the patient EHR, were de-identified, and collected in compliance with the Health Insurance Portability and Accountability Act.
Intubation task
Data were included from encounters of patients aged 18 years or older, with a minimum of one observation of each of the following vital signs and lab tests: diastolic blood pressure (DiasBP), creatinine, Glasgow Coma Scale (GCS), heart rate (HR), oxygen saturation (SpO2), platelet count, respiratory rate (RR), systolic blood pressure (SysBP), temperature, hematocrit, and white blood cell count (WBC). Hematocrit has been shown to improve other pneumonia-related predictions [30]. Community-acquired pneumonia patients were identified by the presence of a pneumonia diagnosis at admission and were excluded. All encounters for this task were required to involve at least one period of invasive mechanical ventilation. As VAP is defined as pneumonia developing after 48 hours following intubation, encounters were required to last at least 48 hours after intubation. All mechanically ventilated patients in this dataset met this requirement. ML models were compared against the CURB-65 [19, 20], VAP PIRO [15], and CPIS scoring systems [17]. To facilitate the comparison with CURB-65, we required encounters to include at least one measurement of blood urea nitrogen (BUN). These exclusion steps are summarized in Figure 1. For this task, the windows of data used to generate predictions were calculated backwards from the 48th hour following the initiation of ventilation. That is, a 12-hour intubation task model used the 12 hours of data up to and including 48 hours after initiation of mechanical ventilation, or hours 37 to 48 after the initiation of mechanical ventilation. All windows for this intubation task included data from an identical number of patients.
Admission task
Identical exclusion criteria were applied for the admission task as were applied to the intubation task, with the exception of the initiation of mechanical ventilation requirement, which was not applied. For this task, the windows of data used to generate predictions were calculated forward from the time of ICU admission. For example, a 12-hour admission task model used the first 12 hours of data after a patient was admitted to the ICU, after which point a VAP risk prediction was generated. Patients were required to have a length of stay as long as or longer than the prediction window being examined; the number of patients included in the experiments therefore varied by prediction window (Figure 1).
For both tasks, we extracted patient baseline and time-varying clinical measurements for each encounter. Baseline data included age, and a boolean value for the presence of any relevant comorbidities or symptoms at time of admission (bacteremia, cirrhosis, congestive heart failure, fever, intracranial hemorrhage, renal failure, respiratory distress, respiratory failure, sepsis, subarachnoid hemorrhage, and shortness of breath). We additionally included an indicator for acute respiratory distress syndrome (ARDS), as pneumonia is associated with ARDS [31]. Time-varying clinical measurements included the required vital signs and laboratory tests, as well as urine output (evaluated as number of urine measurements over duration of stay) and blood culture information (evaluated as the order of any tests during the relevant window, and as the total test count during the window). We further included the hour of the initiation of mechanical ventilation and the number of accumulated mechanical ventilation hours at the time of prediction.
Raw measurements were binned into one-hour intervals and averaged within bins to produce a single, representative value for each hour. Missing values were imputed based on median values, which were determined using only data in a training set. This process did not allow information from the hold-out test set to influence the imputation. We calculated six summary statistics (minimum, maximum, median, first, last, and average) of each vital sign and laboratory tests over a variable-length window (Table 1). Specifically, for each window length , we calculated the statistics for the hours preceding and including the 48th hour after the initiation of mechanical ventilation (in the case of the intubation task) or the 48th hour after admission. Age, the number of total urine output events, and the number of blood culture tests were kept in their raw form. Boolean indicators were added for the presence of antibiotics, sputum labs, blood culture labs, comorbidities and symptoms listed above, and ARDS (Table 1). All variables were then concatenated into one vector for each encounter.
Table 1: Data included as input to the algorithm
Required vitals and labs
|
Boolean Indicators
|
Optional Measures
|
- Systolic BP
- Dias BP
- HR
- Respiratory Rate
- Temperature
- Hematocrit
- SpO2
- BUN
- GCS
- Platelet Count
- WBC
- Creatinine
|
- Antibiotics
- Sputum labs
- Blood culture labs
- Any of cirrhosis, congestive heart failure, fever, bacteremia, intracranial hemorrhage, renal failure, respiratory distress, respiratory failure, sepsis, subarachnoid hemorrhage, shortness of breath
- Acute respiratory distress syndrome (ARDS)
|
- Age
- Total urine output events
- Number of blood culture tests
- Number of sputum tests
- Number of MV hours
|
Gold Standard
The International Classification of Diseases (ICD) Revision 9 code 997.31 was the gold standard definition of VAP. Literature assessing the accuracy of ICD codes for VAP identification remains limited [32]. However, studies have suggested that, while sensitivity of administrative coding may be only moderate for VAP identification, specificity and NPV are quite high [33, 34].
Machine Learning Methods and Comparators
For each prediction task and each window length, we trained and tested five ML models: logistic regression, multilayer perceptron, random forest, support vector machines, and gradient boosted trees. We favored variety in our choice of ML methods due to the absence of VAP prediction literature. The logistic regression and support vector machines models were chosen as representative linear models and the random forest and gradient boosted trees models were chosen as representative ensemble learning and tree-based methods. The multilayer perceptron model was included in lieu of neural network models with more layers, as there were too few training examples to effectively train such models. Except for the gradient boosted trees model, which was created using the XGBoost Python package, the models were implemented using the scikit-learn Python package.
We compared performance of the machine learning algorithm to the CURB-65, VAP PIRO, and CPIS scores for evaluating pneumonia severity. CPIS performance was estimated from the literature [14] as it could not be calculated in our dataset. We implemented CURB-65 and VAP PIRO in our dataset [18, 19]. CURB-65 values were calculated for each hour according to the number of the following which were true: BUN > 19 ml/dL, Respiratory Rate ≥ 30, Systolic BP < 90 mmHg/Diastolic BP ≤ 60 mmHg, and age ≥ 65. We tried several variations of assigning a CURB-65 score to a temporal window, including using its maximum, average, and last values over the window. As the results were similar in each case, we reported its average over the window. PIRO is a four-variable score based on predisposition, insult, response, and organ dysfunction. The score is measured by assigning one point in four areas: detection of a comorbidity (Chronic obstructive pulmonary disease, immunocompromised, heart failure, cirrhosis, or chronic renal failure), bacteremia, a systolic BP < 90 mmHg, and ARDS.
The data were partitioned uniformly-at-random into a set for training and hyperparameter tuning (90%) and a 10% hold-out test set, against which all trained models were evaluated for final performance metrics in the last step. For each task and for each window length k, each model was trained using four-fold grid search cross-validation on the 90% training set. After searching the space of model hyperparameter values, the hyperparameters that produced the best cross-validation performance in terms of AUROC were chosen. Each model was then tested on the 10% hold-out test set. Feature importance was measured through Shapley additive explanation (SHAP) values to assess similarities or differences in the features used to generate predictions across model types.