2.1 VAP annotation
The first step in building our model is to curate a dataset of MV episodes with a ground truth label for VAP and no-VAP events. For prediction tasks, we need to mark the onset of VAP events to enable developing models capable of estimating VAP risk prior to the VAP onset time (i.e., prediction gap). While there are some necessary criteria to diagnose VAP, there is no clinical definition that provides sufficient, sensitive, and specific criteria for identifying VAP events [11]. The absence of ground truth for VAP events and, more importantly, their timing poses a significant challenge to the development of predictive models.
A common approach to identify adverse events or diagnoses in EMR is to use ICD codes. This approach does not provide the onset time of the events, as ICD codes are typically assigned at the discharge point only. Even if limited to a one-shot prediction at a pre-set time (e.g., 48hrs after intubation [8]), the prediction time might overlap with the VAP window (temporal window where VAP is present and has already been clinically observed). More importantly, ICD codes are not a reliable indicator of adverse events during patient stays and, in particular, ICD codes used to chart VAP diagnosis have a positive predictive value of 0.57, with a low sensitivity ranging from 0.18 to 0.35 [12]. On the other hand, the clinical approach to diagnose VAP requires at least two serial chest radiographs with new or progressive and persistent infiltrates accompanied by at least two clinical parameters that all require administration of instrumental diagnostic tests (e.g., pulse oximetry, blood and sputum tests, arterial blood gas tests) [5]. Some of these tests are costly and labor intensive and therefore are not widely or frequently available in different healthcare setups; hence, their absence could lower the sensitivity of VAP diagnosis. Additionally, chest radiographs are not consistently used and when available, they suffer from high interrater disagreement on radiologic observations [13]. The subjective nature of radiology assessments is one of the main reasons the CDC has introduced a new guideline for tracking complications during invasive MV, collectively referred to as ventilator-associated events (VAE) [11].
Proposed EMR criteria for VAP annotation
The IDSA/ATS[1] guidelines define hospital acquired pneumonia (HAP) as a pneumonia that occurs 48hrs or more after admission to the hospital for patients that did not appear to be incubating at the time of admission, thereby ruling out community-acquired pneumonia (CAP). Both CDC[2] and IDSA/ATS guidelines define VAP as a HAP that arises more than 48-72hrs after endotracheal intubation [14], [5].
We present a set of criteria for VAP annotation (Figure 1) that consolidates clinical guidelines (IDSA/ATS for hospital-acquired infections (HAI)s and VAP [14], NHSN[3] pneumonia [5], and CDC VAE guidelines [11]):
(a) For adult (≥18y) patients on invasive mechanical ventilation, mark suspicions of nosocomial respiratory infection as events of 1) administration of a new antibiotic agent, temporally contiguous to 2) any indications of ordered microbiological tests (cultures):
- Within the temporal window starting at 48hrs after intubation and extending to 72hrs after extubation, search for antibiotic agents from the list of eligible agents and administration routes for VAEs [11].
- Ensure that the detected agents are new; an agent is new if it is initiated ≥48hrs after intubation and was not administered in the 2 days preceding the current start dates.
- Search for cultures within the temporal window of 72hrs prior to and 24hrs after the detected new antibiotic agent.
- Limit the culture types to those obtained from respiratory tract [11] (e.g., sputum, endotracheal aspirate, bronchoalveolar lavage, lung tissue, protected specimen brush) and blood cultures [15].
- Mark the onset of a suspicion of infection as the earlier of the new antibiotic agent and its accompanied ordered culture times.
(b) Mark the VAP events from within the suspected infections identified in (a) as those satisfying the following criteria:
- Positive culture results
- No indications that the suspected infection is associated with community-acquired pneumonia (CAP)
- No indication of pneumonia in their admission diagnosis,
- No indication of CAP in their discharge ICD codes.
iii. The onset of clinical suspicion is ≥48hrs after intubation, excluding early infections.
The use of antibiotic administration temporally adjacent to culture orders has been previously reported as the criterion for marking suspicion of infection, notably in the Sepsis 3 definition [16]. Patients with suspected infections as defined in Sepsis 3 were found to have worse composite outcome defined as in-hospital mortality and/or ICU length of stay ≥3 days (46.30% vs 19.90%) [17]. Specific to VAP, antibiotic administration at least 48hrs after intubation has been previously used as the criterion to identify patients with presumed VAP in EMRs [18], [19]. The latest IDSA/ATS guidelines [14] also recommends initiating an empiric antibiotic treatment as soon as a VAP is suspected, accompanied by a culture order to help tailor the antibiotic treatment according to identified pathogens. In another study, the presence of an antibiotic administration and the ordering of a sputum culture were among the most salient features to VAP. Finally, in a recent systematic review, 73% (32/44) of studies used microbiological confirmation of infection as the sole reference standard for VAP diagnosis [20].
Given the lack of a standard diagnosis for VAP, the identified VAP cases in our study using the proposed criteria can be regarded as presumed VAP cases. The concept of presumed infection has been previously used and protocolized in other infection surveillance areas including the presumed serious infection (PSI) in sepsis surveillance systems that is defined as at least a blood culture (regardless of the result) temporally contiguous with the start of a new antibiotic treatment that is continued for 4 consecutive days [21]. Computational models for PSI prediction and the importance of such prediction for early interventions have been previously reported (e.g., [22]).
2.2 Prediction model design
Data
The patient data used in this study is derived from the EMR data in the Philips eICU Research Institute (eRI) [23]. The eRI dataset includes the EMR data from intensive care units (ICU)s in 459 United States hospitals of various sizes and teaching statuses and was collected over 12 years, from 2004 to 2016. We found that among the eRI hospitals, 292 hospitals have mechanically ventilated patients (invasive ventilation). We further excluded hospitals that do not report culture data, bringing the number of hospitals to 118. Additionally, for the remaining hospitals, we excluded the years with no culture or antibiotic prescription data.
The methods of data acquisition, validation, aggregation, security procedures, and a description of the ICUs in eRI are reported in [23]. All analyses were conducted after removal of patient and institutional identifying information under a waiver of the requirement to informed consent by the appropriate institutional review boards from participating institutions that contributed to the eRI database. Furthermore, the study and the use of eRI in this work have been approved by the Internal Committee of Biomedical Experiments at Philips.
Sample definition
Samples are defined following a one-shot prediction approach in which one sample is defined per mechanical ventilation episode. The sample definition is governed by three parameters: 1) reference point, 2) prediction gap, and 3) observation window. For MV episodes with a VAP event, the reference point is the time of the event. For each MV episode without VAP events (control patients), we select a random timepoint in the episode starting at 48hrs post-intubation as the reference point, with a constraint that the distribution quartiles of the resulting control references points should match that of the VAP samples.
A prediction gap defines how early (in hours) the model is predicting an impending VAP event (see the prediction gap illustration in Figure 1 and 5). In this work, we set the prediction gap to 24hrs prior to the reference point. An observation window (ObW) immediately precedes the prediction gap and is the temporal interval over which we observe the patient and compute an abstract feature-based representation of their instantaneous status based on their EMR entries.
Feature engineering
The features used to compute a patient's status in an observation window include: 1) demographics, 2) vitals, 3) labs, 4) ventilator settings, and 5) derived features including VAE-related features (Table 1). We did not use admission diagnoses as features due to discrepancies between hospitals in how they chart these diagnoses. The VAE features are inspired by the CDC VAE guidelines [11]. These features capture the counts of stability and oxygen worsening events normalized by the temporal distance (in hours) of the observation window from the intubation time. The stability count (“stable_n”) captures the normalized number of times from intubation where the daily minimum of PEEP or FiO2 remains unchanged or decreases over two consecutive days. The oxygen worsening feature (“OW_n”) captures the normalized number of times from intubation where the daily minimum of PEEP or FiO2 increases over the span of two consecutive days by 3 cmH2O and 20% for PEEP and FiO2, respectively. Finally, when available for computation, we include 2 additional features: body mass index (BMI) and PaO2/FiO2 (pf-ratio).
Different features have different temporal profiles in EMRs; some are available routinely at fixed intervals, others are measured intermittently, and yet others are rare or completely missing for some patients. This is caused by differences in patient needs and ICU journeys as well as differences in healthcare processes in different hospitals. We define feature-specific observation windows to account for differences in measurement frequencies (column “ObW” in Table 1). Vitals and labs that fell outside their standard reference ranges were excluded. Standard reference ranges were defined by clinical experts and based on the published laboratory test reference ranges from the American Board of Internal Medicine. We used a set of descriptive statistics (average, minimum, maximum, last observed value) for each vital, lab, or ventilator setting computed over their corresponding ObW as features.
Table 1
Features used for VAP prediction modeling.
Featrure category
|
Feature name
|
OW(hr)
|
MWS(hr)
|
Featrure category
|
Feature name
|
OW(hr)
|
MWS(hr)
|
Featrure category
|
Feature name
|
OW(hr)
|
MWS(hr)
|
Demographics
|
Age
|
-
|
-
|
Labs
|
Hematocrit
|
24
|
24
|
Labs
|
ALT
|
24
|
48
|
|
Gender
|
-
|
-
|
|
Hemoglobin
|
24
|
24
|
(cont.)
|
Chloride
|
24
|
24
|
|
Weight
|
-
|
-
|
|
Glucose
|
24
|
24
|
|
Calcium
|
24
|
24
|
|
Height
|
-
|
-
|
|
Lactate
|
24
|
48
|
|
TotalCO2
|
24
|
24
|
Vitals
|
HR
|
12
|
4
|
|
Phosphate
|
24
|
48
|
|
INR
|
24
|
48
|
|
RR
|
12
|
4
|
|
Magnesium
|
24
|
48
|
|
Albumin
|
24
|
24
|
|
Systolic BP
|
12
|
6
|
|
PaO2
|
24
|
48
|
|
Platelets
|
24
|
24
|
|
Diastolic BP
|
12
|
6
|
|
pH
|
24
|
24
|
|
BUN
|
24
|
24
|
|
Mean BP
|
12
|
6
|
|
PaCO2
|
24
|
48
|
|
WBC
|
24
|
24
|
|
Temperature
|
12
|
4
|
|
BaseExcess
|
24
|
48
|
|
|
|
|
|
SpO2
|
12
|
4
|
|
AnionGap
|
24
|
48
|
|
|
|
|
|
EtCO2
|
12
|
24
|
|
Creatinine
|
24
|
24
|
|
|
|
|
Ventilator
|
PEEP
|
24
|
24
|
|
Bilirubin
|
24
|
48
|
|
|
|
|
setting
|
FiO2
|
24
|
24
|
|
PTT
|
24
|
48
|
|
|
|
|
|
Tidal volume
|
24
|
48
|
|
Potassium
|
24
|
24
|
|
|
|
|
|
TV/Kg IBW
|
24
|
48
|
|
Sodium
|
24
|
24
|
|
|
|
|
VAE
|
stable_n
|
24
|
-
|
|
ALP
|
24
|
48
|
|
|
|
|
|
OW_n
|
24
|
-
|
|
AST
|
24
|
48
|
|
|
|
|
ObW: observation window, MWS: backward search window for missing features, BMI: body-mass index, HR: heart rate, RR: respiratory rate, BP: blood pressure, EtCO2: end-tidal carbon dioxide, PEEP: baseline corrected positive end-expiratory pressure where baseline is defined as the initial value of PEEP at the intubation time, FiO2: baseline corrected fraction of inspired oxygen where baseline is defined as the initial value of FiO2 at the intubation time, OW: oxygen worsening, PaO2: partial pressure of oxygen, pH: potential hydrogen, PaCO2: partial pressure of carbon dioxide, PTT: partial thromboplastin time, ALP: alkaline phosphatase, ALT: alanine transaminase, AST: aspartate aminotransferase, INR: international normalized ratio, BUN: blood urea nitrogen, WBC: white blood cell count, TV/Kg IBW: Mean tidal volume per kilogram ideal body weight.
When a feature is missing in an observation window, we search the temporal interval immediately preceding the observation window and use the most recent measurement for the feature if it exists in that interval. Due to the dynamically evolving status of mechanically ventilated patients, in the search for missing features, we assume that observations too far away from the prediction time are invalid. As such, we define an upper limit (in hours) for the backward search for missing features to ensure clinical and physiological relevance to the current patient status. Furthermore, we make the upper-limit feature-specific to account for differences in measurement frequencies, e.g., vitals are more frequently charted as compared to labs (column “MWS” in Table 1).
Risk prediction model
The risk prediction model predicts the risk of an impending VAP within the next 24hrs. We trained an ensemble of decision trees using the XGBoost gradient boosting algorithm [24]. In our experiment, the XGBoost model surpassed other classifier models (random forest, logistic regression, ADABoost, KNN) in performance and risk score interpretability as well as the ability to handle correlated features. The XGBoost parameters were empirically selected through cross-validation experiments optimizing the AUC ROC for an internal validation set: 500 estimators with early stopping set at 100 rounds, maximum depth of 3, and learning rate of 0.05. In each boosting round, 90% of training samples and 50% of features were used to mitigate overfitting. We used SHAP (SHapley Additive exPlanations [25]) values to quantify the local and global impact of every variable on VAP risks.
Model evaluation
The model was trained and tested using a hold-out cross-validation experiment in which the data was split at the patient level into 80% training patients and 20% holdout test patients. The experiment was repeated 10 times, where in each run, a model was trained using the training patients and then, the resulting model was tested on the hold-out test patients.
Missing data handling
In this work, we use XGBoost that learns to assign missing cases to a tree branch that will optimize the loss function, which can be regarded as an implicit imputation. However, missingness patterns in EMR data might not be random (e.g., laboratory tests are ordered more frequently for sick patients [26] or a measurement might be missing only at some hospitals) and simply imputing them might result in the model learning the missingness pattern as a salient feature, rather than the actual pathophysiological feature underpinning the condition. To protect our model from selection biases caused by non-random missingness patterns, we have implemented the following:
- A missingness threshold of 30% based on which a feature is eliminated if found missing in more than 30% of the training patients.
- Balancing class-specific missingness: it is more likely for negative samples to have missing features [26]. In order to avoid situations where a model associates a missing feature with a particular class, we implement class-specific balancing anchored at the ‘temperature’ measurement to ensure that temperature is present/missing uniformly between the two classes. We select temperature as it is a frequently measured and important vital sign in infection. Ensuring equal class-specific missingness rate for temperature results in more similar rates of missingness in all other features.
Modeling safeguards
There are differences in care practices at different institutions that could significantly impact the performance of AI models built using EMR datasets, which, in turn, has direct implications on the applicability of the models in different types of hospitals. We implemented safeguards to manage and control, as feasible, the irrelevant sources of variations in the data:
- Balanced hospital representation: intrinsic differences between hospitals (size, teaching status, care practices) lead to unequal amounts of data available for model training from different hospitals. We clip the number of training samples from a hospital to 100 to avoid a situation where a large-size hospital dominates our training samples.
- MV-length matched VAP and control samples: It is critical to ensure that the control class is matched with the VAP class in terms of MV length (i.e., where in the patient journey a sample is extracted from to train the prediction model) to avoid a selection bias situation where the model learns the episode length as a predictive feature; associating longer episodes with a higher risk of VAP (as seen in [10]) , which albeit true and well-known, is a trivial feature that can obscure other salient pathophysiological features from being learned by the model. In this work, we match distribution quartiles of no-VAP samples to those of VAP samples.
Hospital phenotyping model
We present a hospital clustering approach based on the frequencies of measured vitals and labs as well as reported ventilator settings to capture differences in healthcare practices for mechanically ventilated patients at different hospitals. Vitals are consistently measured across hospitals; however, their frequency of measurement could reflect hospital-specific practices. Labs are important as 1) there are reports on associations between the presence of laboratory tests and patient outcomes [26], 2) a large amount of laboratory test data are available in EMRs, 3) lab results are time-tagged and commonly available in EMR data, whereas other healthcare process variables, such as doctor experience, specialty, and hospital policies, are more difficult to quantify or obtain in EMRs, and 4) the type and frequency of labs are varied between hospitals and this, in part, could reflect healthcare processes in use as well as available facilities at different hospitals. Finally, since our target cohort is mechanically ventilated, we include reported ventilator settings. We are interested in macro-level differences and as such we use aggregate measures across all mechanically ventilated patients at a hospital to capture these differences (Figure 3). Macro-level differences reflect inter-hospital variabilities in healthcare practices and protocols, whereas micro-level differences are at the patient level; e.g., lab works with small repeat intervals could indicate anomalies or clinical concerns specific to the ongoing patient status.
Given EMR datasets from different hospitals and a target patient population (mechanically ventilated patients), we extract EMR records for all vitals, lab works, and ventilator settings during patient stays. We exclude MV episodes that are shorter than 48hrs and retain only the hospitals that have at least 10 MV episodes. Next, we compute the time intervals between subsequent measurements for each variable for each patient. It is important to emphasize that we do not use the actual vital, lab or ventilator setting values but rather, we use the temporal interval between subsequent measurements. Next, we compute a hospital-level representation for each hospital: for a hospital and a measurement pair e.g., heart rate, we compute statistics descriptive of the distribution of patient-level intervals (25, 50, 75 percentiles) for the measurement at that hospital. We then project the resulting hospital-level representations into a low-dimensional space using the uniform manifold approximation and projection (UMAP) clustering [27] to obtain a 2D representation of the hospitals. Using K-means clustering, we then search for clusters of similar hospitals in the resulting 2D UMAP space. Using silhouette coefficients and elbow method on the sum of squared errors (the square of the Euclidean distance of a hospital to its cluster head or cluster centroid) to find an optimal number of clusters. Larger cluster numbers result in more homogeneous clusters, while fewer clusters will have a larger amount of within-cluster variability. The latter enables training cluster-specific predictive models that are more generalizable as compared to those trained on a smaller and more homogenous cluster.
The identified clusters were then used to build cluster-specific VAP prediction models and evaluate in-cluster and out-of-cluster prediction performances. The latter constitutes an external validation of our algorithm where a model trained on a particular hospital cluster is validated on other hospital clusters.