Source of data
QEHB is part of University Hospital Birmingham NHS Foundation Trust, one of the largest teaching hospitals in England. The trust serves a population of more than 2.2 million patients per year, a large proportion of whom are seen at QEHB [16]. Detailed information on all patients admitted to QEHB are recorded within its electronic patient management system, including clinical diagnoses, observations, assessments and laboratory results [15]. Unlike many other trusts in England, QEHB has also recorded drug prescriptions electronically for more than 10 years, making it an invaluable resource for research linked to antibiotic prescribing.
Development dataset
To develop the predictive models, we will use data from all eligible patients who attended the ED at QEHB between 1st November 2011 and 31st December 2017 (electronic recording of ED diagnosis at QEHB started after a system change at the end of October 2011).
Validation dataset
We will use data collected at QEHB between 1st January 2018 and 31st March 2019 to externally validate the model. Patients who were included in the development dataset due to an earlier attendance will be excluded from the validation dataset. We will undertake external validation of our models in an independent dataset from University College London Hospital NHS Foundation Trust.
Participants
Inclusion and exclusion criteria
All patients who attended the ED at QEHB within the study period and who had a urine sample submitted for microbiological testing within 24 hours of arrival are eligible for inclusion in the study. A window of 24 hours was chosen to account for discrepancies between when the sample was collected and when the urine sample was recorded in the laboratory system (particularly overnight). Patients enter the study at registration in the ED and exit the study on the earliest of the following dates: date of discharge, date of death, date of transfer to a different hospital, or date of urine culture results.
Individuals aged <18 years, pregnant women, patients who were not admitted via the ED, and patients whose urine sample was submitted for culture but was not cultured due to standard laboratory protocols at QEHB (see Outcome section for details) will be excluded from the analysis.
Outcome
The principal outcome of interest is microbiological growth (≥104 colony-forming units / mL). Only urine samples that were eventually cultured will be included in the analysis. Microbiological cultures at QEHB are performed in accordance with standard laboratory procedures (UK Standards for Microbiology Investigations: SMI B41, Investigation of Urine; SMI B37: investigation of blood cultures (for organisms other than Mycobacterium species) [17]. The decision whether to culture a urine sample depends on cell count results performed in the laboratory. Only urines with white blood cell counts and bacteria counts above a threshold value were cultured. At the start of the study the threshold value for proceeding to culture was white cell counts >40/µL or bacteria counts >4000/µL. This was adjusted to >80/µL or bacteria counts >8000/µL following the introduction of a revised standard operating procedure in the microbiology laboratory in October 2015. Performing cell counts is not possible for urine samples less than 4mL or for samples too viscous to pass through the instrument. Samples for which cell counts could not be performed are always cultured and included in the analysis. Following standard procedure at QEHB, (heavy) mixed growth in the urine sample will be considered as contamination, except where E. Coli was present. In addition, samples will be classified as positive if there are <104 colony-forming units / mL but the same urinary pathogen is identified from a blood culture, implying urosepsis.
Predictors
We will consider a wide range of candidate predictors relating to characteristics of the urine sample, a patient’s clinical presentation at the start of and throughout the hospital stay, and to risk factors encoded in a patient’s medical history (Table 1). Candidate predictors were chosen based on clinical experience, the frequency with which variables are measured in the clinical context where the model is likely to be applied, and existing literature [8].
Sample size
Each year around 60,000 patients are seen in the ED at QEHB. In 2014, more than 4,500 patients were admitted to QEHB and prescribed an antibiotic. Preliminary analysis suggests that 20% of these prescriptions were for suspected UTI syndromes, hence we expect ~5,400 admitted patients using data from late 2011 to end of 2017 (6 years) [19]. Based on clinical experience, we expect a similar number of patients with suspected UTI syndromes to be discharged directly from the ED, resulting in an estimated total training sample of ~10,800 patients. Assuming a prevalence of bacteriuria of 30% like that reported by Taylor et al. previously, this would imply >30 events per variable when including all variables defined in Table 1.
Statistical analysis methods
Feature engineering and selection
All continuous predictors will be winsorized at the 1st and 99th percentile to account for outliers and normalised to lie within the range (0, 1]. Categorical predictors will be encoded in a full-rank encoding, combining levels with a small number of cases (<5%). Predictors with zero variance will be excluded before analysis. For highly correlated predictors (correlation coefficient > 0.9 using Spearman’s rank correlation), one predictor will be removed before analysis based on clinical judgement. Similarly, predictors which are found to be largely missing and might thus not be expected to be present when the model will be used in practice at QEHB will be removed from the analysis before fitting the models.
We will consider the use of fractional polynomials (FP) with up to four degrees of freedom (i.e. 2 fractional polynomial terms) for each numerical predictor [20, 21]. We will estimate the optimal number of FPs using the Akaike Information Criterion. Once the best-fitting FPs have been determined, we will consider models with all predictors and parsimonious models selected via backwards feature elimination based on Wald statistics and Rubin’s rules [22]. Since the large number of possible predictors might limit the model’s usability in clinical practice, we follow Taylor et al. and consider a minimal model based on age, sex, urinalysis results, and history of UTI [8].
Type of model
Baseline model in the ED
We will first develop a multivariable logistic regression model to predict bacterial growth in the urine and/or blood sample at the end of ED attendance. A prediction will be made for each patient based on the fitted value, which will serve as a baseline comparison for all further models considered.
Landmarking models at distinct time points after hospital admission
Additional measurements taken during the first couple of days in hospital may further improve the predictive power of our risk prediction models. We will develop a set of landmarking logistic regression models [23] that predict the probability of bacterial growth in the ED urine sample at pre-defined times t = {0, 12, 24, 36, 48, 60} hours after the patient has left the ED and was admitted to the hospital ward. In order to do so, we require a value for each included predictor at time t. Since predictors are measured irregularly throughout the patient’s hospital stay, we will first train a multivariate generalized linear mixed model (MGLMM) on all past predictor values up to time t to estimate the most likely value of each predictor at time t (see missing data section below for details). Values at time t will be estimated using the best linear unbiased predictors from the empirical Bayes posterior distribution of the random effects, conditional on past predictor measurements [23]. The estimated predictor values will then be fed to a logistic regression model that predicts the probability of microbiological growth in the ED sample after having observed the patient for t hours. As a result, patients might have more than one prediction, one for each time t at which they were still part of the at-risk population. Only patients still admitted and without a culture result at time t will be considered at-risk and will be included in the fitting and evaluation of the logistic regression model for time t.
Missing data
In EHR data, information is only recorded when events take place and we cannot distinguish between cases in which a test or diagnosis wasn’t made and cases in which they were made but not recorded. Consequently, if historical variables such as co-morbidities, procedures, admission records, test results and procedures are not recorded (e.g. because they were performed at another hospital) we will have to assume that these events did not take place. For other variables with missing values that should have been obtained during the current visit (particularly vital signs and laboratory measurements), we will examine the pattern of missingness and impute values where appropriate depending on the type of prediction model.
Our baseline model is a logistic regression, which requires a non-missing value for each included predictor. We will use multivariate imputation by chained equations (MICE) based on the assumption that data are missing at random, i.e. whether a variable is missing or not only depends on the values of observed variables [24]. Following standard MICE procedures [25], we will include all predictors as well as the prediction outcome in the imputation procedure and impute 5 datasets with 10 iterations per dataset (Table 2). Depending on computational feasibility, we will aim to impute up to 100 datasets for our final model to ensure that we obtain robust imputations. Model training will be performed on the imputed development dataset. However, we cannot use the same imputation procedure to evaluate our models since we expect predictors to also be missing during model deployment. When used in practice, our model must impute any missing data in real-time before making a prediction, but at this point no outcome will be available yet to use in the imputation. This will tend to result in suboptimal imputations when the model is used in practice [25]. To obtain an honest estimate of the performance of our models, we will evaluate them on a second set of imputations that were fit without using the outcome in the imputation procedure, emulating the situation in which the model will ultimately be used [26].
For our time dependent models, the nature of missing data slightly differs. Values for each predictor might have been recorded never, once, or multiple times before time t and we are interested in estimating the most likely value at time t. To estimate a good approximation for each predictor, we will separately fit a MGLMM at each landmarking time [23]. Each model will include fixed intercepts and slopes for each predictor and a time-dependent covariate indicating concurrent antibiotic treatment. We will consider correlation structures of varying complexity, with uncorrelated and correlated patient-specific random intercepts and/or slopes for each predictor. If the MGLMM is intractable, we will consider a simpler last observation carried forward (LOCF) method to estimate predictor values at time t, or a mixture of LOCF and MGLMM.
Model validation
Clinical diagnosis of bacterial UTI requires the presence of urinary symptoms in addition to microbiological culture. Bacteriuria in the absence of urinary symptoms (called asymptomatic bacteriuria) should not be treated with antibiotics [2]. Prevalence of asymptomatic bacteriuria differs between patient groups and increases for example with age. Whereas a urine sample might be sent for culture in many different patients “just in case”, a clinically usable model to confirm or rule out suspected bacterial UTI needs to perform especially well in patients with urinary symptoms. In our main analysis, we will therefore validate our models in the subgroup of patients with a suspected ED diagnosis of lower UTI or pyelonephritis, and our final model will be chosen based on the performance in this group. This group differs from the training population, which will include all patients irrespective of ED diagnosis to increase sample size and provide our model with enough power to learn general relationships. In a secondary analysis, we will also evaluate the performance of our models in patients without an ED diagnosis of UTI as well as in different age groups, by sex and by outcome (i.e. discharge diagnosis, death, admission to intensive care unit, length of stay). We will further consider training our model using only data from patients with a suspected ED diagnosis of lower UTI or pyelonephritis for training to ensure that a heterogeneous training population is not obscuring important relationships in patients with suspected UTI. Finally, we will perform secondary analyses limited to the first visit of each patient and to data after 2015, assessing the impact of repeated patient visits and the impact of increased culture thresholds on our models.
Internal validation
Model discrimination in each scenario will be assessed via multiple performance metrics: AUROC, Brier score, area under the precision-recall curve (AUPRC), specificity and negative predictive value (NPV). We will estimate each model’s specificity and NPV at a pre-set sensitivity of 95%, which will evaluate the model’s ability to be used as a screening tool to rule out bacterial UTI. We will assess how well predicted and observed probabilities correspond within each predicted decile (model calibration) by creating a calibration plot and estimating the calibration slope. An estimated slope > 1 indicates underfitting, whereas a slope < 1 indicates overfitting.
Evaluating the model only on the development dataset or a single validation dataset leads to optimistic estimations of the true model performance (henceforth called the apparent performance) [27]. To obtain a more reliable estimate of model performance, we will draw at least 100 bootstrap samples of the development dataset. Where computation time allows for it, we will consider up to 1,000 bootstrap samples. All preprocessing and analysis steps including missing data imputation, estimation of fractional polynomials, feature selection, and model evaluation will be carried out independently within each bootstrapped sample to avoid any data leakage [28]. The result will be one final model per bootstrapped sample. Evaluating each model on the bootstrap sample in which it was developed provides another estimate of the apparent performance, this time within the bootstrap. To estimate the magnitude of optimism in this bootstrapped apparent performance, we will simultaneously evaluate the bootstrapped model in the original development dataset (called test performance). The difference between test performance and bootstrapped apparent performance will be an estimate of model optimism.
Averaging estimates of the optimism across all bootstrapped samples results in a stable estimate of the optimism [27]. The final, optimism-corrected (“true”) estimate of model performance will then be calculated as:
See formula 1 in the supplementary files.
All metrics used in the model evaluation (AUROC, AUPRC, specificity and NPV) will be adjusted for optimism.
External validation
The performance of the model (AUROC, AUPRC, specificity and NPV) in a new dataset will be evaluated using EHRs from patients with suspected UTI who were admitted to QEHB between 1st January 2018 and 31st March 2019. We will summarise average performance and calibration in this temporally independent sample. We will further validate the model in geographically independent sample of patients from University College London Hospital NHS Foundation Trust.
All analysis will be performed using the statistical software R [29] including but not necessarily limited to the packages: tidyverse [30], tidymodels [31], mice [32], and mfp [33].