Data
In this study we use data extracted from a US electronic healthcare record database Optum® de-identified Electronic Health Record Dataset (Optum EHR). This database contains medical records for 93,423,000 patients recorded between the years 2006-2018. The medical record data includes clinical information, inclusive of prescriptions as prescribed and administered, lab results, vital signs, body measurements, diagnoses, procedures, and information derived from clinical notes using Natural Language Processing (NLP).
The use of Optum EHR was reviewed by the New England Institutional Review Board (IRB) and were determined to be exempt from broad IRB approval.
Strategies for developing patient-level prediction models with data containing loss to follow-up
We investigate four possible simple design choices for dealing with patients lost to follow-up, both with pros and cons, see Table 1.
Table 1
candidate design choices for dealing with loss to follow-up
Design choice
|
Pros
|
Cons
|
1: Classification model and exclude all patients lost to follow-up [12,13]
|
· The labels are correct as we observed all the patients in the training data for the complete time-at-risk follow-up
|
· We reduce the size of the training data (the longer the time-at-risk, the smaller the dataset)
· If the health outcome is often fatal, then we may exclude all or the majority of the patients who have the health outcome
· May limit model generalizability to only those who are healthy
|
2: Classification model and include all patients (including those lost to follow-up) [14]
|
· We do not compromise generalizability
· Larger sample size
|
· Labels may be incorrect for those who are lost to follow-up (this noise may impact the model’s ability to learn)
|
3: Classification model and exclude patients lost to follow-up unless they have the outcome prior to loss to follow-up [15]
|
· The labels are correct
· We include all outcomes
· Do not lose outcomes when outcome is associated to death
|
· Generalizability may be compromised
· Outcome patients may be sicker as we can include those who die within time-at-risk but this is not possible for non-outcomes
|
4: Cox model including all patients (including those lost to follow-up) [16]
|
· Method suitable for censored patients
|
· Not intended for risk prediction, the main purpose is hazard rate calculation per predictor. Requires baseline hazard function for prediction.
· Predict survival time (time before event) rather than risk of event.
· Computationally more expensive
|
We used a least absolute shrinkage and selection operator (LASSO) logistic regression model as the classifier for solutions 1-3. For solution 4 we used a LASSO Cox regression model [17].
Synthetic data
We created synthetic data in two steps:
Step 1: Create Synthetic data with no right censoring
We created a synthetic dataset using the following real-world prediction problem: ‘within patients who are pharmaceutically treated for depression, who will experience nausea within 3 years of the initial depression diagnosis?’ We extract real world data on predictors, outcomes, and follow-up time from Optum EHR. The extracted data contains 86,360 randomly sampled patients in the target population (we sampled 100,000 but 13,640 patients had nausea prior to index and were excluded), of which 52,325 (60.5%) lacked complete 3-year time-at-risk follow-up. To create a dataset with complete follow-up, we trained a prediction model to predict nausea on this dataset and then applied it to the patients lost to follow-up to impute whether they had the outcome. For each patient lost to follow-up we drew a number from a uniform distribution X ~ U(0,1) and if this value was less than or equal to the predicted risk of the patient experiencing the outcome then the patient was labelled as an outcome patient, otherwise they were labelled as non-outcome. This resulted in 8,944 patients lost to follow-up being labeled as having the outcome and 43,381 labeled as not having the outcome. For each patient with the outcome imputed, we also randomly selected the date at which they had the outcome by randomly picking uniformly between their start date and 3 years following.
We chose to impute the outcome for patients lost to follow-up rather than restrict to patients who were not lost to follow-up due to potential bias issues. If the patients lost to follow-up were systematically different to the patients not lost to follow-up, then the results observed when analyzing the impact of loss to follow-up restricted to patients with complete follow-up may not generalize to the whole population.
Step 2: Simulating loss to follow-up
Starting with the synthetic dataset from step 1 that considers every patient to have complete follow-up, we then partition this set into 75% training data and 25% test data. We then simulate loss to follow-up in the training data based on either random selection or morbidity-based selection:
i) To simulate random loss to follow-up at a rate of thres% (thres in {10,20,30,40,50,70,90}) we draw from a uniform distribution per patient i, X1i ~ U(0,1), and censor the ith patient if the number is less than the censoring rate X1i < thres/100 (e.g., if the censoring rate is thres=10, then patients are censored if their randomly drawn number is 0.1 or less).
ii) To simulate morbidity-based loss to follow-up at a rate of thres% we calculate each patient’s baseline Charlson comorbidity index score and then find the score where thres% of patients have a score equal or higher. We then consider all patients with that score or higher to be censored.
For patients who are identified as being lost to follow-up, we then simulate when they were lost. To simulate the date a patient is lost to follow-up, we uniformly picked the date during the 3-year follow-up (1095 days). For example, to simulate the date we draw a number from a uniform distribution, X2j ~ U(0,1), per patient j and set their censored date as start_datej + floor(1095*X2j) where start_datej is the date patient j entered the target cohort. If a patient has the outcome at a date after their loss to follow-up date, then the outcome would have been observed after loss to follow-up, so we revise these patients to be labelled non-outcome patients. If the patient has the outcome on a date before the loss to follow-up date, then we would have seen the outcome prior to loss to follow-up, so they are still considered to be labelled as outcome patients.
We do not simulate loss to follow-up on the 25% test set, as this ‘silver standard’ is used to evaluate the impact of the four different solutions for developing patient-level prediction models in data containing loss to follow-up. The creation of the synthetic data is illustrated in Figure 1
Empirical Study Data
For each strategy we empirically investigate the performance when addressing 21 different prediction problems for two different follow-up periods (1 year and 3 years after index) using real world data. In a previous study we developed models to predict 21 different outcomes in a target population of pharmaceutically treated depressed patients [3]. For consistency, here we picked the same 21 prediction problems.
The target population of pharmaceutically treated depressed patients are defined as:
- Index rule defining the target population index dates:
- First condition record of major depressive disorder
- Inclusion criteria:
- Antidepressant recorded within 30 days before to 30 days after the target population index date
- No history of psychosis
- No history of dementia
- No history of mania
- >=365 days prior observation in the database
- >=30 days post observation in the database
The 21 outcomes were: gastrointestinal hemorrhage, acute myocardial infarction, stroke, suicide and suicidal ideation, insomnia, diarrhea, nausea, hypothyroidism, constipation, seizure, delirium, alopecia, tinnitus, vertigo, hyponatremia, decreased libido, fracture, hypotension, acute liver injury and ventricular arrhythmia and sudden cardiac death. All definitions and logic used to define these outcomes are supplied in Supplement A.
Real world labelled data were extracted from Optum EHR for each prediction problem, where the predictors were the presence of medical conditions and drugs that occurred prior to index or demographics at index. We created binary indicator variables for every condition and drug one or more of the target population had recorded prior to index. For example, if a patient had a record of type 1 diabetes prior to index, we could create a variable ‘type 1 diabetes any time prior’. Any patient who had type 1 diabetes recorded prior to index would have a value 1 for the variable ‘type 1 diabetes any time prior’ and any patient who did not have a type 1 diabetes record prior to index would have a value of 0. In total we extracted 204,186 variables. We created labels for each patient and time-at-risk, 1-year and 3-years. The outcome label was 1 if the patient had the outcome recorded during the time-at-risk following index and 0 otherwise. We then partitioned the labelled data into 75% training set and 25% test set. The four design choices were each independently applied for each prediction problem and models were developed using the training data.
Performance evaluation
We evaluate the models’ performances by calculating the area under the receiver operating characteristic curve (AUROC) on the test data with and without the patients lost to follow-up. An AUROC of 0.5 is equivalent to random guessing and an AUROC of 1 corresponds to perfect discrimination (able to identify the people who will develop the outcome at a specific risk threshold). The Cox regression AUROC was calculated using the exponential of the sum of the effect parameters multiplied by the covariate values (without the baseline hazard function).