To Include, or Not Include, that is the Question: An Empirical Analysis of Dealing with Patients who are Lost to Follow-up when Developing Prognostic Models Using a Cohort Design

doi:10.21203/rs.3.rs-54715/v1

Download PDF

Research article

To Include, or Not Include, that is the Question: An Empirical Analysis of Dealing with Patients who are Lost to Follow-up when Developing Prognostic Models Using a Cohort Design

https://doi.org/10.21203/rs.3.rs-54715/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 Feb, 2021

Read the published version in BMC Medical Informatics and Decision Making →

You are reading this older preprint version

Read the latest preprint version →

Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One of the main decisions is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.

Methods: We generate a synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. We investigate four simple strategies for developing models using data containing some patients with loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. In addition to our synthetic data study, we empirically evaluate the discrimination and calibration performance of these strategies across 21 prediction problems using real-world data.

Results: The synthetic data study results show that excluding patients lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on the model performance. Our empirical results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year, but demonstrated differential bias when we looking into 3-year time-at-risk. Removing patients who are lost to follow-up before the outcome but keeping patients who are loss to follow-up after the outcome can bias a model and should be avoided.

Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Medical Informatics

prognostic model

loss to follow-up

censoring

PatientLevelPrediction

best practices

model development

Prediction models in healthcare can be used to identify patients who have a high risk of developing some undesirable outcome. These high-risk patients can then be targeted for suitable interventions with the aim of reducing their risk. For example, numerous risk models are being clinically used to identify patients with a high risk of cardiovascular issues who may benefit from modification of blood lipids [1]. Prediction models address the patient’s question: ‘what is my probability of developing <insert outcome> during the next N years?’. However, many developed prediction models removed patients from the training data who left the database before the N year follow-up and therefore implicitly answered ‘what is my probability of developing <insert outcome> during the next N years given I remain in the data’. Many published papers did not investigate the impact that loss to follow-up may have on their model [2] and this has been highlighted as a challenge is risk prediction development [2].

Our recent framework for standardizing the development of patient-level prediction models [3] recommends defining some index date for each patient where the data prior to index are used to construct potential predictors and the data post index are used to identify whether the patient has the health outcome of interest in some follow-up period. The prediction question can be standardized into three parts: i) the target population (the patients you want to apply the model to) and an index date where they enter the cohort, ii) the outcome (what you want to predict) and iii) the time-at-risk (a time period relative to the target cohort index date where you wish to predict the outcome occurring). The prediction problem becomes: ‘Predict which patients in <Target Cohort> will experience <outcome> during the <time-at-risk> following target cohort entry.’ For example, we may wish to ‘predict which patients with depression who are pharmaceutically treated will experience nausea 1 day until 3 years after they are first diagnosed with depression’.

Sometimes patients are not observed for the complete time-at-risk period due to numerous reasons including they may change insurance, relocate to outside the database capture area, or die during the time-at-risk period. Continuing with the example, some patients with depression may change insurance, they may move to another country or they may die from other illnesses within the 3 years. We refer to these patients as being ‘lost to follow-up’ as they were not observed for the complete time-at-risk. There are four possibilities for each patient in training data: 1) having complete follow-up and no record of the outcome during time-at-risk means the patients is a ‘non-outcome’ patient, 2) having complete follow-up and a record of the outcome during time-at-risk means the patients is an ‘outcome’ patient, 3) having incomplete follow-up and a record of the outcome during the partially observed time-at-risk means the patients is an ‘outcome’ patient or 4) having incomplete follow-up and no record of the outcome during the partially observed time-at-risk means the patient’s label is unknown as they could have the outcome after being lost to follow-up. Should the patients who are lost to follow-up be included in training data, potentially making the labels noisy, or should they be excluded, which might cause generalizability issues or impact the model due to the data containing less patients with the outcome?

Researchers developing prediction models are faced with various design choices which may have significant impacts on the model performance. Some guidelines have been proposed for certain best practices in developing patient-level prediction models such as best practices for model development [4], considerations for making clinically useful models [5] and reporting prediction models [6]. However, there is currently no experiment-driven guidelines that inform researchers about how design choices to address loss to follow-up can impact prediction performance, so non-optimal design choices may commonly be leading to sub-optimal models. As a result, the developed prediction model may not perform as well as desired when applied in a real-world setting.

Binary classification models, such as logistic regression, aim to learn a mapping from the predictor space to a value between 0 and 1 that corresponds to the risk of the outcome occurring during the time-at-risk. These models are unable to incorporate loss to follow-up, so a choice is needed whether to i) include patients who are lost to follow-up and assume whether they have the outcome prior to loss to follow-up is the ground truth or ii) exclude patients who are lost to follow-up. Cox regression aims to learn hazard rates per predictor and is a method that can include patients lost to follow-up. The baseline hazard function needs to be calculated if the Cox model is required to estimate outcome probability during the time-at-risk and this can often be complex. It is unknown whether it is preferable to use a survival model rather than a classifier when loss to follow-up is frequent. There have been various one-off comparisons between logistic regression and Cox regression for effect estimation [7,8] and prediction [9,10]. One key study compared various ways to deal with loss to follow-up for a single prediction question [11]. They developed a unique way of dealing with loss to follow-up by assigning weights based on survival probability to the datapoints used to train various machine learning models. Their results showed that the discrimination performance of the different methods was similar, but the calibration was better using their weighting approach. However, it is unclear to what extent these findings generalize to other prediction problems. There is currently no large-scale data-driven guideline based on empirical evidence that can help model developers decide the approach to take for prediction problems where patients are lost to follow-up.

We use synthetic data studies and an empirical assessment across 21 prediction questions using real world data to evaluate the impact of various simple strategies for dealing with loss to follow-up. These results will be used to provide best practice guidelines for dealing with loss to follow-up in healthcare prediction. We picked simple strategies that don’t require editing classifier software, so these strategies can be easily implemented by researchers.

Data

In this study we use data extracted from a US electronic healthcare record database Optum® de-identified Electronic Health Record Dataset (Optum EHR). This database contains medical records for 93,423,000 patients recorded between the years 2006-2018. The medical record data includes clinical information, inclusive of prescriptions as prescribed and administered, lab results, vital signs, body measurements, diagnoses, procedures, and information derived from clinical notes using Natural Language Processing (NLP).

The use of Optum EHR was reviewed by the New England Institutional Review Board (IRB) and were determined to be exempt from broad IRB approval.

Strategies for developing patient-level prediction models with data containing loss to follow-up

We investigate four possible simple design choices for dealing with patients lost to follow-up, both with pros and cons, see Table 1.

Table 1

candidate design choices for dealing with loss to follow-up
Design choice	Pros	Cons
1: Classification model and exclude all patients lost to follow-up [12,13]	· The labels are correct as we observed all the patients in the training data for the complete time-at-risk follow-up	· We reduce the size of the training data (the longer the time-at-risk, the smaller the dataset) · If the health outcome is often fatal, then we may exclude all or the majority of the patients who have the health outcome · May limit model generalizability to only those who are healthy
2: Classification model and include all patients (including those lost to follow-up) [14]	· We do not compromise generalizability · Larger sample size	· Labels may be incorrect for those who are lost to follow-up (this noise may impact the model’s ability to learn)
3: Classification model and exclude patients lost to follow-up unless they have the outcome prior to loss to follow-up [15]	· The labels are correct · We include all outcomes · Do not lose outcomes when outcome is associated to death	· Generalizability may be compromised · Outcome patients may be sicker as we can include those who die within time-at-risk but this is not possible for non-outcomes
4: Cox model including all patients (including those lost to follow-up) [16]	· Method suitable for censored patients	· Not intended for risk prediction, the main purpose is hazard rate calculation per predictor. Requires baseline hazard function for prediction. · Predict survival time (time before event) rather than risk of event. · Computationally more expensive

We used a least absolute shrinkage and selection operator (LASSO) logistic regression model as the classifier for solutions 1-3. For solution 4 we used a LASSO Cox regression model [17].

Synthetic data

We created synthetic data in two steps:

Step 1: Create Synthetic data with no right censoring

We created a synthetic dataset using the following real-world prediction problem: ‘within patients who are pharmaceutically treated for depression, who will experience nausea within 3 years of the initial depression diagnosis?’ We extract real world data on predictors, outcomes, and follow-up time from Optum EHR. The extracted data contains 86,360 randomly sampled patients in the target population (we sampled 100,000 but 13,640 patients had nausea prior to index and were excluded), of which 52,325 (60.5%) lacked complete 3-year time-at-risk follow-up. To create a dataset with complete follow-up, we trained a prediction model to predict nausea on this dataset and then applied it to the patients lost to follow-up to impute whether they had the outcome. For each patient lost to follow-up we drew a number from a uniform distribution X ~ U(0,1) and if this value was less than or equal to the predicted risk of the patient experiencing the outcome then the patient was labelled as an outcome patient, otherwise they were labelled as non-outcome. This resulted in 8,944 patients lost to follow-up being labeled as having the outcome and 43,381 labeled as not having the outcome. For each patient with the outcome imputed, we also randomly selected the date at which they had the outcome by randomly picking uniformly between their start date and 3 years following.

We chose to impute the outcome for patients lost to follow-up rather than restrict to patients who were not lost to follow-up due to potential bias issues. If the patients lost to follow-up were systematically different to the patients not lost to follow-up, then the results observed when analyzing the impact of loss to follow-up restricted to patients with complete follow-up may not generalize to the whole population.

Step 2: Simulating loss to follow-up

Starting with the synthetic dataset from step 1 that considers every patient to have complete follow-up, we then partition this set into 75% training data and 25% test data. We then simulate loss to follow-up in the training data based on either random selection or morbidity-based selection:

i) To simulate random loss to follow-up at a rate of thres% (thres in {10,20,30,40,50,70,90}) we draw from a uniform distribution per patient i, X1_i~ U(0,1), and censor the ith patient if the number is less than the censoring rate X1_i< thres/100 (e.g., if the censoring rate is thres=10, then patients are censored if their randomly drawn number is 0.1 or less).

ii) To simulate morbidity-based loss to follow-up at a rate of thres% we calculate each patient’s baseline Charlson comorbidity index score and then find the score where thres% of patients have a score equal or higher. We then consider all patients with that score or higher to be censored.

For patients who are identified as being lost to follow-up, we then simulate when they were lost. To simulate the date a patient is lost to follow-up, we uniformly picked the date during the 3-year follow-up (1095 days). For example, to simulate the date we draw a number from a uniform distribution, X2_j ~ U(0,1), per patient j and set their censored date as start_date_j + floor(1095*X2_j) where start_date_j is the date patient j entered the target cohort. If a patient has the outcome at a date after their loss to follow-up date, then the outcome would have been observed after loss to follow-up, so we revise these patients to be labelled non-outcome patients. If the patient has the outcome on a date before the loss to follow-up date, then we would have seen the outcome prior to loss to follow-up, so they are still considered to be labelled as outcome patients.

We do not simulate loss to follow-up on the 25% test set, as this ‘silver standard’ is used to evaluate the impact of the four different solutions for developing patient-level prediction models in data containing loss to follow-up. The creation of the synthetic data is illustrated in Figure 1

Empirical Study Data

For each strategy we empirically investigate the performance when addressing 21 different prediction problems for two different follow-up periods (1 year and 3 years after index) using real world data. In a previous study we developed models to predict 21 different outcomes in a target population of pharmaceutically treated depressed patients [3]. For consistency, here we picked the same 21 prediction problems.

The target population of pharmaceutically treated depressed patients are defined as:

Index rule defining the target population index dates:
- First condition record of major depressive disorder
Inclusion criteria:
- Antidepressant recorded within 30 days before to 30 days after the target population index date
- No history of psychosis
- No history of dementia
- No history of mania
- >=365 days prior observation in the database
- >=30 days post observation in the database

The 21 outcomes were: gastrointestinal hemorrhage, acute myocardial infarction, stroke, suicide and suicidal ideation, insomnia, diarrhea, nausea, hypothyroidism, constipation, seizure, delirium, alopecia, tinnitus, vertigo, hyponatremia, decreased libido, fracture, hypotension, acute liver injury and ventricular arrhythmia and sudden cardiac death. All definitions and logic used to define these outcomes are supplied in Supplement A.

Real world labelled data were extracted from Optum EHR for each prediction problem, where the predictors were the presence of medical conditions and drugs that occurred prior to index or demographics at index. We created binary indicator variables for every condition and drug one or more of the target population had recorded prior to index. For example, if a patient had a record of type 1 diabetes prior to index, we could create a variable ‘type 1 diabetes any time prior’. Any patient who had type 1 diabetes recorded prior to index would have a value 1 for the variable ‘type 1 diabetes any time prior’ and any patient who did not have a type 1 diabetes record prior to index would have a value of 0. In total we extracted 204,186 variables. We created labels for each patient and time-at-risk, 1-year and 3-years. The outcome label was 1 if the patient had the outcome recorded during the time-at-risk following index and 0 otherwise. We then partitioned the labelled data into 75% training set and 25% test set. The four design choices were each independently applied for each prediction problem and models were developed using the training data.

Performance evaluation

We evaluate the models’ performances by calculating the area under the receiver operating characteristic curve (AUROC) on the test data with and without the patients lost to follow-up. An AUROC of 0.5 is equivalent to random guessing and an AUROC of 1 corresponds to perfect discrimination (able to identify the people who will develop the outcome at a specific risk threshold). The Cox regression AUROC was calculated using the exponential of the sum of the effect parameters multiplied by the covariate values (without the baseline hazard function).

Synthetic Data Studies

The results of the analysis on the synthetic data are presented in Tables 2 and 3. In these results the ‘silver standard’ test data contained complete follow-up for each patient, but in the train data we simulated that thres% of patients were lost to follow-up. Table 2 corresponds to when loss to follow-up is randomly simulated, whereas table 3 corresponds to when loss to follow-up was based on a patient’s health. If a patient with the outcome (when they had full follow-up) had a simulated loss to follow-up then two situations were possible i) the outcome date was before the date they were lost to follow-up (before loss to follow-up date) or ii) the outcome date was after the date they were lost to follow-up (after loss to follow-up). If the outcome date was after the simulated loss to follow-up date, then the patient’s label in the train data was set to non-outcome (noisy data). When loss to follow-up was random the solutions performed similarly in terms of discrimination (Table 2). When loss to follow-up was more common in sicker patients, more outcome patients were lost to follow-up and the solution ‘Logistic remove lost to follow-up non-outcomes’ performed worse in terms of discrimination on the test set (Table 3).

Table 2

AUROC results when predicting the simulated outcome within 3 years, when loss to follow-up is at random.
Percentage censored (thres)	Number in training Target Pop (64,770) censored	Training Outcome count (10,104) with loss to follow-up		Logistic keep lost to follow-up	Logistic remove lost to follow-up	Logistic remove lost to follow-up non-outcomes	Cox keep lost to follow-up
		Before loss to follow-up date	After loss to follow-up date	Test AUROC (train AUROC)
~10%	6532	434	586	0.690 (0.703)	0.690 (0.705)	0.693 (0.714)	0.690 (0.702)
~20%	12914	836	1201	0.690 (0.703)	0.690 (0.715)	0.692 (0.716)	0.689 (0.701)
~30%	19536	1218	1813	0.691 (0.714)	0.691 (0.713)	0.691 (0.718)	0.684 (0.700)
~40%	26002	1668	2440	0.692 (0.712)	0.686 (0.715)	0.691 (0.716)	0.688 (0.699)
~50%	32460	2140	3054	0.688 (0.699)	0.697 (0.714)	0.691 (0.717)	0.688 (0.698)
~70%	45401	2924	4216	0.687 (0.699)	0.678 (0.712)	0.688 (0.718)	0.686 (0.695)
~90%	58356	3766	5339	0.685 (0.699)	0.664 (0.715)	0.679 (0.721)	0.684 (0.695)

Table 3

AUROC results when predicting the simulated outcome within 3 years, when loss to follow-up is based on Charlson comorbidity index
Percentage censored (thres)	Number in training Target Pop (64,770) censored	Training Outcome count (10,104) lost to follow-up		Logistic keep lost to follow-up	Logistic remove lost to follow-up	Logistic remove lost to follow-up non-outcomes	Cox keep lost to follow-up
		Before loss to follow-up date	After loss to follow-up date	Test AUROC (train AUROC)
~10%	6,488	527	901	0.685 (0.697)	0.684 (0.702)	0.675 (0.735)	0.685 (0.693)
~20%	12,946	1024	1606	0.680 (0.695)	0.683 (0.711)	0.654 (0.754)	0.684 (0.687)
~30%	19,371	1422	2294	0.678 (0.692)	0.681 (0.710)	0.636 (0.778)	0.682 (0.680)
~40%	25,834	1925	2847	0.677 (0.692)	0.679 (0.707)	0.621 (0.800)	0.682 (0.675)
~50%	32,313	2289	3450	0.677 (0.692)	0.676 (0.706)	0.607 (0.837)	0.681 (0.671)
~70%	45,271	2973	4387	0.681 (0.708)	0.671 (0.693)	0.592 (0.865)	0.678 (0.674)
~90%	58,274	3726	5394	0.684 (0.714)	0.654 (0.723)	0.590 (0.916)	0.676 (0.689)

The calibration plots, see Appendix A, show that all the logistic models were not calibrated well when there was a high percentage of loss to follow-up (thres>30%). The ‘remove all lost to follow-up and ‘keep all lost to follow-up’ LASSO logistic regression models appear to slightly underestimate the risk, whereas the ‘remove lost to follow-up non-outcomes’ solution substantially overestimated the risk. The Cox regression requires the calculation of the baseline hazard function before it can be used to calculate the probability that a patient experiences the outcome during the time-at-risk period. The tool we used for LASSO Cox regression does not provide this function and calibration could not be calculated.

Empirical Studies

The results of each solution when predicting the various outcomes within 1 year or 3 years of the initial treatment for depression across the three test datasets are presented in Figure 2. The results are also available as Tables in Appendix B.

shows the performance of the four solutions are similar when the time-at-risk is 1 year except when the outcome count is low (acute liver injury) or the outcome is associated to loss to follow-up (ventricular arrythmia and sudden cardiac death). The performance is more varied when the time-at-risk is 3 years. When the time-at-risk increases to 3 years, the LASSO logistic regression trained using data that removed the lost to follow-up non-outcome patients seems to consistently perform worse when evaluated on the data keeping all patients lost to follow-up or excluding all patients lost to follow-up.

Empirical results for 1 to 8-year time-at-risk

We highlight liver injury, because it is the rarest outcome, as well as suicide and suicidal ideation because likely associated to loss to follow-up (e.g., if the patient dies by suicide). For these two outcomes we compare the discrimination of the regularized logistic regression trained on data including lost to follow-up patients and the regularized Cox model for various time-at-risks. We trained the models on 75% of the data, including those who were lost to follow-up. To evaluate we used the test set containing 25% of the data, both when including all patients who were lost to follow-up (keep all) and when excluding all the patients who were lost to follow-up (remove all).

Figure 3 shows that the discrimination performance was similar between a Cox regression model and a logistic regression model that used LASSO regularization and were trained using data that included patients lost to follow-up for the two prediction questions. As the time at risk increases the number of patients lost to follow-up increases, making the performance less certain (larger confidence intervals on the right).

In this study we compared the performance of four different design choices to address loss to follow-up by using a synthetic dataset and 21 real world prediction questions. The simulation results suggest that when loss to follow-up is random the solutions all perform similarly. The only solution where the performance seemed worse was the lasso trained when removing all patients lost to follow-up when the loss to follow-up was high (>=70%). This is likely due to this design choice removing patients with the outcome who were lost to follow-up and therefore the classifier did not have a sufficient sample size to learn a generalizable model. Therefore, if the outcome count is sufficiently high and censoring is at random, all four solutions are likely to be similar in terms of discrimination. With loss to follow-up based on comorbidity, where sicker people are more likely to be lost to follow-up, we saw differences between the solutions. The LASSO logistic regression trained on data that excluded patients lost to follow-up without the outcome but keeping those with the outcome before being lost to follow-up performed much worse than the other models. Interestingly, the train AUC was higher than the other solutions, but the test AUC indicated the model was not transportable to all patients (including those lost to follow-up). This is likely due to the model predicting morbidity rather than just the outcome. The other solutions performed similarly in terms of discrimination compared to when loss to follow-up was at random.

In terms of calibration, the LASSO logistic regression trained on data that excluded patients lost to follow-up without the outcome but kept those lost to follow-up with the outcome was poorly calibrated, especially when the loss to follow-up was based on morbidities. The method tended to assign a much higher risk due to the outcome being overrepresented in the training set. The LASSO logistic regression models trained on data that excluded or included all patients lost to follow-up were well calibrated up to around 30% of patients being lost to follow-up and then tended to slightly underestimate the risk, although the model that excluded all patients lost to follow-up remained well calibrated when the sicker patients were more likely to be lost to follow-up. The LASSO logistic model that excluded all patients lost to follow-up appeared to have relatively good discrimination and calibration compared to the other approaches that dealt with loss to follow-up, except when the outcome count was low or patients with the outcome were more likely to be lost to follow-up The lasso logistic model that included all patients lost to follow-up or the LASSO Cox model are the better choice when the outcome count is low or the outcome is associated to loss to follow-up (e.g., outcomes associated with death).

The empirical results of the various solutions for developing models to predict outcomes within 1 day until 365 days following treatment for depression indicate that the choice to deal with loss to follow-up does not have a large impact on the model discriminative performance. The AUROC values were similar for each solution on each test set across the prediction questions. This is consistent with the loss to follow-up at random result, as people who stop being observed in the database within a year may do so randomly. However, when considering a longer time-at-risk, predicting various outcomes within 3-years (1 day until 1095 days) following treatment for depression, the AUROC varied between solutions and test sets. This may be due to sicker people being more likely to be lost to follow-up (due to death) when the follow-up is longer.

In general, for the 3-year time-at-risk the LASSO logistic regression model trained including lost to follow-up outcome patients but excluding lost to follow-up non-outcome patients had a varied performance. When this model was validated on test data that excluding lost to follow-up non-outcome patients, it performed well but the AUROC dropped significantly when applied to all patients (including all lost to follow-up patients) or applied to only patients who were not lost to follow-up. These results, in addition to the simulation results, suggest that the model may be biased, and the performance is overestimated when evaluated on test data with lost to follow-up non-outcome patients removed.

The discrimination performance for the LASSO logistic regression and LASSO Cox models trained using data including lost to follow-up patients across longer time at-risk periods appear to be equivalent. The LASSO logistic regression model is computationally faster to train and returns a predicted risk without requiring additional calculations of a baseline hazard function.

Based on our simulation and empirical evaluation, we have four key insights:

i) The LASSO Cox model does not appear to be better than training a LASSO logistic regression model with training data that includes all patients lost to follow-up up to the 8-year time-at-risk investigated.

ii) Training a model using data that removed patients lost to follow-up who do not have the outcome but kept those with the outcome can bias a model.

iii) Evaluating a model on data that removed patients lost to follow-up who do not have the outcome but kept those with the outcome can lead to optimistic performance estimates

iv) If the loss to follow-up is associated with the outcome (i.e., the outcome can cause death) or the outcome count is low then training a model on data where patients lost to follow-up are removed could limit performance.

As best practices we propose that researchers i) develop models using data that includes patients that are lost to follow-up as this is less likely to lead to biased models and ii) evaluate the model performance on test data that includes patients that are lost to follow-up but also evaluate the model performance on test data that excludes patients that are lost to follow-up to gain more insight into the true model performance.

A strength of this study is that we were able to empirically evaluate the impact of various solutions to deal with loss to follow-up at scale. In this study we developed 4 models in 2 time-at-risk periods for 21 outcomes, so 168 models in total. In future work it may be useful to expand this further and evaluate whether the results hold across more datasets and prediction questions. In addition, it would be useful to investigate the performance on external datasets to see which solutions are more generalizable. A limitation of our simulation study is that we assumed the loss to follow-up date was uniform between the time-at-risk period, whereas in reality you may find censoring more common at the start or end of the follow-up. However, the empirical results used real world data that would capture any complexities in loss to follow-up distribution. We also only tested the solutions on 21 real world prediction questions, and it is not possible to know whether our results would generalize to all prediction questions. In addition, there are other solutions available for addressing loss to follow-up that were not investigated. For example, patients lost to follow-up could have a lower weighting assigned when calculating the model performance, so they have less impact. However, we selected the four solutions investigated in this paper due to their simplicity so they could be widely implemented without advanced knowledge of machine learning or programming, as this is likely to limit a solution’s utility.

This is the first study to empirically evaluate the design choice of dealing with loss to follow-up data in prediction model development at scale and our results can now be used to guide other researchers.

We compared four different techniques that can be used to address the issue of loss to follow-up in prediction model development. Our results suggest that using training data that removes patients who are lost to follow-up who do not have the outcome but keeps patients lost to follow-up who have the outcome can lead to biased models. Based on this research it appears that it is best to develop models using data that includes patients that are lost to follow-up and to evaluate models on two test set; one that contains patients lost to follow-up and another that excludes patients lost to follow-up.

AUROC – Area Under the Receiver Operating Characteristic

EHR - Electronic Health Record

IRB – Institutional Review Board

LASSO - Least Absolute Shrinkage and Selection Operator

NLP – Natural Language Processing

Ethics approval and consent to participate

All patient data included in this study were deidentified.

The New England Institutional Review Board determined that studies conducted in Optum are exempt from study-specific IRB review, as these studies do not qualify as human subjects research.

Consent for publication

Not applicable

Availability of data and materials

The Optum EHR data that support the findings of this study are available from Optum (contact at: https://www.optum.com/business/solutions/life-sciences/explore-data/advanced-analytics/ehr-data.html) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.

Competing interests

Jenna Reps is an employee of Janssen Research & Development and shareholder of Johnson & Johnson. Patrick Ryan is an employee of Janssen Research & Development and shareholder of Johnson & Johnson. Martijn Schuemie is an employee of Janssen Research & Development and shareholder of Johnson & Johnson. Peter Rijnbeek works for a research group who received unconditional research grants from Boehringer-Ingelheim, GSK, Janssen Research & Development, Novartis, Pfizer, Yamanouchi, Servier. None of these grants result in a conflict of interest to the content of this paper.

Funding

This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA.

Authors' contributions

JMR, MJS and PBR contributed to the conception and design of the work. JMR ran the analysis. JMR, PR, AC, NP, PBR, MJS interpreted the results. All authors contributed in drafting, revising and approving the final version.

Acknowledgements

Not applicable

NICE Lipid modification. cardiovascular risk assessment and the modification of blood lipids for the primary and secondary prevention of cardiovascular disease 2014.
Goldstein BA, Navar AM, Pencina MJ, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198–208.
Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969–75.
Steyerberg EW, Moons KG van der. Windt DAet al.. Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;102:e1001381.
Lee YH, Bang H, Kim DJ. How to Establish Clinical Prediction Models. Endocrinol Metab (Seoul). 2016;31(1):38–44. doi:10.3803/EnM.2016.31.1.38.
Collins GS, Reitsma JB, Altman, DGet. al.. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 2015;131:1–9.
Xue X, Kim MY, Gaudet MM, Park Y, Heo M, Hollenbeck AR, Strickler HD, Gunter MJ. A comparison of the polytomous logistic regression and joint cox proportional hazards models for evaluating multiple disease subtypes in prospective cohort studies. Cancer Epidemiology Prevention Biomarkers. 2013;22(2):275–85.
Howards PP, Hertz-Picciotto I, Poole C. Conditions for bias from differential left truncation. Am J Epidemiol. 2006;165(4):444–52.
Moriguchi S, Hayashi Y, Nose Y, Maehara Y, Korenaga D, Sugimachi K. A comparison of the logistic regression and the cox proportional hazard models in retrospective studies on the prognosis of patients with castric cancer. Journal of surgical oncology. 1993;52(1):9–13.
Peduzzi P, Holford T, Detre K, Chan YK. Comparison of the logistic and Cox regression models when outcome is determined in all patients after a fixed period of time. Journal of chronic diseases. 1987;40(8):761–7.
Vock DM, Wolfson J, Bandyopadhyay S, Adomavicius G, Johnson PE, Vazquez-Benitez G, O’Connor PJ. Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. J Biomed Inform. 2016;61:119–31.
Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, Simonelli PF. Development and validation of a claims-based prediction model for COPD severity. Respiratory medicine. 2013;107(10):1568–77.
Chandran U, Reps J, Stang PE, Ryan PB. Inferring disease severity in rheumatoid arthritis using predictive modeling in administrative claims databases. PloS one. 2019;14(12).
Tai D, Dick P, To T, Wright JG. Development of pediatric comorbidity prediction model. Arch Pediatr Adolesc Med. 2006;160(3):293–9.
Wang Q., Reps JM., Kostka KF., Ryan PB., Zou Y., et al. Development and Validation of a Prognostic Model Predicting Symptomatic Hemorrhagic Transformation in Acute Ischemic Stroke at Scale in the OHDSI Network. PloS one. <add when published>
Ezaz G, Long JB, Gross CP, Chen J. Risk prediction model for heart failure and cardiomyopathy after adjuvant trastuzumab therapy for breast cancer. Journal of the American Heart Association. 2014;3(1):e000472.
Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D. Massive parallelization of serial inference algorithms for complex generalized linear models. ACM Transactions on Modeling and Computer Simulation. 2013;23(10).

Download PDF

Journal Publication

published 06 Feb, 2021

Read the published version in BMC Medical Informatics and Decision Making →

Editorial decision: Major revision
23 Sep, 2020
Review #1 received at journal
16 Sep, 2020
Review #2 received at journal
16 Sep, 2020
Review #3 received at journal
07 Sep, 2020
Review #4 received at journal
31 Aug, 2020
Reviewer #8 agreed at journal
28 Aug, 2020
Reviewer #5 agreed at journal
27 Aug, 2020
Reviewer #7 agreed at journal
27 Aug, 2020
Reviewer #6 agreed at journal
27 Aug, 2020
Reviewer #4 agreed at journal
24 Aug, 2020
Reviewer #3 agreed at journal
23 Aug, 2020
Reviewer #2 agreed at journal
22 Aug, 2020
Reviewer #1 agreed at journal
15 Aug, 2020
Reviewers invited by journal
12 Aug, 2020
Editor assigned by journal
07 Aug, 2020
Editor invited by journal
06 Aug, 2020
Submission checks completed at journal
06 Aug, 2020
First submitted to journal
05 Aug, 2020

You are reading this older preprint version

Read the latest preprint version →

To Include, or Not Include, that is the Question: An Empirical Analysis of Dealing with Patients who are Lost to Follow-up when Developing Prognostic Models Using a Cohort Design

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Methods

Data

Synthetic data

Step 1: Create Synthetic data with no right censoring

Step 2: Simulating loss to follow-up

Empirical Study Data

Performance evaluation

Results

Synthetic Data Studies

Empirical Studies

Empirical results for 1 to 8-year time-at-risk

Discussion

Conclusions

List Of Abbreviations

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Competing interests

Funding

Authors' contributions

Acknowledgements

References

Supplementary Files

Status:

Journal Publication

Version 1