It is widely known in medicine that prevention is better than cure. Prognostic models that can determine a personalized risk of some future illness could be used to identify high risk individuals that would benefit from interventions. Making decisions based on personalized risk could improve patient care [1]. Big observational healthcare databases, such as electronic health records or insurance claims data, are potential data sources that could be used to develop patient-level prediction models. A recent review of prognostic models for cardiovascular outcomes showed that the number of models being published is increasing over time, but the majority of published models have issues (e.g., methodology details missing in publication, lack of external validation and standard performance measures not used) [2]. This problem is observed across outcomes where many models fail to adhere to best practices for model development and reporting [2-4]. In addition, many published models have not been widely tested on diverse populations, so the models may perform poorly when transported to different patients such as low-income populations [4]. There may be even bigger problems with some prognostic models developed on observational data due to study design. Prognostic models are developed by firstly creating labelled data consisting of a set of predictors for each patient and labels indicating whether each patient has the outcome during the time at risk. Machine learning algorithms are then applied to the labelled data to learn associations between the predictors and the outcome label. The idea is that these associations will generalize to new data (often in a clinical setting). It is widely known that if you provide junk data to machine learning algorithms you will get useless models returned. The study design (e.g., case-control or cohort) determines the quality of the data and therefore the quality of the model. When a model is only internally validated, the model is evaluated using the same database and study design as the model development data. If there are issues in the study design, then this is unlikely to be identified by internal validation, but the negative consequences may become apparent on external validation or when used in a real clinical setting.
The two most widely implemented study designs for extracting labelled data from observational databases are the cohort design [5] and the case-control design [6,7]. Figure 1 illustrates the differences between the designs. In Figure 1, part A shows a set of patients and their medical timelines from birth to death. Healthcare databases often only capture a section of a patient’s medical observations. An index point in time is required when developing prediction models using observational data, where data prior to index are used to construct predictors and data after are used for labelling. Part B illustrates that the index for a case-control is the outcome date and part C shows that for the cohort design the index date is when a patient satisfies some specified criteria (e.g., experiences some medical event).
In the cohort design, a group of patients who you wish to be able to predict some outcome risk for, termed the ‘target population’, enter the cohort at a point in time where they satisfy some entrance criteria [5]. The patients are followed for some time-at-risk period to identify whether they develop the outcome. For example, to predict stroke in patients with atrial fibrillation, the target population would be ‘patients newly diagnosed with atrial fibrillation’ with the index of initial diagnosis, the outcome would be ‘stroke’ and the time-at-risk would be 1 day to 5 years following index. A patient-level prediction models is then learned by finding differences at baseline between the target population who did not experience the outcome compared to the target population who did. Alternatively, a case-control design [6,7] picks the point in time when a set of patients experience some outcome (cases), then finds some other patients (controls) paired with a date that matches on certain criteria (such as age and gender) but have no record of the outcome. The design requires the user specifying some time period prior to the outcome event where the data are used to learn to discriminate between the outcome patients and matched patients. For example, to predict stroke in patients with atrial fibrillation, the cases would be patients with stroke and a history of atrial fibrillation, and the controls would be patients with no stroke during a specified time period who have a history of atrial fibrillation and match the cases on certain criteria. The index is the stroke date for the cases and a randomly chosen date for the controls.
Case-control designs are known to have numerous issues. It is widely known that case-control designs are problematic when you wish to assess absolute risk [8]. A recent study argued that case-control designs have a temporal bias which impacts their ability to predict the future [9] and it is widely accepted that the design leads to miscalibrated predictions. Researchers have argued that external validation of case-control prognostic models using a cohort design is essential [10]. When researchers have access to electronic health records or other longitudinal healthcare datasets, they can choose what design to use. Unfortunately, prediction models developed using the case-control design are still being published even when the researchers could have used a cohort design [11-13]. If the case-control design results in researchers extracting inappropriate labelled data, then the models developed using a case-control design may be invalid clinically even though they appear to perform well during model development (i.e., on the test set).
In this paper we empirically investigate various theoretical issues that can occur when using the case-control design to develop prediction models using observational databases. We provide examples to show the case-control design can be avoided, when a researcher has access to observational data, since any prediction problem can be properly translated into a cohort design. We replicate two published patient-level prediction studies that employed a case-control design and show that a cohort design could have been used to achieve equivalent or better discrimination and better calibration.
Issues with case-control design using observational data
Table 1 highlights that the case-control design may be problematic due to the potential issue with selection bias and lack of a well-defined point in time to apply the model. These issues can be seen in Figure 1. There are no well-defined criteria indicating when the case-control model should be applied clinically but the cohort design model is applicable when the index target population criteria is satisfied. The case-control design controls could be very healthy patients compared to the cases. In addition, the case-control design often has an incorrect matching ratio and controls are under-sampled. This is likely to impact performance metrics such as the area under the precision-recall curve and calibration and may lead to optimistic internal validation performance. It is important that a model’s predicted risks are correct when using prognostic models for decision making (i.e., if the model tells ten people they have a 10% risk, then one of them should experience the outcome). If a model overestimates risk, then interventions may be given to people unnecessarily. If a model underestimates risk, then a patient who could benefit from an intervention may be missed. Over or under-sampling outcomes often leads to models that are miscalibrated for the clinical setting they will be implemented, this is a key issue with the case-control design.
Table 1 The potential issues with the case-control designs
Issue
|
Description
|
Cohort
|
Case-control
|
Subjective data extraction methodology choices
|
The design requires subjective methodology choices that may differ between researchers
|
Not if problem is well defined with specified target population, outcome and time-at-risk
|
Yes – matching choice can differ (e.g., matching criteria, matching ratio, whether to remove unmatched cases)
|
Selection bias
|
Data used to train model may not be representative of target population
|
No
|
Potentially due to poor matching design
|
Covariate issue / protopathic bias[12]
|
Includes problematic covariates that are actually precursors of the outcome (e.g., symptoms/tests of outcome)
|
Potentially if the target cohort start date is poorly designed. Easily solved by improving target cohort criteria or adding a gap between index and time-at-risk (e.g., predict outcome 60 days to 365 days after index)
|
Potentially an issue if using data around outcome record (e.g., 1 day before). Can be difficult to solve.
|
Performance metric bias
|
Optimistic performance reported due to under-sampling non-outcomes
|
No
|
Potentially if matching ratio not representative of true outcome ratio (e.g., precision will be higher in data with case control ratio more equal compared to actual data)
|
Miscalibration issue
|
The predicted risk does not match the true risk
|
Yes (moderate chance) – if the outcome proportion changes over time or the machine learning model does not calibrate well
|
Yes (high chance) – if the outcome proportion is not representative due to over/under-sampling or the machine learning model does not calibrate well
|
Ill-defined time to apply model
|
No clear point in time for clinical implementation of model (where the performance has been assessed)
|
No – index well defined by target cohort criteria
|
Yes – no clear index
|
Defining any prediction problem as a cohort design
We assert that any prediction problem, including those previously evaluated as case-control designs, can be appropriately implemented within a cohort design. In general, a cohort design will consist of a target population (patients you want to predict the outcome for) and an index event corresponding to when you want to predict the outcome occurring. We present the different types of prediction problems and provide example inclusion criteria and index dates for defining the problem as a cohort design, see Table 2.
Table 2 Different types of prediction problems and examples of how they fit the cohort design
Prediction type
|
Target population
|
outcome
|
Example target cohort inclusion criteria and index
|
Target cohort criteria
|
Index
|
Disease onset
|
General population
|
Disease (e.g., depression)
|
A visit (outpatient or inpatient) during 2010, >365 days observation in database, age >=18, no prior illness
|
First valid visit in 2010
|
Disease progression
|
Early stage disease patients
|
Advanced stage disease
|
Diagnosed with disease, >365 days observation in database
|
Initial disease record date
|
Treatment choice
|
Patients dispensed treatment 1 or 2
|
Treatment 1
|
Dispensed treatment 1 or 2, >365 days observation in database
|
First recorded date of treatment 1 or 2
|
Treatment response
|
Patients dispended a treatment
|
Desired effect (e.g., disease cured)
|
Dispensed treatment at adequate therapeutic level, >365 days observation in database
|
First recorded date of treatment
|
Treatment safety
|
Patients dispended a treatment
|
An adverse event
|
Dispensed treatment, >365 days observation in database
|
First recorded date of treatment
|
Treatment adherence
|
Patients dispended a treatment
|
>X% days covered during some follow-up
|
Dispensed treatment, >365 days observation in database
|
First recorded date of treatment
|