This study found that at sample sizes typically used for developing risk models (e.g. in the CVD domain, the pooled cohort equations(9) and ASSIGN(19) were based on approximately 10 000 individuals or less), there is substantial instability in risk estimates attributable to sampling error. Furthermore, when restricting the analysis to models with high discrimination or good calibration, high levels of instability remained.
This variability in individual risk is especially relevant if using the model to make clinical decisions based on whether a risk score is above or below a fixed threshold (a common use for risk prediction models). From an individual’s and clinician’s perspective, it is unsatisfactory that a different treatment decision may be made depending on the model used. However this is also an issue at the population level. Consider statin therapy in the UK. Initiating statins in patients who have a 10-year risk of CVD > 10% has been shown to be cost effective.(24) This intervention becomes more cost effective the better the performance (calibration and discrimination) of the model used to calculate the risk scores. Sample size is strongly correlated with model performance, and a small sample size will likely lead to a poorly performing model, and less events prevented. However, it is difficult to assess when increasing sample size will improve model performance, given that model performance is affected by many other factors (prevalence of outcome, inclusion of important predictors, strength of association between predictors and outcome). Sample size affects model performance through the precision of coefficients, and imprecise estimates will cause the risk of fixed subgroups in the population to be miss-calculated (the central theme of this paper). Therefore, if the coefficients are precise, and risk estimates are stable, one will unlikely be able to improve model performance by increasing the sample size unless doing so allows for incorporating more predictors. The stability of risk scores (and ultimately precision of coefficients) could therefore be used as a proxy to determine whether increasing sample size will improve model performance. When N = 10 000 we see levels of instability that indicate the performance of the model could be improved by increasing sample size, resulting in fewer CVD events.
At the sample size suggested by Riley et al.,(15) the instability in risk is even higher and the issues are heightened. However, there are no CVD risk prediction models used in practice with such small sample sizes, so the implications are more general. There is often ample data to produce CVD risk prediction models; however this may not be the case for other disease areas, where the outcomes are not well recorded in routinely collected datasets. In this scenario one may have to actively recruit patients into a cohort and the work by Riley et al.,(15) could be used in order to derive a sample size. We propose that if risk scores from a model are going to be used to drive clinical decision making above or below a fixed threshold, Sect. 6 of Riley et al.,(15) “Potential additional criterion: precise estimates of predictor effects” should be properly considered. It is imprecise estimates of the predictor effects that leads to instability of risk scores. If this criterion is not met, as is the case for N = Nmin in this paper, risks scores have high levels of instability and models poorer performance. The number of patients required to ensure stable risk scores will depend on the prevalence of the outcome, the number of predictors and the strength of the association between outcomes and predictors among other things, and therefore will vary for each model.
In practice, to ascertain whether a given development cohort has a sufficient sample size, the process undertaken in this manuscript could be replicated using bootstrap resampling methods. Instead of sampling the population without replacement (not possible in practice), sampling the development cohort with replacement (i.e. bootstrapping) can replicate this process and one could obtain a similar range of risks for each patient. The stability of the risk scores could then be assessed, and a decision made on whether more patients should be recruited. One proposal on how to use this information to determine a sufficient sample size could be to ensure the bootstrapped 2.5–97.5 percentile range for all patients must be smaller than x% of their estimated risk. Another proposal may be to ensure that for patients whose estimates are a certain distance away from a treatment threshold, that there is a less than an x% chance of deriving a risk on the other side of the treatment threshold if one resampled.
There are some limitations that warrant discussion. The first is that the calibration-in-the-large of the population derived model was poor. We don’t believe this is a problem as a similar miss calibration-in-the-large is found in QRISK3,(7) despite the model being well calibrated within risk deciles. It is likely caused by incompatible assumptions under how the observed risks (Kaplan Meier assumes unconditional independent censoring) and predicted risks (Cox model assumes independent censoring only after conditioning on the covariates) are estimated. When looking within risk deciles, the difference in assumptions is not as large and good calibration was found. Centring these measurements thus allowed the evaluation of whether the instability in risk was being driven by over and under predicting models. A second limitation was that one may argue that variation in predicted risk was observed because the proper process for deriving risk prediction models wasn’t followed. We didn’t do this as it would have resulted in different variables and non-linear terms being selected across the models, and we believe this would have increased the variation in risks across the models, rather than reduce it. Finally, this study concerned the outcome CVD and used a specific set of variables for prediction. However the results are likely to be generalizable to other disease areas as the study evaluated the effects of random variability in sampling.