3.1 Model reporting
The inclusion criteria of our literature search were met by 35 studies, which described a total of 59 prediction models (5, 7–16, 27–50). Table 3 summarizes the reporting of our replicability criteria in the included articles.
Table 3
Reported model criteria for replicability by included studies.
Category
|
Validation criteria
|
No. of models (%)
|
Population settings
|
Target population definition
|
59 (100)
|
Index date
|
23 (39)
|
Time-at-risk
|
39 (66)
|
Outcome definition
|
59 (100)
|
Statistical analysis settings
|
Prediction method
|
59 (100)
|
Predictor definitions
|
46 (78)
|
Predictor time window
|
21 (36)
|
Model specifications: Full model
|
8 (14)
|
Model specifications: Partial model
|
19 (32)
|
Criteria that were reported by all included studies were the target population definition, the outcome definition, and the prediction method. Various prediction methods were used, including Cox proportional hazard, single tests, logistic regression, linear discrimination analysis, competing risk regression, disease state index, random forest, and support vector machine. Most frequently reported prediction methods were Cox proportional hazard (13 studies, 21 models) and logistic regression (8 studies, 14 models).
Frequently reported criteria were the time-at-risk (66% of the models) and the predictor definitions (78% of the models). Most reported time-at-risk was between three and five years. Studies that did not explicitly state the time-at-risk or predicted over the full follow-up time of a patient were considered to not report this criterion.
Rarely reported criteria include the index date, the predictor time window, and the full model specifications. Non-demographic predictors were most commonly measured in a time window between one year and five years before index.
Of the included studies, three reported all nine replicability criteria for a total of seven proposed models (27). The median number of reported criteria across all included models was five.
3.2 Externally validated models
We selected one of the seven fully reported models, Walters’ Dementia Risk Score for persons aged 60–79, for replication and dismissed the other six for various reasons: Walters et al. did not endorse their second model aimed at persons aged 80–95 for clinical use due to low discrimination performance (15); four models which used detailed education variables (0 to 5 years of primary school, Vocational school certificate, French junior-school diploma, French high school diploma, Graduate studies) that were unavailable in the validation databases (29); one model was developed on data from a prospective cohort study, Adult Changes in Thought (ACT), which is currently not available in the OMOP CDM format (27). Of the partially reported models, there were two for which the development data and predictors were available in the OMOP CDM, and for which missing criteria, such as the baseline hazard and time-at-risk could be left out or approximated under reserve. Therefore, we selected the following three models for external validation: (1) Walters’ Dementia Risk Score which predicts 5-year risk of first recorded dementia diagnosis among patients aged 60–79 using a Cox proportional hazard model and was developed on THIN/IMRD (15); (2) Mehta’s RxDx-Dementia Risk Index predicts risk of incident dementia among patients diagnosed with type 2 diabetes mellitus and hypertension using a Cox proportional hazard model developed on CPRD (16); and (3) Nori’s ADRD prediction model which predicts Alzheimer’s disease and related dementias (ADRD) among patients aged 45 and older using a logistic regression model developed on OptumLabs (13).
Of the externally validated models, the Mehta model did not report the baseline hazard and the time-at-risk and the Nori model did not report the time-at-risk. Because missing information could not be provided by authors, we decided to use a 5-year time-at-risk as used by the Walters model and many of the other reviewed models.
3.2.1 Walters model
The Walters model was found to report all replicability criteria defined in Table 1. It was developed on the THIN database and had several notable modeling decisions. For example, during development data imputation has been used for various numeric variables. Of the six imputed variables (smoking, height, total cholesterol, HDL cholesterol, systolic blood pressure, and weight) only smoking remained in the final model. In the replicated model the smoking status is imputed by assuming that patients that have neither a code for “smoker” nor “ex-smoker” are considered “non-smokers”. This demonstrates a general shortcoming of observational data, as the absence of a code does not guarantee the absence of a condition, drug, or in this case smoking, and the code may simply not have been recorded despite the patient being a smoker or ex-smoker.
The Walters models uses a variable called “social deprivation score”, which ranges from 1 to 5 indicating social deprivation. The information in this variable has been established through a linkage of the UK postal (zip) code recorded in patient notes to UK Population Census data. However, this linkage is no longer available, unlikely to exist in other databases across the world, and establishing the linkage may not be possible or feasible.
The index date (and start of follow-up) of the Walters model is the latest of four entry events: (1) 1 January 2000, (2) when the individual turned 60, (3) one year following new registration with a THIN practice, and (4) one year after the practice met standard criteria for accurate recording of deaths, consultation, health measurements and prescribing. Only the index date of a patient turning 60 could be fully replicated. The start of follow-up on 1 January 2000 was not applicable to any of the data sources as it lies too far in the past. The remaining two index events are THIN-specific and could not be replicated in other databases, including IMRD. Considering that the model is meant for patients aged 60–79, we extended the inclusion for the index event to patients aged 60–79 with the latest visit in their patient record as entry event. Visits are suitable for defining index dates, because they indicate interaction with a healthcare provider that may be qualified to apply a model and interpret its predictions.
The paper mentions that Read codes were used for development, which is a hierarchical coding system that maps onto ICD-10 codes. The authors provide literal names of predictors, for which the corresponding code could be determined by us.
3.2.2 Mehta model
The research paper does not report the full model, which is a Cox proportional hazard model. While the coefficients are reported, the baseline hazard and the time-at-risk are missing. We have contacted the authors of this study, but they were unable to provide us with this information. We are still able to replicate the model for an estimated time-at-risk of 5 years and by normalizing the values of \({\theta }^{T}X\) to a risk score between 0 and 1, where \(\theta\) and \(X\) are the coefficient vector and feature vector, respectively. However, without the baseline hazard we are unable to assess calibration and will report discrimination performance only.
In addition, no data source codes or vocabularies are provided for the predictors, so that predictors needed to be replicated from the medical terms.
3.2.3 Nori model
The Nori model did not explicitly report the time-at-risk. As with the Mehta model, we are still able to replicate the model for an estimated time-at-risk of 5 years.
The paper provides ICD-9-CM codes for diagnoses and CPT-4 codes for procedures that are used as predictors in the final model. The OMOP-CDM uses CPT-4 as the standard vocabulary for procedures and SNOMED CT for diagnoses. However, a mapping table from ICD-9-CM to SNOMED CT is available. Therefore, we could replicate predictors using exact code definitions for the Nori model.
A characteristic of the Nori model was a complex target population definition with multiple entry events and various observation windows. Given a written definition and graphical representation (Fig. 1 in original paper) of the target population, replication was notably more difficult than for the other replicated models (13).
3.2.4 External validation performance
Table 4 provides the discrimination and Table 5 the recalibration performance of the replicated models. Calibration and re-calibration in terms of the Eavg was only assessed if the model’s authors provided the baseline risk, for example in the form of the intercept or baseline hazard. In Fig. 2 we present “round-trip” calibration as observed versus expected risks.
Walters’ Dementia risk score performed best during its development on THIN and worst after model replication and validation on CPRD, MDCR, and IMRD. Interestingly, IMRD, which incorporates THIN, presents the best approximation of the development data and still shows a significant performance deterioration for the round-trip. Figure 2a shows the Walters model round-trip calibration of the original Walters model on IMRD indicating moderate agreement between observed and predicted risk for the entire target population.
Mehta’s RxDx-Dementia Risk index performed best during development on CPRD and almost equally well in the three primary care databases CPRD, IPCI, and IMRD.
Nori’s ADRD dementia prediction model performed best during development on OptumLabs and almost equally well in the remaining data sources. Interestingly, the round-trip performance on OPEHR was the worst. In Fig. 2c we learn that the model overpredicts the round-trip risk in the target population of CPRD.
Almost all models show improvements of the Eavg after recalibration (Table 5). Recalibration for the Mehta model was not assessed because no baseline hazard was provided.
Table 4. Internal and external discrimination performance in AUROC of externally validated models. The round-trip performances for each model are presented in the shaded cells.

*Discrimination AUROC (95% confidence intervals)
Table 5
External calibration and recalibration performance in Eavg of externally validated models. Calibration of Mehta’s RxDx-Dementia Risk Index was not assessed due to missing baseline hazard.
Model
|
MDCR
|
IQGER
|
OPSES
|
OPEHR
|
CPRD
|
IPCI
|
IMRD
|
Walters
|
0.060 (0.002)*
|
0.025 (0.011)
|
0.064 (0.011)
|
0.057 (0.032)
|
0.073 (0.015)
|
0.024 (0.011)
|
0.065 (0.001)
|
Mehta
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Nori
|
0.164 (0.001)
|
0.142 (0.001)
|
0.258 (0.001)
|
0.170 (0.001)
|
0.198 (0.001)
|
0.790 (0.0002)
|
0.19 (0.001)
|
* Calibration Eavg (recalibrated Eavg) |