The baseline characteristics for the female development cohort, validation cohort, and the contemporary cohort are provided in Table 1. See additional file 3 for the equivalent table for the male cohort. There was missing data for ethnicity (57.93% and 58.16% for female and male cohorts respectively), BMI (31.17% and 46.38%), cholesterol/HDL ratio (61.52% and 64.29%), SBP (18.99% and 40.79%), SBP variability (49.61% and 79.06%) and smoking status (24.82% and 34.83%). Note that not all these variables were used to derive risk scores in this paper, but they were included in the imputation process to ensure imputed values were as accurate as possible.
Table 1: Baseline characteristics of each female cohort


Development
(n=1 865 079)

Validation
(n = 100 000)

Contemporary
(n = 387 557)

Outcome

Total CVD events

82 065

4482

NA


Total follow up (years)

13 098 449

703 471

NA

Age


43.07 (15.94)

43.14 (15.96)

48.38 (14.43)

Systolic blood pressure


123.91 (18.28)

124 (18.22)

123.97 (15.17)

Body mass index


25.6 (5.60)

25.56 (5.56)

27.1 (6.31)

Cholesterol/high density lipoprotein ratio


3.72 (1.20)

3.72 (1.21)

3.46 (1.04)

Smoking status

Never

56.04%

56.15%

46.05%


Ex

16.97%

16.98%

31.66%


Current

27.00%

26.87%

22.29%

Townsend

1 (least deprived)

21.96%

21.96%

24.95%


2

21.99%

21.81%

22.35%


3

21.17%

21.46%

21.56%


4

20.46%

20.36%

18.70%


5 (most deprived)

14.42%

14.41%

12.44%

Treated hypertension


6.18%

6.19%

8.45%

Family history of CVD


15.08%

15.13%

20.86%

Type 2 diabetes


1.16%

1.19%

1.15%

For continuous variables the mean (standard deviation) is reported. There is no follow up (NA) for the contemporary cohort is individuals enter the cohort on 1st Jan 2016, and follow up in the CPRD extract stopped three months after this.
The distribution of the C statistic, calibrationinthelarge, MAPEpractical and net benefit of the 1000 models for each sample size are given in Table 2. The 97.5th percentile of C statistics was similar for each sample size, but as the sample size decreased, the 2.5th percentile got smaller (0.802 vs 0.868 female and 0.805 vs 0.843 male). All C statistics in the 2.5 – 97.5 percentile range were > 0.8. The variation in the calibrationinthelarge decreased as the sample size increased. The 2.5 – 97.5 percentile ranges of the calibrationinthelarge values was 2.61% (female) and 3.12% (male) for N = Nmin, decreasing to 0.32% (female) and 0.36% (male) for N = 100 000. Note that the calibrationinthelarge is not centred on zero, but we do not believe this affects the validity of the results. QRISK3(28) suffers from a similarly poor calibrationinthelarge, yet is well calibrated within risk deciles. This is discussed further in the limitations section. There was an improvement in the MAPE between the 2.5th and 97.5th percentile of the models as the sample size increased, ranging from 1.13% to 2.46% (female) and 1.34% to 2.91% (male) when N = Nmin, and from 0.13% to 0.28% (female) and 0.14% to 0.32% (male) when N = 100 000. There was also an improvement in the net benefit as sample size increased, ranging from 0.017 to 0.021 (female) and 0.024 to 0.029 (male) when N = Nmin, and from 0.021 to 0.022 (female) and 0.028 to 0.029 (male) when N = 100 000.
Table 2: Quantiles of C statistics, calibrationinthelarge MAPEpractical and net benefit of the 1000 models, for each sample size


Female

Male


Sample size

2.5%

25%

50%

75%

97.5%

2.5%

25%

50%

75%

97.5%

C statistic

Nmin

0.802

0.852

0.857

0.861

0.864

0.805

0.827

0.831

0.835

0.839

Nepv10

0.856

0.861

0.863

0.865

0.867

0.826

0.834

0.837

0.839

0.841

10 000

0.865

0.866

0.867

0.867

0.868

0.840

0.841

0.842

0.843

0.843

50 000

0.867

0.868

0.868

0.868

0.868

0.843

0.843

0.843

0.843

0.844

100 000

0.868

0.868

0.868

0.868

0.868

0.843

0.843

0.843

0.844

0.844

Calibrationinthelarge

Nmin

2.22

1.43

0.95

0.47

0.39

2.56

1.49

1.01

0.45

0.56

Nepv10

1.85

1.27

0.97

0.64

0.11

2.23

1.47

1.02

0.60

0.29

10 000

1.45

1.13

0.95

0.78

0.44

1.61

1.20

1.01

0.80

0.39

50 000

1.18

1.03

0.95

0.87

0.73

1.28

1.11

1.02

0.93

0.77

100 000

1.11

1.01

0.96

0.90

0.79

1.21

1.08

1.02

0.95

0.85

Mapepractical

Nmin

1.13

1.53

1.75

2.00

2.46

1.34

1.79

2.04

2.34

2.91

Nepv10

0.76

1.03

1.20

1.36

1.74

1.00

1.36

1.57

1.78

2.26

10 000

0.42

0.55

0.63

0.73

0.90

0.48

0.63

0.73

0.85

1.04

50 000

0.19

0.25

0.28

0.32

0.40

0.21

0.29

0.33

0.37

0.45

100 000

0.13

0.17

0.20

0.22

0.28

0.14

0.20

0.23

0.26

0.32

Net benefit

Nmin

0.017

0.019

0.020

0.021

0.021

0.024

0.026

0.027

0.028

0.029

Nepv10

0.020

0.021

0.021

0.021

0.022

0.026

0.027

0.028

0.028

0.029

10 000

0.021

0.021

0.021

0.022

0.022

0.028

0.028

0.028

0.029

0.029

50 000

0.021

0.022

0.022

0.022

0.022

0.028

0.029

0.029

0.029

0.029

100 000

0.021

0.022

0.022

0.022

0.022

0.028

0.029

0.029

0.029

0.029

Performance metrics of the population derived models were as follows. C statistic: 0.868 (female) and 0.844 (male). Calibrationinthelarge: 0.95% (female) and 1.02% (male). Net benefit: 0.022 (female) and 0.029 (male)
Figure 2 plots the 5 – 95 percentile range in risks for patients across the 1000 models, grouped by population derived risk (female cohort). Specifically, each data point making up the boxplots is the 5 – 95 percentile range in risk across the 1,000 models for an individual. The box plots are done in Tukey’s style(29), where outliers are plotted separately if they are more than 1.5 times the interquartile range below or above the 25th and 75th percentiles respectively. Note that these limits on the boxplot are distinct from the 5 – 95 percentile range in risk for each individual. The number of patients contributing to each box plot (defined by the population derived risk) is stated at the top of the graph. For N = 100 000, the median 5  95 percentile range was 0.77%, 1.60%, 2.42% and 3.22% for patients in the 45%, 910%, 1415% and 1920% risk groups respectively. For N = 50 000, the median percentile range was 1.10%, 2.29%, 3.45% and 4.61% in the respective groups, for N = 10 000 it was 2.49%, 5.23%, 7.92% and 10.59%, for N = Nepv10 it was 4.60%, 9.61%, 14.52% and 19.39%,and for N = Nmin it was 6.79%, 14.41%, 21.89% and 29.21%. For each sample size, there was a linear relationship between the median percentile range of each group and the population derived risk of that group. For example, for a sample size of 10 000 the median percentile range was always approximately 50% of the population derived risk. For Nmin, the median percentile range was always approximately 150% of the population derived risk. Results for the male cohort followed a similar pattern, but the level of instability was slightly lower (additional file 3).
[INSERT FIGURE 2]
Figure 3 plots the 5 – 95 percentile range in risks for patients across models subsetted by the C statistic of the models (female cohort, N = 10 000). The median 5  95 percentile range across models with C statistics in the top third was 2.05%, 4.27%, 6.47% and8.71% for patients in the respective risk groups. This equates to an 1819%% reduction in the median percentile range when using well discriminating models compared to all models (2.49%, 5.23%, 7.92% and 10.59%). Results for other sample sizes presented in additional file 3.
[INSERT FIGURE 3]
Figure 4 plots the 5 – 95 percentile range in risks for patients across models subsetted by the calibrationinthelarge of the models (female cohort, N = 10 000). The median 5  95 percentile range across models with the best calibrationinthelarge was 2.29%, 4.78%, 7.26% and 9.77%, for the respective risk groups. This equates to a 89% reduction in the median percentile range compared to when using all models (2.49%, 5.23%, 7.92% and 10.59%). Results for other sample sizes presented in additional file 3.
[INSERT FIGURE 4]
Figure 5 plots the 5 – 95 percentile range in risks for patients across models subsetted by the MAPEpractical of the models (female cohort, N = 10 000). The median 5  95 percentile range across models with the MAPEpractical in the top third was 1.92%, 4.04%, 6.11% and 8.20%, for the respective risk groups. This equates to a 23% reduction in the median percentile range compared to when using all models (2.49%, 5.23%, 7.92% and 10.59%). Results for other sample sizes presented in additional file 3.
[INSERT FIGURE 5]
Figure 6 shows the probability that a patient from a given risk group (according to population derived model) may be classified on the opposite side of the 10% threshold by a randomly chosen model (female cohort). For example when using a sample size of Nmin, 26.91% of patients with a population derived risk between 1415% would be classified as having a risk below 10%, for N = Nepv10 it would be 16.18%, whereas this is only 2.50% for N = 10 000, 0.01% for 50 000 and < 0.01% for 100 000.
[INSERT FIGURE 6]