Characterizing and Quantifying Performance Heterogeneity in Cardiovascular Risk Prediction Models — A Step Towards Improved Disease Risk Assessment

Prediction models are commonly used to estimate risk for cardiovascular diseases; however, performance may vary substantially across relevant subgroups of the population. Here we investigated the variability of performance and fairness across a variety of subgroups for risk prediction of two common diseases, atherosclerotic cardiovascular disease (ASCVD) and atrial fibrillation (AF). We calculated the Cohorts for Heart and Aging in Genomic Epidemiology Atrial Fibrillation (CHARGE-AF) for AF and the Pooled Cohort Equations (PCE) score for ASCVD in three large data sets: Explorys Life Sciences Dataset (Explorys, n = 21,809,334), Mass General Brigham (MGB, n = 520,868), and the UK Biobank (UKBB, n = 502,521). Our results demonstrate important performance heterogeneity of established cardiovascular risk scores across subpopulations defined by age, sex, and presence of preexisting disease. For example, in CHARGE-AF, discrimination declined with increasing age, with concordance index of 0.72 [95% CI, 0.72 – 0.73] for the youngest (45 – 54y) subgroup to 0.57 [0.56 – 0.58], for the oldest (85 – 90y) subgroup in Explorys. The statistical parity difference (i.e., likelihood of being classified as high risk) was considerable between males and females within the 65 – 74y subgroup with a value of -0.33 [95% CI, -0.33 – -0.33]. We observed also that large segments of the population suffered from both decreased discrimination (i.e., <0.7) and poor calibration (i.e., calibration slope outside of 0.7 – 1.3); for example, all individuals 75 or older in Explorys (17.4%). Our findings highlight the need to characterize and quantify how clinical risk models behave and perform within specific subpopulations so they can be used appropriately to facilitate more accurate and equitable assessment of disease risk.

Variability in standard performance metrics to assess cardiovascular disease (CVD) risk has frequently been reported 6,11 with findings highlighting that performance varies depending on the type of the groups, for example, sex groups 1 , racial groups (in the US 2,3,4 and out of the US 5,8,9 ), and groups with specific clinical factors 7,10 . With the continued growth of large collections of electronic health records accessible for research purposes it is now possible to more thoroughly explore and better understand performance heterogeneity, considering more refined subgroups.
CVD risk models are commonly used to prioritize individuals for preventive counseling (e.g., weight loss, alcohol cessation) and therapies (e.g., cholesterollowering medication). For atherosclerotic CVD (ASCVD), risk estimation using the Pooled Cohort Equations (PCE) is recommended by U.S. guidelines for determining whether individuals without established ASCVD should be considered for cholesterollowering therapy 12 . For atrial fibrillation (AF), in which the presence of arrhythmia is associated with an increased risk of stroke and heart failure (HF), risk estimation may also prioritize individuals for screening to detect asymptomatic disease 13,44 . The Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF) score 14,15 has consistently demonstrated good predictive performance for incident AF risk across multiple community cohorts 16,17 and electronic health record (EHR)-based repositories 18 .
Leveraging three large and distinct datasets, one from a prospective cohort and two from electronic health records, covering millions of individuals, we aimed to robustly characterize CVD risk score performance heterogeneity across multiple subpopulations defined by clinically relevant strata (e.g., age, sex, and presence of relevant diseases at baseline). Specifically, we deployed the CHARGE-AF and PCE scores within 5 subpopulations across each dataset and quantified model performance, including discrimination, calibration, and fairness metrics, assessing for important and consistent patterns of heterogeneity 19 .

Data sources
A high-level summary of our methodology is illustrated in SUPPLEMENTARY FIGURE 1. We analyzed 3 independent data sources: the Explorys Dataset, Mass General Brigham (MGB), and the UK Biobank (UKBB).
The Explorys Dataset is comprised of the healthcare data of over 21 million individuals, pooled from different healthcare systems with distinct EHRs that have been previously used for medical research 20,18,21 . Data were statistically de-identified 22 , standardized, and normalized using common ontologies and made searchable after being uploaded to a Health Insurance Portability and Accountability Act-enabled platform. The data included EHR entries for all patients who were seen between January 1, 1999, and December 31, 2020.
MGB is a large healthcare network serving the New England region of the US.
We utilized the Community Care Cohort Project 23 , an EHR dataset comprising over 520,000 individuals who received care at any of the 7 academic and community hospitals in MGB.
The UKBB is a prospective cohort of over 500,000 participants enrolled during 2006-2010 24 . Briefly, approximately 9.2 million individuals aged 40-69 years living within 25 miles of 22 assessment centers in the UK were invited, and 5.4% participated in the baseline assessment. Questionnaires and physical measures were collected at recruitment, and all participants are followed for outcomes through linkage to national health-related datasets.

Cohort construction
To ensure adequate data ascertainment and follow-up, we included in Explorys individuals with at least two outpatient encounters greater than or equal to 2 years apart 25 . Individuals in the MGB dataset had at least one pair of primary care office visits 1-3 years apart. We included all individuals who enrolled in the UKBB study. We excluded all enrolled individuals who decided at a later point to withdraw consent.
In Explorys, the start of follow-up was defined as the first encounter following the second qualifying outpatient encounter. In MGB, the start of follow-up was defined as the second office visit of the earliest qualifying pair. In UKBB, as an enrollment-based resource, start of follow-up was the date of the initial assessment visit. In each dataset, individuals with missing data for AF risk estimation at baseline were excluded. We refer to the AF analysis sets as the "AF Subsets". We defined the ASCVD analysis set analogously, with exclusion of individuals with missing data needed to calculate the PCE score ("ASCVD Subsets"). Full details of the cohort construction for the 3 datasets are shown in SUPPLEMENTARY TABLES I-VI.

Clinical factors
Age, sex, race, and smoking status were defined using EHR fields in Explorys and MGB and were self-reported at the initial assessment visit in UKBB. Height, weight, blood pressure, total cholesterol, and high-density lipoprotein cholesterol values were

Follow-up and outcome definitions
The primary outcomes were 5-year incident AF (for the AF Subsets), and 10-year incident ASCVD (for the ASCVD Subsets). Incident AF was defined using a modified version of a previously validated EHR-based AF ascertainment algorithm (positive predictive value 92%), in which electrocardiographic criteria were not used given the absence of electrocardiogram reports in the Explorys Dataset 26 . Incident ASCVD was defined as a composite of myocardial infarction (MI) and stroke, each defined using previously published sets of diagnosis codes 27 . Outcome definitions are shown in

SUPPLEMENTARY TABLE VII.
All models were censored at the earliest of death, last follow-up, or the end of the relevant prediction window (i.e., 5 years for CHARGE-AF and 10 years for the PCE).
Last follow-up was defined as the last office visit or hospital encounter in Explorys, last 9 EHR encounter in MGB, and date of last available linked hospital data in UKBB.

Subgroup types
Per the original design of the PCE, we assessed the 4 sex-and race-specific models within their respective populations (Black women, Black men, White women, White men). All populations were stratified further into 10-year age ranges. These age-based analyses included 6 age strata for CHARGE-AF (45-54, 55-64, 65-74, 75-84, 85-90, and all) and 5 age strata for PCE (40-49, 50-59, 60-69, 70-79, and all). In the AF analyses, we evaluated the following additional subgroups: females, males, Black race, White race, prevalent HF, and prevalent stroke. In the PCE analyses, we also evaluated prevalent HF.

Quantification of model performance
We computed incidence rates for each outcome, reported per 1,000 patient years (1K PY). For each risk score and subgroup, we assessed the association between the risk score and its respective outcome using Cox proportional hazards regression, with 5year AF as the outcome of interest for CHARGE-AF and 10-year ASCVD as the outcome of interest for PCE. Hazard ratios were scaled by the within-sample standard deviation (SD) of the linear predictor of each score for comparability (Standardized Hazard Ratio [SHR]). Therefore, the SHR reflects the relative increase in event hazard observed with a 1-SD increase in the respective linear predictor. We also assessed the discrimination of each score by calculating Harrell's c-index. We compared calibration slopes, defined as the beta coefficient of a univariable Cox proportional hazards model with the prediction target as the outcome and the linear predictor of the respective risk score as the sole covariate, where an optimally calibrated slope has a value of one 28 .
To assess further potential biases in performance, we calculated fairness measures, including differences in statistical parity, true positive rates, and true negative rates 29 . These analyses focused on subgroups most likely to be affected by potential bias, including age, sex (female and male) and race (Black and White). For these analyses, the CHARGE-AF and PCE scores were converted to event probabilities using their published equations 14,12 . Where fairness metrics required application of binary risk cutoffs (i.e., true positive rate difference and false positive rate difference), we defined high AF risk as estimated 5-year AF risk  5.0% using CHARGE-AF 30,18 and high ASCVD risk as estimated 10-year ASCVD risk  7.5% 31,1,2,6 .

Results
A summary of baseline characteristics for the three data sets and their associated two distinct outcomes is shown in TABLE 1, including mean (SD) for continuous measurements, percentage for binary attributes, and follow-up durations for each of the six scenarios (i.e., two scores applied to three distinct datasets). For brevity, only the PCE model with the largest cohort (female-White; n = 1,603,450) is described in the sections below; results for all four PCE models are presented in SUPPLEMENTARY TABLE VIII and SUPPLEMENTARY FIGURE 2.

Association between age and incidence of AF and ASCVD
As shown in FIGURE 1A (AF) and FIGURE 1B (ASCVD) incidence rate increased with age in each dataset. Explorys and MGB showed similar incidence rates in each age group, whereas UKBB patients had substantially lower AF incidence. Similarly, the ASCVD incidence rate increased with age. The effect of age on ASCVD within each of the four PCE groups is shown in SUPPLEMENTARY TABLE VIII.

Performance heterogeneity of CHARGE-AF
We observed that a variety of subgroups were affected by poor discrimination, poor calibration, or both (SUPPLEMENTARY TABLE X and XI); for example, patients 75 or older had discrimination lower than 0.7 and calibration slope out of the 0.7-1.3 range (17.4% in Explorys, 10.6% in MGB). All patients with prevalent HF had the two measures out of boundaries as well (3.7% in Explorys, 1.9% in MGB).  SHRs were substantially lower among individuals with prevalent HF and stroke.

Biased behaviors for CHARGE-AF
As shown in FIGURE 3A, risk estimates using the CHARGE-AF model were much lower for females than for males, with regard to the population as a whole and particularly in the age groups (65-74 and 75-84); for example, the most biased

Biased behaviors for PCE
As shown in FIGURE 5A, risk estimates using the PCE were much lower for females than for males in the overall population as well as within the intermediate age respectively. Differences in sensitivity on the basis of race decreased with increased age in all 3 datasets, with very little difference observed in the oldest age group (70-79). 16 As shown in FIGURE 5F and again unlike CHARGE-AF, across specific age ranges, specificity was lower for Black individuals than for White individuals; this effect was especially noticeable at intermediate age groups (50-59 and 60-69); for example, specificity difference for the 50-59y subgroup was the greatest compared to the other subgroups in Explorys at -0.246 [95% CI, -0.249--0.243].

Discussion
We analyzed three large independent datasets including millions of individuals and identified important patterns of performance heterogeneity across clinically relevant subgroups as indicated by standard performance measures including discrimination, calibration, SHRs, and fairness metrics. Our results build on previous efforts to understand the nature of AF and of ASCVD risk in several keyways. First, we assessed the scores on very large databases, allowing us to perform more granular subgroup analyses. Second, we provide results applicable to 3 resources, allowing us to assess consistency in results across independent datasets. Third, our results provide analyses focused on 2 distinct outcomes, which allows a comparison of performance measures not only using different resources, but also different conditions. Fourth, our results highlight the magnitude of poor performance affecting a large proportion of the population (discrimination, calibration, or both), especially patients at older ages and with prevalent conditions. Fifth, to our knowledge, our study is the first to report on fairness-related measures for the CHARGE-AF (to predict 5-year incident AF) and PCE (to predict 10-year incident ASCVD) scores to assess possible biases considering sex and race differences.
Patterns of variability were fairly consistent across the CHARGE-AF and PCE models. Importantly, we observed that discrimination and calibration were consistently lower at extremes of age, as well as for individuals with certain prevalent conditions (e.g., HF). Furthermore, we observed evidence of potentially biased performance, with important differences in fairness metrics for sex and race in both scores; for instance, sensitivity was much lower for females than males for both scores in intermediate subgroups, a finding that was consistent in all datasets. Overall, our findings underscore the importance of evaluating prognostic models across the many specific subpopulations in which risk prediction is intended, in order to better understand the accuracy and potential bias of the prognostic information used to drive clinical decisions at the point of care.
Our findings suggest that clinicians utilizing prognostic models should not assume that a given level of performance in the overall population will translate to similar accuracy within a subgroup of the population to which their patient belongs.
Consistent with prior findings suggesting good overall performance of CHARGE-AF 16,17 and the PCE 33,7 across multiple populations, we observed moderate or greater discrimination using each score in our datasets. However, we observed that multiple standard metrics (e.g., discrimination and calibration) vary substantially within subpopulations. Specifically, we observed a consistent pattern of decreasing discrimination and increasing miscalibration for higher age groups. Since risk of the majority of incident CVD occurs among older individuals, our findings suggest that more accurate models for an older population remains a critical unmet need. Future work is needed to assess whether models derived within specific subgroups of clinical importance may lead to better and more consistent model performance across important subsets of the population. In addition to variation across standard model metrics, our findings also suggest that common prognostic models may have biased performance across strata of sex and race. Use of the CHARGE-AF score led to lower sensitivity and greater specificity among women, as well as for Black individuals.
Although use of the PCE also led to lower sensitivity and greater specificity among women, it demonstrated the opposite pattern (greater sensitivity and lower specificity) among Black individuals. It is notable that these differences existed despite the fact that the PCE has dedicated models stratified on the basis of race and sex (i.e., it is based on 4 distinct equations). Since PCE model predictions were generally better calibrated among White individuals (as shown in SUPPLEMENTARY FIGURE 2B), our findings suggest that model derivation in populations having greater representation of women and Black individuals may lead to more accurate and generalizable models with less bias.
Of the 3 databases we analyzed, 2 were EHR-based (Explorys and MGB) and the other (UKBB) was a prospective cohort study. While we did identify a strong consistency between MGB and Explorys, patterns identified in the UKBB were not as consistent in all scenarios with the EHR databases. To make more accurate comparisons, additional studies are required to account for differences in EHR resources compared to enrollment-based resources. Individuals appearing in EHR resources are typically associated with higher prevalence of comorbid conditions (as highlighted in TABLE 1). Furthermore, EHR resources contain data entries collected for archiving and retrieval purposes; differently, prospective resources are based on systematic data collection mechanisms and are thus susceptible to selection biases.
Our study has several limitations. First, mirroring definitions of race for CHARGE-AF and the PCE, we classified race as White and Black, which limits our ability to assess for more granular effects of race on model behavior and performance. Second, we were unable to assess the effects of socioeconomic deprivation 37,38,39 given the lack of available data in Explorys and MGB. Third, although we analyzed data from large 20 datasets representing very different settings (i.e., two EHR-based datasets and a prospective cohort study), the majority of individuals across the datasets were White.
Inclusion of data sources comprising larger proportions of Black individuals may have allowed us to examine heterogeneity with greater precision. Fourth, cause of death was not available in any of the 3 datasets, affecting calculations of incident ASCVD and AF measures (we considered in our analyses all death causes, not just CVD-related). Fifth, although our findings provide important evidence of performance heterogeneity and potential bias in commonly used risk estimators, we did not explore methods to mitigate these biases. Sixth, we have not applied recently proposed fairness metrics that assess individual fairness (rather than assessing bias at the population level) 42,43 .
There are several potential strategies to mitigate the important heterogeneity in performance we characterized and quantified in the current study. One strategy is to adjust models according to empirically observed patterns of bias, such as a recalibration methodology, which have been previously proposed as a potential method to reduce bias and minimize, in particular, decisions related to the overtreatment of healthy individuals 5,34 . Another potential approach is to reweight existing models 36,40,41 within each subgroup of the population, resulting in distinct weights for each subgroup of interest. Yet another strategy is to create new larger models that include certain variables (e.g., socioeconomic deprivation) 35,5 that may offer more consistent prognostic value across subgroups, as well as variables defined to greater precision (e.g., more precise quantification of self-reported race(s)). Applying mitigation as well as individuallevel fairness assessment techniques are outside the scope of the current study and are the subject of planned future work.

21
In summary, we identified evidence of important performance heterogeneity and bias in two cardiovascular risk scores, CHARGE-AF and the PCE. We observed consistent patterns across three large and contrasting populations totaling millions of individuals, including consistently worse risk discrimination among older individuals and substantial miscalibration at extremes of age. We also observed that use of common score thresholds may lead to notable biases on the basis of sex and race. Our studycharacterizing and quantifying the performance heterogeneity and bias in clinical risk modelsis just an initial step toward improved disease assessment. These results can help inform clinicians on when it may be appropriate to use and not use a particular risk model for an individual patient. They can also inform the important next step: the development of risk models that are more robust to differences across clinical settings and patient characteristics, to facilitate more accurate and equitable risk estimation to guide improved clinical decisions. A major challenge, however, may still remaineven if much more robust models will be developed, care systems that extensively rely on existing simple models must be convinced that not only the new models are significantly much more robust, but are also easy to use and interpretable.

Data Availability
The institutional review boards of Mass General Brigham (MGB) and IBM approved this study and its methods, including the EHR cohort assembly using the Explorys Dataset, data extraction, and analyses. MGB data contain potentially identifying information and may not be shared publicly. Explorys data can be made available through a commercial license (for details see: https://www.ibm.com/downloads/cas/4P0QB9JN). We are indebted to the UKBB and its participants who provided biological samples and data for this analysis (UKBB Applications #7089 and #50658). All UKBB participants provided written informed consent. The UK Biobank was approved by the UK Biobank Research Ethics Committee (reference# 11/NW/0382).  (a) AF. Zoom-in to better view details for the prevalent stroke and HF subgroups. Note that data for patients 75 or older was not available in the UKBB.

(b) ASCVD (female-White).
Zoom-in to better view details for the prevalent HF subgroup.