Investigating Potential Gender-Based Differential Item Functioning for Items in the Kansas City Cardiomyopathy Questionnaire (KCCQ) Physical Limitations Domain

Women with heart failure report worse health-related quality of life on average, than men. This may result from actual differences in care or differing interpretations of and responses to survey questions. We investigated potential gender-based differential item functioning on the Kansas City Cardiomyopathy Questionnaire (KCCQ) Physical Limitations domain. Using data from the HF-ACTION trial, a multicenter, randomized controlled trial of exercise training in patients with chronic heart failure with reduced ejection fraction (661 women, 1670 men), we assessed gender-based differential item functioning using a Wald test based on item response theory and ordinal logistic regression. Both methods evaluated how men and women responded to each KCCQ item after adjusting for physical limitation status. No item exhibited statistically significant differential item functioning using the Wald method. Two items exhibited differential item functioning using the ordinal logistic regression method (KCCQ1e: Climbing a flight of stairs without stopping; KCCQ1f: Hurrying or jogging) (P < 0.01), but the magnitude of differential item functioning was negligible. To accurately measure patient-reported outcomes, it is important to evaluate potential biases that may influence the ability to compare patient subgroups. The magnitude of differential item functioning on a 5-item KCCQ Physical Limitation domain was negligible.


Introduction
Patient-reported outcome measures (PROMs) provide valuable information for cardiovascular clinical studies and clinical care because the outcomes assessed by PROMs can drive decision-making at the individual patient and population levels. PROMs provide complementary insight to standard objective clinical events (e.g., hospitalization or death) or clinician-reported ratings such as the New York Heart Association (NYHA) Functional Classification (Heidenreich, 2021).
An important presumption in the use of PROMs is that patients interpret the PROM questions similarly regardless of their race, ethnicity, gender, age, or any other patient characteristics (Weinfurt, 2021). If this were not the case, then varying interpretations of items might mask actual differences in health status between subgroups or suggest that there are differences between groups when there are no true differences. When two groups of patients (e.g., men and women, older and younger patients) interpret and respond to an item differently for reasons other than the outcome of interest, it is known as differential item functioning (DIF). The examination of DIF is an important step in the continued evidence accrual for use of PROMs in clinical studies.
DIF testing is particularly important for PROMs used in clinical studies of heart failure (HF) as the scientific community investigates the underlying biological mechanisms behind differences in HF outcomes by sex (Coles, 2022). Specifically, symptoms and functional limitations associated with HF are different for men and women (Adams et al., 1996;Lee et al., 2015;McSweeney et al., 2003). On average, women with HF report worse health-related quality of life using PROMs compared with men. The reason for this discrepancy in PROM scores is not understood (Dewan et al., 2019;Khariton et al., 2018). Given the variation in reported quality of life experiences between men and women with HF, and the historic low representation of women in HF studies (Heiat et al., 2002), it is possible that the questions and response choices on PROMs related to HF may be interpreted differently by gender.
As of 2016, at least 9 PROMs were developed to evaluate quality of life in adults diagnosed with heart failure (Kelkar et al., 2016). Since the publication of Kelkar's work, the Patient-Reported Outcomes Measurement Information System® (PROMIS®) Plus-HF profile (Ahmad et al., 2019) measure was developed, adding another PROM option. Of these, only two PROMs have been evaluated for DIF, PROMIS + HF and Minnesota Living with Heart Failure Questionnaire (Coles, 2022;Munyombwe et al., 2014;Rector & Cohn, 1992). The Kansas City Cardiomyopathy Questionnaire (KCCQ) (Green et al., 2000) is a PROM that measures quality of life for adults diagnosed with HF. While the KCCQ has been psychometrically evaluated in multiple studies (Eurich et al., 2006;Green et al., 2000;Spertus et al., 2008), it has not yet been evaluated for DIF. The KCCQ Physical Limitations (PL) domain has been used alone or in conjunction with other KCCQ domains as the basis for endpoints in numerous clinical trials (Armstrong et al., 2020;ClinicalTrials.gov, 2020ClinicalTrials.gov, , 2021a. The PL domain, as well as other KCCQ scores, has been qualified by the United States Food and Drug Administration (FDA) for evaluating effectiveness of medical devices in individuals with symptomatic HF (United States (U.S.) Food and Drug Administration (FDA)). In this study, we evaluated gender-based DIF specifically on the KCCQ PL domain. We obtained expert input from clinicians to refine specific item-level DIF hypotheses. We then conducted a quantitative DIF analysis to determine if there was quantitative evidence to support potential interpretation differences on the KCCQ PL domain by gender.

DIF Hypothesis Refinement
We hypothesized that women and men interpret one or more KCCQ PL items differently. To provide direction for DIF testing and to explore the clinical context surrounding each KCCQ item, we further refined our KCCQ item-level hypotheses with insight from cardiology clinicians (Reeve et al., 2007). Methods, procedures, and a description of clinicians that participated in hypothesis refinement were published previously (Coles et al., 2023). Figure 1 presents an overview of the item-level gender-based DIF hypotheses for the three KCCQ items for which at least two clinicians shared gender-based DIF hypotheses on the PL domain.

Data
Data are from the Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION) trial, which is a National Institutes of Health  Figure 1 shows a priori-defined DIF hypotheses for 3 of 6 KCCQ PL items where at least 2 clinicians agreed on a hypothesis. The remaining 3 KCCQ PL items were not hypothesized to exhibit DIF by 2 or more clinicians. DIF, differential item functioning; HF, heart failure; KCCQ, Kansas City Cardiomyopathy Questionnaire; PL, Physical Limitations 1 3 (NIH)-funded, multicenter (82 sites in the United States, Canada, and France), randomized-controlled trial. HF-ACTION was designed to test the long-term safety and efficacy of aerobic exercise training (n = 1159) versus usual care (n = 1172) in medically stable outpatients with chronic reduced ejection fraction (Flynn et al., 2009;O'Connor et al., 2009;Piña et al., 2009). Our analyses were conducted on baseline data, including all 2331 participants with left ventricular dysfunction and NYHA classes II to IV HF. While HF-ACTION labeled individuals by "sex," the classification was not based on chromosomal identification and, therefore, is more accurately described as labeled by gender as a proxy for sex.
Institutional Review Board approval was granted by the local IRB in December of 2018 for this secondary data analysis, and FDA requirements were granted an exemption under 18159-144033783.

KCCQ Physical Limitations Subscale
The KCCQ comprises 23 questions that assess symptoms, physical functioning, social functioning, and quality of life. The PL domain includes 6 items addressing limitations due to HF, with 2 items in each of 3 levels of exertional requirement (low [<4 METS], moderate [5-7 METS], and high [>7 METS]). These 6 items include: (1) dressing yourself; (2) showering/bathing; (3) walking 1 block on level ground; (4) doing yard work, housework or carrying groceries; (5) climbing a flight of stairs; and (6) hurrying or jogging. All PL items include 6 response choices ranging from 1 = "Extremely limited" to 5 = "Not at all limited", and 6 = "Limited for other reasons or did not do the activity." Responses of "6" were recoded as missing in order to optimize the disease-specificity of the scale to HF-imposed limitations. The PL domain scores range from 0 to 100, with higher scores reflecting fewer physical limitations.

Analysis Methods
Descriptive statistics were calculated for the KCCQ PL domain overall and by gender. Figure 2 illustrates steps that were used to evaluate DIF quantitatively. In the following sections, we elaborate on each step.

Fig. 2
Steps for evaluating DIF for the KCCQ PL domain. Steps that were used to evaluate DIF quantitatively. DIF, differential item functioning; KCCQ, Kansas City Cardiomyopathy Questionnaire; PL, Physical Limitations

3
Investigating Potential Gender-Based Differential Item…

Steps 1 and 2: Evaluate Model Assumptions and Description of Methods for Adjustment
To confirm dimensionality of the PL domain (i.e., the 6 physical limitation items are reflective indicators of the concept of physical limitations, the latent variable), we conducted confirmatory factor analysis (CFA) fitting a one-factor model to the response data.
We conducted full-information CFA on items in the PL domain using IRT-PRO™ (Vector Psychometric Group. IRTPRO™, 2022). Model fit was evaluated using the root mean square error of approximation (RMSEA < 0.06) (Cai et al., 2006), with lower numbers indicating better model fit. The constellation of model fit indices, as well as factor loadings and item response theory (IRT) parameters, were considered when evaluating model fit and unidimensionality.
Local independence (Toland, 2014) was examined using the standardized local dependence X 2 indices (Chen & Thissen, 1997). Pairs of items whose relationships exceed those predicted by the model (X 2 > 10) could be indicative of possible unmodeled dimensions. For a pair of items exhibiting local dependence, the relationship of the item to the domain score was evaluated via IRT parameters (i.e., the discrimination parameters), and the item with the strongest relationship to the domain was retained for DIF analysis.

Steps 3 and 4: Methods for DIF Analysis and Contextualizing Results
Two approaches were used to evaluate DIF, including IRT-based and ordinal logistic regression (OLR)-based approaches (described below). The null hypothesis is that adults diagnosed with heart failure interpret the KCCQ PL domain items the same by gender.

Primary DIF Analysis: Item Response Theory (IRT) Approach
To evaluate DIF, the Wald chi-square test was implemented within an IRT framework for unidimensional scales (Langer, 2008). If statistically significant DIF was found, then the magnitude of DIF was evaluated in the IRT framework using the weighted area between the expected score curves (wABC) procedure, which calculates the area between the item expected score curves by gender (Edelen et al., 2015). wABC values of at least 0.30 indicate that an item may exhibit clinically significant DIF and would benefit from being carefully considered before inclusion in a patient reported outcome measure (Edelen et al., 2015).

Secondary DIF Analysis: Ordinal Logistic Regression Approach
Secondary DIF analyses were conducted using an OLR approach for unidimensional scales Zumbo, 1999). Three hierarchically-nested models are evaluated for each KCCQ item with and without interactions for scores by gender (see Online Resource 1 for details on OLR methods). Briefly, statistically significant differences in fit among models using the -2 log likelihood between models to a X 2 distribution with 2 degrees of freedom provided indication of statistically significant DIF. To contextualize the results, magnitude of DIF was evaluated by comparing pseudo-R 2 between nested models. We applied Zumbo et al.'s guideline: below 0.035 is negligible DIF, between 0.035 and 0.070 is moderate, and above 0.070 is large .

Description of the Sample
Detailed descriptions of the HF-ACTION sample overall and by gender have been previously published . In brief, the HF-ACTION analysis sample had a majority of men (76%; n = 1670), and most participants were classified as NYHA Class II (62%) with an average ejection fraction of 24.7%. Women were younger than men (median: 57.4 years vs. 60.2), and a larger proportion of women enrolled in the study were Black (45.8%) compared with men (26.7%). Women were more likely to have non-ischemic HF etiology (68.4% vs. 40.8%).
The average KCCQ PL domain score for the HF-ACTION sample was 69.4 (median = 70.8; standard deviation = 21.9; n = 2331). KCCQ PL domain scores were similar for men (mean = 69.9; standard deviation = 21.8) and women (mean = 68.1; standard deviation = 22.0). Online Resource 2 shows item-level response frequencies for the KCCQ PL domain overall and by gender.

Results for Steps 1 and 2: Evaluation of IRT Model Assumptions and Description of Adjustments
Inter-item polychoric correlations among PL items ranged from 0.46 to 0.87, indicating moderate to strong relationships among items (Table 1) (Cohen, 1992). Three pairs of KCCQ PL items exhibited local dependence, indicating that the assumption of local independence could not be met for the PL domain. These pairs are noted in Table 1: KCCQ1a (Dress yourself) and KCCQ 1b (Showering/bathing); KCCQ1d (Yard work) and KCCQ1f (Hurrying or jogging); and KCCQ1e (Climbing flight of stairs) and KCCQ1f (Hurrying or jogging). The unidimensionality of the KCCQ PL domain was not supported by the full-information CFA (M 2 RMSEA = 0.10). Obtaining a unidimensional scale when there are items with local dependence is challenging, so we considered whether any of the items in these pairs could be set aside to improve fit. Among the item pairs exhibiting local dependence, KCCQ items 1a (Dress yourself) and 1b (Showering/bathing) yielded the highest X 2 . Examination of the IRT parameters for both items suggested that 1b was slightly more related to the PL domain. Also, there was agreement from clinicians on DIF hypotheses for Investigating Potential Gender-Based Differential Item…

3
Investigating Potential Gender-Based Differential Item… item 1a. Therefore, 1a was removed and the remaining 5 PL items were evaluated for the unidimensionality assumption. For the 5-item KCCQ PL domain (including items 1b, 1c, 1d, 1e, and 1f), local dependence was present for just one pair of items: 1e (Climbing flight of stairs) and 1f (Hurrying or jogging) (X 2 = 13.0). The new model did not meet standard cut-offs supporting unidimensionality of the 5-item PL domain (items 1b-1f) (M 2 RMSEA = 0.10); however, there was an improvement in local dependence issues. Given that the DIF analysis requires multiple anchors and there are very few items in the KCCQ PL domain, we decided to proceed with evaluating DIF using this set of 5 items (we discuss the limitation of modeling short measures such as the KCCQ in the Discussion).

Results for Steps 3 and 4: DIF Analyses, Comparison of Model Results, and Magnitude of DIF
No IRT-based Wald DIF tests were statistically significant for the 5-item model (Online Resource 3). wABC was not evaluated because no statistically significant DIF was identified using the IRT-based Wald DIF test. Using the OLR method, two items yielded statistically significant uniform DIF: 1e (Climbing flight of stairs) and 1f (Hurrying or jogging). However, the pseudo-R 2 difference was very small for both items (<0.002); therefore, the magnitude of DIF was determined to be negligible. An overview of the primary, secondary, and magnitude DIF evaluations are presented in Table 2.

Discussion
DIF evaluations provide important insights into the validity of subgroup comparisons and/or combination of data. DIF testing also contributes to evidence for the validity of a PROM score's interpretation when there is no DIF in the scale. We hypothesized that DIF may be present on KCCQ PL items for a few reasons: (1) the underlying mechanism causing women with HF to report worse quality of life than men with HF is not well understood (Dewan et al., 2019;Khariton et al., 2018); (2) the varying real-world experiences of women with HF could cause differential interpretation of questions about HF (Adams et al., 1996;Lee et al., 2015;McSweeney et al., 2003;Pope et al., 2000); and (3) the underrepresentation of women in HF studies (Heiat et al., 2002) may have excluded some women's perspectives in the development of HF measures. Clinicians independently described and agreed on DIF hypotheses for three items: KCCQ1a (Dress yourself), KCCQ1b (Showering/ bathing), and KCCQ1d (Doing yardwork, housework or carrying groceries). Yet when we evaluated DIF quantitatively using two DIF testing methods, only two items (KCCQ1e [Climbing a flight of stairs without stopping] and KCCQ1f [Hurrying or jogging]) showed statistically significant DIF (through OLR only), and these were not items clinicians had hypothesized would be affected by DIF. Statistically significant DIF alone is not an indicator of a consequential measurement issue. The IRT and OLR methods used to evaluate DIF are powerful to detect very small differences that may not be clinically meaningful. Therefore, we evaluated the magnitude and impact of item-level DIF on scores. We found the magnitude of DIF to be negligible, and the impact on the 5-item KCCQ PL domain scores was not significant. Regarding cause of DIF, clinicians did not have strong hypotheses about the two items exhibiting statistically significant gender-based DIF using OLR. Therefore, these items most likely do not represent a validity threat to the KCCQ PL domain because the magnitude of DIF was negligible and clinicians did not have consistent DIF hypotheses for these items.
This study highlights the challenge in evaluating DIF for short measures with few anchors. Short measures minimize the amount of time needed to complete questionnaires, thereby reducing patient burden. Despite this advantage, short PROMs may not be good candidates for existing DIF methods that rely on enough items to serve as anchors. Though the KCCQ contains a total of 23 items, these are spread across multiple domains (symptoms, functions, impacts, quality of life), each of which needs to be evaluated individually to meet the unidimensional assumption for DIF methods. In our case, only the 6-item Physical Limitations domain could be included in a DIF evaluation. While the PL was one of the longest domains on the KCCQ, we still encountered challenges with an insufficient number of anchors to evaluate DIF. The KCCQ is not alone among HF measures with few items per domain (Kelkar et al., 2016). We hope evidence presented in this paper prompts DIF methodologists and other researchers to look more closely at this issue. Balance between the need for sufficient anchors and PROM length is essential to address potential DIF associated with a measure. Until this balance is achieved, many PROMs will not be able to be evaluated for DIF, leading to a potential for drawing biased conclusions from group-based scores.
Alternatively, researchers may consider qualitative approaches to complement quantitative studies. Differing interpretations of the same items could be described using qualitative methods, which do not require a pool of anchors as previously discussed. Concept elicitation studies could thematically compare how PROM concepts are described by patients of different groups. Cognitive interview studies could compare patient interpretation of PROM items and response choices by level of health status and across groups (e.g., gender). The conclusions from this study are limited by the types of patients in the HF-ACTION trial, who were relatively healthy individuals diagnosed with HF participating in an exercise intervention. We are curious about consistency of results for patients with HF that were not included in HF-ACTION, specifically patients who are not healthy enough to exercise or those whose HF was not severe enough to be eligible for HF-ACTION (e.g., NYHA Class I). The results may vary if DIF were evaluated in individuals diagnosed with preserved ejection fraction, or those with left ventricular assist devices, especially given the differences in noncardiac comorbid conditions in HF with preserved ejection fraction versus HF with reduced ejection fraction groups (Mentz et al., 2014). The generalizability of results is an important consideration for DIF evaluations and highlights the need to replicate findings in similar populations, as well as to evaluate DIF in independent samples with more heterogeneous populations.

Investigating Potential Gender-Based Differential Item…
Another important consideration for contextualizing results is potential confounders. Statistically significant DIF between subgroups could indicate confounders that are strongly related to subgroup membership. For example, because women tend to have HF diagnosed at an older age, age rather than gender may drive DIF in this population (Dewan et al., 2019). In addition to gender-based DIF, researchers may consider other variables such as age, education, race/ethnicity, and language translations. Multiple DIF evaluations may be needed to untangle possible confounders. Obtaining insight on potential DIF confounders directly from clinicians and patients allows researchers to further refine DIF hypotheses, provide context to the study results, and drive DIF analyses on multiple variables.

Conclusion
The implications of DIF can impact study results creating bias in the comparison or combination of data from subgroups and, importantly, may hide real or show false inequities in patient outcomes. When different subgroups of the population, such as gender or race, are associated with better or worse patient reported outcomes after DIF has been ruled out, the clinical community is prompted to investigate the cause of these differing scores. As clinical research continues to investigate and address equity in patient outcomes for HF, DIF methods are of critical importance to rule out validity issues, or uncover inequities among population subgroups. DIF is not only relevant for patients with HF, but for any patient population in which subgroups are hypothesized to interpret PROM questions differently.