Comparing EQ-5D-3L and EQ-5D-5L in measuring the HRQoL burden of 4 health conditions in China

EQ-5D-3L has been used in the National Health Services Survey of China since 2008 to monitor population health. The five-level version of EQ-5D was developed, but there lacks evidence to support the use of five-level version of EQ-5D in China. This study was conducted to compare the measurement properties of both the EQ-5D-3L and EQ-5D-5L in quantifying health-related quality of life (HRQoL) burden for 4 different health conditions in China. Participants from China were recruited to complete the 3L and 5L questionnaire via Internet. Quota was set to recruit five groups of individuals, with one group of individuals without any health condition and one group of generalized anxiety disorder (GAD), HIV/AIDS, chronic Hepatitis B (CHB), or depression, respectively. The 3L and 5L were compared in terms of response distributions, percentages of reporting ‘no problems’, index value distributions, known-group validity and their relative efficiency. In total, 500 individuals completed the online survey, including 140 healthy individuals, 122 individuals with hepatitis B, 107 with depression, 90 individuals with GAD and 101 with HIV/AIDS. 5L also had smoother and less clustered index value distributions. Healthy group showed different response distributions to the four condition groups. The percentage of reporting ‘no problems’ decreased significantly in the 5L in all domains (P < 0.01), especially in the pain/discomfort dimension (relative difference: 43.10%). Relative efficiency suggested that 5L had a higher absolute discriminatory power than the 3L version between healthy participant and the other 4 condition groups, especially for the HIV/AIDS group when the 3L results was not significant. The 5L version may be preferable to the 3L, as it demonstrated superior performance with respect to higher sensitivity to mild health problems, better relative efficiency and responses and index value distributions.


Introduction
EQ-5D has been used to measure health-related quality of life (HRQoL) across the globe [1]. The EQ-5D questionnaire for data collection consists of two essential parts: a multidimensional health descriptive system and the EQ visual analog scale (EQ-VAS). The EQ-5D descriptive system comprises five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression [2]. It has two versions, a three-level (EQ-5D-3L) and a five-level version (EQ-5D-5L). EQ-5D-3L (hereinafter 3L) was developed in 1987 and has been the most popular preference-based instrument. It is reported that the 3L has suboptimal sensitivity and suffers from ceiling effects [3]. Therefore, EQ-5D-5L (hereinafter 5L) was introduced in 2009 [4]. In total, 3L defines a total of 243 unique health states, while 5L defines 3,125 health states. The higher number of health states described by the 5L version is aimed at improving sensitivity to small differences or changes in HRQoL [5]. Table 1 summarized the findings of previous validation and 3L/5L comparison studies in China. Overall, both EQ-5D version have been validated in different disease groups and general population in China and most psychometric properties were either good or satisfactory, except for the ceiling effects [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. The detailed results of these studies can be found in Table 5 in Appendix . Ten studies have compared the performance of two versions of EQ-5D in terms of their face validity [14], acceptability [14], ceiling effects [13][14][15][16][17][18][19][20][21], responsiveness [14], informativity [13-16, 18, 20, 21, 25], test-retest reliability [14,15,17], known-group validity [14,15,17,18] and convergent validity [13][14][15]17]. Findings of these studies supported the use of 5L. Despite these evidence in different health condition groups, there lacks evidence of how these two versions performed in mental conditions and most published studies did not report the distribution of responses and did not compare the relative efficiency between these two versions. To date, only one study found that the 3L had higher relative efficiency in individuals with hypertension [13]. Given these limitations, we conceptualized this study to further compare the measurement properties of two EQ-5D versions in China.
In this study, we aimed to compare the measurement properties of the two versions of EQ-5D in quantifying the HRQoL burden associated with 4 chronic conditions including chronic hepatitis B (CHB), depression, generalized anxiety disorder (GAD), and AIDS/HIV in China. The selection of disease groups covered two physical conditions, two mental conditions and used a healthy group as the reference group. We hypothesize that the HRQoL Table 1 Measurement properties of EQ-5D-3L and EQ-5D-5L from published studies a ○: The measurement properties of EQ-5D-3L or EQ-5D-5L were acceptable in Chinese population in some studies b √: The measurement properties of EQ-5D-3L or EQ-5D-5L had good to excellent performance in Chinese population in some studies c × : The measurement properties of EQ-5D-3L or EQ-5D-5L were not satisfactory in Chinese population in some studies d × /○: Some studies found serious ceiling effect of EQ-5D-5L, while other studies proved that it was acceptable e 5L: 5L was better than 3L f 3L & 5L: Both 3L and 5L were satisfactory and no significant difference between 3L and 5L was observed g 5L / 3L: 5L performed better than 3L in some studies while 3L was better in other studies  [7,11,29,30] burden of these conditions is mainly in the mental or psychological domain which may be difficult for 3L to detect and necessitates a more sensitive measure like the 5L.

Participants
This study utilized the data collected in a cross-sectional online survey in China. The survey was part of an international study called 'extending the QALY (E-QALY) project' [35]. The E-QALY project aims to develop a new quality of life measure. As for the sample size, Yfantopoulos et al. [36] used the sample size of 396 for the study of the psychometric properties of the EQ-5D-3L and EQ-5D-5L instruments in psoriasis, and Bhadhuri et al. [37] included 224 patients in the psychometric analyses. Considering that many studies on comparing EQ-5D-3L and EQ-5D-5L used a sample size of 500 or less [15,[36][37][38][39], we used a sample size of 500 in our study, and this number allows for robust analysis within the groups of interest. Therefore, in the online survey of China, 500 respondents with and without a selected health condition were recruited to complete the E-QALY items, EQ-5D-3L, EQ-5D-5L and the Short Warwick-Edinburgh Mental Wellbeing Scale (SWEMWS). This data was collected between April and July 2019 online by Accent, a U.K online survey company. Quotas and inclusion criteria were applied to recruit a sample of 500 individuals in which there were similar numbers of individuals with GAD, HIV/AIDS, CHB, or depression, or without any of those 4 chronic conditions. The sample was broadly representative to the country in terms of geography, ethnicity and gender. The study was approved by the Ethic Committee of University of Sheffield, United Kingdom (Approval letter number 025524) and the IRB of Jinan University, China (Approval letter number JNUKY-2020-001), and all methods were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all participants prior to the online survey through panel, the type was electronic. The online survey began by giving an outline of the research purpose. Participants were then asked to report their disease history. Eligible respondents reported their background information including education level, gender and age etc. Next, respondents were asked to respond to a battery of questionnaires including (in the order of) a subset of E-QALY items, the 3L/5L (half of the sample responded to 5L and the other half 3L), some more E-QALY items, the SWEMWS, the 5L/3L, and the EQ-VAS.

Instruments
The EQ-5D-3L and EQ-5D-5L were both preference-based HRQoL instruments developed by the EuroQol Group. Both instruments have the same five health dimensions, i.e., mobility, self-care, usual activities, pain/discomfort, and anxiety/ depression. The difference is the 3L has three response levels (no problems, some problems, extreme problems), while the 5L has five response levels (no problems, slight problems, moderate problems, severe problems, and unable/extreme problems) for each dimension. To calculate the utility, we used the value set developed by Liu

Statistical analyses
We first described the characteristics of our sample and we reported, by condition groups (1) the median response distributions; (2) sensitivity to mild health problems as measured by the percentages of reporting 'no problems'; (3) the distributions of the 3L and 5L utility scores; and (4) the knowngroup validities when compared with the healthy group and the relative efficiency between the 5L and 3L. Data were analyzed using Stata for Windows, Version 14.0 MP, and IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp (2013).
Given the skewed distribution of EQ-5D responses [42], the median responses were reported to understand the overall health state of each group. Then, the percentages of reporting 'no problems' for each dimension if often being referred as 'ceiling effects' in published studies. This is not accurate given it is unknown the reasons of this large proportions of reporting 'no problems'. For this reason, we refer this phenomenon as sensitivity to mild health problems, which was defined as the proportion of respondents indicating 'no problems' in each dimension and all five dimensions taken together [43]. Previous studies have shown that the use of 5L could reduce the responses of reporting 'no problems' and was considered as a more sensitive measure [16,18]. For this reason, we hypothesized that the 5L has better sensitivity and calculated the reduction of reporting 'no problems' from 3 to 5L. We evaluated the reduction separately for each dimension and all five dimensions taken together. Next, the index values of each group were calculated using 3L and 5L value sets respectively and the distributions were plotted.
Known-group validity of the two EQ-5D index scores was evaluated using the analysis of variance (ANOVA) tests. Relative efficiency was calculated as the ratio of F statistics derived from the ANOVA analysis. The F statistic was widely used to assess RE of measurement scales [19,44,45]. The index score with a higher F statistic would be deemed to be more efficient than its comparator since a higher value of F statistic is more likely to result in statistical significance. To understand the RE of the index scores, we compared the distributions of the responses to the EQ-5D dimensions between the healthy group and each of the condition group. Mann-Whitney test was used. For reference, we listed the median values of each dimension reported by the healthy group and 4 condition groups.

Results
In total, 500 individuals completed the online survey, including 140 healthy individuals, 122 individuals with CHB, 107 with depression, 90 individuals with GAD and 101 with HIV/ AIDS. Some respondents reported multiple conditions, e.g., 68 individuals reported both depression and GAD. In general, the whole study sample was young (mean age: 35.8, SD: 8.64) and well educated. The gender proportions of the five groups were generally balanced except for the group of HIV/AIDS, in which, about 87.1% of individuals were female. In terms of the age distribution, the healthy group was mostly young; the CHB group has more participants aged between 40 and 49; the depression and GAD groups had individuals from all four age groups and the HIV/AIDS group aged mainly from 30 to 49. Individuals with tertiary education accounted for over 80% for all four disease groups and the healthy group had more individuals with secondary education. Table 2 shows the demographic information by condition.
There were 13, 34, 28, 26 and 26 unique states reported for the healthy, CHB, depression, GAD and HIV/AIDS group respectively for the 3L. The corresponding numbers were 18, 43, 46, 42 and 35 for the 5L. When measured by the 3L, the median responses of healthy and HIV/AIDS groups were 'no problems' across the five dimensions. In comparison, when measured by the 5L, while the median responses for the healthy group remained to be 'no problems', the median responses of HIV/AIDS group were all 'slight problems'. Similarly, the median responses for the CHB group were 'slight problems' for the last three dimensions when measured by the 5L, while the only pain/discomfort dimension had 'moderate problem' measured by the 3L. The median responses of the last two dimensions for the GAD and depression group were both on the second level for the 3L (moderate problems) and 5L (slight problems). The Mann-Whitney results were all significant at 0.01 level suggesting all 4 condition groups had a different distribution of responses against the healthy group, for both 3L and 5L.
It is evident from Table 3 that the percentage of reporting 'no problems' of the 5L was smaller than the 3L for all dimensions and for all condition groups. When all dimensions are considered, the number of health profiles 11111 decreased by almost 40% when reporting using the 5L. The reduction of reporting 'no problems' is more salient in the pain/discomfort and anxiety/depression dimensions. The most prominent difference was observed in the HIV/AIDS group, i.e., all dimensions had a relative reduction of over 30%.
For the 3L, the healthy group had a mean utility of 0.948 (SD: 0.

Table 3
Percentages of reporting 'no problems' between 3 and 5L across 5 groups      1). An exception is the 5L utility score clustered at 0.734 for the HIV/AIDS group. This is the utility score of health state 22222, which had been reported most frequently in the HIV/ AIDS group. Table 4 shows the mean index values of by groups and the ANOVA tests between the healthy group and the four condition groups. The last column shows the relative efficiency between the 5L and 3L among these four comparisons. The index values of the 5L ranged from 0.711 of the GAD group to 0.948 of the healthy group. In comparison, the index values of the 3L ranged from 0.718 of the GAD group to 0.947 of the healthy group. It is clear that two versions of EQ-5D produced comparable index values for each sub-group and both versions demonstrated good knowngroup validity, except that the 3L did not show a statistically significant result in the comparison of the healthy and HIV/ AIDS groups. The relative efficiency of the 5L index was higher in all four comparisons.

Discussion
Our study used both the 3L version and the 5L version of EQ-5D to measure the HRQoL burden of 4 chronic conditions in China and focused on comparing the measurement sensitivity of these two EQ-5D versions. In general, we found both versions of EQ-5D to be sensitive tools to quantify the HRQoL loss caused by the 4 chronic conditions, but the 5L showed an improved sensitivity to pick up mild health problems. When combined with the life expectancy data, quality-adjusted life years (QALYs) can be calculated and can be used as a standard measure to reflect how a condition could affect the length and the quality of an individual, which provided a single metric to reflect disease burden. Disease burden is typically measured using disabilityadjusted life years (DALYs), but QALYs can also be used and may be a better measure as it could provide individuallevel HRQoL-based data and it is a recommended measure for economic evaluations [46].
For the measurement properties, our results generally agree with the findings reported before, that is, although both versions produced highly agreed responses, index values and both had good known-group validity, the 5L performed better in terms of response distribution, sensitivity to mild health problems, index value distribution and had higher relative efficiency [16,47,48]. The less clustering of reporting 'no problems' for the 5L is most evident for the pain/discomfort and anxiety/depression dimensions. Previous studies referred this issue of clustering at 'no problems' as ceiling effects, but it should be noted that there are two kinds of responses in play, first, the respondents that do not have any problems and second, the respondents do have problems but report 'no problems' anyway. Theoretically, ceiling effects   Our results showed that the relative reduction is smaller for the healthy group, which suggests the large proportion of reporting 'no problems' is not a ceiling effects, but a genuine reflection of the health state of the healthy group. In overall, around 40% of relative reduction of reporting 'no problems' was observed when five dimensions were taken together. This was larger than the numbers reported in other studies, which ranged between 6.9 and 33.7% [16,47]. Overall, this shows the limitation of less cutting-off points provided by the 3L descriptive system, which was first reported by Mathieu F et al. [48]. Hence, the 5L is more sensitive in measuring the HRQoL of individuals with mild health conditions.
For the index value distributions, it was observed that the 3L had more gaps and clustering's than the 5L. Two major factors are in play to decide the distribution of the index value, i.e., the health profiles and characteristics of the value set [49]. Notably in the 3L, there was a large gap between the index value 1 (profile 11111) and the second highest index value 0.887 (profile 11211). In the 5L, there are 5 different profiles worse than 11111 but having an index value higher than 0.9. In terms of the clusterings, 3L resulted in more clusterings than the 5L and the reason for the clusterings of the 3L index values is due to the clusterings of the profiles [49]. In comparison, there were still some clusterings in the 5L distributions, but it is not due to the clusterings of profiles, instead, it is because more profiles were reported and some profiles have similar index values. In overall, the increased levels of the 5L defined more health states and provided more subtle index values.
The relative efficiency results favored the 5L and is in line with the study of You et al. [16]. Based on the F-ratios in Table 4, it appears that 5L is more sensitive than 3L in physical diseases (i.e., CHB and HIV/AIDS). This is mainly because in the two mental condition group (i.e., GAD and depression), the profiles were more similar, that is, the median levels for both 3L and 5L were both 11122 for the mental condition groups. In contrast, the median levels of the HIV/AIDS group were 22222 when used the 5L and were 11111 when used the 3L. This was also observed in the CHB group, where the median level of usual activities was slight problems in the 5L, but no problems in the 3L. The 5L could pick up more subtle problems than the 3L, and when used the 3L, respondents tent to report no problems. The minimal advantage of 5L compared to 3L in differentiating individuals with and without a mental disorder could be due to the weakness of EQ-5D in measuring mental health. There is only one item in the instrument targeting mental problems.
There are some limitations in this study. First, the study sample was young and highly educated. It should be due to the fact that old people and less educated people are less active on Internet. Therefore, findings of this study may not be generalizable to older populations. It has been reported the elderly with less education attainment may have more difficulty to use the five-level descriptive system and in such occasions, the 3L may be a more feasible alternative [33]. Second, some respondents reported more than one condition, but we did not provide deep analysis about the possible effect of multi-conditions. It should be noted that our sample which was recruited online and the health condition were self-reported. Ideally, clinical data are used to verify the presence and absence of diagnoses reported by the study subjects. Due to these limitations, the superiority of the 5L warrants further studies.

Conclusions
In this study, both EQ-5D versions could quantify the disease burden of both physical and mental diseases in terms of self-reported HRQoL. Overall, our findings favor the 5L version of EQ-5D for measuring health-related quality of life burden caused by chronic disease in China. As patientreported outcome (PRO) is increasingly being used all over the world to measure disease burden, EQ-5D can be a useful tool in this context.

Appendix
See Table 5.  Table 5 Measurement properties of EQ-5D-3L and EQ-5D-5L from published studies Measurement properties EQ-5D-3L EQ-5D-5L 3L vs. 5L Sensitivity Acceptable in patients with stable angina pectoris (SAP) [6] Satisfactory because it was able to detect statistical differences in treatment modalities in patients with adolescent idiopathic scoliosis (AIS) [7] Ceiling effect Serious ceiling effects were observed in patients with systemic lupus erythematosus (SLE) [8] and chronic diseases [9] Ceiling effects of all dimensions were significant in patients with chronic hepatitis B (CHB) [10], AIS [7] and spondyloarthritis (SpA) [11] but acceptable in patients with haemophilia [12] Compared to the 3L version, the ceiling effect decreased in 5L in patients with common cancers, hepatitis B, elderly Chinese population, patients with acute myeloid leukemia (AML), hypertensive, diabetes, and general population [13][14][15][16][17][18][19][20][21] Responsiveness Good responsiveness was observed in differentiating patients with heart failure and without heart failure [22] The responsiveness to changes of patients' health condition was satisfactory in patients with SpA [23], AIS [24] and CHB [10] Both 3L and 5L had good responsiveness because of their ability to detect changes of patients' health status and no significant difference between 3 and 5L was observed [14] Informativity Strong informativity of EQ-5D-5L has been confirmed since it could differentiate haemophilia patients with different health condition [12] The 5L version had better discriminatory power than 3L [13-16, 18, 20, 21, 25] Acceptability The 5L version had better acceptability because most patients chose 5L as the easier system to answer [14] Construct validity Good in patients with SLE [8], chronic kidney disease (CKD) [26], chronic heart failure (CHF) [22], and Uygur cervical cancer patients [27], but moderate in knee osteoarthritis (KOA) [28] The health descriptive system showed good construct validity in southwest residents [29] and health population in China [30] Discriminant validity Satisfactory in patients with SAP [6] and SLE [8] but only acceptable in Kashin-beck disease (KBD) [31] Good discriminant validity has been confirmed in patients with southwest residents [29] and family caregivers of leukemia patients [32] Convergent validity Moderate correlation coefficient indicated that 3L had an acceptable convergent validity in patients with SAP [6] and KBD [31] 5L had a good convergent validity in various population because moderate to strong correlation coefficient was found in patients with chronic hepatitis disease [10], haemophilia [12], southwest residents [29], family caregivers of leukemia patients [32] Both 3L and 5L had good convergent validity in patients with common cancer, hepatitis B and AML. Statistics suggested that 5L is slightly stronger than 3L [13][14][15]17] Face validity The 5L version had better face validity because most patients considered that 5L expressed their ideas better [14] Known-group validity Good known-group validity was confirmed because lower scores were gained in patients with worse health status [31] 5L showed good known-group validity in patients with heamophilia [12], and family caregivers of leukemia patients because of its ability to discriminate all known groups [32] Both 3L and 5L had good known group validity. Three studies show that 5L performed slightly better than 3L [14,15,17], while higher efficiency was observed in 3L in discriminating between known groups in rural hypertensive patients [18] Content validity Desirable in patients with KOA [28] and Uygur cervical cancer patients because strong correlation was found between all domains and total score [27] Barely satisfactory in rural residents due to their poor comprehensibility of some dimensions [33] Funding Measurement properties EQ-5D-3L EQ-5D-5L 3L vs. 5L Test-retest reliability Fair to moderate in patients with KBD [31], stroke [34], CKD [26], and KOA [28] Acceptable in patients with SpA [11], chronic hepatitis with stable health status [10], AIS [7] and very good in family caregivers of leukemia patients [32] Both 3L and 5L had good test-retest reliability. 5L was more reliable in some diseases while 3L had better reliability in other diseases [14,15,17] Internal consistency The internal consistency was moderate to good in patients with chronic diseases [9] and Uygur cervical cancer patients [27] with Cronbach's α coefficient > 0.6, and very high in patients with CKD [26] and KOA [28] with the coefficient > 0.8 High internal consistency was observed in patients with SpA [11], southwest residents [29] and AIS [7], with Cronbach's α coefficient > 0.7. But the coefficient was smaller than 0.7 in health population, indicating a moderate reliability [30]