Development and Validation of the EQ-5D in Taiwan Using Item Response Theory

Our study aims to provide validity evidence for the EuroQol ve dimensions questionnaire (EQ-5D) in the National Health Interview Survey of Taiwan in the 2013 wave and further interpret the EQ-5D scores for patients with chronic diseases. Another goal of the study was to use item response theory (IRT) to identify items that are informative for assessing quality of life using EQ-5D. Methods: our study. Psychometric methods, including factor analysis and the IRT model known as the Graded Response (GRM), were used to assess the unidimensionality of EQ-5D and its item properties. Correlation analysis was used to assess whether EQ-5D scores are associated with scores from the 36-Item Short Form Survey (SF-36). Results:

of the EQ-5D in large-scale interview surveys in Taiwan had been shown to be an effective and simple approach [16][17][18]. We use item response theory to assess whether both relatively healthy people and patients with chronic diseases can be measured on this common health-related quality of life scale.
An item response theory (IRT) model was t to item response data from the EQ-5D scale. This method is a popular tool in educational test development [19]. As scholars identi ed the advantages of IRT models [20,21], they have been increasingly applied for quality-of-life research [22][23][24][25][26][27]. In contrast to classic test theory that focuses on average test scores, IRT focuses on a single dimension or latent construct. IRT analysis estimates different item features, and item characteristics are expected to remain the same and will not change due to the sampled population. The item parameters include location and information parameters. The information parameter indicates how an item can distinguish people with different levels of ability. The location parameter shows where on the scale of the ability an item has its most discriminative information. Based on the item responses in EQ-5D, the IRT model could potentially inform us which items better distinguish between levels of quality of life.
In this study, we aim to provide validity evidence that the EQ-5D measures quality of health based on its content, shows coherence in its scores, and correlates to other HRQoL measurements such as the SF-36.
Additionally, this study demonstrates how the IRT model can provide useful insights about the design of scales for quality of life.

Data and sample
The National Health Interview Survey (NHIS) -Taiwan is a national survey conducted jointly by Health Promotion Administration, Ministry of Health and Welfare and National Health Research Institutes, Taiwan [17]. The survey is administered every 4 years to assist Taiwan's public health sector to monitor the health status of the population. Our data came from the NHIS conducted in 2013. NHIS was approved by the ethics committee of the National Health Research Institute in Taiwan. The interviewees in this survey comprised a representative sample of national and city/county populations in Taiwan. A multistage strati ed systematic sampling design was applied. The townships were strati ed by the urbanization and location before sampling. Village/Lin in sampled townships and then the individual in sampled village/lin were selected step by step following the principle of probability proportional to size (PPS). A total of 159 interviewers were recruited and trained for this research. All interviewees signed consent forms. Data was collected using face to face interviews from July to December in 2013. There were three sets of questions designated for three age groups, including age under 12, 12 to 64, and 65 or above. Our study only selected the interviewees aged 12-64 because the questionnaires for the other two age groups did not include SF-36. The response rate was 72.2% for this age group, and 17,260 participants completed the interviews.
The contents of the questionnaire include personal characteristics, health status, chronic diseases, EQ-5D, and SF-36 (optional) for measuring the quality of life in the general population. A total of 8,272 participants (47.9%) lled out the SF-36 forms. Since only about half of the participants nished SF-36, we conducted a sensitivity analysis to compare the sociodemographic characteristics between those lled SF-36 and those who did not. The chronic diseases registered in the NHIS catalog were hypertension, diabetes, dyslipidemia, stroke, asthma, chronic kidney disease, heart disease, gout, peptic ulcer, chronic obstructive pulmonary disease (COPD), liver/gallbladder disease, osteoporosis, cancer, osteoarthritis, psychiatric disorder, benign prostatic hyperplasia (BPH, for male only), and uterine/ovarian disease (for female only).
There are only 6 items in the EQ-5D scale, including ve items in the descriptive system (each item assigned to one speci c domain) and one item on a visual analogue scale (VAS). The domains include mobility (D1), self-care (D2), usual activities (D3), pain/discomfort (D4), and anxiety/depression (D5). Each domain was rated by 3 levels (1: no problem; 2: some/moderate problem; 3: unable to do/extreme problem). The average score of the rst ve items is the EQ-5D index. The SF-36 includes 36 items. In each item, the participants can respond with a score between 0 (representing the poorest health status) and 100 (representing the best health status). These items can be summarized into the physical component score (PCS), the mental component score (MCS), and the total score [7].

Statistical Methods
Descriptive summaries of demographic results are shown. Using classical test theory, we estimated the inter-item correlation coe cients and the Cronbach's alpha that described internal consistency. We examined the dimensionality using factor analysis and principal component analysis. The IRT graded response model was used for estimating location and information parameters. Based on the item responses in EQ-5D, the HRQoL can also be estimated from the graded response model, and we name this the EQ-5D scale score throughout our analysis.
We calculated the EQ-5D scale score for each chronic disease. The item information functions would be presented and show how much information the items can provide and in what range of EQ-5D scale score the items are most informative. We then calculated the conditional standard error of measurement (CSEM) for EQ-5D scale score using the graded response model. CSEM measures the standard deviation of the observed scores of a survey taker with a xed and unchanging true score over repeated measurements using these items. CSEM indicates the precision of EQ-5D scale scores at different levels, and a smaller CSEM indicates the measurement is more precise for examinees.
To gather predictive evidence, we also correlated the EQ-5D scores (EQ-5D index, EQ-5D VAS, and EQ-5D scale score) with the SF-36 scores, including the SF-36 physical component score (PCS), the SF-36 mental component score (MCS), and the SF-36 total scores. An alpha level of 0.05 was used as the cutoff for statistical signi cance. Stata/IC version 15.1 (StataCorp, 2017) was used for statistical analysis.

Descriptive statistics
Demographic data in Table 1 shows a mean age (± standard deviation) of 38 (± 16) years of the sample.
Among the interviewees, 51% were female, 49% were married, and 21% of the interviewees had at least one chronic disease. For educational attainment, one-third of the interviewees had a high school diploma, one-third of them had a bachelor's degree, and about 28% of them received less than a high school education. The rst ve items in EQ-5D all have a mean close to one (D1: 1.02 ± 0.13; D2: 1.01 ± 0.10; D3: 1.02 ± 0.15; D4: 1.10 ± 0.32; D5: 1.04 ± 0.22) on a scale of 1 to 3 and the sixth item has a mean of 79.95 (SD: 13.58) on a scale of 0 to 100. These numbers show our interviewees generally had a good health state. The rst three items in EQ-5D (D1. Mobility, D2. Self-care, and D3. Usual activities) have a moderate correlation (range, 0.52-0.69) with each other. However, the last three items (D4. Pain/Discomfort, D5. Anxiety/Depression, and D6. Overall health) have a weak correlation (range, 0.12-0.31) with all the other items. Scores across the rst 5 items of EQ-5D show moderate internal consistency with a Cronbach's alpha of 0.60. This alpha value supports moderate reliability for scores.
The distributions of the scores of the rst ve items in EQ-5D are highly concentrated at 1. To enable the IRT graded response model to converge upon estimates, we dichotomized the rst ve items with the rst choice scored as 1 and the second and third choices scored as 0. Most of the scores of the sixth item are in multiples of ten. We thus divided these scores into eleven parts (0 stands for scores less than 5; 1 stands for 5-15; …; 10 stands for scores more than 95) for analysis.
Dimensionality of EQ-5D scale A 1-dimensional factor analytic model was t to the data to determine whether items measured a single underlying latent dimension. Standardized factor loadings ranged from 0.27 (D5, D6) to 0.84 (D3). Model t indices were within acceptable ranges (χ 2 = 1392.41, dƒ= 9, root mean square error of approximation = 0.14, comparative t index = 0.88, standardized root mean square residual = 0.09), indicating that a single common factor can account for the relationships among item responses. A principal components analysis showed that an ideally weighted composite of item scores accounts for 32% of total variation. A scree plot from this analysis indicated that the rst component accounted for substantially more variation than subsequent composites, suggesting that EQ-5D was measuring a unidimensional ability (HRQoL). * D1-D5 rescaled as 0 (for those who answered 2 or 3) and 1; D6 rescaled as 0-10 (0 stands for less than 5; 1 stands for 5-15; …; 10 stands for more than 95)

Estimation of IRT graded response model
An IRT graded response model was t to the sample to estimate information and location parameters. In Table 2, the rst ve items all have a location parameter near − 2 (range, -2.58 to -1.72). The rst three items have high information parameters (range, 6.56-8.66), which indicate these three items could effectively distinguish people with severely low HRQoL (e.g. below the 10th percentile). The EQ-5D scale score was estimated from the graded response model. We presented the EQ-5D scale score by dividing the sample into two groups: relatively healthy people and patients with chronic diseases (Fig. 1). We found there was about 10% of scale scores distributed around − 2 in patients with chronic diseases while a scale score lower than − 1.8 was rarely seen in relatively healthy people. To understand how each disease may impact on HRQoL, we presented EQ-5D scale scores for speci c disease subgroups in Table 3. Stroke is the disease with the lowest scale score (-1.04), followed by psychiatric disorder (-0.91), COPD (-0.72), cancer (-0.70), and osteoarthritis (-0.69). Liver/gallbladder disease, gout, and uterine/ovarian disease (female) have the least impact on HRQoL, with scale scores ranged from − 0.33 to -0.25. The item information functions are presented (Fig. 2). The rst three items provide much more information than the last three items do when the EQ-5D scale score is located around − 2. The last three items provide equal information across all range of scale score. We then calculated the CSEMs for EQ-5D scale score in the graded response model. The CSEM is the lowest, around 0.2, when the scale score is located near − 2, meaning that the precision is highest when we estimate EQ-5D scale score for people with severely low HRQoL. The CSEM is high, around 1, when the scale score is located near 0, which means the precision in estimating EQ-5D scale score for people with the average HRQoL is low. Based on these results, we noted those patients with chronic diseases and HRQoL below the 10th percentile could be better differentiated by the rst three items of EQ-5D than the other items. If differentiation is necessary for patients with average HRQoL, the other items may provide some information.

Correlation analysis
To examine whether the EQ-5D scores can substitute for the SF-36 scores, we performed a correlation analysis for 8,272 interviewees who had both EQ-5D and SF-36 scores in our sample. In Table 4, the correlation is moderate between EQ-5D scores and SF-36 scores. Both the EQ-5D index (the average score of the rst ve items) and the EQ-5D scale score are moderately correlated with SF-36 PCS and SF-36 total score, with correlation coe cients of 0.61. However, the EQ-5D VAS (the sixth item) has a relatively weak correlation (r: 0.50) with SF-36 total score. The EQ-5D index, the EQ-5D VAS, and the EQ-5D scale score all have a lower correlation (r: 0.42-0.48) with SF-36 MCS.

Discussion
According to our study, the correlation is moderate between EQ-5D scores and SF-36 scores. Using the IRT model, we found the EQ-5D scale score is moderately correlated with SF-36 PCS and SF-36 total score. Patients with stroke, psychiatric disorder, COPD, cancer, and osteoarthritis have a higher chance of impaired quality of life. The item information functions reveal that patients with chronic diseases and HRQoL below the 10th percentile could be better differentiated by the rst three items of EQ-5D than the other items.
The EQ-5D scale has only 6 items, far fewer than 36 items of the SF-36 scale. For survey takers, it takes at least 10 minutes to complete the SF-36 scale, but it only takes less than 2 minutes to ll out the EQ-5D scale. The EQ-5D scale can bring potential bene ts by saving time and money for the purpose of public health and clinical investigation. It is also of great importance that each country validates their own use of the EQ-5D scores, which will inform future practice in the local context. Using the IRT graded response model for quality of life research has been rarely seen in the previous literature in Taiwan, but it provides many insights into the analysis and interpretation of EQ-5D scores. Since our sample was representative of the national population in Taiwan, we can have an estimate of the average HRQoL using the EQ-5D scale score from the graded response model and establish norms for comparison in the future.
Our study shows that both the EQ-5D index (the average score of the rst ve items) and the EQ-5D scale score (the ability value estimated from the IRT model) have a moderate correlation with the SF-36 total scores. Although the EQ-5D index and the EQ-5D scale score share similar correlation coe cients with the external criterion, the EQ-5D scale score has more information, because the graded response model weights each item according to its information. The correlation coe cient between the EQ-5D index and the EQ-5D scale score is 0.70 (far from 0.99), supporting that the EQ-5D scale score is providing different information than the EQ-5D index does.
The information function from the IRT graded response model helps clarify which items are more informative at a speci c range of the EQ-5D scale score. In our ndings, three items, D1. Mobility, D2. Selfcare, and D3. Usual activities, provide much more information for interviewees with an EQ-5D scale score near − 2 than the other items do. For patients who have chronic diseases and an EQ-5D scale score below the 10th percentile, the rst three items of the EQ-5D scale are useful to tell whether their quality of life is impaired (very low or low). Clinicians can have these three items as a set of screening questions if they encounter a patient with chronic diseases and suspected impaired quality of life. If the patient reports any decreased function in these three items, the clinician should arrange the corresponding management plan to improve (or at least maintain) the patient's health state and quality of life.
One thing worth mentioning is that we have dichotomized the rst ve items of the EQ-5D scale and divided the scores of sixth item into eleven parts (0, 1, 2, …, 10) for the IRT graded response model. This version of scale showed concentrated information around a scale score of -2. Keeping the rst three items of the EQ-5D scale with only 2 score points can be an even more e cient way and provide us adequate information to differentiate patients who have very low and low levels of quality of life. We suggest to use the 2-point scale in the clinic setting for its relative convenience.
In this NHIS dataset, we registered a variety of chronic diseases that were diagnosed by the physicians rather than simply reported by the interviewees. When examining the EQ-5D scale scores by disease subgroups, we found patients with different types of chronic diseases had different levels of HRQoL to various degrees. Patients with stroke, psychiatric disorder, COPD, cancer, and osteoarthritis have a higher chance of impaired quality of life. Clinicians need to be attentive to these subgroups of patients with chronic diseases. The EQ-5D scale can be a useful tool to assess whether the quality of life is impaired among the high-risk patient population.
Although IRT shows great bene ts by revealing the item characteristics, there are some considerations when using the IRT graded response model. First, for polytomous items like EQ-5D VAS, a large sample size (above 3,500) and coverage across polytomous item scales are needed [19]. The case number in our sampling is large enough for us to t the IRT model. Second, we can only gather information from the given items of our scale. If the goal is to nd more details in each dimension of EQ-5D, an in-depth survey with more items is needed. According to our correlation analysis, the EQ-5D scale score itself is a moderate predictor for the SF-36 score. The nding supports that the EQ-5D scale could be a useful and e cient alternative of SF-36 to quickly screen patients' HRQoL under time constraints. However, if we want to understand how patients with different diseases have different quality of life, it is vital to examine the EQ-5D scores for each type of chronic disease and link them to scores of other HRQoL measures with more items in the following studies.
The EQ-5D scores in our study demonstrate a higher correlation with the SF-36 physical component score than with the mental component score. Some diseases are known to be highly associated with mental health problems and may show a different pattern of information function of EQ-5D items in the IRT graded response model if the sample targets the population of people with these diseases. Similar issues have been raised in a previous review of psychometrics and qualitative assessment of EQ-5D [15]. Therefore, future research using IRT is needed to understand how to interpret the scores of EQ-5D items for patients with speci c diseases and across different clinical settings.

Conclusions
Use of the EQ-5D scale scores is appropriate in the general population, particularly for distinguishing between patients who have very low and low HRQoL. The EQ-5D scores have moderate internal reliability and moderate correlation with SF-36 scores. The IRT graded response model strengthens our interpretation of the EQ-5D scores. The information function analysis demonstrates that Domain 1 (Mobility), Domain 2 (Self-care) and Domain 3 (Usual activities) are the three most informative items of the EQ-5D scale for patients who have chronic diseases and HRQoL below the 10th percentile. Subgroup analysis shows that patients with stroke, psychiatric disorder, COPD, cancer, and osteoarthritis have a higher chance of impaired quality of life. If the time constraints in clinical settings are severe and e cient distinction between very low and low HRQoL patients is desired, we suggest using EQ-5D instead of SF-36 to measure the HRQoL for patients with chronic diseases.

Declarations
Ethics approval and consent to participate The respondents were voluntary and consent for participation. All study procedures were approved by the ethics committee of the National Health Research Institute in Taiwan.

Consent for publication
Not applicable.

Availability of data and materials
The data that support the ndings of this study are available from National Health Research Institute, Taiwan. Restrictions apply to the availability of these data, which were used under license for this study.