Development of Chinese VAS Value Set for EQ-5D-3L and Difference analysis with the TTO Value Set Based on Same Nationally Representative Sample

Objective: To develop a Chinese VAS value set for EQ-5D-3L and compare it with the TTO value set based on nationally representative sample. Methods: An adapted Measurement and Valuation of Health (MVH) protocol was applied with VAS method. EQ-5D-3L was used in face-to-face interviews conducted by trained interviewers with participants selected via multi-stage stratied clustered random sample. Fifteen hypothetical health statuses (11 random states in MVH protocol, plus 11111, 33333, unconscious and death) were assigned for assessment individually. Ordinary least square, general least square and weighted least square models were constructed with or without N3. Four categories of indices, including quality of original data, distribution of rescaled values, goodness of t of models, and distribution of predicted values were adopted to compare between Chinese VAS and TTO value sets. Results: All 5,939 participants aged 15 and over were interviewed; 5,884 eligible participants were included in constructing models. An OLS model that included ten dummy variables and a constant and without N3 was chosen as the best-t VAS model (Adjusted R 2 =0.670). The mean absolute error was 0.0319, and the correlation coecient between predicted and mean values was 0.9837. The curve of predicted value of VAS model was uniformly lower than that of Chinese TTO value set. Comparing with the TTO value set, VAS method had higher responsive rate, less inconsistency, lower skewed values and better goodness of t values. Conclusions: VAS value set performed rationally in Chinese population for its simplicity, responsiveness and robustness. We recommend the VAS value set, Model 2, as the benchmark of scoring algorithm of HRQoL for Chinese population.


Introduction
With the extension of human lifespan, the measurement of health-related quality of life (HRQoL) plays an increasingly important role worldwide [1][2][3]. HRQoL has been measured using specialized instruments, including the SF-36 [4], WHOQOL100 [5] and EQ-5D [6], among others. As the most concise preferencebased instrument, EQ-5D has been applied generally since its publication in 1990 [7].
The EuroQol ve-dimensional questionnaire (EQ-5D) consists of a descriptive system and a visual analogue scale (VAS). The EQ-5D-3L describes HRQoL in terms of ve dimensions (mobility, self-care, usual activities, pain/discomfort, and depression/anxiety); each assessed at three severity levels (no problem, moderate problem, and severe problem). The VAS, a 20cm vertical scale, solicits a self-rating value of HRQoL. Another commonly used elicitation technique is time trade-off (TTO), which forces tradeoffs between 10-year time of life and quality of life under conditions of certainty [8]. After eliciting the preferences from the general population based on a selected subset of the EQ-5D states, a scoring algorithm could then be developed for predicting the utility weights for all 243 states [9].
To date, there are 34 value sets for EQ-5D-3L available on EuroQol website [10]. More than half countries constructed value sets using the VAS method, others using the TTO method, and few of them using both methods [10]. We published the Chinese TTO value set for EQ-5D-3L in 2018 [11].
Although TTO is adopted as a mainstream approach to elicit health utility weights for EQ-5D [8], many debates are still ongoing about it, e.g. TTO is burdensome [12], has duration effect [13], interviewer effect [14], and more inconsistencies [15]. In 2020, We proved that VAS and TTO could be equivalent under certain conditions, but in general population, VAS method appears a more logical approach for valuation due to its greater reliability, simplicity and feasibility [16]. Hence, the Chinese VAS value set is critical for not only deriving a safe value of HRQoL for Chinese population but also exploring the potential bias between the two methods.
Though studies have shown that EQ-5D-5L was associated with reduced ceiling effects and improved discriminatory power [17], EQ-5D-3L has responsiveness advantage [18] and has been widely applied, e.g.
EQ-5D-3L has been adopted in the National Health Services Survey (NHSS) of China three times in a row from 2008 to 2018. Therefore, developing a Chinese VAS value set for the EQ-5D-3L has irreplaceable key function to deepen the use of the instrument. The main objectives of this study are 1) to develop a hypothesis-based VAS value set for EQ-5D-3L; 2) to compare the Chinese VAS value set with the TTO value set.

Health States Description and Selection
An EQ-5D-3L health state, 12333 for example, represents no problems with mobility (level 1), moderate problems with self-care(level 2), and severe problems with usual activities, pain/discomfort and anxiety/depression (level 3) [6; 19]. According to the Measurement and Valuation of Health (MVH) protocol, 43 health states were selected for valuation. Each respondent was assigned to value 11 randomly selected health states (two very mild states, three mild states, three moderate states, and three severe states) and 4 particular states (1111, 33333, "unconscious" and "dead" [6].

Sample Size
For this study, Chinese aged 15 years and older in 2014 represent the target population. Based on Dolan's estimates, a sample size of 3,235 should be adequate for modeling [6]. Considering the regional variation and potential deviation of the Chinese population, a sample size of 6,000 respondents was selected.

Sampling Method
A multi-stage, strati ed, clustered random sample was drawn from the target population from ve areas involving the provinces of Jiangsu, Shaanxi, Guangdong, Hebei, and Chongqing. One county (rural area) and one district (urban area) from each province were selected based on economic development status.
In each selected area, ve rural and ve urban streets were sampled. In each selected street, 60 households were systematically drawn according to the existing household code, and all family members 15 years and older were interviewed [20].

Interviewer Training
A total of 10 supervisors trained at Nanjing Medical University (NMU) were sent to the sample counties/districts for interviewer training and quality control. A total of 108 interviewers who had taken part in NHSS 2013 were chosen from the local Health Service Center. To decrease the interviewer effects [21], the procedure was described in a guidebook. A three-day training workshop was conducted, including lectures, demonstrations, mock interviews, and pilot interviews.

The Valuation Tasks
Face-to-face interviews were conducted in the participant's homes based on the MVH protocol [6]. The questionnaire included 17 demographic questions, the EQ-5D-3L descriptive system, the VAS rating scale and the 15 pre-selected health states. The interviewer asked the respondent to describe and rate their health. Then respondents ranked the 15 pre-selected health states and rated them on the VAS regarding their severity level. After the task, the rst three states were re-valued. To help the participants valuing the state, it was marked on a board of visual aid. ( Figure S1 in Supplemental Materials).

Data Entry and Cleaning
The supervisors checked the questionnaires by the end of each day. Any missing data and/or serious inconsistencies including 1) All states valued the same except for non-traders; 2) All states valued worse than dead were returned to the interviewers for updating. Five percent of the respondents were randomly re-interviewed by the supervisor to con rm the accuracy of data collection. Data were input into the database by supervisors for the rst time and latter by NMU's staff with the aid of real-time checking function in Epidata3.1.
Exclusion criteria included: 1) participants who could not complete the questionnaire; 2) less than 15 years old; 3) more than four inconsistencies i, 4) nvolving extreme outliers. Extreme outliers are values that met all the following criteria: 1) points away from continuous groups in a boxplot; 2) the distance between the outlier and the nearer quartile is more than 3 times of interquartile range; 3) lower than the 5th percentile or higher than the 95th percentile [22]. Inconsistency means a worse state valued 0.1 or higher than any logically better state [23].

Modeling
The de nition of dummy variables and speci cations of the models constructed in this study are presented in Table 1. The main effect was represented by a set of ten dummy variables. MO2, SC2, UA2, PD2 and AD2 representing the movements from no problems (level 1) to moderate problems (level 2), and MO3, SC3, UA3, PD3 and AD3 representing the movements from no problems (level 1) to severe problems (level 3). If level 2 or 3 was found in any of the dimensions, it was represented by the constant and variable N3, respectively [9]. Table 1 De nition of the dummy variables and description of the models  The VAS values (dependent variable) were calculated as (Xraw -Deadraw)/ (11111raw-Deadraw). Xraw means the original value on the scale of state X. The rescaled values are numerical, 1 refers to full health, and 0 to death [6].
Six models were constructed with three types of regression methods. Ordinary least squares (OLS) models were constructed for the basic characteristics of the model. General least squares (GLS) regression models with a random effect were constructed to rectify the signi cance potentially in uenced by inner correlations as each respondent valued 13 health states. Finally, weighted least squares (WLS) models were also explored to rectify the possible heteroscedasticity. The inverse of the variance of residuals was employed as the weight. Multiple linear regression models were built using STATA/SE 12.0 (StataCorp, College Station, TX) with an α of 0.05 [27].

Criteria for Choice of Model
We chose the best-t model based on the following criteria [9]: 1) logical consistency in each dimension; 2)the sign of coe cients should be positive; 3) mean absolute error (MAE), adjusted R 2 , Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used for valuing goodness of t [9]; 4)simplicity plays a more critical role when goodness of t makes less sense; 5)easy for non-experts to understand [11].

Sensitivity and Subgroup Analyses and Reliability Tests
The database was randomly split into two groups; one group was used to construct model, and the other group for estimation of goodness of t. To identify the effect on the model coe cients, demographic and socioeconomic variables such as sex, age group, marital status, district, education level, employment, and economic status were added to the nal model.
Based on the re-valuated states, the following reliability tests were performed: the rates of consistency, Pearson correlation coe cient, Kappa, intra-class correlation coe cient, and Cronbach α.

Data Exclusion and reliability tests
The survey was conducted from 10th July to 25th August 2014. A total of 6,041 individuals were interviewed with a mean time of 48.6 ± 16.7 minutes. Due to mental problems, 98 participants were unable to complete the interview, 4 were under 15 years, and no participants met the exclusion criteria. At this stage, 5,939 participants were kept for checking of outliers and inconsistencies.
A total of 104 extreme outliers were identi ed, involving 55 participants who were entirely excluded. A total of 571 participants were identi ed as having inconsistent values; 496 of whom had only one inconsistency, and no participants had more than 4 inconsistencies. Inconsistencies were all kept in the nal dataset, as suggested by Lamers et al. [23]. The nal sample consisted of 5,884 individuals, representing 99.1% of the whole sample.

Sample Characteristics
Based on the probabilistic sampling method, the proportion of the sample demographic characteristics was similar to the general population in 2014 [30]. The demographic and socioeconomic characteristics are shown in Table 2. The composing of the nal sample compared with the Chinese population is presented in Appendix Table S1 in Supplementary Materials.  Table 2 also shows health-related characteristics, including problems reported on each EQ-5D-3L.
Chronic disease (25.1%) and disability (3.8%) rates which had been diagnosed in a county hospital or higher were reported. In EQ-5D, the most reported problems were pain/discomfort (14.4%), followed by anxiety/depression (7.2%).
With the whole sample, the mean VAS value for the 43 evaluated states was 0.577 ± 0.323. The minimum was − 1.375; and the 1st, 5th and 10th percentiles at -0.333, -0.0526, 0.1333 respectively, indicating that no transformation is needed to rescaling negative values.

Analysis of Models
In Table 3, 6 regression models, coe cient estimates, and statistics are presented. All models and coe cients were signi cant (P < 0.001) but failed to pass the heteroscedasticity and inner correlation test. Differences for each dummy variable coe cients were all less than 0.03 in absolute terms, among which 90% were less than 0.01. Six predicted-value curves for Model 1 through Model 6 are shown in Appendix Figure S2. The six lines almost overlapped with each other. We chose the best-t model based on the following process. In the last stage, Model 5 (WLS with constant and with N3) and 6 (WLS with constant and without N3) were constructed based on the WLS method to address heteroscedasticity issues [31]. Although WLS models had little higher adjusted R-square, AIC, BIC and the number of MAE > 0.05 increased more. Additionally, different weights might lead to different models, producing new potential bias. So WLS models were also dropped at the nal stage. Due to its consistency, parsimony, and transparency as well as its high level of goodness of t, model 2 appeared preferable among all the models. Thus, we chose Model 2 as the best-t model. The predicted value, "22321" as an example, could be calculated using the following algorithm: 1-0.0274-0.0727*1-0.1105*1-0.1518*1-0.0898*1 = 0.5478 All 243 predicted values with Model 2 are shown in Appendix Table S2 in Supplemental Materials.
Appendix Figure S3 shows the observed mean values compared with the predicted values based on Model 2 for 43 valued states. The mean errors between observed and predicted values were minimal especially on the left of the curve. Appendix Table S4 shows goodness-of-t indices for the whole sample and the split sample models indicating a high level of robustness.

In uencing Factors
Based on Model 2, the socio-demographic in uence on health state valuation is presented in Appendix Comparisons with Other Models Figure 1A shows the predicted values of Model 2 compared with the value sets of the United Kingdom, New Zealand, Belgium, Spain, Finland, Slovenia and Denmark [22]. Except for full health, Model 2 gave higher values than all the models from these countries. Figure 1B shows the predicted values of Model 2 compared with the Chinese TTO value set derived from the same sample [11]. Except for full health, Model 2 gave uniformly lower values than Chinese TTO value set. Table 4 shows four categories of indices for comparing Model 2 and Chinese TTO value set, including quality of original data, distribution of rescaled values, goodness of t of models, distribution of predicted values of 5,939 cases of the sample [11]. The indices showed that the VAS method has less inconsistency of original data, lower skewness of rescaled values, higher goodness of t, and more responsive predictions of the sample. These characteristics meant VAS was a more feasible approach than TTO in the Chinese population due to its simplicity, robustness and responsiveness. All referred value sets compared with Model 2 are presented in Appendix Table S5[ 11; 22].

Modeling
This study reported on the development of a hypothesis-based VAS value set for EQ-5D-3L with a nationally representative Chinese sample. The best t model was based on OLS regression with main effects and a constant. Although Dolan P. had long recommended GLS models [6], many authors still preferred the OLS models [32][33][34]. In our study, we constructed GLS and WLS models for precise estimation of the signi cance of coe cients. The nal choice was based on full comparison according to our prior criteria.
We dropped N3 term for parsimony since it could not signi cantly improve the goodness of t in Model 1: the MAE only increased by 0.0002 from 0.0317 to 0.0319 with N3 model, the number of MAE > 0.05 did not change and Pearson correlation coe cients only decreased by 0.001. Furthermore, almost half of the value set published worldwide have no N3 term [35].
The constant, an average loss of utility deviated from full health according to Dolan P. [6], was included in the nal model as most value sets performed. Constant in Model 2 was only 0.0274, less than one-fth of that of the U.K. [6]. Through the algorithm of Model 2, the measured value for full health (0.9726) was very close to the value 1 determined by the original hypothesis. The loss of HRQoL is due to the fact that the "fully healthy" reporter is not fully healthy, but does not reach a moderate level. These reporters were identi ed by a head-to-head study [36].
Comparison with Value Sets of other countries Figure 1A shows that the line of Model 2 was consistently higher than that of other countries. This meant that under the same health condition, the utility expectation of Chinese people is higher than that of other countries. There are several possible explanations for this phenomenon. Firstly, different cultures lead to different understandings of quality of life [37][38][39][40]. China had experienced a period of sufferings in early twentieth century; therefore, Chinese people's tolerance of low quality of life is much higher than that of other countries. Secondly, as a developing country with many relatively undeveloped areas, Chinese people living in these areas have low expectations of quality life. Thirdly, in recent decades, China's health service system has been signi cantly improved, and the satisfaction with the quality of life has been greatly increased.
As to the most in uential dimension, self-care had the most signi cant impact on HRQoL for severe problems in our study. In countries such as New Zealand and Belgium, anxiety/depression had the greatest impact, followed by pain/discomfort [22]. The results emphasize the importance of different dimensions to the overall HRQoL.
Comparison with Chinese TTO Value Set Figure 1B shows that the line of Model 2 was consistently lower than the line of Chinese TTO value set. The same trend was also identi ed in Sweden [16]. These phenomena indicated that participants were reluctant to trade despite the low VAS score. Indices in Table 4 show a notable difference between VAS and TTO methods. They also showed obvious psychological burden under the TTO method [12]. For Chinese, there are several possible reasons for their unwillingness to trade. Firstly, Filial Piety. An ancient Chinese book, named Classic of Filial Piety, said that Filial Piety begins with taking good care of our body, skin and hair for alleviating parents' worries about our physical injury. The ultimate goal of lial piety is to do good for the people, to set an example for future generations for leaving parents a great sense of honor [41]. These deep-rooted moral values regulate everyone to stay away from any behavior that endangers his/her health [42]. Secondly, complex emotional combinations, such as emotional attachment to family members [42], affection to wealth [43], fear for death [44], etc. reduce the willingness to trade. Thirdly, some generally accepted beliefs, such as "better a living dog than a dead lion" lead to people's unwillingness to trade [45].
Compare with the Chinese TTO value set, the coe cient ranking of Model 2 was basically the same, since 7 of the 10 coe cients were in the same position [11]. Self-care (-0.2349) had the most signi cant impact on HRQoL for severe problems in Model 2, followed by mobility (-0.1788), anxiety/depression (-0.1702), pain/discomfort (-0.1667) and usual activities (-0.1518). Moderate problems in all dimensions except self-care (-0.1105) were less than 0.1 in absolute terms. This phenomenon demonstrated the internal correlation between the two methods.  [48]. Therefore, the performance of VAS method in Chinese population is more stable than that of TTO method.

Strengths and Weaknesses
There were several strengths to our study. Interviewer effect is one of the weakness of this study. Even though we carried out extensive training, 108 interviewers still had different background and interview skills. Another weakness was the qualitative explanation of the difference between Model 2 and Chinese TTO value set. Further quantitative analysis of the difference will be informative and should be encouraged.
To our knowledge, this is the rst try to develop a Chinese VAS value set for EQ-5D-3L, and provide abundant evidence to show that VAS method still has value and advantages..

Conclusions
Vas method has value and advantages in a wider range of application scenarios for its simplicity, responsiveness, and robustness. We recommend VAS value set, model 2, as the benchmark of scoring algorithm of HRQOL in Chinese population. The authors declare that they have no Competing of interest.

Compliance with Ethical Standards
Ethical approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Nanjing Medical University Ethics Committee approved this study (EA#20140706004).
Informed consent: Written informed consents were obtained from all individual participants included in the study. Participants were informed that it was voluntary to participate and that they could terminate at any time during the interview.

Funding
The project was funded National Natural Science Foundation of China (grant no. 71373183).