The 3L index scores generated from the two 3L value sets for China
The two 3L value sets were developed using different sampling methods, valuation protocols [26,27], modeling methods, leading to distinct algorithms for calculating the 3L index scores (Table 1). For example, the utility score for health state “23221” is 0.466 (i.e., 1-0.039-0.099-0.208-0.074-0.092-0.022) according to the 2014 algorithm or 0.568 (i.e., 1-0.077-0.291-0.037-0.027) according to the 2018 algorithm. In the study, both algorithms were used to generate the two index scores of all the 243 3L health states for analysis.
We assessed the distributions of the two indices (i.e., 3L2014 index score and 3L2018 index score) using the Shapiro-Wilk test. T-test or Wilcoxon rank-sum test were then used to compare their mean values wherever appropriate.
A two-way mixed intraclass correlation coefficient (ICC) and Bland-Altman plot  were adopted to assess the degree of agreement between the two indices at absolute level. The agreement was considered good when the ICC value was higher than 0.7. The Bland-Altman plot was used to visualize and assess the level of agreement across different utility segments, whereby the Y-axis depicts the differences in score between the two indices, and the X-axis represents their mean values. A limit of 0.074, that is the minimally important difference (MID) of the 3L index score , was used to determine whether the magnitude of the difference would be clinically important.
To examine the agreement of the two 3L index scores at relative level, we simulated all the possible health states transitions that may occur over time. All the 243 health states were paired to form 29,403 (C2243) health state combinations, each of which was used to simulate a pair of health states before and after treatment. It was assumed that the health states with higher index scores were as the states after treatment (post-treatment), and the lower were as the health states before treatment (pre-treatment) . Hence, the health gains of our simulated treatment were always positive. However, the index score of the same health state may vary when changing from one value set to the other, thus a health state labeled as pre-treatment when using the 3L2014 value set may represent post-treatment instead when using the 3L2018 value set in the same pair, or vice versa. This was what we considered as an “inconsistent” pair of health states , whereby the choice of index scores would have a substantial impact on health outcomes, i.e. one may generate a positive health gain, while the other may result in health losses.
On the contrary, for a “consistent” pair, the health state representing pre-treatment remained unchanged regardless of using either the 3L2014 or 3L2018 value set. Given the magnitude of health gains may vary from one value set to another, the consistent group was further divided into four subgroups according to the perceived direction and magnitude of the change before and after treatment: (1) major improvement (i.e. at least one dimension in the health transition is increased from level 3 to level 1 or level 2, and no dimension is decreased ); (2) minor improvement (i.e. at least one dimension in the health transition is increased from level 2 to level 1, and no dimension is increased from level 3 to 1 or 2, nor is the level of any dimension decreased); (3) mixed response with minor deterioration (i.e. at least one dimension is decreased from level 1 to 2 and no dimension is decreased from level 1 or 2 to 3); (4) mixed response with major deterioration (i.e. at least one dimension is decreased from level 1 or 2 to 3) . It should be noted that, if the level of one dimension deteriorates yet the level of the others improves in a health transition, it would be considered as a mixed response with some deterioration and thus assigned to either subgroup 3 or 4. We then compared the health gains yielded from the two 3L indices for all the transitions, consistent transitions, and each subgroup of the consistent transitions.
We also compared the responsiveness of the two 3L indices within the consistent group by using Cohen effect size. It is commonly used to measure the effect size of a treatment, and is independent of the sample size which is unlike the significance test. It is calculated as the difference in the mean scores between post-treatment and pre-treatment divided by the standard deviation of the pre-treatment. The effect size was categorized as small (0.2–0.5), moderate (> 0.5–0.8), or large (> 0.8) . Given that the hypothetical treatment was fixed in our simulation, the effect size would reflect the ability of an index score to discern changes in two known health states. The higher the effect size, the more responsive the index score is. We calculated and compared Cohen effect size for all the consistent pairs and each subgroup of the pairs. Microsoft Excel and Stata and SAS were used for statistical analysis.