This validation study was conducted using BGB-A317-208 trial data. BGB-A317-208 (NCT0341989) was an open label, multicenter, international, Phase 2 clinical trial assessing efficacy and safety of the anti-PD-1 monoclonal antibody tislelizumab in patients with unresectable HCC. Enrolled patients received tislelizumab (200 mg) intravenously every three weeks for a total of three or more 21-day treatment cycles, followed by long-term safety and survival assessments.
2.0.1 Patients
Patients were male and female adults (≥ 18 years of age), enrolled from international study sites, with histologically confirmed HCC that was not amenable to a curative treatment approach and who had received ≥ 1 line of systematic therapy for unresectable HCC. All patients were required to have an Eastern Cooperative Oncology Group (ECOG) performance status score of ≤ 1 [19].
2.0.2 Measures
HRQoL was assessed using three patient-reported outcome (PRO) instruments: the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire Core 30 (EORTC QLQ-C30), the corresponding HCC-specific module (QLQ-HCC18), and the EQ-5D-5L. These PROs were collected at baseline and the first day of treatment cycle 2 (week 3), then every other treatment cycle up to cycle 12 (week 36). At each treatment cycle visit, the PRO administration occurred prior to any clinical activities or dosing. For purposes of this psychometric analysis, only QLQ-HCC18 and QLQ-C30 results are reported (the EQ-5D-5L was not employed in validation).
The EORTC QLQ-C30 [20] is a validated generic HRQoL instrument for cancer patients and comprises a global health status/QoL (GHS) scale (two items), five functional scales: physical functioning (five items), role functioning (two items), emotional functioning (four items), cognitive functioning (two items), social functioning (two items), as well as three symptom scales, and several single items: fatigue (three items), nausea and vomiting (two items), pain (two items), and dyspnea, insomnia, appetite loss, constipation, diarrhea, and financial impact (one item each) [21]. The functional and symptom items are rated on a 4-point Likert scale (with 1 = ‘not at all’ to 4 = ‘very much’), while the GHS items are rated on a 7-point Likert scale (with 1 = ‘very poor’ to 7 = ‘excellent’). A high score on the GHS and functional scales indicates high HRQoL and a high level of functioning, whereas a high score on the symptom scales and items indicates a high level of symptom severity. The two individual GHS items were used as concurrent validators. The GHS scale of the QLQ-C30 was used as the PRO anchor variable in test-retest reliability, ability to detect change, and meaningful within-patient change analyses.
The EORTC QLQ-HCC18 [22] measures HCC-specific symptoms and HRQoL. The instrument is an 18-item scale, consisting of six symptom scales and two single items: fatigue (three items), body image (two items), jaundice (two items), nutrition (five items), pain (two items), fever (two items), sexual interest (one item), and abdominal swelling (one item). Scores are based on a 4-point Likert scale (with 1 = ‘not at all’ to 4 = ‘very much’); scaled scores for each domain ranged from 0-100 with a higher score indicating worse symptoms. In addition, an overall index score was calculated. Fatigue and index scores were prioritized in this validation exercise.
The ECOG performance status [19], a clinical measure of disease severity, was also used as a known-groups validator for this psychometric analysis. The ECOG criteria is used to assess how a patient's disease is progressing and the effect of the disease on a patient’s activities of daily living and was assessed at the baseline visit.
In addition, demographic and medical history data, including age, gender, race, geographic region, line of therapy, and viral hepatitis infection status, were collected at the screening visit.
2.1 Statistical Analyses
In accordance with existing and emerging FDA guidance [17, 18], psychometric validation of the QLQ-HCC18 was conducted to measure the reliability (internal consistency and test-retest), construct validity (convergent validity and known-groups validity), ability to detect change, and MWPC. These analyses were conducted using the safety population, which included all patients receiving at least one dose of tislelizumab. Known-groups validity and MWPC analyses were stratified on several pre-defined subpopulations, including region (Asia [China/Taiwan] versus Europe), line of therapy (second-line versus third-line or greater), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). Table 1 provides a summary of these analyses.
Table 1
Summary of psychometric analyses of QLQ-HCC18
Property | Analysis Period | Definition | Test | Success Criterion |
Internal consistency | Baseline | Cronbach’s α | No test, point estimate reported | 0.70 ≤ α |
Test-retest reliability | Baseline to week 3 | ICC(2,1) | No test, point estimate reported | 0.70 ≤ ICC(2,1) |
Concurrent validity | Baseline | Spearman correlations | No test, point estimate reported | |0.40| ≤ r |
Known-groups validity | Baseline | Mean, mean difference, 95% CI, p-value, R2 effect size | ANOVA | p < 0.05; effect size ≥ 5% |
Ability to detect change | Baseline to week 9 | Mean change from baseline in scores between anchor (QLQ-C30 GHS), 95% CI, p-value, and 𝜔2 semi-partial effect size | ANCOVA | p < 0.05; effect size ≥ 5% |
Meaningful within-patient change | Baseline to week 9 | Mean change from baseline in relation to change in anchor groups (QLQ-C30 GHS improvement, maintenance, deterioration) eCDFs and ePDFs plotted | No test, point estimates reported | No criterion, estimates reported |
QLQ-HCC18: Quality of Life Questionnaire – Hepatocellular Carcinoma 18-question module; ICC: intraclass correlation coefficient; CI: confidence interval; ANOVA: analysis of variance; ANCOVA: analysis of covariance; QLQ-C30 GHS: Quality of Life Questionnaire – Core 30 global health status/QoL scale; eCDF: empirical cumulative distribution function; ePDF: empirical probability density function. |
Descriptive statistics for continuous variables were reported as means, standard deviations (SDs), medians, and missing values. Descriptive statistics for categorical variables were reported as frequency counts and the percentage of patients in corresponding categories. Statistical significance was evaluated using a two-tailed α = 0.05 level. Missing data for the QLQ-HCC18 and QLQ-C30 were handled according to the developer’s manuals and no imputation was carried out [22, 23]. All analyses were performed using SAS (version 9.4) and R statistical software (version 3.6.1).
2.1.1 Reliability
Internal consistency evaluates score reliability by assessing the strength with which each item measures an assumed single domain. Internal consistency was assessed for each of the multi-item QLQ-HCC18 scales at baseline using Cronbach’s alpha [24]. Internal consistency estimates of ≥ 0.70 were considered acceptable [23].
Test-retest reliability consists of measuring the degree to which an instrument is capable of reproducing scores across time in subjects whose condition has not changed [18]. Patients whose responses on the QLQ-C30 GHS scale anchor reflected no change in status between baseline and the first follow-up at week 3 were considered a stable subgroup and test-retest reliability was assessed for each of the QLQ-HCC18 scales and single items. In the case of a continuous score, one appropriate measure of test-retest reliability is the two-way random intraclass correlation coefficient (ICC), employed in this analysis and denoted ICC(2,1) [25]. Test-retest reliability estimates of ≥ 0.70 indicate satisfactory reliability [26]. Both unconditional estimates and estimates conditioned on no change in GHS were estimated. Consistent with regulatory guidance, only estimates derived from the primary GHS anchor-based no-change definition (NC1, defined by GHS change score of 0 between baseline and week 3) are reported [17, 18, 27]. To limit the impact of possible treatment effects, three definitions of no change were examined in sensitivity analyses: unconditional, + 1 response category (‘NC2’), or + 2 response categories (‘NC3’). None of these definitions outperformed the pre-specified primary NC1 definition reported in this manuscript.
2.1.2 Construct Validity
Construct validity was assessed by tests of both concurrent validity and known-groups validity. Concurrent validity is a component of construct validity representing the extent to which two scales assessing similar constructs are related. This was estimated from Spearman correlations between the QLQ-HCC18 and QLQ-C30 scores at baseline. Larger positive correlations reflect convergent validity while small correlations or negative correlations reflect divergent or discriminant validity [28]. Spearman correlations of |0.40| or greater met the pre-specified criterion for acceptable concurrent validity [28].
Known-groups validity assesses whether PRO scores can be differentiated between clinically distinct groups. Known-groups validity was estimated for the QLQ-HCC18 scores at baseline. Known-groups validators included geographic region (Asia versus Europe), line of therapy (second-line versus third-line or greater), ECOG status (0 versus 1), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). The difference in QLQ-HCC18 scores between each known-group was calculated and contrasted using analysis of variance (ANOVA), from which the mean difference between known-groups, corresponding 95% confidence interval (CI), p-value, and R-squared (R2) effect size were estimated. Acceptable known-groups validity was achieved if a preponderance of the known-effect-groups had QLQ-HCC18 mean scores consistent with clinical expectations (i.e., more severe groups had worse symptoms or HRQoL compared to less severe groups). Such evidence was strengthened if and when the corresponding differences across known-groups were statistically significant and the corresponding R2 was greater than 5%.
2.1.2 Ability to Detect Change
Ability to detect change is a facet of longitudinal validity that evaluates the relationship between changes in the PRO instrument of interest over time in the context of changes in external criteria (i.e., ‘anchors’) [29]. Ability to detect change was assessed by analyzing the extent to which QLQ-HCC18 change scores could be predicted by change in the QLQ-C30 GHS anchor variable. The QLQ-C30 GHS anchor groups were operationalized as follows: improvement was defined by > 0-point change from baseline to week 9; maintenance was defined as 0-point change from baseline to week 9; deterioration was defined as < 0-point change from baseline to week 9.
Analysis of covariance (ANCOVA) was used to estimate differences in QLQ-HCC18 change score marginal means across QLQ-C30 GHS anchor groups (improvement [effect] versus maintenance [reference]; deterioration [effect] versus maintenance [reference]), controlling for age, gender, region, and baseline QLQ-HCC18 mean. Effect size estimates were based on the Omega squared (𝜔2) statistic [30]. Acceptable ability to detect change was pre-specified as estimates meeting the following criteria: significant differences (p < 0.05) in marginal means across anchor group contrasts and effect sizes exceeding 5%.
2.1.3 Meaningful Within-patient Change
Traditional estimation of meaningful change thresholds has relied on distribution and anchor-based methods. Increasingly, regulatory reviewers are emphasizing the latter, therefore anchor-based methods were the focus of the current analyses [27, 18, 17]. Furthermore, such estimates have emphasized between-group differences (e.g., minimally important differences or minimal clinically important differences). The FDA has justifiably taken the position that within-patient change is not acceptably approximated from between-group differences. Instead, regulatory guidance emphasizes MWPC for the derivation of clinical significance estimates [18].
Anchor-based methods aim to define the magnitude of MWPC on a PRO instrument of interest among patients classified as experiencing meaningful change (improvement/deterioration) on an ‘anchor’. Anchor-based MWPC thresholds were obtained via calculation of mean change in QLQ-HCC18 scores from baseline to week 9 stratified on the QLQ-C30 GHS anchor groups described above. Known-groups validity was estimated for the QLQ-HCC18 scores at baseline. In addition to primary analyses based on the total sample, meaningful improvement estimates were stratified by geographic region (Asia versus Europe), line of therapy (second-line versus third-line or greater), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). These estimates of mean change were then validated by visualizing differences in cumulative proportions achieving the point estimates stratified on anchor groups via empirical cumulative distribution functions (eCDFs) and empirical probability density functions (ePDFs).