Development And Validation Of A Voice Script For Telephone Administration Of THE EORTC QLQ-C30

Claire Piccinin (  claire.piccinin@eortc.org ) European Organisation for Research and Treatment of Cancer https://orcid.org/0000-0002-3918-1174 Madeline Pe European Organisation for Research and Treatment of Cancer Dagmara Kuliś European Organisation for Research and Treatment of Cancer James W. Shaw Bristol-Myers Squibb Sally J. Wheelwright University of Southampton Faculty of Health Sciences Andrew Bottomley European Organisation for Research and Treatment of Cancer

patient-reported outcome (PRO) measurement in cancer clinical trials [4], it is available for use in over 110 language versions, having undergone extensive testing to demonstrate its psychometric [5] and cultural [6] validity.
The majority of QLQ-C30 items (n=28) are measured by a four-option Likert response scale that ranges from 1, indicating "not at all", to 4, indicating "very much", capturing the presence and/or severity of a symptom or issue and its impact on QOL. The nal two items which make up the global health status and QOL scale are rated on a scale from 1 to 7, with 1 indicating "very poor" and 7 "excellent". The time scale for all items is "during the past week" with the exception of the rst 5 items (the physical functioning scale), for which no speci c timeframe is used, given the intent to capture a more global impact on physical functioning, not limited to a one-week recall period. All single items and multi-item scales in the questionnaire are scored and transformed onto a 0-100 scale, with higher scores for the functional and global health status/QOL scales indicating higher levels of functioning and QOL, and higher scores for symptom scales and single items indicating a higher degree of symptomatology and problems.
In addition to its frequent use in cancer research and clinical trials [2] [7], the QLQ-C30 is being increasingly used for monitoring purposes in clinical practice [8]. By providing a direct means of measuring core symptoms and issues from the patient's perspective, the QLQ-C30 provides clinically meaningful information, distinct from that offered by clinical markers and clinicians' ratings [2][7] [9]. In 2018, the EORTC Quality of Life Group (QLG) published guidelines to help facilitate the use and migration of EORTC questionnaires into electronic PRO (ePRO) formats (e.g., computer, tablet) [10]. A computerised adaptive testing (CAT) version of the QLQ-C30, the EORTC CAT Core [11], is also available, and consists of dynamic item banks which correspond with the QLQ-C30's 14 functional and symptom domains.
The purpose of this study was to pilot test the provisional QLQ-C30 phone script through cognitive debrie ng interviews to ensure its acceptability and relevance, amending it if needed, and to subsequently validate the QLQ-C30 phone-administered version by carrying out equivalence testing between the paper and phone administration modes in a population of patients actively undergoing cancer treatment. An intraclass correlation coe cient (ICC) of >0.70, the recommended threshold to demonstrate equivalence between various modes of administration, was employed in this study for the purpose of equivalence testing [12]. Previous research supports the use of ICC >0.70, as demonstrated in studies by Lundy and colleagues [13] [14], in which an interactive voice response (IVR) version of the QLQ-C30 was developed.
Similarly, in an equivalence study aimed at comparing tablet computer, IVR, and paper-based administration of the PRO-CTCAE [15], the degree of mode equivalence was assessed using ICC >0.70 Although previous work conducted by Lundy and colleagues demonstrated the equivalence of an IVR version of the QLQ-C30 to its paper administration [13] [14], this is the rst project aimed at validating a voice script for phone administration of the QLQ-C30 by an interviewer. A considerable body of research comparing paper to screen-based (e.g., tablet, computer) administration of PROs has demonstrated high levels of reliability between both modes [16][17] [18] but less work has compared paper administration to auditory modes (e.g., IVR, phone interview). Still, the existing research suggests that equivalence can be established between paper and phone PRO administration [13] [15].

Methods
Patient recruitment and data collection, management, and analysis were subcontracted to Mapi/ICON plc who provided a nal report to the EORTC detailing the methodology and ndings.

Sample
Recruitment was carried out through a UK-based recruitment agency and patients were eligible to participate if they were 18 years or older, currently receiving cancer treatment as con rmed by a clinician, able to read and understand English, voluntarily agreed to participate in the study, and provided written informed consent.

Pilot testing
Five patients were interviewed to test the acceptability, understanding, and relevance of the instructions for the QLQ-C30 voice script.

Equivalence testing
In addition to the previously described eligibility criteria, patients in the equivalency testing were required to have no changes in treatment planned between the paper and phone version completion. To support equivalence between paper-and-pen and phone administration modes using an ICC >0.70 and a minimally acceptable level of 0.50, a sample size of 63 patients was required [12]. Two waves of recruitment were conducted. In the rst wave, 50 patients were recruited, the appropriate number for an equivalence threshold of ICC >0.90. Since protocol deviations were observed in which only 26 patients completed the paper and phone versions of the QLQ-C30 within the pre-speci ed 2-day timeframe, a second wave of recruitment was therefore conducted to address these limitations. Thirty-seven additional patients were recruited based on the same eligibility criteria, bringing the total sample size to 63.

Pilot testing
Patients' interviews were conducted by trained qualitative researchers and audio-recorded for the purpose of analysis. Interviews lasted approximately 60 minutes and were based on a study-speci c interview guide, which contained a summary of the methods to conduct the interview, along with semi-structured questions. The guide also contained questions regarding demographic and clinical variables to capture during the interview. Patients' responses were recorded anonymously on a grid detailing results per patient and the results were qualitatively reviewed and summarized. The QLQ-C30 phone script was subsequently revised accordingly. The interview recordings were destroyed after completion of the analysis, with an anonymised copy retained for the study les.
Equivalence testing A randomised, cross-over design was used to compare the self-administered paper version and the hetero-, phone-administered version of the QLQ-C30 in patients currently receiving treatment for cancer, following recommendations as set out in the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) PRO Mixed Methods Task Force [19]. Patients were randomised (1:1) to complete either the paper or the phone-administered version rst. After providing informed consent, each patient completed a brief sociodemographic and clinical form. Depending on randomisation, patients were then asked to either complete the paper version of the QLQ-C30 and return it to the recruitment agency in a prepaid envelope or respond by phone to the questionnaire following the phone script as presented by the interviewer, a trained qualitative researcher. The interviewers recorded patients' responses on a paper version of the QLQ-C30. The paper version of the QLQ-C30 was estimated to take approximately 30 minutes to complete and administration time for the phone version was recorded for each patient. Any comments or observations made by the patient during the phone administration were recorded on a feedback form.
Two days after the rst completion of the QLQ-C30, patients were asked to complete it again using the other mode of administration. The date of completion of the paper version was noted for each patient, to assess compliance with the pre-speci ed two-day time frame. For patients who completed the phone interview rst, the recruitment agency waited for con rmation of interview completion from the study team before sending the paper version by post.

Data Analysis
Patients were described in terms of clinical and socio-demographic variables, as reported during the phone interview (pilot testing) or on the socio-demographic/clinical form (equivalence testing). Age, gender, educational status, and disease history were reported. All data processing and analyses were performed with SAS® software for Windows, Version 9.2 or later (SAS Institute, Inc., Cary, NC, USA).

Pilot testing
Feedback from patients was compiled in an analysis grid, and reported per patient based on a qualitative assessment of the questionnaire, its instructions and individual items, with any additional comments also recorded.

Equivalence testing
All patients who met the inclusion criteria and completed enough items in the QLQ-C30 questionnaire during each administration for each domain to be scored were included in the equivalence testing analysis. Responses to items from the QLQ-C30 were described based on completion and distribution of responses per administration mode. Missing data were described in terms of number and percent of missing responses per item along with number and percent of missing items per patient, including the number of patients with at least one missing item. Continuous variables were described based on their frequency, mean, standard deviation, median, rst and third quartiles, and minimum and maximum values. Categorical variables were described based on the frequency and percentage of each response choice, with missing data included in the calculation of percentage.
Equivalence testing was performed at both the item and domain score levels, with the primary objective to evaluate equivalence at the score level between both modes of administration using ICC [20]. The widely used benchmark of ICC of >0.70 was used [21], with ICC values between 0.75 and 0.90 indicating good agreement and values greater than 0.90 indicating excellent agreement [22]. Weighted kappa coe cients [23] were used to assess the extent to which both administration modes produced the same responses by patients to the QLQ-C30 items (results are reported in Appendix A). Following Fleiss' guidelines [24], a kappa value greater than 0.75 was characterized as excellent, 0.40-0.75 as fair to good, and less than 0.40 as poor. Mean differences in item-level scores were also calculated and are displayed in Appendix B.
To ensure robustness of results between the two waves of recruitment, a sensitivity analysis was conducted to compare the ICC values between patients included prior to the study amendment ( rst wave of recruitment: n=26) and those included after (second wave of recruitment: n=37) using scores from the paper and phone administration modes of the QLQ-C30. Additional sensitivity analyses were conducted on the full group of patients included in the equivalence testing (n=63) to compare ICC scores by age (<60 vs. >60) and gender.

Sample Pilot Testing
Five patients (three males and two females) with a mean age of 51 years completed the pilot testing interviews. Patients had either liver, testicular, bowel cancer, or lymphoma, and one patient had breast, lung, and bowel cancer, as well as secondary liver cancer. More details regarding demographic and clinical characteristics are provided in Table 1.

Equivalence Testing
Sixty-three patients (26 from the rst wave and 37 from the second wave) made up the total sample included in the equivalence testing. Patients had a mean age of 55 years and 65% were female. Almost half of the sample (48%) was employed full-or part-time and 76% of patients were living as a couple. Education levels varied with 41% of patients having obtained a bachelor's or postgraduate degree. Breast cancer was the most common disease type, reported in 29% of patients, followed by prostate (11%), lung (10%), and bowel (6%) cancers. A large proportion of patients (41%) reported "other" disease types. The majority of patients were undergoing chemotherapy (25%) or hormone therapy (16%) and other types of treatment included surgery (11%), radiotherapy (10%), biological therapy (13%), mixed therapy (8%) and "other" types of treatment (18%). Detailed demographic and clinical characteristics are provided in Table  2, presented to indicate patients who completed the paper (n=31) or phone (n=32) versions of the QLQ-C30 rst. All patients considered the instructions in the phone script to be clear and straightforward. Three comments were raised concerning the time and response scales of the questionnaire. Two patients made comments regarding the time scales, but these deviated from the source questionnaire and were thus not integrated into the script. One patient suggested numbering the response options from 1 to 4, for clarity. After discussion with the study team, numbers 1 to 4 were added to the response options in the phone script, thereby creating the nal version of the phone script in UK English.

Equivalence Testing
All patients from both testing waves (n=63) completed all items in both the paper and phone versions of the QLQ-C30 and there were no missing data.  Mean differences in domain-level scores were assessed between administration modes and are shown in Table 4. Results for mean differences at the item level are displayed in Appendix B. At the domain level, differences between modes were minimal in absolute magnitude, ranging from 0.00 to 11.00 points. The mean time for completion of the phone version of the QLQ-C30 was 8.6 ± 1.9 minutes and 39 participants (62%) made comments or asked questions during the interview.
Sensitivity analyses comparing patients included before the study amendment (n=26) with those included after (n=37) revealed signi cant differences (i.e., 95% CI overlapping) only for the nausea and vomiting ICC, which was lower in the rst wave of patients, and the constipation ICC, which was lower in the second wave. The full results are displayed in Table 5. The results of additional sensitivity analyses to assess possible differences in scores based on age (<60 versus >60) and gender are displayed in Tables 6 and 7.

Discussion
This study aimed to develop and validate a voice script for phone administration of the QLQ-C30 and evaluate its equivalence to paper administration in a sample of patients actively undergoing cancer treatment. During pilot testing, the voice script was deemed understandable and relevant with minimal comments received from patients.
Results from the nal sample of patients included in the equivalence testing indicated good equivalence between paper and phone administration modes, with all total ICC scores above the 0.70 threshold, ranging from 0.72 to 0.90. In the evaluation of paper administration rst, two ICC scores were found to be below the 0.70 threshold, for nausea and vomiting (ICC 0.55; 95% CI 0.24-0.76) and nancial di culties (ICC 0.60; 95% CI 0.31-0.79). When comparing differences in means at the domain score level, the differences were still well below 10 points for the comparison of both administration modes, suggesting minimal differences despite the ICCs. Failure to reach the 0.70 ICC threshold for nausea and vomiting may also re ect the possibility of more ambiguity surrounding the rating of nausea. While vomiting is a more concrete occurrence, and it is unlikely that a patient's recollection would change over a 2-day timeframe, nausea may be subject to broader interpretation. Moreover, medications are generally readily available to patients, which help to resolve these symptoms on a day-to-day basis, thus indicating that those symptoms can change within a two-day period.
In addition, a more general limitation of using ICC to assess equivalence is that the absolute size of a given ICC is dependent on the variation observed within the sample. As such, minimal variation in nausea and vomiting scores may have contributed to the lower ICC. Still, the ICC for nausea and vomiting was still well above 0.50 for paper administration rst, indicating that it is remains within the minimally acceptable range, especially since the total ICC was over 0.70. It is worth noting that the nausea and vomiting domain score has performed poorly in a previous test-retest study carried out by Hjermstad and colleagues [25], so there may be other factors in uencing that scale, which were not identi ed in this study. Such factors could also account for the lower ICC score found for nausea and vomiting, when the paper version was administered rst.
Differences in mean scores at the domain score level were uniformly minimal, suggesting that, overall, results from both administration modes were equivalent. The relatively short completion time of 8.6 ± 1.9 minutes for the voice script suggests that it can be integrated into a study protocol with relative ease and minimal patient burden.
Following guidelines from ISPOR's PRO Mixed Modes Good Research Practices Task Force, and drawing on methodology used in similar PRO equivalence studies [13] [15], this study had a number of strengths.
The randomized cross-over design helped to minimize the potential for bias in either one of the administration modes, and the inclusion criteria ensured that the voice script was evaluated and tested by patients for whom it would be relevant and feasible (i.e., those actively undergoing treatment, with the appropriate language level). The nal sample of patients was diverse, and su ciently well-balanced in terms of demographic and clinical characteristics, helping to ensure representativeness across patients and disease types. Analyses were also strengthened by the fact that there were no missing data in either the paper or phone administered versions of the questionnaire, making the results more easily interpretable.
The decision to decrease the initial ICC threshold from >0.90 to >0.70 following the study amendment to include a second wave of testing, is well-supported by robust evidence in the literature, for which an ICC of >0.70 has also been used to evaluate equivalence in similar studies [13][14] [15].
Moreover, the total ICCs for both waves were largely similar. Signi cant differences were only observed for nausea and vomiting, which was lower in the rst wave, and constipation, which was lower in the second wave of testing. Although the same recruitment procedures and inclusion criteria were applied, and demographic and clinical characteristics were largely comparable between groups, differences in gender and age distribution were found between the two waves of testing. In light of these differences, sensitivity analyses were carried out across all participants by age and gender. While most scores were similar across groups, differences were found in ICC scores for the nausea and vomiting and diarrhoea domains scores by group, with younger patients scoring lower on these domains compared to older patients. Males also scored lower than females on both the nausea and vomiting and physical functioning scales.
Despite these ndings, when examining all other ICCs and comparing them across subgroups, no consistent pattern was identi ed which would support a potential correlation between lower or higher ICCs and the age or gender of patients. Moreover, the limited sample size makes it di cult to draw robust conclusions at the subgroup level. Factors other than age and gender may be related to the experience of disease and treatment, and may help to account for the differences observed; however, such interpretations are beyond the scope of this study, which is limited to the available demographic and clinical data.
Overall, sensitivity analyses showed that differences observed between the two waves of testing are minimal, thereby further supporting the equivalence of paper-and-pen and phone administration modes.

Conclusions
Results from this study support the equivalence of paper and phone administration modes of the QLQ-C30, consistent with ndings from similar studies evaluating mode equivalence of other PRO measures.
In addition to its initial source language (UK English) development, the QLQ-C30 voice script is now available in multiple other languages, with more translations anticipated in the future. By providing an alternative means of questionnaire completion, the QLQ-C30 voice script helps to ensure that the questionnaire remains accessible in multiple formats across a wide range of patients.

Declarations
Funding: This research was funded by Bristol-Meyers Squibb.