This inter-day test-retest reproducibility study was planned as one of two separate reproducibility studies, which both were part of a randomized controlled multicenter trial (RCT) (ClinicalTrial.gov-identifier: NCT02667171) investigating the effect of pulmonary tele-rehabilitation and conventional PR in patients with severe and very severe (FEV1< 50%) COPD36,37. We followed the Guideline for Reporting Reliability and Agreement Studies (GRRAS)16.
Eligible patients for the RCT were identified and recruited by respiratory nurses during out-patient COPD control visits from the University Hospitals Amager, Hvidovre, Bispebjerg, Frederiksberg, Herlev, Gentofte, Frederikssund and Hillerød. All patients provided written and informed consent. The RCT was approved by the Ethics Committee of the Capital Region of Denmark (H-15019380) and the Danish Data Protection Agency (jr.no.: 2012–58-0004).
All patients who agreed to participate in the RCT were consecutively asked to participate in the reproducibility study, which required an extra assessment visit prior to randomization and intervention start. A consecutive convenience sample size of 50 patients was chosen according to the recommendation from COSMIN (supplement 1 - flowchart)17.
Inclusion and exclusion criteria36 corresponded to the criteria for outpatient hospital-based routine PR in the Capital Region of Copenhagen, Denmark and pertained to adults with a clinical diagnosis of COPD defined as FEV1 to FVC < 0.70; FEV1 <50%; MRC ≥2; who had not participated in PR within the prior six months36.
Administration of the questionnaires was conducted at the Respiratory and Physical Therapy Departments of five different University Hospitals (Hvidovre, Bispebjerg, Herlev, Gentofte and Frederikssund) in Greater Copenhagen. The patients completed the questionnaires in a pause between two sets of performance tests, i.e. the six-minute walk test and the 30-second sit-to-stand test (Figure 1). Ten raters administered the questionnaires. They were familiar with the questionnaires from clinical practice and had obtained accreditation to be raters. The administration on the first test-day (T1) was conducted by one rater, and another rater completed the administration on the second test-day (T2). To ensure that the first administration of the questionnaires had no influence on the second administration, patients and raters were blinded to the previous response, and the interval between the two administrations was 7-10-days. This interval was chosen and appraised as long enough to prevent recall bias and short enough to ensure that the patients had not changed on the constructs that were to be measured.
The raters followed the same procedures (Figure 1), and administration of the questionnaires were conducted in the same location and at the same time during the outpatient clinics’ opening hours from 10am to 2pm, Monday to Friday. CAT, CCQ, HADS and EQ-5D were administered to all patients in the same order, and the patients filled out the questionnaires in an undisturbed room without interference from the rater. All patients got a brief, standardized pre-instruction from the rater; “Answer the questionnaires and questions consecutively in the prepared order. If you have difficulty understanding a question, I will help you with the clarification of the specific question when all other questions are answered. Take the time you need; you do not need to hurry” Patients were instructed not to do any vigorous activities three hours prior to the appointment and to take their prescribed medication as usual. The administration procedure reflects the conditions in every-day clinical practice, where several performance tests and questionnaires are conducted within a narrow time frame (Figure 1).
COPD Assessment Test (CAT) assesses the impact of COPD on self-reported health status and symptoms12. It is an 8-item questionnaire where each item scores from 0 to 5 points (0 indicating no impact or symptoms, 5 worst possible impact or symptoms) summing up to a total CAT score range of 0–40 points12.
Clinical COPD Questionnaire (CCQ) assesses self-reported quality of life7. The CCQ consist of 10-itmes with a total score and 3-domain scores: Symptoms (4-Items), Functional state (4-Items) and Mental state (2-Items). Total- and domain scores range from 0 to 6 (0 = no impairment)7.
Hospital Anxiety and Depression Scale (HADS) assesses the level of anxiety and level of depression in medically ill persons38. The scale consists of two sub scales HADS anxiety (HADS-A) and HADS depression (HADS-D), each of which has seven questions with four possible answers (score range 0 to 3). A total subscale scores of 0–7 is considered normal, 8–10 indicates a risk of anxiety or depression and 11-21 indicates considerable symptoms of anxiety or depression disorder38.
EuroQol 5-Dimension Questionnaire (EQ-5D), is a generic global questionnaire measuring health-related quality of life39. We used the 3 Likert version of the EQ-5D-3L, which has a descriptive and a visual analogue scale. The descriptive system (EQ-5D) compromises five dimensions (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression). Each dimension has three levels (no problem, some problem, severe problem), compromising a total of 243 utility scores ranging from -0.624 (worst possible health utility) to 1.0 (best possible health utility). The EQ-VAS records the overall self-rated health on a 20 cm vertical visual analog scale ranging from zero (worst imaginable health) to 100 (best imaginable health).39
Demographic and descriptive variables
Demographic and descriptive variables, i.e. age, gender, body mass index, smoking status, FEV1/FVC, FEV1, GOLD, A/B/C/D stratification40, Charlson Comorbidity Index, BODE-index and oxygen supplement were registered at T141.
Descriptive data are presented as means with standard deviations (SD) for continuous data and as medians with range for ordinal data and data not normally distributed. Data distribution was inspected by histogram, Q-Q Plots and verified by Shapiro–Wilk test to determine approximately normal distribution. Paired t-test was used to compare inter-day systematic bias between the patients completed questionnaires at T1 and T2. Intra-class correlation coefficient (ICC) was calculated to describe the reliability.
The ICC1.1 model was used because the assessments were conducted at five centers, and all raters did not instruct each patient15,42. The ICC1.1 is a fixed model addressing both systematic and random error. ICCs values between 0–0.49 were considered weak, ≥0.50–0.75 moderate, >0.75-0.90 good and >0.90 excellent reliability43.
Agreement was calculated as standard error of measurement (SEM) and the SEM95 using the equation SD*√1-ICC and respectively 1.96 × SEM (SEM95)15,42. The SEM expresses the measurement error that occur within a single measurement where no real change has occurred and indicates that there is a 68% likelihood that a group of patients’ (or a single patient’s SEM95) “true” score is within this measurement error35,43.
The corresponding smallest real difference (SRD) was calculated by the equation √2 × SEM (SRD) and 1.96 × √2 × SEM (SRD95) respectively. The SRD represents the smallest real difference to be detected beyond the measurement error of repeated measurement without real change and with a 68% certainty on a group of patients (or a single patient’s SRD95)42,44,45. The SEM, SEM95, SRD and SRD95 are presented in actual units. To make comparisons between our agreement parameters and results from other studies easier these parameters were also expressed as a percentage of the mean from the two subsequent visits (grand mean).
The minimal important change (MIC) is derived from longitudinal validity studies and preferably determined by anchor-based methods, represents the smallest amount of change in an outcome that might be considered important by the patient or clinician35,46. For evaluative purpose it is important that the MIC can be distinguish from repeated measurement error. Therefore, we determined a questionnaire suitable for evaluative use when the SRD was smaller than the MIC.
Bland Altman plots were used to visualize potential systematic bias around the zero line as well as heteroscedasticity. The mean difference with 95%CI and limits of agreement (95% LOA) were calculated as mean±1.96*SD and included in the plots15,47. P values of less than 0.05 were considered statistically significant.
Finally, we report the proportion of patients with minimum and maximum score for each questionnaire, because this shows the population-specific risk of floor and/or ceiling effects. There is no consensus regarding cut off values for floor or ceiling effects, but is has been suggested that it is present if >15% of the participants achieve the lowest (floor) or highest (ceiling) score48. Floor and ceiling effects are of special interest in intervention studies, because patients with the lowest possible scores may not be able to further decline, and patients with the best possible scores may not be able to further improve, following an intervention. Data was analyzed using SPSS version 22.0 (SPSS Inc., Chicago, IL, USA).