Putting the Meaning Into Meaningful Change Research

Methods for deriving clinically meaningful change thresholds have advanced considerably in recent years, however, key questions remain about what the identied change score actually means for an individual patient or group of patient. This is particularly important in the case of ClinROs where the translation from clinically meaningful change to patient-relevance in daily living is not clear. This paper provides case studies from an Industry perspective, where we have addressed this challenge using varied approaches. We have explored meaningful change at both the group and individual level. Methods We provide several case studies to illustrate different approaches to understanding and communicating a meaningful outcome on a ClinRO. These include alternative methods for interpreting group-level MCIDs, and several examples of linking ClinRO items to patient-relevant real-world concepts e.g. through exit interviews, translation of ClinRO items into patient-friendly concepts, and use of the Rasch model to equate ClinRO items to real-world functional measures.


Introduction / Purpose (Stating The Main Purposes And Research Question)
Reported Outcome (ClinRO) measures, used when patients are unable to self-report on their own health status or when clinical judgement is required to make an assessment (Powers et al. 2017). For ClinROs, the translation from a clinical trial score change into real-world impact is not always apparent. Multiple stakeholders require a clear link between a trial-based clinical change and real-life outcomes (e.g., Lievens et al. 2019).
The objective of this paper is to provide four illustrative cases where seeking to establish the clinical meaningfulness of the concept measured and the change scores of ClinROs through a variety of approaches. We explore examples at the group and individual level, with a focus on the variety of methods used and not the data examples presented here. Each case study provides examples of how we methodologically addressed the questions, and clearly identi es whether the methods within the case study support an evaluation of group or individual-level change. The results focus on how we propose interpreting the outcomes. Three of the case studies focus on establishing patient-relevance of the meaning of score changes on ClinROs, with the last case study focusing on clinical approaches.

Methods And Results
Case study 1: Patient-centered perspective on individual-level meaningful change estimates for a COA, what do changes mean within a target population?

Context
Translating a ClinRo to patient-relevance is a common challenge. We describe a qualitative approach translating ClinRO items into patient-friendly concepts, to allow in-depth exploration of how functional abilities relate to ability to perform activities of daily living and how changes in functioning can impact Health-related Quality of Life.

Methods
The Motor Function Measure 32 (MFM32) is a ClinRO measure, developed based on clinical expert input, assessing the motor function abilities of individuals with neuromuscular disease. It has been validated for use in individuals with Type 2 and Type 3 SMA aged 2-60 years (Berard et al. 2005;Vuillerot et al. 2013;Trundell et al. 2020). The MFM32 includes 32 items scored on a 0-3 point Likert scale [0 (unable to complete the ability) to 3 (able to complete the ability)] which are summed and transformed to a 0-100% total score with higher scores associated with greater functioning. In-depth, semi-structured, qualitative interviews were conducted with individuals with SMA and caregivers from the US. In order to facilitate discussion with patients, a patient-friendly version of the MFM32 was created using the MFM User Manual (Berard et al. 2021) and input from clinical experts and patient advocacy groups. Clinical terminology used in the items of the validated clinician-reported version of the MFM32 were reworded into patient-friendly language, maintaining focus on the speci c ability assessed by items. Participants were asked to describe the Activities of Daily Living (ADLs) considered to be related to the functional abilities assessed in the MFM32 (Berard et al. 2021), the relevance of score changes (item and total score level) the impact these changes might have on their ability to perform activities of daily living using a patient-friendly version of the MFM32.

Results
Participants were able to relate one or more ADLs to each of the functional abilities assessed in the MFM32 and provide the perspective that maintaining functional ability as assessed by a patient-friendly version of the MFM32 was considered a meaningful outcome. This demonstrates the effectiveness in the approach of translating a complex ClinRO into a patient-friendly alternative, to aid discussion on meaningful change with patients. The results obtained in this study are particularly important in a progressive disease, where the patient's perspective on how functioning -as assessed by a ClinRO -could be related to real-world ADLs. A limitation of this approach is that the clinician-reported version of the MFM32 provides a more detailed assessment of motor function, and assesses intermediate functions that were di cult to explain in the context of the interviews, which instead focused on describing the functional ability associated with achieving a maximum score on each item. The changes in scores discussed are theoretical and the fact that a patient-friendly MFM32 item can be associated with an everyday activity does not mean that change on an item would necessarily lead to changes in that speci c daily activity in the real-world. A limitation is that other clinical (e.g. presence of contractures or scoliosis) and situational (e.g. use of assistive devices) factors may in uence actual functioning. This work has been accepted for publication in a peer reviewed journal (Duong et al. 2021).
Case study 2: Rasch measurement model for approximation of score changes: Can clinical changes translate into meaningful changes in functioning at the individual patient-level?

Context
A challenge in rare diseases is the heterogeneity of patients' symptoms, especially prominent when communicating the meaning of changes on a ClinRO across the spectrum of functional ability.
Qualitative data supported the understanding of the relationship of MFM32 to daily life, however, regulatory and payer questions focused on the relevance of point changes in the context of a total score which ranged from 0-100.
Methods Trundell et al. (2019) used a Rasch measurement model to predict probable item scores on the MFM32 based on an individual's level of underlying motor function. As described by Hobart and Cano (2009) the Rasch measurement model provides a method to score patients according to their abilities, presenting items according to their di culties on the same linear interval scale. In this example, probable item scores on the MFM32 were predicted based on an individual's level of function on the underlying spectrum (i.e., from a logit [log-odds-unit] which is the unit of measurement in Rasch) (Andrich, 2011). Trundell et al. (2019), described a provisional set of analyses using the following steps to establish the relationship between MFM32 item scores and daily functions. In italics, an example is provided for each step: 1. Qualitative data sources reviewed to identify daily activities impacted by SMA, as reported by patients and their families.
Turning in bed was raised as important to patients and their families 2. Physicians (N=2) and physiotherapists (N=6) experienced in administering and interpreting the MFM32 in patients with Type 2 and 3 SMA identi ed key MFM32 items that were most related to each daily activity.
Item 7 on the MFM32 (ability to roll from supine to prone position) was identi ed as the item most related to the ability to turn in bed 3. For each function, an MFM32 item (or items), and the threshold between response options most likely to predict ability to perform the daily activity, were identi ed by the expert panel.
Experts identi ed a change in score from 1 (individual can roll partially) to 2 (individual can turn over into prone with di culty and compensatory movements and/or cannot free the upper limbs from under the trunk) as the threshold for gaining/losing the ability to roll in bed 4. Rasch analysis was conducted using two independent data sources (one from a clinical trial and the other from a real-world data source) to identify the logit associated with each response threshold.
For abilities with multiple items, the mean of the logits was used. Where items had disordered thresholds, categories were combined until thresholds were ordered.
The item characteristic curves (curves showing the ordering of response options for each item; for an example see Petrillo et al. 2015) for item 7 was used to identify the threshold of being equally likely to score 1 or 2 on the logit scale.
5. For each daily activity, the speci ed logit was used to identify the most probable item response (0-3) for each of the other MFM32 items, from which a total score (0-100) was calculated.
This logit value was then used to calculate the most probable item score for all other items using the item threshold map (allows prediction of the most likely responses for each item based on a person's location on the logit scale (i.e., on the underlying construct of motor function ability). For the ability to turn in bed, the resulting score was 52 on the MFM32 total score, which is in the middle of the scale.

Results
The result of this approach was a provisional gure depicting estimated MFM32 score thresholds associated with the gain or loss of meaningful daily activities (Figure 1).
The value of this approach is the ability to link the MFM32 items to real-world daily activities and to quantitatively approximate gain or loss of activities in the context of an MFM32 total score. These estimates were not used to form endpoints but rather to provide a schematic to demonstrate the relevance and meaning of score changes across the MFM32 score spectrum. While there are limitations to this method (e.g., imperfect t to the rasch model due to the multidimensional nature of the scale), and it is not recommended as the sole means for determining score thresholds, the approach provides a unique way to visualise that patients gain or lose meaningful daily activities, regardless of their starting position on the scale. This analysis moves beyond a one-size-ts-all approach to establishing what a change score on a ClinRO may be in relation to a patient's starting functioning. This work has been expanded upon using an additional dataset and this research has been submitted for publication in a peer reviewed journal (Trundell et al. 2021).
Case study 3: Can exit interviews be used to provide context for the meaning of a change in quality of life (QoL)/functioning?

Context
The heterogeneity of autism spectrum disorder (ASD) in symptoms, symptom changes and impact is a well-established challenge. Additionally, available core symptom measures were not developed speci cally for ASD. Additional work to understand what constitutes a meaningful change for available measures is required to validate the assessment and optimally interpret clinical trial data and real-world outcomes.

Methods
The Vineland TM -II Adaptive Behavior Scales -Second Edition (Vineland TM -II; Sparrow et al. 2005 with ASD. Interviews consisted of in-depth, 60-minute, telephone-based conversation with study partners of trial participants who recently completed the nal, Week 24 visit. Open-ended questions were asked to gain insights into the impact of ASD on the lives of individuals and their families, followed by a focused discussion on the meaning of any changes experienced over the course of the trial. In particular, questions focused on the domains captured by Vineland-II TM : socialization, communication, and daily living skills, with discussion on changes in health-related QoL. Interview data was audio-recorded, transcribed and analyzed using thematic analysis to understand meaning and real-world impact of changes. Quantitative ratings of change from baseline, overall and per domain (based on 7-point scales), were completed by study partners during the interview, with the intent to categorize exit interview participants for quantitative evaluation in the psychometric analyses. Experienced, independent clinician reviewers also rated changes, using the same rating scales as the study partners, for a number of transcripts (n=20). This approach required clinicians to review selected transcripts and rate the perception of change from their independent clinical perspective. These data, alongside blinded descriptive data from selected outcome variables in the clinical trials, was used to inform anchor-based analyses exploring the interpretation of meaningful change on Vineland TM -II.

Results
This approach provided insight into the study partner perceived change in the trial participant. When study partners were asked to rank domains in order of importance, socialisation and communication were rated most highly for both aV1ation and V1aduct, thereby supporting the primary endpoint used in these studies. Upon review of the interviews and unblinding of treatment groups, the interviews provided additional context to changes (of which most were improvements) reported in the placebo and active drug arm, supporting the results from the COAs also collected in the trial, somewhat validating the trial ndings. In aV1ation, the majority of study partners reported some form of improvement in the child they cared for, with few reporting no change and even fewer reporting worsening, whereas in V1aduct the majority reported no change. In aV1ation, when this anchor data and clinical trial data were triangulated to calculate meaningful change thresholds for the Vineland TM -II, one noticeable difference between the clinician and study partner feedback from the interviews was that clinicians were assessing change above a certain amount as being meaningful (based on the clinical anchors). However, from the study partner perspective any change, no matter how small, was usually considered meaningful. This suggests the meaningful change threshold that was being used as a benchmark, which had involved clinician input, was conservative, so potentially a lower value may still have been meaningful to study partners and people with ASD.
In aV1ation, the interviews also provided insights from study partners regarding changes which may have been a result of other life events, not attributed to the drug. When study partners were asked if they thought the trial participant was on placebo or active drug, of those who selected placebo, just under half reported a minimal improvement when asked to rate using the anchor. These same participants reported that such changes may have been a result of maturation, starting high school, moving or a personal desire to change. This important nding could be used to explain outliers and highlights the importance of the mixed methods approach used in this study to better understand life changes throughout the duration of a clinical trial when interpreting results.
A limitation is that these are reliant on the willingness of study partners, which may result in a potential selection bias. As the exit interviews were consistent with the clinical trial data, it was likely a representative sample of trial study partners. An additional consideration is that only study partners were involved in these interviews, not ASD participants. Thus, it can't be assumed the changes were consistent with ASD trial participants' perception of change. The study was designed to account for potential di culties ASD participants may have had communicating their experiences and therefore accurately elaborating on any changes. A general limitation applicable to trials in this population and applicable to the optional exit interviews is that the willingness of study partners to see an improvement could have led to changes being observed and reported more sensitively in the placebo and treatment arms. Overall, the exit interviews were received well and also allowed for general feedback on the positive impact of changes experienced from the study, with one study partner noting: "If the people in the pharmaceutical company are going to read it, I would say that for families that are on the spectrum like us, it's life changing work what they're doing. It gives us the opportunity to, to t and to hope for a fair future for our kid, that sometimes get shattered when we get the diagnosis. On a personal level I would like to say it feels like a miracle. So thank you."

Context
In designing pivotal trials to assess e cacy and safety of new therapeutics, drug developers should endeavor to power a study to detect a clinically meaningful treatment effect. Powering is done at a group level whereby con dence is needed to support that an observed difference between groups is clinically important. Though recent guidance has moved from group-level meaningful change towards the more patient-relevant concept of individual response, group-level evaluation continues to be important among clinical decision makers where group level inferences can inform comparisons between different treatments or decisions regarding public policy (Rai et al. 2015).
While anchor-and distribution-based methods (e.g. Copay et al. 2007) are proposed to support arguments, when there are limited data to support meaningful differences, or in indications where there is a high unmet need and limited treatment options available, clinical expert opinion is often sought to inform a relevant and meaningful treatment effect. The evaluation of relative clinical importance of an observed group level difference is context-dependent and is dynamic within an ever-evolving treatment landscape. This is important when primary endpoints are derived from ClinROs where clinician opinion plays a prominent role. This study illustrates clinician-centric metrics support for discussions on a meaningful group-level difference on a complex multi-domain ClinRO.

Methods
This example is also based on research in ASD. We sought to de ne a clinically meaningful treatment effect on the Vineland TM -II (Sparrow et al. 2005). The Vineland TM -II is a complex ClinRO that relies on normative data to support a standardized scoring algorithm. This complexity, combined with a lack of data on meaningful change thresholds and a lack of existing treatments, meant that clinical experts struggled to identify a treatment difference that would be meaningful. To overcome this, we utilized three different clinically-relevant metrics to support discussions:

Responder delta
Numbers needed to treat (NNT) Standardized effect size Results Table 1 illustrates different metrics using a hypothetical ClinRO which has a standardized score ranging from 0-100 points (higher score indicating better functioning) and assumed standard deviation (SD) of 10. The relevance of change is inherently more meaningful when individual-level changes are presented.
Using either established or illustrative thresholds for clinically meaningful individual change, it is possible to model the proportion of responders in each treatment group by adjusting group-level treatment effect and keeping the placebo arm constant.
In considering the responder delta, clinical experts were better able to contextualize the group-level differences.

Numbers needed to treat
Building on the responder delta, another clinical concept is NNT. NNT can be de ned as the average number of patients who need to have the treatment for one to have the positive outcome in the time speci ed (NICE Glossary 2021). The closer NNT is to 1, the more effective the treatment (CEBM 2021). Citrome and Ketter (2013) posit NNT is one of the most clinically intuitive metrics which helps clinical decision makers evaluate the real-world effectiveness of a documented effect size. Even when no similar treatment is available, we have found that clinicians are able to evaluate the clinical relevance of any given NNT in the context of other treatments. Although numbers needed to harm (NNH) is required to obtain a full picture of the relative bene t-risk of a treatment, a hypothesized NNT value provides a unique and valuable lens for evaluating group-level treatment differences.

Standardized effect size
Finally, a standardized effect size is a metric that is well known to clinical experts. Effect size is traditionally used to support interpretation of clinical research (alongside statistical signi cance). Labels such as small (0.2), medium (0.5) or large (0.8) (Cohen, 1988) are often used by prescribing clinicians to consider the relative e cacy of a treatment and the resulting clinical importance.
Converting group-level Meaningful Change Scores into well-known metrics like NNT, standardised effect size and responder deltas can enable a panel of clinical experts to better articulate the clinical meaning of a speci ed change on a ClinRO. This can be valuable during the early development of new therapies where expert clinical opinion is sought to help powering for a clinically meaningful effect.

Discussion
These case studies highlight a series of novel approaches that have been employed to derive and demonstrate the meaningfulness of score changes on ClinROs to support stakeholder decision-making. Regardless of context (clinical trial or real-world evidence generation) and whether a ClinRO is being used to generate individual or group level data, the importance of deriving meaning from data by 'going beyond the numbers' to de ne relevance to patients and clinical experts stands true.

Important learnings include:
Case study 1, the patient perspective is vital to determining the clinical relevance of items to their everyday lives, and providing patients with a patient-friendly version of the concepts measured in a ClinRO could be a useful approach to obtaining these insights.
Case study 2, in heterogeneous conditions, different ADLs may be gained or lost depending on patients' functional ability. Visual approaches using the Rasch measurement model can depict this reality to external audiences such as regulators and payers.
Case study 3, exit interviews could provide an opportunity to understand real-life examples of how study partners perceived change in study participants; identify the importance of additional considerations when thinking about disease improvement in supporting concepts covered by primary endpoints in clinical trials; highlight the importance of the attribution of treatment effect, and that this effect can be driven by multiple factors; that perceived change can be variable across different respondents.
Case study 4 demonstrates the utility of clinician expertise is important supporting the interpretation of meaningful difference on a ClinRO at the group level using a variety of well-known metrics.
Although the innovation of the different approaches delivered value to improving interpretability of score changes on ClinROs, the case studies also had several important limitations worth considering. First, the approaches presented here should be considered as unique supplementary approaches to the traditional anchor-based approaches for the evaluation of meaningful change on complex ClinROs. The information obtained through these innovative methods cannot be relied upon as the sole source of an estimation. Second, ClinROs used in clinical trials are often measures taken from clinical practice which require validation (i.e., using classical test theory and Rasch measurement theory analytic methods) to ensure optimal performance in a clinical trial. The linearity of such scales is often not veri ed and as such single point changes may not mean the same across the measure causing challenges with describing and interpreting 'point changes'. It is important that appropriate validation studies detailing scale structure, reliability, validity and ability to detect change have been conducted prior to interpreting the meaning of changes on the scale.
Finally, in terms of future research in this eld, it is important to note that in patient populations such as SMA and ASD where a wide degree of heterogeneity exists with regards to functional ability, a single estimate of meaningful change is often not appropriate. The authors encourage greater discussion and consideration of appropriate methods for factoring in baseline function in the calculation of meaningful change estimates.

Conclusion
While methods for deriving meaningful change thresholds have evolved considerably in recent years, there remains a challenge in communicating what observed changes mean to the patient, a challenge which is further complicated in ClinROs. These case studies showcase novel approaches to addressing this challenge and may provide a useful addition to the COA scientist's toolbox. Our studies focused on neurological conditions and the applicability of our approaches should be veri ed across multiple therapeutic areas. Figure 1 Estimated MFM32 score thresholds associated with the gain or loss of meaningful daily functions