The challenges inherent with anchor-based approaches to the interpretation of important change in clinical outcome assessments

Anchor-based methods are group-level approaches used to derive clinical outcome assessment (COA) interpretation thresholds of meaningful within-patient change over time for understanding impacts of disease and treatment. The methods explore the associations between change in the targeted concept of the COA measure and the concept measured by the external anchor(s), typically a global rating, chosen as easier to interpret than the COA measure. While they are valued for providing plausible interpretation thresholds, group-level anchor-based methods pose a number of inherent theoretical and methodological conundrums for interpreting individual-level change. This investigation provides a critical appraisal of anchor-based methods for COA interpretation thresholds and details key biases in anchor-based methods that directly influences the magnitude of the interpretation threshold. Five important research issues inherent with the use of anchor-based methods deserve attention: (1) global estimates of change are consistently biased toward the present state; (2) the use of static current state global measures, while not subject to artifacts of recall, may exacerbate the problem of estimating clinically meaningful change; (3) the specific anchor assessment response(s) that identify the meaningful change group usually involves an arbitrary judgment; (4) the calculated interpretation thresholds are sensitive to the proportion of patients who have improved; and (5) examination of anchor-based regression methods reveals that the correlation between the COA change scores and the anchor has a direct linear relationship to the magnitude of the interpretation threshold derived using an anchor-based approach; stronger correlations yielding larger interpretation thresholds. While anchor-based methods are recognized for their utility in deriving interpretation thresholds for COAs, attention to the biases associated with estimation of the threshold using these methods is needed to progress in the development of standard-setting methodologies for COAs.


Introduction
Clinical outcomes assessments (COAs), a term that encompasses patient-reported outcomes (PROs), clinicianreported outcomes (ClinROs), observer-reported outcomes (ObsROs), performance outcome (PerfOs), as well as certain COAs derived from technologies such as mobile health technologies [1], are crucial to interpretation of clinical studies as they uniquely describe or reflect how patients feel or function. It is essential to understand the COA change over time that is meaningful to patients to appropriately interpret clinical study findings.
Early pioneers in the quest to incorporate the patient voice into clinical studies and the need for meaningful interpretation of COA results used global patient-reported ratings of change as external anchor items to document patients' overall assessment of change [2,3]. The anchor-item responses were then used to classify patients' individual PRO change scores to inform the interpretation of changes in PRO observed in clinical studies. Jaeschke et al. noted "Despite the absence of a criterion measure, establishing the meaning of changes in a new measure requires some sort of independent standard. Global ratings represent one credible alternative. …This information will be useful in interpreting questionnaire scores, both in individuals and in groups of patients participating in controlled trials" [3].
Anchor-based approaches have been central to many empirical methods employed to aid the interpretation of clinically meaningful change in COAs. For over a decade, the US Food and Drug Administration (FDA) has promoted the use of these methods to support the interpretation of the threshold "for an individual patient PRO score change over a predetermined time period that should be interpreted as a treatment benefit" [4]. More recently, the FDA has confirmed this approach, noting "To aid in the interpretation of study results, FDA is interested in what constitutes a meaningful within patient change (i.e., improvement and deterioration from the patients' perspective) in the concepts assessed by COAs" and "recommends the use of anchor-based methods to establish meaningful within-patient changes" [5]. Indeed, this best estimate of change when individual patients have experienced a meaningful benefit over time in the concept assessed by a COA is the focus of this paper. In the same 2019 document, FDA diminished the role of distributionbased methods, stating: Distribution-based methods (e.g., effect sizes, certain proportions of the standard deviation and/or standard error of measurement) do not directly take into account the patient voice and as such cannot be the primary evidence for within-patient clinical meaningfulness. Distribution-based methods can provide information about measurement variability [5].
Anchor-based methods explore the associations between the change in the targeted concept of the COA measure and an external criterion measured by the anchor (or multiple anchors) chosen to be easier to interpret than the COA measure [1,4,5]. Simply stated, "the anchor measure(s) are used as external criteria to define patients who have experienced a meaningful change in their condition" [1]. As noted above in the early interpretation studies, the anchor is often a global rating [2,3]. By identifying the subgroup of patients who are considered to have experienced meaningful change based on the anchor measure(s), the meaningful change threshold of the COA measure can be derived [1].
While the anchor-based methods are valued for providing plausible criterion measures to aid determining the meaningful change threshold for clinical trial interpretation, a number of inherent theoretical and methodological challenges arise from their use. The term meaningful change threshold used throughout this paper describes an estimate of change based on an analysis of COA scores from a group of patients who have experienced an individually meaningful benefit derived from an external standard [6].
The intent of this paper is to address five substantive research issues in the use of the anchor-based methods. We then propose strategies that may address these challenges, and recognize fundamental limitations for the interpretation of individual change that remain.

The research issues
Five important research issues in the use of anchor-based methods are as follows: 1. Global estimates of change are consistently biased toward the present state. 2. The use of static current state global measures, while not subject to artifacts of recall, may exacerbate the problem of estimating meaningful change. 3. The anchor assessment response(s) that indicates meaningful change usually involves an arbitrary judgment. 4. For receiver-operator characteristic (ROC) curve methods, the derived meaningful change thresholds are sensitive to the proportion of patients who have improved. 5. For anchor-based regression methods, the correlation between the COA change scores and the anchor has a direct linear relationship to the magnitude of the interpretation threshold derived using an anchor-based approach, with stronger correlations yielding larger interpretation thresholds.
Each of these five key challenges is discussed below.

Global estimates of change are consistently biased toward the present state
Anchor-based methods based on a patient global estimate of change (e.g., "Please choose the response that best describes the overall change in your < symptom/overall status/etc. > since you started taking the study medication: Much better, A little better, No change, A little worse, Much worse" [7]) have consistently demonstrated bias resulting from overweighting the present state and underweighting the initial state. Theoretically, if global reports of withinpatient change are an unbiased estimator of the difference between their baseline and present condition, the correlation between the global assessment and PRO scores of present and baseline states should be equal and opposite [8]. However, empirical investigations correlating the assessment of change on the anchor with measures of baseline and present state have consistently demonstrated a high positive correlation with the patients' current status, and a near zero, and occasionally positive, correlation with baseline assessments [8][9][10][11][12]. The fundamental problem with the approach is that remembering and estimating change from a baseline several weeks or months earlier can be an extremely difficult recall task; as a consequence, people devise alternative, albeit unconscious, strategies [8]. One identified strategy is implicit theory of change [13]. Using numerous examples from the social science literature, Ross [14] documented how individuals do not directly recall the initial state; instead, they use implicit theories based on their current state to work backwards to an estimate of their initial state and then reconstruct the estimate of change over time. As a result, global assessments of patient-perceived stability and/ or change lead to an overweighting of current status in the change estimation.

The use of static current state global measures may exacerbate the problem of estimating clinically meaningful change
One alternative to avoid the problems arising when patients directly estimate change is to have patients estimate their present state using a global impression of severity (PGIS) (e.g., "Please choose the response below that best describes the severity of your < symptom/overall status/etc. > over the past week: None, Mild, Moderate, Severe)" as recommended by the FDA [7], then determine the amount of change by subtraction. However, this approach does not directly elicit information from patients about the magnitude of meaningful change, as described below.
Moreover, these patient assessments of present state may also suffer from a bias analogous to the implicit theory of change or stability. The related bias is called response shift-as a patient's health state changes, their expectation of ideal health may change with it. Patients with chronic or degenerative diseases may acclimatize to their health state, so report good or excellent health despite obvious infirmities. As a consequence, "HRQoL scores can be stable despite changes in HRQoL" [15]. That is, the PGIS response given at baseline may not reflect the health framework used by the patient at later PGIS assessments and will bias any change score computed by taking the difference of the two estimates.

The anchor assessment response(s) that indicates meaningful change usually involves an arbitrary judgment
Using the patient global estimate of change (e.g., "Please choose the response that best describes the overall change in your < symptom/overall status/etc. > since you started taking the study medication: Much better, A little better, No change, A little worse, Much worse" [7]) to understand and interpret meaningful change requires the selection of a specific global response(s) to anchor the change score analyses. What then constitutes a meaningful change? Does a patient need to be Much better or perhaps simply A little better for the COA change/improvement to be meaningful? What if the disease or condition is known for rapid patient deterioration on the COA's concept of interest, should a patient response of No change be considered meaningful given the historically known downward disease trajectory? In short, what estimate of change corresponds to patients who have meaningfully changed, while at the same time, have not changed too much [16].
The situation is exacerbated when static state measures are used because relevant change levels are determined by computing the difference between the two states, yet meaningful change is not directly estimated by patients. Recognizing this, the FDA asks sponsors to specify and justify "the anchor response category that represents a clinically meaningful change to patients on the anchor scale, e.g., a 2-category decrease on a 5-category patient global impression of severity scale" [1,7]. However, when the criterion judgment of meaningful change over time on this static scale is left in the hands of the investigating team or an expert panel, this undermines the process of identifying meaningful patient-informed change threshold using the static anchor.
It is suggested [1,5] that interpretation of meaningful change may be assisted by graphic display of the empirical cumulative density function (eCDF) at each global change value, and indeed, knowledge of the distributions of COA change scores is essential [17]. While the eCDF curves from each change level provides descriptive information on the relationship of the COA to the anchor's change score, these graphic displays do not directly inform the anchor's meaningful change threshold. That is, by noting that "The meaningful within-patient threshold of the target COA should be explored by the eCDF of the anchor category where the patients are defined and judged (by the anchor measure) as having experienced meaningful change in their condition" [1] assumes that meaningful change level for the anchor is known or has been established.
Indeed, without an adequate qualitative investigation [18,19] of the global item's response options to understand what patients within the target population consider a meaningful change in how they feel or function on the global item's scale, the selected relevant level(s) for the global assessment response(s) that indicates meaningful change may involve an arbitrary judgment by the investigating team, and that judgment can differ over time [3,20]. The use of a static anchor (e.g., PGIS) and eCDF displays does not address the crucial issue of how the anchor's meaningful change level is established.
Finally, the reliability of anchor ratings is generally unknown, with limited evidence supporting test-retest reliability of anchor item(s). The anchor is often a single item; hence, more prone to measurement error than a multi-item scale [21]. The paucity of evidence of reliability for the anchor assessments was noted in 1997 by Norman, Stratford, and Regehr [8], and as described by Lavigne in 2016, if the anchor item(s) used to assess the meaningful change is not reliable, the resulting change threshold for meaningful improvement or decline may not be reliable [22].

The calculated interpretation threshold is sensitive to the proportion of patients who have improved
Terluin et al. [23] examined the impact of the proportion of improved patients on minimally important change (MIC) thresholds in (1) multiple simulations of patient samples from anchor-based MIC studies, and (2) in a clinical study dataset. A group MIC was compared to the average of all individual MIC levels if the patient reported an important change/improvement using a global change anchor, and the group MIC was calculated using two methods, receiveroperator characteristic (ROC) curves and predictive modeling [24]. Not surprisingly, the group MIC was strongly biased by the proportion improved. When the less than 50% of the sample improved, the group MIC underestimated the average of individual MICs because proportionately more observations came from the unchanged group. Conversely, when more than half the patients had an important change/ improvement, the group MIC overestimated the average of individual MICs for the same reason [23].
The FDA has discouraged the use of ROC curve analysis as the primary method for understanding the meaningful within-patient change threshold [5]. FDA noted three key ROC curve concerns: (1) the method is a model-based approach, such that different models may yield different threshold values; (2) the most sensitive threshold identified by ROC may not actually be the most clinically meaningful threshold to patients; and (3) the method is partially a distributional-based approach [5], as demonstrated by Terluin et al. [23].

The strength of the relationship between changes in the anchor and changes in the COA has a direct impact on the magnitude of the meaningful change threshold
An important and often overlooked source of bias in anchorbased methods that directly influences the magnitude of the meaningful change threshold is the correlation between change assessed by the anchor and the COA change scores [16]. It is self-evident that there should be some relationship between change in the COA and change assessed by the anchor scale [1,5].
Without providing specific thresholds, FDA notes that an anchor should be "sufficiently correlated to the targeted COA" [1,5]. Hays, Farivar, & Liu reported in 2005 that a correlation coefficient of r ≥ 0.371 (equivalent to an effect size of 0.80) defines "a noteworthy (large effect) association" between change on the anchor and change on target COA measure [25]. Other authors have recommended a range of 0.30-0.70 for the magnitude of this change scores correlation [26,27]. Leaving aside the broad nature of these recommendations, what is not recognized is that selecting a correlation range does not nullify the impact of the correlation on the calculation of the meaningful change threshold. Quite the opposite, the magnitude of the association is a direct determinant of the magnitude of this threshold when using linear regression [16]. The equation relating the correlation with criterion to the magnitude of the COA's meaningful change threshold (MCT COA ) is Expressed in standard deviation (SD) units, the meaningful change threshold is the difference on the anchor scale corresponding to meaningful change multiplied by the correlation coefficient.
This important effect of the correlation can be visualized by considering extreme cases. If there is no relationship between change on the COA and change on the anchor (r COA-Anchor = 0.0), then no amount of change in the anchor will lead to a non-zero predicted change in the COA. The two measures are independent. Conversely, if there is a perfect linear relationship (r COA-Anchor = 1.0), then any change in the anchor will lead to an equivalent change in the COA (with both expressed in SD units). Intermediate change correlations/relationships between the two measures must result in correspondingly intermediate values on the meaningful change on the COA. This equation makes it clear that the magnitude of the resulting meaningful change threshold derived from an anchor-based method will increase with the strength of the relationship between anchor and COA [16]. Moreover, the effect is non-trivial-depending on the strength of the correlation, the meaningful change threshold can vary from 0 to 1 SD, and is minimal when the correlation is smallest-the weakest relationship. Setting arbitrary ranges of correlation such as 0.30-0.70 reduces the impact of correlation, but it remains a major determinant of the magnitude of the threshold.
While the magnitude of this bias can be directly estimated, it is unclear how to address the role of this key determinant. Fayers and Hays recommended a strategy called MCT COA = Δ Anchor × r COA -Anchor MCT and COA expressed in SD units.
linking that equates the standardized change in the COA and the anchor [28]; this scale-aligning strategy is equivalent to assuming a perfect linear relationship (r xy = 1.0) between the change scores [16]. The effect of this strategy is not trivial. Using the results from Suner et al. [29], Fayers and Hays' report the estimated minimal important difference (MID) on the 25-item National Eye Institute Visual Function Questionnaire (NEI VFQ-25). This scale is scored 0 (worst) to 100 (best vision-related function)); change in visual acuity was the anchor. The authors [16] note that "the MID is best estimated by honing in on the change group that has improved by a non-trivial important amount but not by a medium or large amount," yet the three anchor levels (≥ 15 letters gained, ≥ 15 letters lost, < 15 letters change) prohibited this refinement [29]. However, the resulting MID was 4.3 points using linear regression models (r < 0.3) and 21.8 via the linking approach-two widely different thresholds for interpreting visual function change for patients with neovascular age-related macular degeneration [16,29].
This finding is worrisome in that a stronger association between change on the COA and anchor will yield higher values for meaningful change threshold, while a weaker anchor relationship yields a lower threshold for demonstrating a meaningful change using linear regression analysis. In this case, a three-level change in letters read from a distance of 2 m may not be a strong anchor for understanding change in visual function, and a low correlation yielded a small MID [29]. The importance of this finding warrants reconsideration of existing meaningful change thresholds computed using anchor-based methods and linear regression analysis.

Discussion
Anchors have assumed a central role in the interpretation of health status changes related to therapeutic interventions. This investigation explored five challenges to an external anchor-based approach for interpreting changes in the COA of interest.
At the heart of this discussion is the utility and interpretation of the external anchor measure. When detailing the considerations for selecting anchors, the FDA states "anchors should be plainly understood in context, easier to interpret than the COA itself, and sufficiently correlated to the targeted COA" [1]. Naturally, the patient's global change assessment is often used as such an anchor, allowing patients to directly provide "the standard by which to measure the benefits and harms of their treatments" [23]. However, global change assessments consistently suffer from recall bias; these and other anchor measures often are subject to arbitrary-level setting for meaningful change, limited and often unknown test-retest reliability, and vulnerability to biases from implicit theory of change and response shift. In addition, the change score correlation between the COA and the anchor is directly associated with the magnitude of the meaningful change threshold derived using linear regression analysis questions the role of these assessments in the interpretation process.
There is an underlying paradox. Extensive resources are invested in research related to the development of content and psychometrically valid COAs. Yet, the interpretation of change over time, and the critical act of determining the magnitude of meaningful change on these multi-item measures is accomplished by comparing the domain/concept change scores to a single anchor measure [27] with often unknown and irreducible biases, as in the case of retrospective transition ratings.
In our view, the external anchor-based method, while providing an early and useful approach to interpretation, may have reached the limit of its development unless researchers address these identified biases. Specifically, static anchors must reliably capture the COA's concept of interest, and the anchors' meaningful change level(s) need to need to be informed by the target patient population [1,7]. Moreover, anchor-based analyses must avoid the identified analysis pitfalls (e.g., ROC curves, linear regression) [5]. We appreciate recommendations for multiple strong anchors when all are supportive/convergent to a single estimate [1,25,30], but cautiously recognize that when non-convergent estimates result, the subsequent disentanglement to identify the "best" anchor can become a cherry-picking exercise.

A new approach: defining the meaningful change threshold within the COA
Another solution is to refocus efforts to development of methods that derive thresholds of change within the COA itself. One promising development in this regard initiated by the Patient Reported Outcomes Measurement Information System ® (PROMIS ® [31]) researchers is the PRO-Bookmarking [32] approach, an adaptation of the Bookmark procedure of standard setting used by US state academic achievement assessment systems [33]. Using Item Response Theory (IRT) methods, PROMIS ® calibrated PROs to assess physical, emotional, and social health of patients in clinical care, observational studies, and clinical trials [31]; in addition, PROMIS ® researchers developed PRO-Bookmarking to identifying key change thresholds in PROs using the items within the PROMIS ® measure [32].
In its original form, a key feature of the Bookmark procedure is the Ordered Item Booklet (OIB), which contains all academic test items (one per page) ordered by empirically determined difficulty (easiest to hardest). Using the OIB, a panel of subject matter experts (SMEs) determines where to place a bookmark between two items such that the "minimally qualified" student is expected to have mastered the items below the bookmark, with multiple rounds of training and discussion between SME panelists [33,34].
In a recently published PROMIS ® bookmarking report investigating meaningful change in the concepts of rheumatoid arthritis (RA) pain interference and fatigue, the original Bookmark procedure was modified in a number of ways [35]. The OIB was replaced with a series of patient vignettes (natural language short stories) that describe 4 or 5 key symptoms from PROMIS Likert-type item banks of the two investigated concepts. Each vignette had a PROMIS ® IRT-based score, and these vignettes were presented to the panels in order from lowest to highest scores. The SMEs panels including two key informants, patients (n = 11) and clinicians (n = 8); each panel separately reviewed ordered vignettes to identify (bookmark) three transition points from none-to-mild, mild-to-moderate, and moderate-to-severe across the severity spectrum for these two concepts. Using vignettes at the severe level as a baseline, each panel was then asked to identify the vignettes that represented a meaningful improvement for each concept (i.e., change indicating that "treatment was working" or "enough to be important to you" [35]), with the PROMIS ® change scores then used to identify the magnitude of the meaningful change thresholds reported for each panel. The same exercise was conducted to identify meaningful worsening using mild vignettes as baseline [35].
The approach has a critical difference from previous distribution-and external anchor-based methods in that calibration arises within the COA, thereby obviating many of the concerns raised earlier regarding anchor-based assessments (bias due to retrospective change estimation, response shift, arbitrary/uninformed meaningful change levels, ROC curve concerns, and anchor correlation influencing the estimate magnitude), although the reliability/reproducibility of the bookmarking estimates deserves investigation. Second, it clearly articulates the patient's voice, unlike distributionbased methods, and provides a clear comparison to the clinician perspective, which did not always align with the patient perspective [35]. A third difference is that, while it originated in the educational context of standard setting for academic tests, the PROMIS ® bookmarking adaptation for PRO assessments provides a relevant and rich opportunity to learn from key informants how meaning is discerned.
However, this emerging method to understand meaningful COA change has limitations. Most notably, this bookmarking example used IRT-based scoring to order the vignettes by empirically-determined difficulty. While most COAs are not IRT scored, promising mobile health technologies with interval scaling (e.g., steps) may benefit from this method for understanding meaningful change within the appropriate patient population. In addition, the patient and clinician samples were small, and each informant group wanted more contextual information about the patients described in the vignettes (age, occupation, treatment history, gait, etc.) to aid in classifying the severity levels and make the meaningful change determinations [35], demonstrating the cognitive challenges of theoretical vignettes and the potential drawback of judgements based on vignettes rather than patients' lived experiences [19].
It is also important to recognize that the anchor-based and bookmarking methods used to derive thresholds to classify individuals who have achieved meaningful change over time are grounded in group-level approaches that yield a classification criterion but do not inform about the meaningfulness of a specific patient's change. Anchor-based methods rely on the distribution of COA change score for the subgroup of patients reporting the selected anchor-derived level of change, followed by analyses (e.g., mean, median, regression, ROC analyses) that compare this subgroup to other anchor-informed subgroups. Similarly, bookmarking efforts use stakeholders to classify patient scenarios to create meaningfully changed subgroups, followed by analyses of these hypothetical patients' associated change scores, which are then applied to classify the individual change scores of other patients. The resulting thresholds derived from these group methods are not inherent to the COA measure and a specific context of use; rather, different samples/groups within the context of use can yield different results.
Indeed, the derived individual change threshold from these group-level methods does not directly inform on the meaningfulness of a specific patient's change over time; rather, it becomes an applied threshold to classify patients' outcomes. Alternatively, methods for classifying a patient as meaningfully changed using only an individual patient's COA scores (without group reference information) would require (1) a predetermined level of confidence that the observed change is beyond chance, and (2) multiple assessments over time to understand the intra-individual change variation, which can considerably increase patient burden. While the resulting change threshold could signal that the patient has achieved change that is beyond chance, understanding the importance or meaningfulness of this change to the patient is still needed.
Finally, we note that the Bookmark procedure is only one strategy used by educators for calibrating criterion referenced cut scores using known information from within the test. Others-including Angoff [36], Hofstee [37], Nedelsky [38]-may be adaptable to COA assessments and deserve investigation [39][40][41]; however, all employ judgments made on hypothetical states, either at the item or test level, and therefore do not yield information at the individual student (or patient) level. With heightened awareness and a deeper understanding of the biases inherit to external anchor-based methods used to derive meaningful change thresholds to understand change over time, we encourage researchers to further pursue new methods that use the information within a COA based on their usefulness in the education standardsetting arena.
Funding No funding was received to assist with the preparation of this manuscript.

Declarations
Conflict of interest Kathleen W. Wyrwich is an employee of Brisol Myers Squibb; the authors have no other conflict of interest to declare that are relevant to the content of this article.
Ethical standards All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.