How strong should my anchor be for estimating group and individual level meaningful change? A simulation study assessing anchor correlation strength and the impact of sample size, distribution of change scores and methodology on establishing a true meaningful change threshold

Treatment benefit as assessed using clinical outcome assessments (COAs), is a key endpoint in many clinical trials at both the individual and group level. Anchor-based methods can aid interpretation of COA change scores beyond statistical significance, and help derive a meaningful change threshold (MCT). However, evidence-based guidance on the selection of appropriately related anchors is lacking. A simulation was conducted which varied sample size, change score variability and anchor correlation strength to assess the impact of these variables on recovering the simulated MCT for interpreting individual and group-level results. To assess MCTs derived at the individual-level (i.e. responder definitions; RDs), Receiver Operating Characteristic (ROC) curves and Predictive Modelling (PM) analyses were conducted. To assess MCTs for interpreting change at the group-level, the mean change method was conducted. Sample sizes, change score variability and magnitude of anchor correlation affected accuracy of the estimated MCT. For individual-level RDs, ROC curves were less accurate than PM methods at recovering the true MCT. For both methods, smaller samples led to higher variability in the returned MCT, but higher variability still using ROC. Anchors with weaker correlations with COA change scores had increased variability in the estimated MCT. An anchor correlation of around 0.50–0.60 identified a true MCT cut-point under certain conditions using ROC. However, anchor correlations as low as 0.30 were appropriate when using PM under certain conditions. For interpreting group-level results, the MCT derived using the mean change method was consistently underestimated regardless of the anchor correlation. Sample size and change score variability influence the necessary anchor correlation strength when recovering individual-level RDs. Often, this needs to be higher than the commonly accepted threshold of 0.30. Stronger correlations than 0.30 are required when using the mean change method. Results can assist researchers selecting and assessing the quality of anchors.


Introduction
Statistical significance alone does not necessarily reflect a treatment benefit that is meaningful from the patient perspective. Interpretation of meaningful change from the patient perspective is, therefore, important when assessing concepts measured by clinical outcome assessments (COAs) [1,2]. Meaningful change thresholds (MCTs) can be used for assessing the treatment benefit experienced. However, these thresholds are derived differently for interpreting Pip Griffiths and Joel Sims are joint first authors.
* Joel Sims joel.sims@adelphivalues.com Extended author information available on the last page of the article within-individual change compared to group-level comparisons. The use of such thresholds can arise in a healthcare setting during health discussions with physicians. Alternatively, they are necessary in clinical trial work either at the individual level (to define responders) or at the group-level (to interpret mean differences between treatment and comparator arms) for assessing the potential treatment efficacy of a new product.
Anchor-based methods are the preferred approach employed to aid the interpretation of meaningful changes in COA scores [1,2]. Anchors provide an external indicator to classify patients depending on their degree of change in their overall disease condition or the specific concept of interest. The Food and Drug Administration (FDA) recommend that anchors should be simply worded, easy to understand and assess a specific concept [3]. Multiple anchors can be used to aid interpretation and selection of an appropriate meaningful change threshold [4].
Although this work is primarily focussed on defining an appropriate anchor, it is important to define the context in which MCTs are derived and applied; the group-level and the individual-level. At the group level there are two forms of meaningful change threshold, a meaningful change (e.g., the minimally important change) and a meaningful difference (e.g., minimally important difference). A meaningful change is the score change a given group of patients (e.g., a treatment group) would have to change by on average for this change to be meaningful. A meaningful difference is the amount of score change a group needs to show relative to another group for their score change to be meaningful (Fig. 1). The key point in both cases is that the MCT is derived using group means (i.e. the mean change approach) and later used to interpret the mean change of the group (or the mean difference between two groups) [5]. Therefore, not every individual in a group will necessarily experience a meaningful change, but the group, on average, does. In contrast, an individual-level meaningful change threshold is the amount of score change that an individual needs to experience for that individual to be considered improved (or worsened). It is likely that every individual has their own score change that would be meaningful to them. However, the goal when deriving an individual-level meaningful change threshold, much like in diagnostic testing, is to find the score change value at which a given individual is more likely to have changed than not changed [6].
When conducting anchor-based analyses, the FDA Patient Focused Drug Development (PFDD) guidance recommend use of both empirical cumulative distribution function (eCDF) and probability density function (PDF) curves to aid identification of an appropriate meaningful change threshold [7]. Importantly, for anchor-based analyses to be possible, external anchors should be sufficiently correlated with the change score of the COA of interest [4]. The strength of the correlation between the external anchor(s) and target COA change score, however, remains a topic of uncertainty and debate. Cohen's rules of thumb have frequently been used as a guide to choosing a suitable correlation coefficient (r), with correlations of 0.10, 0.30 and 0.50 representing small, medium and large correlations respectively. [8] Based on this, published literature recommends a correlation of 0.30-0.40 as an appropriate threshold [2,9], while others have chosen higher values such as 0.50 despite this being acknowledged as an arbitrary cut-point [10][11][12]. Coon and Cappelleri (2015) have recommended a correlation of 0.40-0.70 as preferred [1]. Amidst the mixed recommendations for correlation thresholds, empirical reports continue to suggest a correlation of 0.30-0.40 as an acceptable lower-limit of the level of association between an external anchor and the target COA [13]. The FDA does not explicitly recommend a correlation threshold, with little guidance on how researchers should assess the strength of anchor correlations [3]. Therefore, the responsibility lies with researchers to evaluate the suitability of anchors prior to conduct of anchor-based analyses.
The strength of the relationship may affect the level of accuracy and reliability of results, alongside the meaningful change threshold that is ultimately selected. A poor correlation between an external anchor and the target COA can increase error in the derived threshold estimates [1]. Any value of r below 1.0 leads to attenuation of the meaningful change threshold, as demonstrated previously at the group-level [14]. Although limitations such as correlation strength and sample size have been noted [13], how thresholds may be affected by varying anchor correlations, sample size and distribution of change scores, remains empirically unexplored. Therefore, there is an unmet need to provide clearer guidelines on the suitability of anchors for anchorbased analyses. This has significant implications for studies aiming to aid the interpretation of COA meaningful change thresholds.
To address this gap in the literature, we conducted a simulation incorporating different samples of varying change score distributions and anchor correlation strengths to assess the impact of these factors on estimating a meaningful change threshold (MCT) using receiver operating characteristic (ROC) curves and predictive modelling (PM) approaches (which define responder definitions to be used to interpret results at the individual-level and the mean change method to interpret results at the between group level. Throughout the remainder of this work, we refer to this as assessing "individual-level" and "group-level" MCTs.

Methods
The methods for this simulation study were developed in accordance with best-practice guidelines for conducting simulation studies to evaluate statistical methods [15].

Simulation of data (data generating mechanisms)
Data were simulated separately for the individual and group level meaningful change analyses. For each analysis, several conditions were created which varied the distribution of change scores and the sample size. Different conditions tested a range of between 100 and 2000 patients, with differing variability (Standard Deviation scaled in relation to the simulated mean) and strength of correlations between the simulated anchor and target score (from 0.30 upwards). Separate simulations were conducted for the individual and group level scenarios. The simulations were different to keep an underlying meaningful change threshold of 15 points both at the individual level (where this represents the responder definition that specifies if a patient has improved) and group level (where this represents the mean difference between groups of stable and improved patients).

Individual level
For the individual level meaningful change data, sample sizes of either 100, 250, 500 or 2000 patient records were created. For each patient, a COA change score was developed by drawing a random number from a normal distribution with a mean of 15 and, depending on the condition, a standard deviation (SD) of 3.0, 5.0 or 7.5 to represent different levels of deviation. For each patient, anchor variables were created which correlated with the COA change score at 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 and ~ 1.0. This was done by drawing a random number from a normal distribution with a mean equal to the sample mean and a SD equal to that of the sample SD ( ∼ N( , ) ) and using it in the following formula: where rho represents the simulated level of correlation between the COA change score and anchor variable and and were relative to the specific simulated sample.
This led to a series of variables which were correlated at the respective levels but had an inflated mean compared to the original COA change score. To be able to categorize patients on the new anchor variables, a correction was applied to each sample to recentre the mean of the anchor variables to the mean of the original COA change score. This correction used the difference between the mean

3
COA change score and the mean of each simulated anchor variable.
For each patient in each sample, a variable was created to represent patient improvement status. This variable was created based on the simulated anchor variables. Due to how the data was simulated, the simulated COA change score had a mean of 15, this value was used as the "true" cut point to determine change. Where anchor scores fell above 15, patients were defined as "responders" and where anchor scores fell below this value, patients were defined as "non-responders". This meant that 50% of the patients were "responders". This was done because bias exists where there is an imbalance in the number of patients in each group for these methods [15].

Group level
Simulation of data for deriving MCTs for group-level interpretation (mean change approach) followed the same methodology to individual level data (i.e., same sample sizes and standard deviation conditions) but with some small differences. A cohort of patients were simulated from a single normal distribution, with a mean of 7.5 and a SD of 3.0, 5.0 or 7.5. Simulated anchors were derived in the same manner as described above for individual-level change. Patients were then re-classified into "improved" and "stable" groups using the simulated anchors. Given that the overall mean for the COA change score in each sample was 7.5, this value was used to classify patients on the anchors. When a patient's simulated anchor score was ≤ 7.5 they were defined as "stable"; where a patient's simulated anchor was > 7.5, they were defined as "improved". A threshold of 7.5 is arbitrary, but allowed for relatively equal group sizes.
The "true" difference in mean change between these groups was determined as the difference seen when patients were grouped by a perfectly correlated anchor, and this threshold was use for comparison of all other anchors with a correlation < 1.

Individual level
To recover the MCT (that is, to output the estimated MCT for each simulated condition) at the individual level, a series of Receiver Operating Characteristic (ROC) curve-based and Predictive Modelling-based (PM) analyses were conducted. Briefly, using the simulated anchor groups ("nonresponders" and "responders"), the ROC method identifies the number of "non-responder" and "responder" patients who are correctly and incorrectly classified by each possible threshold. For each sample, the COA change score which most accurately represents the ROC curve point closest to the top-left hand corner of the plot is used as the estimate for the responder definition. This is calculated for each COA change score using the following formula, with the smallest resulting value being selected as the responder definition: where se represents sensitivity of the cut point and sp represents specificity.
The PM approach, on the other hand, uses logistic regression to predict the COA change score at which patients are most likely to be accurately classified into two groups [16]. This logistic regression model uses the improvement status on the anchor ("responder" vs "non-responder") as the dependant variable and the COA change score is the independent variable. The PM approach then uses the regression output in the following formula, which takes into account the prevalence of the number of responders: where p is the proportion of patients in the improved group (responders) expressed as a decimal, C is the intercept from the logistic regression and β is the regression coefficient.
Once both of these analyses had been completed for each of the 1000 samples in each of the 12 conditions, the recovered cut points were displayed graphically using a histogram with the recovered threshold along the x-axis and number of samples the threshold was recovered in on the y-axis. This was done in order to assess accuracy and variation around the "true" simulated threshold.

Group level
To recover the MCTs used for assessing between-groups results, the mean change method was used. This defined groups of patients based on the simulated anchor as "improved" and "stable". The difference between the group means of the "stable" patients and "improved" patients was calculated for each sample in each condition (correlation with simulated anchor, standard deviation of change score and sample size). Recovered MCT frequencies for each of the 12 conditions (sample size*change score variability) were graphed using probability density functions, with the difference in change score on the x-axis and the density on the y-axis. All simulated anchor conditions were displayed on the same plot, with the r = 1 condition shown as the "true" change to compared other correlation strength scenarios against. Therefore, anchors with a correlation < 1 could be compared to what may be expected if a perfect anchor existed.

Individual level
Results showed that each of the variables controlled for in this simulation study (i.e., magnitude of correlation coefficient, sample size and variability of change score) played a role in the accuracy of the recovered MCT. As can be seen in Table 1 and Figs. S1-S4 in the supplementary material, ROC curve-based methods were less accurate than PMbased methods at recovering the true MCT, as shown by the spread of the recovered MCT around the expected value of 15. To help the reader assess bias in relative terms, Table 1 shows the bias from the simulated MCT as a deviation from 0. Furthermore, smaller sample sizes led to higher variability in the returned MCT for both methods, but higher variability still for the ROC methods. Anchors which had a low correlation coefficient with the COA change score also led to increased variability in the returned MCT when using ROC methods. Notably, this increased variability was not observed for the PM-based methods (Table 1). Interestingly, although larger COA change score SDs led to a less accurate estimate of the MCT (particularly for ROC curve-based assessments), at the highest level of variability (SD = 7.5),  for sample sizes less than 2000, there was some small variability in the returned MCT even when the anchor had a near perfect correlation with the COA instrument (r = ~ 1). This is due to the variability inherent in performing such analyses on small samples with a large variance to estimate ratio.

Group level
When the mean change method was used to derive MCTs, results showed that the resulting MCT was consistently underestimated. For example, an anchor correlated at 0.30 typically returned a value of 2.0-3.0 and an anchor correlated at 0.60 typically returned a value of 4.5-5.5 (Fig. 2). This compares to a "true" MCT (anchor correlation of 1) of just under 8.0. Although the variability around these estimates changed with sample size and COA change score variability, the pattern remained the same (Fig. 3). Of note, the "true" MCT also changed as the SD of the COA change score increases. This is expected and will be discussed.

Discussion
This study aimed to provide some answers to the unmet need for clearer guidelines about what level of correlation between external anchor(s) and COAs can be considered sufficient to conduct anchor-based analyses and produce an appropriate MCT that reflects a true meaningful change.
While prior studies have explored the effect of anchor correlations and sample sizes on MCTs based on the mean change method and used for interpretation at the group-level this has not been assessed for methods used for individual-level interpretation. The overall objective of this study was to conduct a simulation to explore the degree to which various anchor, sample and change score conditions may influence selection of a true MCT at both the individual-level (responder definition; RD) and group-level. Different anchor correlations were assessed while evaluating different sample sizes and distributions of change scores to explore the effect on MCTs.
Overall, the results show that an ideal anchor correlation of 0.50-0.60 may be appropriate to identify a true MCT (RD) cut point at the individual-level when using ROC analyses. Notably, this range is dependent on the sample size and  [17]. However, here we show that PM methods (as described elsewhere) [16] were better able to accurately represent the true MCT (RD) at the individual-level, even with weaker anchor correlations. Even correlations at the lower end of the literature recommended level (r = 0.30) provided PM results which were more or less equivalent to what was obtained with ROC analysis at a correlation of r = 1.0. In addition, the PM approach provided robust MCT estimates despite small sample sizes and high variability in COA change scores. These factors appear to reduce accuracy of ROC estimates more significantly in comparison to PM estimates.
Using the mean change method to derive MCTs for group-level interpretation showed that any level of anchor correlation lower than near perfect (i.e., r = ~ 1.0) led to underestimation of the true MCT. This is theoretically unsurprising, given that an anchor which is used to divide patients into two groups will introduce noise and error into this division at a rate that is inversely related to the magnitude of its correlation. Therefore, a weakly correlated anchor will be worse at correctly classifying patients into groups of those who have, and have not, improved. The result of this is not only a widening of the distribution of change scores in each of the "stable" and "improved" groups, but crucially, a shift in the group means. As more patients are misclassified by the anchor, their associated COA change score used to derive the MCT is also incorrectly classified. This directly leads to a shift in the mean of each group, whereby the erroneous inclusion of improved patients in the "stable" group increases the stable group's mean and vice versa for the "improved" group. Differences in group means are therefore increasingly diminished as misclassification increases leading to an underestimated MCT. Relative to the true MCT, the returned MCT for any given correlation appears proportional to the magnitude of the correlation. These results appear similar to the attenuation of group-level MCTs as described elsewhere when linear regression techniques are used [14]. In the case of linear regression, r is directly incorporated into the calculation of the MCT such that any value of r below 1.0 leads to an underestimate of the true MCT, a method argued against by Fayers and Hays (2014) [14]. In addition, this analysis was shown to be heavily influenced by the SD of the COA change scores. Even with a perfect anchor, a wider distribution of change scores leads to a larger difference between groups of "improved" and "stable" patients. This is because the mean of "improved" and "stable" groups also diverges as the change score SD increases. Although this, in itself, is not an issue, it is then compounded with the strength of the anchor correlation. A poorly correlated anchor can underestimate a between-group MCT much more substantially when the "true" MCT is higher by virtue of a large change score SD.
The FDA PFDD guidance recommends use of both eCDF and PDF curves to aid identification of an appropriate MCT [7]. While this recommendation is clear, the PFDD guidance lacks specificity on what anchor-based analyses can be considered acceptable. While the PFDD guidance does make the distinction between individual-level betweenpatient change and between-group mean differences, in both the PFDD and PRO guidance documents there is lack of specific individual-level analyses which can be used to determine meaningful change, focusing only on assessment of group-level change and applying this level of change to define individual-level RDs [1]. As illustrated in this study, a threshold at the group-level is likely to be very different to a threshold considered appropriate at the individual-level and as such individual and group-level thresholds cannot, and should not, be considered interchangeable. Results from this study indicate that PM anchor analyses were more reliable and less susceptible to small sample sizes and variability in COA change scores. However, if sample sizes are not prohibitive, somewhat reliable results can also be produced using ROC analyses. Selection of an appropriate anchorbased analyses should be based on these factors.
Regardless of the analysis selected however, a sufficient correlation should ideally be demonstrated between the external anchor(s) and target COA. As shown here 'sufficient' will depend on whether meaningful change will be assessed at the individual or group-level. However, it is acknowledged that in reality achieving a 'sufficient' correlation between the external anchor(s) and target COA may be challenging and may result in some level of circularity whereby only self-report external anchors that are assessing the same concept of interest as the COA correlate at a sufficiently high enough level, ruling out the possibility of using more clinical assessments as external anchors. In practice, using multiple anchors that may include a combination of clinical assessments and self-reports and accounting for the level of correlation between the external anchor(s) and target COA when triangulating across multiple anchors, is recommended.
It is important to note limitations of the study when interpreting the results. The simulations presented here only included one "stable" and one "improved" group (representing the "minimally improved" patients) as it is these two groups which typically define meaningful change. One problem with this is that there was no opportunity for an imperfectly correlated anchor to "misclassify" patients outside of this range. It is likely that in real studies, imperfectly correlated anchors misclassify patients who have worsened as "stable" and misclassify patients with a moderate improvement as "minimally improved". This could have some effect on controlling the shift in the means observed in this study. In addition, for the simulation of group level meaningful change, a single normal distribution was used and the mean of this distribution was used to divide patients into "stable" and "improved". The mean difference between these groups varied based on the SD simulated, but the arising results were consistent with one another. It would have been advantageous to simulate each of the groups from their own normal distribution. Unfortunately, this made recategorization of the patients (based on the correlated anchor) impossible. For example, when a group has a mean of 0, patients will fall either side of this mean. Using a score of 0 to regroup patients into "stable" and "improved" groups would result in half of the "stable" patients being reclassified as "improved" even when the anchor has a perfect correlation with the measure under examination. Future work could take another approach. Perhaps by simulating groups based on a perfect anchor measure (r = 1.0) as suggested above, but then changing the strength of the correlation through manipulation of the standard deviation. This means that the standard deviation cannot be independently varied as a function of the anchor correlation strength, but it may help to address this important question and test whether the underestimation still exists.
Finally, we assessed methods using anchor's correlated with the change score between 0.30 and 1.0. The PM method worked well at all tested levels. The ability of this method to perform well at such low correlation levels was unexpected and we cannot make claims about its ability for anchors correlated below 0.30.
Findings from this study support use of anchor correlations above 0.30 to identify an appropriate individuallevel MCT when using PM based methods, while stronger correlations are needed for ROC based methods (perhaps around 0.50-0.60). At the individual-level, 0.50-0.60 may demonstrate an ideal threshold, however, correlations in the 0.30-0.50 range remain a viable outcome in practice. In such cases, consideration of the PM method as a primary analysis would be beneficial, and close attention to the correlation when triangulating across multiple anchors is recommended. For deriving MCTs for interpreting group-level results using the mean change method, there will always be a bias in the MCT derived from a less than perfect correlation. However, researchers should assess what they think is acceptable as an anchor correlation in their own work based on the results here, and perhaps err on the side of a more conservative estimate given the apparent under-estimation present in this method. No recommendation for the group level is offered here, as any relationship less than r = 1.0 leads to an attenuation of the true threshold in the current simulation method. However, given this knowledge, it may be possible in future to develop an anchor correlation-based adjustment for group level MCTs which will help account for the bias observed. This adjustment would also need to account for the SD of the COA change score, but perhaps not the sample size. Further work is needed to support development of guidance for the conduct of appropriate anchor-based analyses.
Importantly, selection of an anchor should be based on one that is simple, easy to understand and representative of the concept the researcher is aiming to classify. Only in this way can misclassification error be reduced and some faith in the group-level anchor-based MCT be assured.
Funding The authors have no relevant financial or non-financial interests to disclose.