This study aimed to provide some answers to the unmet need for clearer guidelines about what level of correlation between external anchor(s) and COAs can be considered sufficient to conduct anchor-based analyses and produce an appropriate MCT that reflects a true meaningful change. While prior studies have explored the effect of anchor correlations and sample sizes on MCTs at the group-level this has not been assessed at the individual-level. The overall objective of this study was to conduct a simulation to explore the degree to which various anchor, sample and change score conditions may influence selection of a true MCT at both the individual-level (responder definition; RD) and group-level. Different anchor correlations were assessed while evaluating different sample sizes and distributions of change scores to explore the effect on MCTs.
Overall, the results show that an ideal anchor correlation of 0.50–0.60 may be appropriate to identify a true MCT cut point at the individual-level when using ROC analyses. Notably, this range is dependent on the sample size and variability of the change score under assessment. Smaller samples and larger variability in the COA change score reduces the reliability of the results produced here using ROC analyses, regardless of anchor correlation. Although not assessed in this study, Terluin et al. (2020) described how the MCT derived from ROC curve analyses may be biased when anchor groups are unequal in size [15]. However, here we show that PM methods (as described elsewhere) [14] were better able to accurately represent the true MCT at the individual-level, even with weaker anchor correlations. Even correlations at the lower end of the literature recommended level (r = 0.30) provided PM results more or less equivalent to what was obtained with ROC analysis at a correlation of r = 1.0. In addition, the PM approach provided robust MCT estimates despite small sample sizes and high variability in COA change scores. These factors appear to reduce accuracy of ROC estimates more significantly in comparison to PM estimates.
At the group-level, any level of anchor correlation lower than near perfect (i.e., r = ~ 1.0) led to underestimation of the true MCT. This is theoretically unsurprising, given that an anchor which is used to divide patients into two groups will introduce noise and error into this division at a rate that is inversely related to the magnitude of its correlation. Therefore, a weakly correlated anchor will be worse at correctly classifying patients into groups of those who have, and have not, improved. The result of this is not only a widening of the distribution of change scores in each of the “not-improved” and “improved” groups, but crucially, a shift in the group means. As more patients are misclassified by the anchor, their associated COA change score used to derive the MCT is also incorrectly classified. This directly leads to a shift in the mean of each group, whereby the erroneous inclusion of improved participants in the “not-improved” group increases the not-improved group’s mean and vice versa for the “improved” group. Differences in group means are therefore increasingly diminished as misclassification increases leading to an underestimated MCT. Relative to the true MCT, the returned MCT for any given correlation appears proportional to the magnitude of the correlation. These results appear similar to the attenuation of group-level MCTs as described elsewhere when linear regression techniques are used [12]. In the case of linear regression, r is directly incorporated into the calculation of the MCT such that any value of r below 1.0 leads to an underestimate of the true MCT, a method argued against by Fayers and Hays (2014) [12]. In addition, this analysis was shown to be heavily influenced by the SD of the COA change scores. Even with a perfect anchor, a wider distribution of change scores leads to a larger difference between groups of “improved” and “not-improved” patients. This is because the mean of “improved” and “not-improved” groups also diverges as the change score SD increases. Although this, in itself, is not an issue, it is then compounded with the strength of the anchor correlation. A poorly correlated anchor can underestimate a between-group MCT much more substantially when the “true” MCT is higher by virtue of a large change score SD.
Although the results presented here were stark and may require researchers to reconsider the MCT practices they employ, they were confirmed through the use of an alternate procedure for calculating the anchor correlations (not presented here). The alternate method used a Cholesky decomposition rather than the algebraic formula to specify the intended correlation between the COA change score and the anchor score.
The FDA PFDD guidance recommends use of both eCDF and PDF curves to aid identification of an appropriate MCT [5]. While this recommendation is clear, the PFDD guidance lacks specificity on what anchor-based analyses can be considered acceptable. While the PFDD guidance does make the distinction between individual-level between-patient change and between-group mean differences, in both the PFDD and PRO guidance documents there is lack of specific individual-level analyses which can be used to determine meaningful change, focusing only on assessment of group-level change and applying this level of change to define individual-level RDs [1]. As illustrated in this study, a threshold at the group-level is likely to be very different to a threshold considered appropriate at the individual-level and as such individual and group-level thresholds cannot, and should not, be considered interchangeable. Results from this study indicate that PM anchor analyses were more reliable and less susceptible to small sample sizes and variability in COA change scores. However, if sample sizes are not prohibitive, somewhat reliable results can also be produced using ROC analyses. Selection of an appropriate anchor-based analyses should be based on these factors.
Regardless of the analysis selected however, a sufficient correlation should ideally be demonstrated between the external anchor(s) and target COA. As shown here ‘sufficient’ will depend on whether meaningful change will be assessed at the individual or group-level. However, it is acknowledged that in reality achieving a ‘sufficient’ correlation between the external anchor(s) and target COA may be challenging and may result in some level of circularity whereby only self-report external anchors that are assessing the same concept of interest as the COA correlate at a sufficiently high enough level, ruling out the possibility of using more clinical assessments as external anchors. In practice, using multiple anchors that may include a combination of clinical assessments and self-reports and accounting for the level of correlation between the external anchor(s) and target COA when triangulating across multiple anchors, is recommended.
It is important to note limitations of the study when interpreting the results. The simulations presented here only included one “not-improved” and one “improved” group (representing the “minimally improved” patients) as it is these two groups which typically define meaningful change. One problem with this is that there was no opportunity for an imperfectly correlated anchor to “misclassify” patients outside of this range. It is likely that in real studies, imperfectly correlated anchors misclassify patients who have worsened as “not-improved” and misclassify patients with a moderate improvement as “minimally improved”. This could have some effect on controlling the shift in the means observed in this study. However, further simulations conducted alongside this study (not reported here) have assessed the impact of this more realistic situation and have not led to a solution to the issues observed here. These additional simulations have involved groups of patients outside of the minimally improved and not-improved groups (those who have worsened or improved to a greater degree). As such, although it is necessary to complete and share work examining these extreme group-level results presented here and how typical they are of real-life studies, it is unlikely that future comprehensive simulations would offer any renewed faith in this method. Importantly, selection of an anchor should be based on one that is simple, easy to understand and representative of the concept the researcher is aiming to classify. Only in this way can misclassification error be reduced and some faith in the group-level anchor-based MCT be assured.
Another limitation is that, under some conditions, it was found that even a near-perfectly correlated anchor (r = ~ 1.0) had some variability in the returned MCT cut point. This could be a result of the anchor approaching r = 1.0 rather than being exactly r = 1.0. Equally, this could be due to the wide distribution of change scores influencing the ability of these anchor-based methods to determine the true MCT. Given that this effect also varies by sample size, it is likely that this is due to between-sample fluctuation where the true mean of the sample varies randomly in line with the procedures used to generate the data. Sample size alone effects the variability of frequencies for returned MCTs. While many studies are unavoidably limited by sample size, such as in rare diseases, longitudinal study designs can be used to increase the number of data points over time for a limited sample size, increasing the reliability of estimates and the likelihood of selecting a true MCT. Collection of data from multiple early timepoints and later timepoints may therefore serve to mitigate the limitations arising from small sample sizes.
Findings from this study support use of anchor correlations above 0.30 to identify an appropriate individual-level MCT when using PM based methods, while stronger correlations are needed for ROC based methods (perhaps around 0.50–0.60). At the individual-level, 0.50–0.60 may demonstrate an ideal threshold, however correlations in the 0.30–0.50 range remain a viable outcome in practice. In such cases, consideration of the PM method as a primary analysis would be beneficial, and close attention to the correlation when triangulating across multiple anchors is recommended. At the group-level, there will always be a bias in the MCT derived from a less than perfect correlation, but researchers should assess what they think is acceptable as an anchor correlation in their own work based on the results here, and perhaps err on the side of a more conservative estimate given the apparent under-estimation present in this method. No recommendation for the group level is offered here, as any relationship less than r = 1.0 leads to an attenuation of the true threshold. However, given this knowledge, it may be possible in future to develop an anchor correlation-based adjustment for group level MCTs which will help account for the bias observed. This adjustment would also need to account for the SD of the COA change score, but perhaps not the sample size. Further work is needed to support development of guidance for the conduct of appropriate anchor-based analyses.