Comparison of Anchor- and Distribution-Based Methods for Estimating Thresholds of Meaningful Within-Patient Change Using Simulated PROMIS PF 20a Data Under Various Joint Distribution Characteristic Conditions

Purpose To compare the performance of anchor-based and distribution-based methods for estimating thresholds of meaningful within-patient change of clinical outcome assessments in conditions reecting data characteristics of small- to medium-sized clinical trials. Data sets were generated from the joint distributions of the PROMIS PF 20a T-score changes and a seven-point global change anchor measure. The 108 simulation conditions (1,000 replications per condition) included combinations of three marginal distributions of T-score changes, three improvement percentages in the anchor measure, four levels of responsiveness correlations, and three sample sizes. Threshold estimation methods included mean change, median change, ROC curve, predictive modeling, half SD, and SEM. Relative bias, precision, accuracy, and measurement signicance of the estimates were evaluated based on comparison with true thresholds and IRT-based individual reliable changes of PROMIS scores. Quantile regression models were applied to select and interpret effects of simulation conditions on estimation bias.


Introduction
Statistically signi cant mean change on clinical outcome assessment (COA) scores does not necessarily correspond to meaningful changes from the perspective of patients, observers, or clinicians [1]. To guide interpretation of COA score changes for regulatory review, thresholds characterizing a clinically meaningful within-patient change (or responder thresholds) are commonly requested by the United States Food and Drug Administration (FDA) in submissions for COA labeling claims [1,2]. Patients can be classi ed as either "responders" or "nonresponders" based on the threshold to felicitate later treatment e ciency evaluation.
Given the high stakes of COAs used as primary or key secondary endpoints, it is important to ensure that thresholds of meaningful within-patient changes are developed using thoughtful study design and rigorous methods. Of the available estimation methods, the current FDA COA guidance documents recommend anchor-based methods as the primary approach, with distribution-based methods being supportive [1,2]. Anchor-based methods rely on an external criterion to characterize meaningful improvement, such as change based on a relevant global rating or change in a well-established outcome (e.g., biomarker or COA) with an accepted threshold identifying clinical improvement. Commonly used anchor-based methods include anchor-based mean, anchor-based median, receiver operating characteristic (ROC) curve, and predictive modeling methods [3][4][5]. Distribution-based methods use the target COA's statistical or measurement attributes (e.g., the half standard deviation [SD] of the baseline COA score or the standard error of measurement [SEM] [7-9]) as a guide for interpreting change.
To date, few studies have systematically evaluated the performance of different estimation methods for meaningful within-patient change thresholds under various combinations of data conditions [3,9]. This study attempts to provide information to address this concern by using a simulation study for the performance comparison under conditions designed to re ect realistic characteristics of clinical trial or observational data. Although the current study focuses on anchor-based methods, two distribution-based methods were also incorporated to provide further context and supportive boundaries. Estimation performance was evaluated by comparing the sample estimates of each method to the true thresholds under the corresponding simulation conditions and to the respective individual measurement precision. Additionally, the impacts of simulation conditions on the biases in anchor-based estimates were evaluated to facilitate researchers' planning on key study design and post-hoc data selection for the application of these methods.
The current simulation considered the systematic impacts of four data conditions. The rst condition was the distribution of target COA changes, characterized by differences in variance and skewness. The second condition was the proportion of patients reporting improvement on a global anchor measure, which was used to identify the population-level true threshold of meaningful within-patient change in target COA. The third condition was the relationship strength (i.e., correlation) between target COA change score and anchor measure. Revicki et al. [11] suggested a correlation greater than 0.30; Hays et al. [12] suggested at least 0.371; and de Vet et al.
[13] recommended a higher bar of 0.50. The fourth condition was overall sample size. Although it is essential to ensure that the estimation methods provide consistent thresholds with large sample sizes, it is also important to understand how the methods behave with small sample sizes, as the recruitment of large sample sizes is often not feasible in rare disease clinical trials.

Methods
Simulation Table 1 provides an overview of the simulation study design conditions.
1. PROMIS T-score change distribution: Changes were sampled from three marginal distributions. The rst two distributions were normal with the same mean but different SD, and the third was a negatively skewed distribution with the same mean and an approximate SD to match the second distribution. The population mean of the T-score change was xed to 7 such that at least 50% of subjects could achieve change above an extreme RC of 6.9 (computed from an extreme SE = 2.5 for the T-scores). This is intended to represent a trial or other longitudinal study designed with an effect size of overall change that PROMIS PF SF 20a can detect for at least 50% of the subjects.
2. Anchor measure distribution: The anchor measure was a hypothetical seven-category Patient Global Impression of Change (PGIC) generated separately from three types of marginal distributions (Table 1) characterized by levels of meaningful improvement or responder percentages: 30%, 50%, and 70%.
3. Strength of correlation: To create four weak to strong Spearman correlations (p = 0.1, 0.3, 0.5, and 0.7) between the T-score change and PGIC, the Iman-Conover method was implemented through a series of matrix factorization, multiplication, and pairing rearrangement [15,17]. The method generated monotonic relationships between the two variables, without specifying a precise linear or nonlinear model or distributional assumptions (unlike for the polyserial correlation), to extend the generalizability of the current results.
4. Sample size: Three sample sizes (n = 50, 100, and 300) were simulated. The rst sample size represented a reasonable scenario for rare disease. The second and third re ected typically sized clinical trials evaluating COA measures in practice [17].
When combined, the design conditions yielded 108 settings (3×3×4×3). For each setting, 1,000 datasets were generated via repeated sampling. All data generation and analyses were performed using SAS version 9.4 or higher for Windows statistical software [19].

Methods for estimating thresholds of within-patient meaningful change
The anchor-based mean method, anchor-based median method, ROC curve analysis (using logistic regression and optimizing sensitivity and speci city), and a predictive modeling method (using logistic regression) ( Table 2) were applied to each simulated dataset [3,3,5,12,13,20]. Two distribution-based methods were applied to provide supportive estimates: half SD and SEM at baseline. Finally, individual RCs served as reference values to evaluate the estimates, given that these IRT-based values are constant for speci c item response patterns.

Performance evaluation criteria
The 1,000 threshold estimates per method in every simulation setting were compared with two types of reference values: the population-level true thresholds and the individual RCs.

Comparison with true thresholds
For each simulation setting, the true threshold was de ned as the quantile of T-score change corresponding to the target population-level anchor-based percentage of improvement. For example, for the 30%-improvement condition, the true threshold corresponded to the 0.7 quantile of the normal distribution of T-score change and the 0.30 quantile of Gamma (shape = 1.5, scale = 6) which was then subtracted from 16 (Table 1).
Three performance statistics were computed: relative bias (RB) as symmetric difference from the true value, coe cient of variation (CV) as random error around the average of the estimates, and relative root mean squared error (rRMSE) to quantify how accuracy was impacted by both bias and precision. These statistics were computed as percentages to facilitate comparison across settings: where T was one estimate from one sample; E(T) and Var(T) were the τ mean and variance of the 1,000 estimates per method × simulation setting, respectively; and τ was the corresponding true threshold. Relative bias closest to 0% (no bias) is preferred, with the positive or negative direction indicating the risk of misclassifying responders or increasing false responders, respectively. Smaller CV and rRMSE values indicate better estimation with higher precision and higher accuracy.

Comparison with individual reliable change
For each subject in each simulated sample, the individual RC was computed: where SE BL and SE FU were reference SE of T-score at baseline and follow-up, respectively (Supplemental Table 1). Two performance statistics were computed: percentage of subjects whose individual RCs were not greater than an estimate, and the median positive difference of the threshold estimate minus RC within every simulated sample. Higher values of both indicate that the threshold estimate was more likely to exceed individual measurement errors, one necessary aspect for an appropriate within-person threshold. The distributions of the rst statistic were tabulated. The values of the second statistic were displayed in probability density function (PDF) plots. Impact and interaction of study data characteristics within each anchor-based method The predictors (Table 1) of estimation bias (i.e., (T-τ) for each anchor-based method were screened through a quantile regression selection process using an adaptive Lasso method [21]. Candidate predictors were treated as categorical, as were all higher-order interactions. Due to nonconvergence of the full model using all candidate predictors, separate quantile regression selection processes were implemented and strati ed by distribution type of T-score change, which, however, limited the later evaluation of interactions with distribution type. Based on results from these strati ed models, a subset of predictors was selected.
These selected predictors, their lower-order terms, and the distribution type of T-score change were used to predict estimation bias in a nal quantile regression model for each anchor-based method. The effects of signi cant predictors on the 0.25, 0.50, and 0.70 quantiles of estimation bias were plotted. Table 3 presents results for the comparison between estimated and true thresholds across the 108 simulation settings. No one method was the best performer overall (Supplemental Table 2). The strengths and weaknesses of the methods varied with settings and performance criteria. The most consistent observation was that the SEM method had the lowest CV values (smallest variability around the mean of every 1,000 estimates) across all settings-an inherent bene t of the large-scale IRT calibration of the PROMIS items. The second consistent observation was that, among all anchor-based methods, the predictive modeling method produced the lowest CV values across all settings. The third consistent observation was that the predictive modeling method was usually the best performer on all criteria, with 50% and 70% population-level improvement or subjects being classi ed as responders by the anchor with generally normal distributions of T-score change.

Performance evaluation
The remaining ndings in Table 3 were more setting and/or criterion speci c. To facilitate the interpretation, the between-method difference in the magnitude of a performance statistic was considered to be similar if it was < 1% and otherwise comparable if < 5%, or superior/inferior if ≥ 5%. When about 30% of subjects were responders with normally distributed T-score change, the RBs of the mean and median methods were similar and produced less bias than the two logistic methods. The mean method had the smallest rRMSE (highest accuracy) in these settings. When the T-score changes were negatively skewed, the median method was clearly superior with 30% of responders, yielding the smallest RBs and rRMSE; the ROC curve method performed best with 50% of responders, and the predictive modeling method showed advantages again with 70% of responders. Additionally, the RBs and rRMSE of the half-SD method were the smallest in magnitude across all methods under the normal distribution (7.0, 7.0) of T-score change and 70% of responders-this is likely due to the coincidence that the true threshold of 3.3 under these conditions was near the half SD of 2.5 for the simulated population (Supplemental Table 2). Table 4 presents the minimum, median, and maximum values of the percentages of subjects with estimates greater than individual RCs for the 1,000 replications (see Supplemental Table 3 for complete lists). (The comparison with half SD and SEM was not included because, notably, the two distributionbased estimates were always smaller than individual RCs.) As Table 4 shows, the predictive modeling method tended to provide the most consistent protection against individual measurement errors for which the minimum percentages of subjects with individual RCs not greater than the estimates were almost always highest across the simulation settings. However, based on the median value of those percentages (≥ 95%), the threshold estimates of all four anchor-based methods were greater than RCs most of the time. indicating that when the estimates of the other methods (especially the median method [orange curves]) exceeded individual measurement errors, their positive differences tended to be larger (higher thresholds) than the difference from the predictive modeling method.

Signi cant impact and interaction of clinical data characteristics
The predictor selection process retained only rst-order predictors and select two-way interactions. The reference predictor classes were designated as normal distribution (7.0, 3.5), 50% improvement, ρ = 0.70, and n = 300 because the overall rRMSE (systematic and random difference) tended to be minimal within those reference classes, despite a few local reverse trends ( Table 3).
The selected predictors were very similar across the models for the different estimation methods. All estimated effects of signi cant predictors by the anchorbased methods are plotted in Fig. 2, except for the interaction between ρ and n, which yielded very small effect sizes (− 0.00 to 0.40) and almost undifferentiated lines if plotted (Supplemental Table 4). Across the anchor-based methods, the most prominent and consistent predictors were improvement percentage and its interaction with correlation strength (ρ)-departure from the reference classes (50% improvement and ρ = 0.70) generally increased estimation bias. For example, for the predictive modeling method (the anchor-based method most sensitive to varying improvement percentage), 70% improvement had a main effect of 1.0-point positive increase on the 0.50 quantile of estimation bias compared with 50% improvement; at ρ = 0.30, an additional 0.87-point positive increase was introduced due to the improvement percentage × ρ interaction.
The next most prominent and consistent predictors were ρ and n, which impacted the mean and median methods more than the other two methods (Fig. 2). Lower correlation generally produced a negative increase in the bias; for example, a main effect of 1.23-point negative increase was shown on the 0.50 quantile of estimation bias by the mean method when ρ was reduced from 0.70 to 0.10. Smaller sample sizes generally increased the bias positively or negatively depending on the quantile location.
With normal distributions, larger variance (i.e., population SD = 7.0) tended to increase bias positively for the mean and median methods. A comparison of skewed and normal distributions (both with SD ~ 7) indicated an ~ 1.0-point negative increase in bias with skewed distribution when using the mean and predictive modeling methods on the 0.50 quantile of estimation bias.

Discussion And Conclusion
Although no single recommended method exists for estimating thresholds of meaningful within-patient change, in practice researchers tend to use the anchorbased mean approach as the primary method and distribution-based approaches as supportive. Alternatively, researchers tend to prefer the median anchorbased method whenever the COA change scores or anchor-measure distributions are skewed [e.g., 22, 23]. Using data generated for changes in PROMIS PF SF 20a T-scores, our simulation study compared four widely recognized anchor-based and two distribution-based methods for estimating thresholds of meaningful within-patient change under conditions designed to mimic realistic clinical and observational studies.
As expected, among the anchor-based methods, the optimal choice depended on the clinical data characteristics. Although the results supported the common application of mean or median anchor-based methods, the results identi ed scenarios where the other methods should be strongly considered. Speci cally, when ≥ 50% of participants were true responders and PROMIS change scores generally formed a normal distribution, the predictive modeling method performed best overall on controlling bias, increasing precision and accuracy, and exceeding individual measurement errors. Although this method did not always yield the smallest bias on average, its variability around mean estimates was almost the smallest among the anchor-based methods. This high precision was consistent with the simulation nding by Terluin et al.
[6] that the 95% CI for the ROC curve was wider in length than that obtained by the predictive modeling method in the setting of 50% improvement prevalence and normal distribution of target COA change. The likely reason for this nding is that both logistic regression methods use the entire sample to locate the threshold estimate based on sensitivity, speci city, or odds, whereas the mean and median methods focus on the group at one anchor level (e.g.," minimally improved"). Therefore, higher precision (low CV)-especially for larger sample sizes for the two logistic methods-was not surprising.
With < 50% (e.g., 30%) of responders under normal distributions of T-score change, method preferences trended toward mean and median anchor-based methods for the smallest of RBs and satisfactory protection against measurement error most of the time. One major reason for this preference, as shown in Table 3 and Fig. 2, is that the mean and median methods had smaller increases in bias than the two logistic methods for the 30%-improvement group when the 50%-improvement group was used as the reference. At rst glance, this nding seemed in con ict with the simulation ndings by Terluin et al. [6], that changing the "prevalence of improvement" alone did not affect the estimates of the two logistic-based methods. However, the current study and Terluin et al.
[6] applied different simulation conditions. The population percentage of improvement simulated for the anchor-based methods impacted the true threshold or responder de nition in the current study, while the "prevalence of improvement" in Terluin et al. [6] may not have matched the underlying responder percentage. In Terluin et al.
[6], the true threshold was xed to 3.5 when the prevalence changed from 50-70%, but in the current study, the true thresholds varied with the population improvement percentage of the PGIC.
For skewed T-score change distributions, the median method and ROC curve method performed best at the conditions of 30% and 50% improvement, respectively. As shown in Table 3 and Fig. 2, this nding was likely related to the smaller effects of positive increases in bias due to skewed distributions and the countereffect of a negative increase on bias due to 30% improvement for the two methods, in contrast to the larger positive effects of both predictors on the mean method and predictive modeling method. In the 70%-improvement condition, the countereffects were observed in the predictive modeling and mean methods, while the combined positive increases further in ated the bias resulting from the other two methods.
Among the conditions investigated, the most suitable for minimizing rRMSE (hence reducing bias and increasing precision overall) was the setting related to a normal distribution (7.0, 3.5), 50% improvement, ρ = 0.70, and n = 300. As a result of the PROMIS IRT-based calibration, the SEM method consistently demonstrated much smaller CV values than the anchor-based methods and the half-SD method; the median within-sample percentages of subjects with individual RCs not greater than the anchor-based estimated thresholds was at least 95%. These ndings highlight the importance of selecting a reliable (small random variance in measurement) and valid (adequate relationship with anchor measure) COA in addition to identifying a robust data source (where both responders and nonresponders are well represented) when conducting analyses to identify a meaningful within-patient change threshold. For example, if researchers intend to use interim data cuts of ongoing trials to establish the meaningful within-person threshold, it is sensible to wait until ~ 50% of the subjects can be considered responders, based on multiple anchor measures or external gold standards (where bias tends to be minimal and precision and accuracy tend to be maximized across methods), if feasible for related therapeutic areas. For literature reviews or meta-analyses of meaningful change, greater weight can be placed on thresholds estimated when approximately 50% of the participants were responders. Not surprising, this study's results further emphasize the need for a strong responsiveness correlation-however, this does not imply that the correlation must be perfect, because the unique value of the target COA (in addition to the anchor measures) is established in theory and qualitatively. To maximize estimation precision, wise decisions must be made with respect to item selection, calibration, and scoring rule (i.e., valid, reliable, discriminative, highly intercorrelated items; raw versus pattern scoring; weekly versus monthly scores; and missing-data rule). As always, a larger sample and normal distribution of target COA change are desirable.
Finally, the half-SD and SEM methods generally underestimated the thresholds in most settings speci ed. This nding con rmed their roles as supportive estimates, in addition to the RC value, in identifying the minimal value when reporting a range of thresholds.

Limitations And Future Research
Although this study was designed to generalize to typical applications, there are limitations. This research focused on thresholds for detecting improvement in a COA; therefore, the results cannot be easily applied to COA thresholds for use in clinical trials or observational studies aimed at mitigating the progression (worsening) of a condition.
In addition, the correlation between PROMIS change and PGIC was simulated as a Spearman correlation to free the assumptions regarding linear relationship or normal distribution of the target COA change. Readers should be cautious if directly applying these ndings to situations with other correlation types (e.g., Pearson).
Another important consideration is that the simulation used a retrospective anchor measure with minimal measurement error (only from random sampling). In practice, retrospective anchors could be subject to additional measurement error due to response-shift bias or recall bias [24]. Fayers and Hays [24] recommend inclusion of both retrospective and concurrent anchors (e.g., global ratings of current severity) in clinical trial designs. Our simulated PGIC values could be considered as change between two administrations of Patient Global Impression of Severity (PGIS) rating scales. However, PGIS change likely would have provided more levels than our simulated PGIC, resulting in use of a different type of correlation. Similar caution would be required in settings using a continuous anchor measure but only two response classes: "responder" versus "nonresponse" (e.g., a biomarker with only one reference cutoff or change in the 22-item Sinonasal Outcome Test using the recommended cutoff of − 8.9 [25]), which would result in more exibility in correlation computation.
Regardless of anchor measure type (retrospective or concurrent), more measurement error is still possible in practice. This would not only undermine the responder classi cation but also attenuate the responsiveness correlation [25]. Hence, a correlation corrected for measurement error [25] and sensitivity analyses on the responder classi cation at different con dence limits of the anchor should be considered in these situations.
Finally, due to computational limitations, the current study did not model the relationship between the baseline score and follow-up change in the target COA and did not allow for varying true thresholds or responder percentages conditioned on baseline scores. These knowledge gaps can be addressed by future studies to facilitate discussions about how to thoughtfully estimate responder thresholds under different clinical data characteristics.

Declarations
Funding: The study was supported by RTI Health Solutions. Note: Except sample size, the sample statistics of population parameters were subject to random sampling uctuation. To maintain the different levels of percentage improvement (based on PGIC) and responsiveness correlation, each simulated data set retained for analysis was required to have absolute difference of sample -population responsiveness correlation < 0.065, and an absolute difference of sample -population improvement ≤ 10%.

Con icts of interest/Competing interests: Shanshan
a Bold text indicates improvement. Table 2. Anchor-and distribution-based estimation methods

Method Estimation
Anchor-based

Mean method
Arithmetic mean of the changes in the PROMIS T-scores for the subjects with "minimally improved" PGIC values

Median method
Median value of the changes in the PROMIS T-scores for the subjects with "minimally improved" PGIC values ROC method Observed PROMIS T-score change that minimized the sum of (1 -sensitivity) 2 and (1 -speci city) 2 in the receiver operator characteristic curve of the logistic regression to predict PGIC responder classi cation of at least minimally improved