Comparison of raw and regression approaches to capturing change on patient-reported outcome measures

Patient-reported outcome (PRO) analyses often involve calculating raw change scores, but limitations of this approach are well documented. Regression estimators can incorporate information about measurement error and potential covariates, potentially improving change estimates. Yet, adoption of these regression-based change estimators is rare in clinical PRO research. Both simulated and PROMIS® pain interference items were used to calculate change employing three methods: raw change scores and regression estimators proposed by Lord and Novick (LN) and Cronbach and Furby (CF). In the simulated data, estimators’ ability to recover true change was compared. Standard errors of measurement (SEM) and estimation (SEE) with associated 95% confidence limits were also used to identify criteria for significant improvement. These methods were then applied to real-world data from the PROMIS® study. In the simulation, both regression estimators reduced variability compared to raw change scores by almost half. Compared to CF, the LN regression better recovered true simulated differences. Analysis of the PROMIS® data showed similar themes, and change score distributions from the regression estimators showed less dispersion. Using distribution-based approaches to calculate thresholds for significant within-patient change, smaller changes could be detected using both regression estimators. These results suggest that calculating change using regression estimates may result in more increased measurement sensitivity. Using these scores in lieu of raw differences can help better identify individuals who experience real underlying change in PROs in the course of a trial, and enhance the established methods for identifying thresholds for meaningful within-patient change in PROs.


Introduction
Estimating meaningful within-patient change is among the most important elements of statistical analysis to support patient-reported outcomes (PROs) as endpoints in clinical trials. In their most recent guidance for clinical outcome assessments, the United States Food and Drug Administration (FDA) defines meaningful within-patient change as improvement or deterioration from the patient's perspective, and serves as a way of defining clinical benefit on a PRO [1]. Improvement and deterioration are captured in terms of change in PRO scores over the course of a clinical trial. For example, on a fatigue PRO, where higher scores indicate worse fatigue, deterioration is indicated by increases in scores and improvement is indicated by a reduction in scores.
In clinical trials, change in PRO scores are calculated almost exclusively using the difference in scores between a post-baseline timepoint to a baseline timepoint-i.e., the raw change. These raw change scores are then used in conjunction with other methods to estimate meaningful within person change [2]. To determine meaningful thresholds for within-patient change, the FDA currently recommends stratifying raw change scores on a PRO by an anchor variable. Then, distributions of change scores by anchor group categories are visualized by plotting empirical cumulative distribution function (eCDF) and probability density function (PDF) curves [3]. While raw change scores are simple to calculate and are often easy to interpret, they have several notable disadvantages [4]. Of these disadvantages, the largest drawback of raw change scores is that these scores have high measurement error and can lead to misguided conclusions [5].
Due to the problems with raw change scores, alternative approaches are needed to adjust these scores for measurement error. Fortunately, the classic psychometric literature has several potential directions to advance estimation of change on PROs, but these have generally gone unused in health and clinical trials. Lord offered regression estimators of a true difference on a measure over two timepoints [6][7][8].
Regression provides a framework for discerning true change from error. In classical test theory (CTT), any score, including change scores, is comprised of a true element and error (e.g., measurement error). Approaches that can distinguish true change from error are likely superior to the primitive difference between scores at two timepoints. An additional element of more advanced approaches to estimating change involve predicting post-test scores and determining how much the observed post-test value deviates from the prediction [4,9]. Notably, Lord's estimator incorporates this element of deviation from the predicted post-test value, as well as another key element, the correlation between pre-and post-test. Additionally, Cronbach and Furby extended Lord's estimator by accommodating additional variables that may improve estimation [8]. Additional variables may be measured at the pre-or post-test and can be alternative, potentially "gold standard" measures of the construct. Cronbach and Furby refer to this as complete estimation. Both innovations address the poor reliability of raw change scores directly by incorporating additional information other than the baseline scores in the calculation of the change. Ultimately, a difference score that is estimated by a regression method, because they account for measurement error, would be interpreted as a true difference if that difference deviates from zero. The current paper explores methods that are appropriate for quantifying disattenuated changes in scores.
In this paper, we intend to demonstrate how these regression-based approaches can be used to incorporate measurement error into estimates of the change scores. Utilizing these regression-based approaches extends CTT methods that are ubiquitous in PRO research. We provide both a simulation to demonstrate the performance of the regression-based estimates and then follow that with an example that uses these methods with an applied dataset. The applied example represents a common administration PRO paradigm in clinical research with only two assessment times. Because there are only two administrations, one of which often comes after a treatment, there is limited information regarding how the measurement error can be characterized.

Accounting for individual-level changes with limited information
Traditional treatments within CTT model observed scores as a decomposition of true score and error as in Eq. 1.
With errors ( e ) being independent. However, to account for repeated measures, i.e., X and the subsequent measurement, Y , we need to update Eq. 1 as in Eq. 2.
where X represents a random effect attributable to individuals' repeated measurements [8]. The decomposition in Eq. 2 for X also applies to Y . The correlations between measurements can be, based on these two models, as unlinked-i.e., based on Eq. 1-and linked, based on Eq. 2. The unlinked correlation is represented in Eq. 3.
And the linked version in Eq. 4.
Thus, the linkage between X and Y is taken into account with XY and represents a within-individual correlation.
These linked and unlinked correlations can be used to assess the reliability of a change score [8,9]. As both X and Y potentially have both independent and dependent measurement error components, the reliability of the difference between them should take these errors into account. Equation 5 shows how this relationship is calculated for the linked case.
Note that the linked case was an extension posited by Cronbach and Furby to the Lord and Novick [8,9] presentation in which XY is replaced by r XY .

Regression-based estimators for individual change scores.
To estimate change scores while taking measurement error and individual differences into account, one approach is to treat the problem as a regression estimation. The regression estimator described by Lord and Novick [9] and extended by Cronbach and Furby [8] is defined for linked scores as: Or, rearranged to look more like a regression equation as [10]: Thus where X is a score at time 1, Y is a score at time 2, r XY is the unlinked correlation between time 1 and 2, and XY is the linked version of that correlation; the C term is a constant that aligns the mean of the estimator to be equal to that of the raw difference. Note that Lord and Novick did not make the linked versus unlinked distinction-i.e., they did not use a linked correlation from within-subjects in their formulation-but it has been added to their formula here. This equation is based on the idea that the measurement error is inherent in each measurement occasion, i.e., X and Y , and so some regression to the mean for the sample is expected. The expressions that include the reliabilities, correlations, and standard deviations of X and Y are ostensibly treated as regression coefficients to account for the measurement error in each administration of X and Y.
A further extension of this estimator was also developed by Cronbach and Furby [8], and is described as the complete estimator. This estimator further builds on the prior work of Lord and Novick [9] by also adjusting for covariates, thereby incorporating validating information into the change score estimation. The idea was to take the basic notion of accounting for measurement error in the target scores and add relevant information from other measurements to form a more precise estimate of change. The estimator is described as: where the coefficients for X and Y and C are defined as above. The other two terms represent residual scores of the partial variate of the other terms. The covariates are additional information about each respondent with W i being i variables collected concurrently with X and Z j similarly for Y . The covariates are entered into the estimate as partial variates, i.e., residual scores from regressing W i and Z j onto the other variables considered. Thus, the approach, to the extent relevant covariates are included, adds unique information to the model used to estimate a change score. These terms are partial variates to ensure that the information added by W i and Z j are not redundant or confounding to the estimation of D CF . That is, the terms for W i and Z j only account for unique information in estimating D CF above and beyond that in X and Y. Therefore, given limited information on individuals' measurement error, we can improve score estimation by using between-individual information about the measures. The scores generated from the target instrument are adjusted according to its reliability. Additionally, based on what is known about parallel instruments, additional information can be employed to improve estimation of true scores.

Methods
The main objective of the current study is to compare the recovery of real population differences in individuals' scores across two time points using Lord and Novick's regression estimator and Cronbach and Furby's complete estimator. These two regression-based estimators for calculating individual change scores are compared against the calculation of raw differences. First, we present a simulation study as a proof of concept and to demonstrate the performance for the regression-based estimators of individual differences in a context where the true population is known. Next, the applicability of these methods is demonstrated using data from the PROMIS® 1 Wave 2 Depression and Pain validation study. The applicability of the methods presented herein can apply to any number of scoring algorithms use to generate scores cross-sectionally and applied in the analysis of change overtime.

Simulation study
Simulated item response data were generated for 200 respondents. A total of 21 items were simulated with 4 response categories. Factor loadings ( ) for the first 20 items ranged from 0.3 to 0.9 to mimic what is typically encountered in validation studies of patient-report outcome (PRO) measures. Item parameters are presented in Supplementary Table 1. The 21st item was to represent a patient global impression of severity item (PGIS). Conceptually, the PGIS item should be a near perfect singleitem measure of the construct. Therefore, item 21 was hard coded with a loading of 0.9. Values of were simulated at the two time points. The at the first time point (t 0 ) was 0 ∼ MVN(0, Σ) , with Σ a covariance matrix of 1. The second time point (t 1 ) applied a decrease to the distribution of theta at t 0 , with 1 = 0 + N(−0.6, 1) . Item response data were then simulated for t 0 and t 1 using the graded response model (GRM) parameterization with the common population parameters. Item responses were generated using the simdata() function in the R package mirt and the code for this simulation is supplied in the supplementary materials.
Next, four scores were generated from the simulated response data at t 0 and t 1 , respectively: (1) a sum score based on the first 20 items (SS), (2) expected a-priori (EAP) scores on the Z-score metric, (3) T-transformations of the EAP scores (TS), and (4) the true score ( )-also on the Z-score metric. The sum score was simply the sum of the responses to the first 20 items within t 0 and t 1 . EAP scores at t 0 were generated by fitting a GRM to the responses to the first 20 items at t 0 . EAP scores at t 1 were created by carrying forward the item parameters from the fitted model at t 0 to the GRM fit to the responses at t 1 and freeing the mean and variance of 1 to maintain factor invariance. EAP scores were then computed for t 1 responses. T-score transformations of the EAP scores were created by applying the basic transformation to the EAP scores within each time pointi.e., ∼ N(0, 1) → T ∼ N(50, 10) . A final true score ( ) was estimated by applying the population parameters used to create the data to a GRM model fit to all 21 items simulated at each time point and freeing the mean and variance. This set of parameters was then used to generate the EAP scores for the observed responses at t 0 and t 1 .
Once scores were determined, significant change was then assessed with each of the Raw estimators using the standard error of measurement (SEM) and the standard error of estimation (SEE) for the LN and CF estimators. [10] These standard errors were then used to generate a confidence interval in each data set for a lower 95% CI limit. [9,10] Difference scores smaller than the lower 95% CI limit would, therefore, be considered a meaningful improvement.

Application to PROMIS® Pain Interference
The NIH Patient-Reported Outcomes Measurement Information System (PROMIS®) is an innovative set of PROs covering multiple domains in physical, mental, and social health that leverages item response theory (IRT) [11]. Its basis in IRT allows for the construction of large item banks that represent the construct that can be tapped to implement the PRO in various ways, including computer adaptive tests (CAT). Though IRT provides a framework for understanding the performance of individual items, it also generates highly-reliable scores, especially under CAT implementation, where the most informative items from an item bank are selected in sequence until a pre-specified reliability threshold is reached [12]. These characteristics make PROMIS® a good resource to explore methods to estimate individual change.
We sourced data from the PROMIS® 1 Wave 2 Depression and Pain validation study (Protocol 07-04) to further compare the raw and regression estimators [13]. This was a prospective longitudinal study aiming to test the validity of the PROMIS Depression and Pain item banks in a "real world" setting. Among PROMIS instruments administered, the PROMIS Pain Interference adult item bank v1.1 was administered by CAT at a baseline timepoint, then again at one-and three-months post-baseline. Eligible patients had a diagnosis of low back pain with or without sciatica for at least 6 weeks and were scheduled for any kind of spinal injection for pain management (e.g., steroid injection). Injections provide short term pain relief and take effect within a matter of weeks. Therefore, we expected some decrease in participants' PROMIS Pain Interference scores, reflecting a reduction in pain interference. The PROMIS Pain Interference adult item bank was developed during PROMIS Wave 1 and contains 40 items in total focusing on the consequences of pain in the patient's life, including impacts on social, cognitive, emotional, physical, and recreational activities. All items are universal (i.e., not focused on a particular clinical population or health condition) [14]. Per PROMIS standards, an IRT score (expected a posteriori) is transformed to a T-score with a population mean of 50 and standard deviation of 10, and higher scores indicate greater pain interference.
We analyzed PROMIS pain interference T-scores for 159 patients at baseline and one-month post-baseline. As the pain interference scores were based on a CAT system, no sum scores were analyzed. Brief Pain Inventory (BPI) empirical reliability based on the individual standard errors were used for the reliabilities at each time point for calculation of the LN and CF estimators [15]. Direct comparisons between these estimators were made and both were compared to assessments of change based on the IRT-determined standard error of individuals' T-scores-i.e., T ∼ N(50, 10).

Methods for the comparison of scores
To meet the main objective of this study, several change scores (i.e., t 0 − t 1 ) were calculated using the sum scores, T-scores, and true scores (τ). Specifically for the simulated data, the raw change scores and the two regression-based scores were calculated (i.e, LN and CF). Additionally, since the objective of the analyses was to compare the performance of these different methods for calculating change scores, all were placed on a common z-scale metric. Of main interest were the two regression-based scores as they adjust for measurement error in calculating individual change scores.
Using the simulated data, change scores and regression estimators were calculated for both SS and TS, totaling six scores for comparison to Δτ-i.e., the true score change. Also, standardized effect sizes were used to characterize the recovery of Δτ. To calculate these effect size differences, the absolute value of the z-transformed change scores was subtracted from the z-transformed Δτ. These standardized differences between the calculated change score and Δτ can be conceptualized as effect sizes (d), with d = 0.2 considered a small, d = 0.5 a medium, and d = 0.8 a large [20]. Probability density function (PDF) curves for the standardized differences in the change scores were used to compare the recovery of Δτ, as were descriptive statistics.
For the PROMIS® data, there is no Δτ and thus no comparison to a true score is possible. In addition, since the data were collected in a CAT format, patients had different numbers of items answered. Thus, T-scores from the PROMIS® validation and the individual-level standard errors from the CAT scores were used for assessment of these data.
Significant change was then assessed similarly to the simulated results by comparing computed lower limits of 95% CI's as well as comparison to the IRT SE values from the PROMIS® scoring.

Simulated data
Details of the simulation are supplied in the supplemental materials and additional descriptive statistics for the simulation can be provided upon request. Descriptive statistics of the Δ PGIS, Δ , and the change score estimators appear in Table 1.
Although on different scales, making some direct comparisons difficult, both TS and SS showed a similar pattern for change score estimates. Across estimators, including more information in the form of reliability and/or covariates reduced variability in estimates as evidenced by smaller standard deviations for LN and CF estimates. The impact was more pronounced for the SS than the TS with the LN and CF standard deviations approximately half of the raw change scores.
To compare estimates based on a comparable scale, deviations from Δ on the Z-score scale were computed, expressed as a standardized effect size, d, and are shown in Table 2.
Among TS estimates, remarkably similar results were observed with deviations largest for the CF estimates. The small differences seen between SS and LN for SS scores are more pronounced with 4.5% of raw changes in SS reaching large effect sizes and only 1% of LN estimates reaching that same level. For the SS, a similar overall pattern was observed with CF estimates showing higher levels of deviation from true scores than for the TS scores. Distributions of the deviations are displayed in Fig. 1.
Using the estimates to determine a generalized limit for improvement, estimates for significant individual improvement are presented in Table 3.
For the T-scores, the LN estimates showed the best results with the smallest confidence limit of −3.49. The CF and Raw changes scores showed increasing values. For the SS estimates, the LN estimates also showed the best precision as it has the smallest confidence limit with an improvement of just over three points indicating a significant improvement. In comparing the T-scores to Sum scores, T-scores outperformed Sums when looking at the Raw differences, but for regression methods, results were remarkably similar between the two scoring schemes.

PROMIS® Pain interference
Details of the PROMIS® data analysis can be provided upon request. Descriptive statistics for the PROMIS® scores used for the current analyses are in Table 4. Reliability estimates for Administrations 1 and 2 Pain Interference T-scores were computed using the individual T-score SE values with r 11 � = .924 and r 22 � = .935 for each administration, respectively. Linked and unlinked correlations between administrations were calculated and found to be 12 = .828 and r 12 = .834 . These calculations then fed into the assessment of reliability of the change in Pain Interference T-score with r DD � = .594.
The reliabilities and correlations were then employed in calculating the LN and CF estimates, also summarized in Table. The LN and CF estimates showed notably smaller standard deviation, minima and maxima as expected with regression estimators.
Comparison of the raw and regression estimates, as in the simulated analysis, were done on a Z-score and Fig. 2 shows the density plots of those estimates. All three estimates are very close to one another in distribution, indicating that the one with the smallest dispersion on its natural scale is likely the strongest estimator. Table 5 displays the computed standard errors. To ground these estimates, descriptive statistics of the individual-level T-score standard errors was also computed as this was the natural error measurement for the CAT scores in the dataset.
The lower confidence limit values for each estimator indicate that the LN estimates again have the most precision  and are closest to the Individual SE summaries, which take the most information into account. Of note, both regression methods yielded confidence limits that were within the maximum value of the Individual T-score SE of 5.61.

Discussion
The current paper's objective was to illustrate how the use of additional information can enhance the assessment of raw change scores, especially for individuals. We presented methodologies that, although seasoned, are not commonly used in PRO measure applications. The LN and CF estimators both incorporate measurement error by using reliabilities of the measurements at two occasions as well as the reliability of the difference between those occasions into account. The CF estimator also adds additional information in the form of one or more covariates at one or both time points. While the current methodologies do incorporate both within-individual and between-individual levels of information, we have proposed them as a utility for practitioners in cases where individual-level information may be limited. As the measurement of latent states within individuals manifests as the interaction between individual's responses and instruments' measurement properties, the methods attempt to quantify both sources of information and combine them to increase the precision of measurement. We presented these methods to be consistent with CTT methods commonly used in PRO work.
Regarding the simulated dataset, within score type the pattern was clear that the regression estimators reduced dispersion with smaller SD values and minima and maxima closer to the mean in both LN and CF cases. Attenuation of dispersion for T-scores was apparent, but noticeably more so for the Sum scores when considering the regression approaches. Whereas one could argue that for the T-score estimates, there were small differences, the Sum score comparison showed that SD values for the regression estimates reduced the Raw difference SD by close to half. Comparisons of deviances from values on the Z-scale and as d-values also showed a similar pattern, although the CF estimator showed a relatively high percentage of individual deviances at a medium or high d-value for both T-and Sum scores. This result warrants further investigation but may be related to use of a single item PGIS as the grading of this variable compared to the continuous scores could affect the estimator. Additionally, it is worth noting that the current study is not a full simulation study of estimator behavior, but an example to initiate further inquiry.
Specifically, with regard to the Sum scores, the Z-score transformations showed poorer, i.e., larger, deviances from the values in general for both Raw and CF showing 4.5% and 8.5% of d-values at 0.8 or higher. Comparably, the LN estimator was only 1%. While further investigation is warranted, these results suggest that, especially for sum scores, that the LN estimator performs best.
Comparisons of PROMIS® Pain Interference scores also showed attenuated dispersion for the regression estimators with the LN estimator showing the smallest values, indicating better precision. The right panel of Fig. 2 underscores this point as plots of the estimators on the Z-scale reveals very small distributional differences, therefore, all things equal, the LN would be the estimator of choice with the smallest dispersion.
Using the individual SE values from the CAT T-scores as a basis for further comparing the performance of the estimators suggested the regression estimators had the closest SE values as compared to the mean SE of the CAT-determined errors. Further, when the lower 95% confidence limits for the regression estimators are computed, they are remarkably close to the maximum value of the Individual SE. This would suggest that both SEP-based limits for individual improvement, in the current case, would catch classify improvement for 95% of those that would exceed a value of improvement based on their individual SE value, i.e., 1.98 × SE . Further, when comparing observed values, one should use the SEM value, while the regression estimators' equivalents should use the SEE values [10].
Some of the limitations of the current methods could be addressed by modeling with IRT methodologies-specifically, assessing reliability and information as a function of the latent variable rather than a single point-estimate. Of note, the regression approaches presented are not new as the mathematical underpinnings were developed in the 1950s to 1970s. However, we believe that they offer value to those incorporating PRO scores into their research designs, where administration of a PRO in a design is limited in frequency, and methods used to score and evaluate the performance of the PRO are most commonly drawn from CTT, and methods used to assess efficacy using the PRO often rely on use of the simple change score. An extension of the methods here that would increase the individual-centric idea could be to employ IRT modeling. As IRT models conceive of reliability as an information function that varies along the latent variable continuum, I(θ) respondents' scores could then be evaluated according to a more-precise set of information, i.e., taking into account item-level information that is not marginalized into a summary score, per se. We think IRT models within a longitudinal framework, anchoring θ for the first administration and assessing individual changes according to empirical standard errors of the computed scores is an approach that has potential in PRO research. Further investigation of such methods and the parameterization of such models, we believe, should be the subject of future research.
While it is best, if available, to use the tools available from IRT analyses to determine individual changes in latent states over simply looking at raw score values, we feel that using the regression estimators described here present a compromise that is available in many situations including those for which a legacy instrument has been validated with a CTT, e.g., sum score, set of techniques.

Conclusions
Our results suggest that calculating change using regression estimates may result in more increased measurement sensitivity. Both regression estimators incorporate information other than baseline scores, such as measurement error and the correlation between scores at different time points, into the estimation of a change. Using these scores in lieu of raw differences can help better identify individuals who experience real underlying change in PROs in the course of a trial and enhance the established methods for identifying thresholds for meaningful within-patient change in PROs as significant change scores is observed directly for individuals that have a change score which exceeds 0, when calculated using the LN and FC estimators. Further, the use of regression estimators for change may result in increased power to detect change in trials.
The choice of whether to use the LN or CF estimator is dependent on the situation at hand. The important point to consider is that both utilize regression to the mean with the reliability of an instrument as the regression weight. This incorporates what is known about measurement error into the estimate. The CF, or "complete," estimator in theory could improve precision for individual measurement over the LN, but this is almost assuredly dependent upon the quality of the partial terms added to enhance this precision. In the CF example above, a high-quality anchor item is used to create a partial term. Therefore, the estimation of change at the individual level subsumes the useful information from the anchor. To the extent that the information from this anchor item is capable of sorting individuals into those who have experienced a meaningful change, it is done directly at the individual level without the need to reclassify all individuals post-hoc based-on statistics produced from a sub-group in the data (e.g., creating a cut for scores to identify meaningful change for all participants based on the median from the 1-point improvement group on the PGI-S). The regression-based incorporation of the anchor in the CF estimator also takes the onus off the analyst in determining whether an anchor is of sufficient quality, as the term for the partial from a poor anchor item would take on less weight in sorting change scores at the individual level. Coupled with distribution-based indices, such as the SEE, a hypothetical reliable difference at the group-level might be identified in a trial using either the LN or CF regression-based change scores by identifying a group-level difference that exceeded the lower 95% confidence interval for the SEE. For a difference test which used the CF estimator to generate the change scores, the lower 95% SEE for the distribution of scores would also incorporate information from the anchor directly.
Of note, the CTT estimators we have explored here still contain an element of marginalization with regard to the information contained in PRO item responses. The EAP or other scoring methods employed for scoring IRT models allow for individual-level errors to be calculated, based on the respondents' levels on the latent variable being measured-i.e., . While we advocate using IRT when appropriate, we also think the regression estimators presented here represent a better and more accessible alternative to raw change scores for determining individual improvement or worsening.
Author contributions All authors contributed to the conceptualization, drafting, and review of the manuscript. DAA: conducted the analyses. JDP: supplied the PROMIS® dataset. All authors approved the final manuscript.
Funding The current project did not have explicit extramural funding sources. All authors are employees of their respective institutions.

Conflict of interest
The authors have no competing interests to declare.
Ethical approval and consent to participate Not applicable.
Consent for publication Not applicable.