The significance of group-level change is evaluated to assess treatment efficacy and effectiveness. In addition, group-level minimally important change (MIC) thresholds are used because trivial mean change can be statistically significant if the sample size is large enough. The MIC indicates if statistically significant group mean differences are large enough to be important or meaningful to patients and clinicians. Identifying those who improve (“responders” to treatment) provides important supplemental information to group-level change. This paper reviews approaches for assessing MIC and estimating responders to treatment. We note that while group-level MIC thresholds have been used to identify responders to treatment in HRQOL studies [1,2], other approaches are more appropriate.
Estimating the MIC
MIC estimates rely on anchors to provide an external indication of the level of underlying change. The variety of possible anchors makes a single MIC estimate problematic. It is advisable to use multiple anchors whenever possible, but the most used anchor is a retrospective rating of change question such as:
How is your health now compared to 6 weeks ago?
Much better
A little better
About the same
A little worse
Much worse
This example item refers to change in ‘‘health.’’ Depending on the context and measure being evaluated, the anchor might be worded more specifically such as ‘‘physical functioning,’’ ‘‘pain,’’ ‘‘getting along with family,’’ etc. The choice of words is likely to result in different MIC estimates. In addition, there are known limitations of retrospective ratings of change, which include a tendency to reflect the patient’s current state more than change, potentially due to recall bias [3,4].
Change on the target measure should be correlated and have a monotonic association with change indicated on the anchor. The mean change on the target measure should be larger for the subgroup of people who report they are much better on the anchor than mean change for the other subgroups. And those who report no change on the anchor should have no more than minimal change on the target measure [5]. The mean group change on a HRQOL measure for those who report being “a little better” (improvement) or “a little worse” (decrement) are the basis for MIC estimates. But sometimes investigators fail to limit the MIC estimate to those who changed a little and include all those reporting any change on the retrospective rating of change item. This was the case in a sample of 123 adult surgical patients with spinal [6] and in a study of 223 patients with chronic low back pain [7]. Including all those who change rather than focusing on those with minimal but important change lead to MIC estimates that were too large.
Identifying Responders to Treatment
Individual-level variation and change can be estimated using simulation modeling for time series data, but it requires a minimum of 10 observation in the data stream [8]. Similarly, Moinpour et al. [9] estimated mixed effect models and noted that the PROMIS fatigue computer adaptive test would need 15 total assessments to obtain 0.90 reliability of change. Because of limits on research budgets and concerns about respondent burden, nearly all longitudinal HRQOL studies are limited to a few waves of assessment (e.g., two time points). Guidance for identifying responders to treatment for this environment are needed. Hence, we review approaches for estimating individual change from baseline to a single post-baseline assessment.
Table 1 lists several formulae previously proposed for estimating the significance of individual change that are analogous to between group t-tests [10-11]. All the formulae include individual change in the numerator and error in the denominator. The different methods vary in how they estimate error--for example, the time 1 standard
![](https://myfiles.space/user_files/58677_ec8811c6b4185256/58677_custom_files/img1613667565.png)
Following the conventional p < .05 threshold for group-level research, responders are usually defined by an RCI of 1.96 or larger. A variant of the RCI used for cognitive measures corrects for practice effects [120], though caution has been raised about use of this particular RCI variant [13]. The denominator of the RCI for item response theory (IRT) calibrated measures uses IRT standard errors at time 1 and time 2 [14]. The coefficient of repeatability indicates the amount of change necessary to be significant on the RCI and is, therefore, equivalent to it. This coefficient is also known as the minimally detectable change, smallest real difference, and the smallest detectable change [15].
Variations to these methods have been proposed to account for regression to the mean (see Table 1). Regression-based approaches compare observed scores at time 2 with regression predicted scores based on time 1 score and other time 1 variables. This can be useful clinically because time 2 status is compared to what would be expected based on time 1 characteristics.
MIC Thresholds Should not be used to Identify Responders to Treatment
There are two major problems with applying group-based MIC methods to categorize individual patients as having changed or not: one conceptual and one statistical. The conceptual issue regards using averages derived from groups that may be relevant to any one patient. MIC estimate are averages of individual-level MICs, implying a distribution of individual MICs; small changes may be meaningful for some and large changes for others [16,17]. Even if such MIC estimates are derived from patient-reported anchors representing the construct of interest, these averages may not represent change that is meaningful to individuals. For example, an individual patient who would consider only a large magnitude improvement in physical function to be meaningful likely is not interested in achieving the average improvement, since the average value falls below that individual’s perception of meaningful change. The statistical issue is that group-based MIC methods drastically underestimate the amount of change needed to be significant at the individual level due to the large measurement error around individual change scores [18]. “Any inspection of measured data reveals an order of magnitude difference between the variability in group versus individual changes” [19]. Thus, group-based MIC estimates will often be indistinguishable from individual score error [20].
Abu et al. [21] is a recent example of using MIC thresholds to identify whether patients improved or declined on the Atrial Fibrillation Effect QualiTy-of-Life (AFEQT) Questionnaire. A five-point change threshold was used as the threshold for “clinically meaningful change.” This threshold was based on group-level MIC estimates from a prior study of the AFEQT MIC that used physician assessment of functional status [22]. The authors concluded that 22% declined and 40% improved from baseline to 1 year later in a sample of 1097 older adults with atrial fibrillation. Table 2 shows the standard deviations, internal consistency reliabilities, and coefficients of repeatability for the four ADEPT scores we computed. The coefficients of repeatability are two-to-three times larger than the 5-point change threshold the authors used. Ironically, Abu et al. could have adopted the more appropriate SDC estimates (equal to the coefficient of repeatability) reported by Spertus et al. [22].
It is clear that Abu et al.’s [21] paper is among the set of cases where the MIC, derived from group-based estimates, falls well below the coefficient of repeatability. When this is the case, Kemmler, et al. [20] suggest increasing the MIC thresholds to the coefficients of repeatability. Terwee, et al. [15] recommend looking to see how measurement error might be reduced: 1) increasing homogeneity of the study sample’s scores at the first measurement timepoint and thereby reducing the SD; and/or 2) increasing the reliability of the measure. Both options are made difficult if the amount of SD reduction or reliability increase is not trivial.
Using the Abu, et al. [21] example, we calculated and plotted the SD’s needed at 0.90, 0.95 and 0.99 reliability on the AFEQT. Figure 1 uses the approximate SD (~17.5) and coefficient of repeatability (~15) observed for the AFEQT overall scale at 90% as a starting point, then scenarios under which the reliability is increased or the SD is decreased can be examined. As seen in the plot, at 0.90 reliability, the SD must drop to about 5 for the coefficient of repeatability to equal the MIC. If the reliability were 0.99, SDs under 17.5 would result in a coefficient of repeatability at or less than the MIC. This example demonstrates the types of conditions required for an instrument’s coefficient of repeatability to equal its MIC. Many instruments will not achieve such low SD’s or high reliabilities under any circumstances.
Combining Statistically Significant And Meaningful Individual Change
A clinician or researcher might also regard relative standing on the measure at the follow-up time point to be important. In some areas of medicine, change in clinical status alone is enough to be important. For example, COVID-19 patients who changed to a more positive level on a six-point ordinal scale (not hospitalized; hospitalized but not requiring supplemental oxygen, hospitalized, requiring supplemental oxygen, hospitalized, requiring nasal high-flow oxygen therapy, non-invasive ventilation, or both; hospitalized, requiring invasive mechanical ventilation, extracorporeal membrane oxygenation, or both; dead) were regarded as improved in one study [23]. Or, a primary care physician might be interested in whether a patient ends up within the normal blood pressure range following initiation of high blood pressure medicine. Similarly, a rehabilitation clinician might want to know if a patient with impaired physical functioning at the beginning of treatment ends up functioning as well as other people with a similar condition. The FDA has suggested that meaningful change needs to be assessed in addition to significant individual change [1]. Some contend that any individual change that is significant at p<.05 is substantial and likely to be meaningful to patients [10, 24].
Jacobson and Truax [25] classified change as 1) recovered (statistically significant and clinically significant); 2) improved (statistically significant but not clinically significant); 3) unchanged (not statistically significant), and 4) deteriorated (statistically significant decrement). In one study, responders were those with significant individual improvement on the Functional Disability Inventory (FDI) and improvement in the FDI severity level (no/minimal disability, moderate disability, severe disability) [26]. These change categories offered by Jacobson and Truax may be more appealing than use of either statistically significant change (coefficient of repeatability) or the MIC alone.
Secondary Analysis Combining Significant and Meaningful Individual Change
To illustrate how significant individual change and meaningful individual change can be presented together, we conduct a secondary analysis of the Impact Stratification Score (ISS) administered in a prospective comparative effectiveness clinical trial of 750 active-duty U.S. military personnel [27]. The average age of the sample was 31; 76% were males and 67% white. Most of the participants reported low back pain for more than 3 months.
The ISS was proposed for use with chronic low back pain patients by a National Institutes of Health Pain Consortium research task force. The ISS is the sum of the PROMIS-29 v2.1 physical function, pain interference and pain intensity scores [28]. The ISS has a possible range of 8 (least impact) to 50 (greatest impact). Physical function (4 items with response options ranging from without any difficulty = 1 to unable to do = 5) and pain interference (4 items with response options ranging from not at all = 1 to very much = 5) each contribute from 4 to 20 points, and the pain intensity item contributes from 0-10 points. The task force proposed three categories of ISS severity: 8-27 (mild), 28-34 (moderate), and 35-50 (severe).
Following guidelines by de Vet et al. [29], Dutmer et al. [7] estimated a SEM of 5.2 for the ISS based on test-retest reliability. But test-retest reliability estimates can be problematic. Test-retest reliability can underestimate reliability when there is true underlying change. Reeve et al. [30] noted that:
ISOQOL respondents agreed that as a minimum standard a multi-item PRO
measure should be assessed for internal consistency reliability.…
However, they did not support as a minimum standard that a multi-item PRO
measure should be required to have evidence of test–retest reliability. They
noted practical concerns regarding test–retest reliability; primarily that some
populations studied in PCOR are not stable and that their HRQOL can fluctuate
This phenomenon would reduce estimates of test–retest reliability, making the
PRO measure look unreliable when it may be accurately detecting changes over
time. In addition, memory effects will positively influence the test–retest reliability
when the two survey points are scheduled close to each other.
We estimated a much smaller SEM of 2.4 using an internal consistency reliability estimate from another study [27]. In this dataset, we examine significance of individual change on the ISS between baseline and 6 weeks later using the coefficient of repeatability (= 6.6). In addition, we compare the significance of change with self-reports on a retrospective rating of change item administered at 6 months: “Compared to your first visit, your low back pain is: much worse, a little worse, about the same, a little better, moderately better, much better or completely gone?”
Thirty-seven percent of the sample improved significantly on the ISS over these 6 weeks and 59% reported on the retrospective change item that they were better (16% a little better, 14% moderately better, 23% much better, and 6% completely gone). Among those who improved significantly on the ISS, 89% reported they were better on the retrospective rating item. Thirty-three percent of the sample improved significantly and reported improvement on the retrospective change item (statistically and clinically significant), 4% improved significantly but did not report that they were better on the retrospective change item (statically but not clinically significant), 26% did not improve significantly but reported improvement on the change item, and 37% did not improve significantly or report improvement on the change item.
Extending this application to further illustrate how group-based methods of estimating MICs can underestimate significant individual change, we compared two alternative ways of defining improvement on a retrospective rating of change item to identify optimal cut points on the ISS. The first way is more inclusive in that improvement from baseline to 6 weeks later included those who reported on the retrospective change item at 6 weeks that one’s back pain was either a little better, moderately better, much better or completely gone. The second way is more restrictive as improvement was limited to those who reported their back pain was moderately better, much better or completely gone on the retrospective change item.
The Youden;[31] index, (sensitivity + specificity)-1, suggested an optimal cut point of 5 points for change on ISS from baseline to 6 weeks later for the first definition of improvement: sensitivity of 65%, specificity of 82%, negative predictive value of 62%, and positive predictive value of 84%. For the second definition of improvement, the Youden index indicated an optimal cut point of 7 points for ISS change: sensitivity of 66%, specificity of 85%, negative predictive value of 77%, and positive predictive value of 76%. The group-level thresholds estimated for the second definition that excluded those who said they were a little better from the improvement group were closer to the coefficient of repeatability.