Censoring patients for intercurrent events has been a widely used strategy in the analysis of PFS. Before the introduction of the estimand framework, sensitivity analyses varied censoring rules, sometimes addressing different questions. Within the new framework these analyses represent different estimands rather than sensitivity analyses. Before the estimand framework, little attention was given to the fact that different censoring rules for intercurrent events (such as treatment discontinuation and initiation of subsequent treatment), performed to satisfy different stakeholders, actually addressed different clinical questions, or provided a biased estimate of the treatment effect of interest. As an example, censoring patients at the time of treatment discontinuation due to toxicity can bias the analysis in favor of the more toxic treatment in a randomized trial [3, 24].
For clinical trials designed before the Covid-19 pandemic and ongoing at the time it started, one of the main questions researchers have been asking is “Will the originally defined analysis (typically using a “treatment policy” strategy) correctly address the scientific question of interest”?
The oncology estimand working group, a cross-industry international collaboration, was established by the European Federation of Statisticians in the Pharmaceutical Industry as a European special-interest group for estimands in oncology, and was granted the official status of American Statistical Association scientific working group. In their publications, they have argued that the clinical trial objective should relate to a world without ongoing Covid-19 pandemic, including no major disruption of healthcare systems [20]. The censoring of time-to-event endpoints at the intercurrent event was suggested as a possible way to target a hypothetical estimand, even in the context of the Covid-19 pandemic-related healthcare system disruptions in PFS analysis [25].
Our simulated examples illustrate the indirect impact of a pandemic on results of ongoing trials when using a “hypothetical” strategy rather than the “treatment policy” strategy for handling a shutdown-related missing assessment situation. Our simulations show that the “hypothetical” strategy consisting of censoring those events documented immediately after the shutdown period has serious statistical implications. First, the loss of events due to censoring reduces the power of the statistical comparison unless the analysis time is delayed by several months. Second, this method results in a greater overestimation of the median PFS than the “treatment policy” strategy. Finally, as the maturity of the PFS curve and the median PFS are achieved at different times between treatment arms, the pandemic may affect different portions of the PFS curves in the two groups. For this reason, the imbalance in median overestimation can be more pronounced with the “hypothetical” strategy than with the “treatment policy” strategy. For all these reasons, the treatment policy strategy should remain the primary method of assessment. Interval-censoring methods, although not broadly used for primary analysis, would address the issue of unduly long intervals between scans. However, this more sophisticated method does not provide a point estimation of the median PFS.
Censoring a patient at the time of occurrence of an intercurrent event effectively estimates the PFS of patients in absence of that event as that of other patients that did not experience the intercurrent event at that time point and are in the same treatment arm. Therefore, the underlying assumption is that the arm A patient who experienced the intercurrent event at time t has the same PFS expectation from time t onward as the patients remaining on arm A at time t (‘non-informative censoring’). In many situations, e.g., censoring for switch to another treatment, the assumption does not hold. This may be a reasonable assumption, however, in the specific case of the shutdown of healthcare facilities during a pandemic. Nevertheless, as shown in our simulations, even when that assumption holds, censoring can result in a distortion of the treatment comparison when medians are used.
To evaluate whether the pandemic shutdown induced any bias in the evaluation of PFS, several sensitivity or supplemental analyses may be performed. As an example, an estimation of the times to the nth visit (n = 1, 2, …, last) by treatment group might disclose an imbalance between treatment groups in the occurrence of delayed assessments induced by the pandemic.
It is important to note that each patient will be affected by the pandemic at a different time point in relation to their date of randomization. This difference between the time scale used for treatment comparison (time from randomization) and the calendar time stretches the portion of PFS curves affected by the pandemic shutdown to an extent that depends on the duration of accrual. As an example, if the 6-month pandemic shutdown starts 6 months after the end of a 12-month accrual period, we can only be reassured that the events observed within 6 months from randomization and after 24 months from randomization will not be affected by this shutdown period. As the true median PFS is assumed to be different under each treatment, the pandemic period, although not related to treatment, will affect each PFS curve in a different way, which may translate into an imbalanced overestimation of the median PFS between treatment groups. Indeed, when the shutdown starts before the median is reached in the experimental arm, but after sufficient follow-up is achieved and half of the patients have progressed in the control arm, only the median PFS in the experimental arm can be overestimated. This is illustrated, for example, by our findings related to Scenario 1 with the shutdown starting at 24 months and Scenario 2 with the shutdown starting at 21 and 24 months.
Under Scenario 1, results of the simulated study affected by a pandemic shutdown occurring 24 months from start of accrual illustrate that phenomenon, with a control median PFS estimate left unaffected but a treatment difference in median PFS of up to 10 months with the “hypothetical” approach (instead of the 5 months in the absence of a pandemic). Similarly, in Scenario 2, a treatment difference of 13 months (instead of the true difference of 8 months) was observed when the pandemic occurred 18 months from accrual start or later.
Our simulation studies were based on an exponential distribution assumption (in absence of the pandemic), which implies the assumption of proportional hazards. The hazard ratio, which is the appropriate measure of treatment effect under this assumption, is relatively unaffected by intercurrent events and censoring conventions. In situations commonly seen with immuno-oncology agents, such as delayed separation or crossing of the PFS curves (violating the proportional hazards assumption), and/or in settings for which a proportion of patients are expected to be cured, the impact of censoring for PFS because of the pandemic may affect the two arms in an even more unbalanced way than in our examples. These situations illustrate how censoring a time-to-event endpoint for intercurrent events that are completely independent from treatment might induce a bias in the treatment comparison. For this reason, in addition to the issues related to power loss due to massive censoring (which may not always be recovered), the practice of censoring for PFS, however well intended, is counterproductive. If interest truly focuses on a hypothetical estimand, methods based on causal inference can often be used instead, though in the presence of a pandemic affecting all patients, the opportunity for using such methods may be severely limited. Finally, the difference in median PFS is a statistically unstable and unreliable measure of treatment effect, and our results confirm that this statistic should generally not be used, regardless of the chosen estimand strategy [26, 27].