Emulating trials from observational data
Given the large costs associated with prospective RCTs, an important question to consider is whether one is needed to answer a research hypothesis. This question has received particular attention in recent years, given the increasing amount of routinely collected data available, from sources such as CALIBER(11). Furthermore, there are now an array of patient cohorts and registries, with IMID-Bio-UK(12) an example of a UK initiative to bring these together for various IMIDs.
These data sources allow comparisons of different treatment strategies to be conducted through retrospective observational studies. Results from such analyses can be valuable, but are subject to confounding and other flaws such as selection bias and immortal-time bias(13). This is especially true if inappropriate analyses are applied.
An example, from outside of IMIDs, of where inappropriate analyses gave a misleading answer is presented by Dickerman et al(14). The effect of statins on the risk of developing cancer was assessed from retrospective data by comparing individuals who had received multiple years of statin therapy against those who had not. Even after adjustment for potential confounders, this approach was severely biased: a consequence of the fact that individuals who received multiple years of statin therapy could not have done so if they had died from cancer before or during that time. Within IMIDs, a recent paper(15) reviewed retrospective comparative effectiveness evaluations in rheumatoid arthritis; it was found most analyses had some flaws that would potentially lead to biases.
Instead, an approach called emulation of a target trial(16) can address many biases and result in more reliable answers. This involves specifying the ‘target trial’ that one would have liked to have done (i.e., which patient population, intervention, comparator, and outcomes) and analysing the data in a way that emulates this as closely as possible. Each timepoint in the retrospective data is then examined to identify which patients would have been eligible for randomisation in the target trial. The probability that they could have received intervention or comparator is modelled in a way that emulates random assignment from a trial as closely as possible. Dickerman et al(14) demonstrate how this approach, applied to data from CALLIBER, yields the same conclusions as a large meta-analysis of RCTs for the (lack of) effect of statins on reducing risk of cancer.
With many IMIDs being chronic conditions, RCTs are often used to compare different strategies for employing treatments known to be efficacious. Examples may include testing different ‘treat-to-target’ strategies(8) that may employ more aggressive treatment until a measure of disease activity is below a set threshold. When different strategies are already being employed in practice, and frequent measures of disease activity are recorded in routine data, emulation of target trials may be an efficient approach for evaluating different strategies.
It is important to note, however, that target trial emulation is still subject to bias. This is especially true if the routine dataset does not record sufficient information on potential confounding variables (or if there is a lot of missing data). Consequently, there may still be a need for prospective RCTs of treatment strategies. Nonetheless, target trial emulation could play an important role in prioritising which strategies should be tested and whether an RCT is likely to be successful in finding a significant effect.
Adaptive trial designs
An adaptive design is one “that offers pre-planned opportunities to use accumulating trial data to modify aspects of an ongoing trial while preserving the validity and integrity of that trial”(17). Adaptive designs consist of a wide range of approaches that can improve efficiency in trials. Unlike the other innovative methodologies we discuss here, they have been discussed at length in other recent articles. There are both papers that have provided an overview of adaptive designs in general(18) and for specific clinical areas such as rheumatology(19). We refer the reader to these articles for a comprehension introduction to adaptive designs.
However, we do provide in Table 1 a brief summary of several available types of adaptation and their potential advantages. We also highlight one key factor that influences the added efficiency provided by an adaptive design: the ratio between the recruitment length of the trial and the time taken to observe the primary endpoint(20). If it takes a long time to observe the primary endpoint, then at an interim analysis there will be a proportion of patients who do not contribute information and who don’t benefit from an adaption. As an example, if the primary outcome takes one year to observe and all patients are recruited in six months, then by the time the first patient’s one-year outcome has been observed, all patients have been recruited and the adaptive design cannot provide any utility. A more quickly observed ‘intermediate’ outcome can be used to make adaptations, but it must be sufficiently informative for the primary outcome to be useful.
Given the amount of well-developed methodology now available for adaptive trial design, it is this consideration on the choice of primary outcome and its observation time relative to the anticipated recruitment rate, which we believe may principally influence whether an adaptive approach would provide efficiency advantages for a given IMID trial.
Basket and umbrella trial designs
Because of rapid advancements in biological and genomic understanding during the past few decades, an increasing number of new therapies are being formulated to target specific molecular or immune aberrations. Given that many IMIDs share common mechanisms, these targeted therapies may perform equally well for multiple distinct IMIDs.
Originating in oncology settings, basket and umbrella trial designs have recently emerged as new types of efficient approaches for testing treatment efficacy in potentially heterogeneous subgroups(21). These novel designs are administratively efficient as they investigate multiple treatments or diseases, sometimes both, in a single study under an overarching protocol. Figure 1 gives conceptual illustrations of basket and umbrella trial designs with components (sub-studies) defined by biomarkers or genetic mutations, to which the new treatment(s) for evaluation are matched.
(Figure 1 here)
While traditional oncology trials focus on a single treatment for a specific cancer histology, basket trials can involve multiple histologies and enrol patients with a common mutation that the new therapy targets. As shown in Figure 1, an oncology basket trial consists of a number of sub-studies, with each specific to a histology or disease subtype. The principle aim is to test the treatment efficacy in various sub-studies simultaneously. As examples, Drilon et al(22) evaluated the efficacy of Larotrectinib, a tropomyosin receptor kinase inhibitor, in diverse TRK fusion positive tumours. Hyman et al(23) evaluated the BRAF inhibitor vemurafenib, finding significant activity in some tumours (e.g., non-small cell lung carcinoma (NSCLC) and Erdheim-Chester disease), yet inactivity in pancreatic cancer and multiple myeloma.
Efforts have been made to translate the idea of basket designs to disease areas outside of oncology. For example, patients can be stratified to enter a trial with multiple sub-studies by biological characteristics, such as disease stage, number of prior therapies, specific genetic/epigenetic changes, or demographic characteristics(24). There is also precident for a basket-type approach having been used in IMID research. Although not officially labelled a basket trial, TRANSREG(25) is a multicentre open-label trial involving 11 IMID patient subgroups evaluating the safety, biological and clinical effects of low-dose interleukin-2. The broad eligibility criteria allow patients with rare IMID diseases to participate in the trial.
Early strategies for analysing basket trials regard the sub-studies in isolation. Although this fully acknowledges the heterogeneity between responses to the same treatment observed in the various patient subgroups, this inevitably leads to low-powered tests due to small sample sizes. Several sophisticated approaches have been developed to enable sharing of information across sub-studies(26–29), among which the proposal by Zheng and Wason(26) can be readily applied to non-oncology basket trials with covariates. With necessary extension or modification, these approaches could lead to the efficient design and analysis of IMID basket trials.
By contrast, umbrella designs, illustrated in Figure 1, offer the possibility to efficiently test multiple targeted therapies in a single disease population(24). To date, umbrella designs have only been implemented in oncology(30): patients of the same tumour type, as screened by an array of biomarkers, receive the treatment specific to their genetic aberration. The ongoing ALCHEMIST trial(31) represents an early example of an umbrella trial. It enrols NSCLC patients and evaluates therapies targeting two types of genetic changes, EGFR mutations and ALK translocations, which are hypothesised as key factors to tumour growth and disease progression.
The increased understanding in pharmacogenomics and pharmacogenetics of IMIDs, especially rheumatoid arthritis(9,32), makes umbrella designs a suitable approach to answering more treatment-related questions efficiently in a single trial. The identification of specific genes and epigenetic changes involved in the development of rheumatoid arthritis, which may be predictive of the response to treatment, could potentially lead to the initiation of an umbrella trial.
With the multi-biomarker approach of umbrella trials, more patients are likely to meet eligibility criteria for at least one of the biomarker-defined subgroups. This is particularly beneficial compared to an alternative ‘enrichment’ trial that tests one targeted treatment in a subgroup. However, there are unresolved issues in how best to allocate patients who test positive for more than one biomarker, or to no biomarker, in an umbrella trial. Allocating the most suitable treatment to such patients is not straightforward.
Umbrella designs are flexible and can possibly be integrated with various adaptive designs to make them more efficient. Biomarker adaptive randomization could be incorporated to assign patients to the most promising biomarker-linked treatments using accruing trial data (e.g., as in the recent BATTLE trials(33)); a MAMS type approach could be used when a number of treatments are available for evaluation within a cohort; and if promising treatments unavailable at the start of the trial become available, protocol amendments could be made to allow addition of trial arms.
Ultimately, both basket and umbrella designs allow investigators to test more research questions in the same trial. Basket trials help assess whether a new therapy works in distinct patient subgroups (or related diseases) and to what extent(34), while umbrella trials identify whether biomarker-treatment pairs are valid and which one(s) can best improve outcomes.
Sequential multiple assignment randomised trial (SMART) designs
Therapy of chronic conditions or rapidly fatal diseases often requires several lines of treatment with different drugs or interventions used as the disease progresses. In each line, the treatment may achieve the required clinical objective (e.g., response), or not (e.g., non-response). When treatment fails for a patient at a certain line, it is common medical practice to switch to a different treatment or strategy for the next line. The type or dose of the treatment/intervention may be adjusted repeatedly according to a patient's ongoing clinical information, including their treatment history and response to previous treatments(35,36).
An adaptive intervention is a treatment strategy that personalises treatment through established decision rules that recommend when and how the treatment changes, taking into account the history of previous treatments and response to those treatments(37). A Sequential Multiple Assignment Randomised Trial (SMART) is a multistage trial design that is used to construct effective dynamic treatment regimens (DTR), also known as adaptive interventions (AIs) or adaptive treatment strategies(38). Figure 2 depicts an example of a SMART design in which only non-responders to first stage intervention are re-randomised in the second stage. This would provide information to inform an AI that chooses which first-line intervention to use, and how to subsequently treat patients who do not respond to the first-line treatment.
(Figure 2 here)
An AI consists of four key elements: critical decision point(s), intervention component(s), tailoring variable(s), and decision rule(s). The first element, a sequence of critical decision point(s), comprises the intervention to begin with, when and how to measure signs of response/nonresponse, how to maintain the success of the initial intervention, and what interventions may be used for non-responders. The second element, the intervention components, is a set of intervention/treatment options at each critical decision point. From Figure 2 we can see that there are two treatments options in the first stage (treatment A and B), and six treatment options in the second stage (two options for responders, and four options for no-responders). The third element is the tailoring variable(s). A tailoring variable is an early indicator of the overall outcome (success or failure of the intervention). The response status at week 24 plays the role of the tailoring variable in the example shown in Figure 2. Lastly, the decision rules occurring at each critical decision point link the tailoring variable(s) to the intervention components. Each stage in a SMART corresponds to one of the critical decisions involved in the adaptive intervention. Each participant moves through the multiple stages, and at each stage the participant is randomly (re)assigned to one of several intervention options(35,39). Each AI can be summarized in the form (X1;X2:X3) where X1 is the recommended first-stage treatment, X2 the recommended second-stage treatment for responders, and X3 the recommended second-stage treatment for non-responders. There are four different adaptive interventions embedded in the SMART depicted in Figure 2: (A,A,C),(A,A,D),(B,B,E), and (B,B,F).
SMARTs have been used for a wide range of chronic conditions, including some IMIDs. Recent studies that have used them include the CATIE study of treatments for schizophrenia(40), the EXTEND trial of treatments for alcohol dependence(41), and studies of treatments for metastatic renal cell carcinoma(42), depression(43), HIV infection(44,45), ulcerative colitis(46), autoinflammatory recurrent fever syndromes(47), psoriasis(48–50) , and rheumatoid arthritis(51).
An alternative design to a SMART study is the use of “multiple one-stage-at-a-time” randomised trials. This design considers each critical decision point as an independent trial(39). For instance, from the SMART in Figure 2, there are three different “one-stage-at-a-time” trials. The first trial would correspond to the first stage treatment options. The second trial would study treatment in non-responders to treatment A, and the third trial would study treatment in non-responders to treatment B. One advantage of the SMART design over the “multiple one-stage-at-a-time” is that it uses information from all stages to find the best AI. To do this, it uses Q-Learning; a multistage regression method that can use data from a SMART study to examine whether and how certain variables are suitable to develop an AI or improve an existing one.
SMARTs are not without limitation, however. In particular, some issues arise from modelling data from SMARTs when the estimation of the optimal AI is of interest. These include model building, missing data, statistical inference, and choosing an outcome when only non-responders are re-randomised(36). The fact that the re-randomisation depends on the evolving patient status, along with the sequential design nature of the SMART, bring more complexities to the handling of missing data compared to classical clinical trials. For instance, in a SMART study where only non-responders are re-randomised at the second stage, a patient who is lost to follow-up during the first stage will have missing information on their intermediate response status, second stage treatment, and outcome. It is not to possible to know whether the information in the second stage is truly missing or is missing by design since it depends on an unobserved patient response status. Furthermore, the use of flexible regression approaches to avoid complex functions in the Q-learning approach can also make it difficult to acquire interpretable results and valid statistical inference due to potential high variability(36).
SMARTs provide a lot of potential utility to chronic IMIDs, where the most suitable AI is of interest.
Use of high-dimensional data to stratify patients: Adaptive signature trial designs
It is common in clinical trials that only a subgroup of treated patients may benefit from an experimental therapy(52–55). Identifying these subgroups would allow tailoring of treatment, avoiding costly or toxic treatment of individuals who will not benefit. To identify such subgroups, predictive biomarkers are required. Predictive biomarkers are biomarkers (objective characteristics associated with some aspect of a patient’s function or health) that are associated with the response to treatment. If one has a predictive biomarker, this can be used to predict the likely response to treatment. Some clinical areas, such as oncology, have strong availability of predictive biomarkers. For example, the RAS-mutation identified a subgroup of patients with a significant benefit across all efficacy endpoints associated with treatment for colorectal cancer(56).
However, predictive biomarkers are lacking for most IMIDs, meaning predicting response to treatment is more difficult(57–59). For example, in rheumatoid arthritis although genetic variants associated with response to methotrexate have been identified(60–63), there is a lack of consensus on the predictive utility of these variants.
In the absence of predictive biomarkers, alternative methods that utilise high-dimensional information could be used. With the rapid development of new next generation sequencing, proteomics, and medical imaging technologies, a large amount of high-dimensional data about patients is starting to be collected in clinical trials. There is the potential for this information to be informative for identifying subgroups of patients who are likely to benefit from a new treatment.
To utilize high-dimensional information in RCTs, a method has been developed known as the adaptive signature design (ASD). The aim of the ASD is to allow a single RCT to both test the overall treatment effect in all patients and to form a predictive biomarker signature that predicts a subgroup of patients who strongly benefit from the treatment. Although the ASD has ‘adaptive’ in its name, it is not actually an adaptive design as it does not change anything about the trial.
The original method(64,65) utilised (high-dimensional) gene expression data in an oncology setting, but it can be used in any case where heterogeneity in the treatment effect is expected and there is high-dimensional information available. Which of the high-dimensional data should be included in the signature is determined by imposing a threshold on the significance level, odds ratios, and number of biomarkers. Further papers have proposed modifications of the original ASD(66–68) to provide improved performance (in terms of correctly identifying a subgroup who benefit from treatment). In these methods, the high-dimensional data is used to form a signature that is computed based on the interaction between these data with the treatment. The adaptive signature is represented by a single score for each patient. The scores can then be utilised to divide the patients into subgroups using a variety of clustering techniques, or as covariates in the tests of association with the outcome. The test for the overall comparison between the arms can be performed by testing for the difference between the arms in the trial population (at the significance level α1) and testing for the difference between the arms in the subgroup (at significance level α2). The overall significance level of the trial is then controlled at the α = α1 + α2 level (Figure 3).
Figure 3 here
In conclusion, ASDs are a novel methodology that can develop and validate predictive signatures in a single trial. They have the potential to increase the efficiency of clinical trials by finding the group of patients benefiting from particular treatments. However, when the clinical benefit for a subgroup is minimal, a large sample size might be required to detect it with sufficient power. Additionally, the performance of the designs deteriorates if there are many covariates that are not associated with patient benefit. To address this issue, an additional pre-filtering of the covariates might be required. This family of designs may also benefit from exploring different methods of interaction of treatment with high dimensional covariates(69), and from considering multiple trial endpoints(70). These considerations notwithstanding, ASDs offer a potential route to identifying patient subgroups that will benefit from treatment in IMIDs for which predictive biomarkers are currently lacking.
Composite responder endpoints and augmented analysis methods
Clinical trials specify primary and secondary outcomes that measure how patients respond to a treatment or intervention. The primary outcome should be chosen as a measurement that will be more favourable if the treatment being tested is efficacious or effective. As many IMIDs have complex manifestations and multiple symptoms, it can be difficult to specify a single measurement as being the most important. For this reason, it is common that primary outcomes in IMID trials combine multiple relevant measurements into a single composite outcome. A specific type of composite endpoint is a responder endpoint, which divides patients into responders and non-responders based on different measurements, or components. Some of these components can be binary and others may be whether continuous measurements are above a threshold.
The standard method of analysis for composite responder endpoints is to treat them as binary variables (responder or non-responder). The analysis then estimates the proportion of patients who are responders and whether there is a significant difference between arms: this is done with a suitable binary method such as Fisher’s exact test or logistic regression, amongst many others.
Responder endpoints have the appealing property of summarising very complex information into an easy-to-interpret single quantity. This is also a limitation when applying analysis methods that treat the outcome as binary: much information is discarded, especially from continuous components when dichotomising (see, e.g.,(71,72)) which can lead to a reduction in power(73).
Assuming that the responder endpoint is clinically relevant, there are alternative ways of estimating the proportion of patients who are responders. For endpoints that define response based on a single continuous component, methods were proposed in the 1990s to more precisely estimate the proportion of responders(74,75). For composite responder endpoints that are a mixture of continuous and binary components, the augmented binary method has been proposed to provide higher efficiency. This was originally proposed for response criteria endpoints used in phase II oncology trials(76) but has since been extended to endpoints used in IMIDs such as rheumatoid arthritis(77) and systemic lupus erythematosus (SLE)(78). The method has also been extended to endpoints that are formed from the time until a composite event occurs(79) (e.g., time until relapse, where relapse involves a continuous biomarker being above a certain level), although further work in this area is needed.
The augmented binary method requires no additional data to be collected; it simply fits a more complex statistical model to the data collected on the different components and uses this model to estimate the difference between arms in the proportion of responders (together with a confidence interval and p-value). It has been shown in various papers(77,78,80,81) to provide large gains in efficiency, equivalent to applying the traditional binary analysis with a sample size of 30% or more higher. The extent of the increase of efficiency depends on to what extent the continuous component(s) distinguish between responders and non-responders(82).
A previous review(83) found that several IMID conditions used composite responder outcomes. We show some examples of these in Table 2.