On Small-Sample Inference in Multiphase Stepped Wedge Cluster Randomized Trial (MSW-CRT) With Binary Outcomes

Background The stepped wedge cluster randomized trial (SW-CRT) design is now preferred for many health-related trials because of its flexibility on resource allocation and clinical ethics concerns. However, as a necessary extension of studying multiple interventions, multiphase stepped wedge designs (MSW-CRT) have not been studied adequately. Since estimated intervention effect from Generalized estimating equations (GEE) has a population-average interpretation, valid inference methods for binary outcomes based on GEE are preferred by public health policy makers. Methods We form hypothesis testing of add-on effect of a second treatment based on GEE analysis in an MSW-CRT design with limited number of clusters. Four variance-correction estimators are used to adjust the bias of the sandwich estimator. Simulation studies have been used to compare the statistical power and type I error rate of these methods under different correlation matrices. Results We demonstrate that an average estimator with 𝑡(𝐼 − 3) can stably maintain type I error close to the nominal level with limited sample sizes in our settings. We show that power of testing the add-on effect depends on the baseline event rate, effect sizes of two interventions and the number of clusters. Moreover, by changing the design with including more sequences, power benefit can be achieved. Conclusions For designing the MSW-CRT, we suggest using more sequences and checking event rate after initiating the first intervention via interim analysis. When the number of clusters is not very large in MSW-CRTs, inference can be conduct using GEE analysis with an average estimator with 𝑡(𝐼 − 3) sampling distribution.


Background
As firstly applied in the hepatitis B (HBV) study (Hall et al., 1987), stepped wedge cluster randomized trials (SW-CRTs) have gained much popularity based on both intervention initiation resources and clinical ethics perspectives. Since SW-CRTs allow to roll out the intervention to two or more communities in stages, they are especially useful under labor and financial constraints (Cook and Campbell, 1979). Additionally, the participating communities in SW-CRTs keep receiving the intervention once they cross over from the placebo group to the intervention group. Eventually, all the participants will receive the intervention. Therefore, the SW-CRT is more clinically ethical than the parallel design or the crossover design when the intervention is believed to be beneficial (Brown and Lilford, 2006;Hemming et al., 2015a). The traditional SW-CRT design is the cluster randomized trial (CRT) that measures participant-level outcomes (observation level) within units such hospital, schools, etc. (cluster level). In the conventional SW-CRT design, the clusters are randomly assigned to initiate the intervention at pre-determined time points ( Figure 1A). Moreover, clusters can be grouped in sequences for randomization and transition from the placebo to the intervention. Comparison between sequences can help ameliorate the impact of the contamination between clusters (Hemming et al., 2015b). Traditional SW-CRTs utilize a single-phase design by considering only one intervention during the study. However, different interventions may be combined to treat participants Pears et al., 2016;Salmon et al., 2011), so the MSW-CRT design becomes a necessary extension to study effects from multiple interventions. Lyons et al. (2017) then introduced the MSW-CRT. They pointed out that this design can help reduce the study time and cost, increase engagement among participants and improve efficiency. Moreover, it allows assessment of possible interaction effects between multiple treatments and requires fewer number of clusters compared with running two separate trials. Figure 1B shows a simplest case of MSW-CRTstreatment B will be sequentially implemented to each sequence after treatment A. This design therefore will allow for estimation of the add-on effect of the second treatment (treatment B).
The inference of continuous outcomes in MSW-CRTs is completed by our group and under review, in which the closed-form sample size and power formulas are derived based on the linear mixed model. The properties between design parameters and statistical power are shown in that study, which provide general guidance for the MSW-CRT study design. Even though we have investigated the continuous outcomes, the inference method for binary outcomes in MSW-CRTs is still less well developed. Binary outcomes, however, are very common in health-related trials, and different from the continuous ones on modeling and analysis.
Since the responses observed from the same cluster are usually correlated, two types of statistical models are commonly used for analysis: conditional and marginal models. Though they are distinctive on other aspects like model assumptions, one important difference is the interpretation of parameters (Preisser et al., 2003). For SW-CRTs, while the treatment effects based on the conditional model are interpreted as cluster-specific changes, the treatment effects estimated from the marginal model are interpreted as the population-average changes. Therefore, generalized linear mixed models are often used for inferences on the conditional effects (Hussey and Hughes, 2007;Hemming et al., 2015;Hooper et al., 2016;Woertman et al., 2013). However, in health population sciences research, the population level interpretation from marginal models may be preferred for making health policies (Li et al., 2018). To evaluate the statistical power of testing the treatment effects in SW-CRT designs, a marginal model is commonly assumed with a constant exchangeable intraclass correlation (ICC) and a logit link for binary responses (Crespi et al., 2009). GEE is then used to fit this model and make inferences (Li et al., 2018;Li, 2020;Scott et al., 2017;Ford and Westgate, 2020).
Though some SW-CRTs (Li et al. 2018;Ford and Westgate, 2020;Li, 2020;Thompson et al. 2020) have been conducted using marginal models with binary outcomes, binary responses from MSW-CRTs have not been studied. Therefore, the focus of this paper is to demonstrate appropriate inference methods for the analysis of binary subject-level outcomes arising from MSW-CRTs. In the following three sections, we will introduce the GEE analysis for MSW-CRT designs with binary outcomes and conduct simulations to assess the performance of inference methods using Mancl and DeRouen (2001) (MD estimator), Kauermann and Carroll (2001) (KC estimator), average of MD and KC estimator (Ford and Westgate, 2017) and Fay and Graubard (2001) (FG estimator). The Simulation study section will describe our simulation settings, hypothesis, and planned analysis.

GEE analyses of MSW-CRT
In this study, we consider a cross sectional stepped wedge design with a placebo phase and two consecutive intervention phases, A and B. For example, Figure 1B shows an MSW-CRT with 7 periods and 3 sequences, where each cross section defines one period. Each sequence consists of group of clusters receiving the same treatment in every period. In this example, we have two interventions: A and B, while 0 refers to placebo or control. In this example design intervention B is initiated only after all the clusters receive intervention A. We denote the binary outcome of interest to be , representing the measurement for the th individual in the th cluster at the th period. The marginal means = ( ) can be modeled as: log ( 1 − ) = + 1 + 2 + , = 1, 2, … , ; = 1, 2, … , ; = 1, 2, … , , where and are the treatment indicators of receiving the intervention A or B ( = 1 if the cluster receives treatment A at period , and = 0 otherwise), respectively; and is the log odds of the baseline event rate. Therefore, 1 is the treatment effect of A; 2 is the add-on effect of treatment B; and is the period effect ( 1 = 0 for identifiability). Considering the multi-phase design may usually happen when the first intervention is shown to be beneficial, our model is designed to answer the specific question: is it more beneficial by adding the second intervention B than using the first intervention A alone? Therefore, we did not include the interaction term between two interventions in our model. We denote the parameter vector as = ( , 2 , … , , 1 , 2 ) ′ in the mean model (1). For cluster , the vector of outcomes is = ( 11 , 12 , … , ) ′ , and thus the marginal mean × 1 vector is = ( 1 , 1 , … , ) ′ . The × ( + 2) design matrix can be written based on model (1) and the pre-defined design.
We can then write model (1) as: The marginal variance in our case is ( ) = (1 − ), and we can obtain ̂ by solving: where = , and = 1 2 1 2 is the working covariance matrix for , with as the × diagonal matrix with elements ( ) (setting the dispersion parameter to be 1) and as the working correlation matrix. Similar to the Hussey and Hughes model (Hussey and Hughes, 2007) for continuous outcomes, we assume common exchangeable structure with a uniform correlation parameter . We just consider cross sectional sampling here, for making the constant correlation assumption reasonable (Li et al., 2021). We note that the constant correlation is restrictive, and it is suggested using multiple correlation parameters for longitudinal CRT designs (Li, 2020;Hemming et al., 2020). For example, different within-period correlation 0 and intra-period correlation 1 can be included for sample size calculation. Specifically, 0 is used for quantifying the similarity between individuals in the same cluster at the same period, while 1 is used to represent the similarity between individuals in the same cluster across different periods. Therefore, we studied the impact of different correlation structures in the following sections.
Since the variance function is related to the mean for the binary responses, the analytical solution of equation (2) cannot be obtained, and iterative reweighted least squares (IRLS) is used to obtain a numerical solution. When the number of clusters is large enough, the estimator ̂ can be assumed to be multivariate normally distributed with mean . The variance-covariance matrix of ̂ can be estimated using the model-based variance: or the empirical sandwich estimator: where = −̂ is the residual vector of the th cluster. The statistical consistency of the model-based variance estimator relies on the correct specification of the working correlation structure, so the robust sandwich variance estimator is preferred. However, several studies have shown that when the number of clusters is limited, the fitted mean ̂ tends to be close to , leading to negatively biased sandwich estimators due to the small residuals (Mancl and DeRouen, 2001;Lu et al., 2007;Murray et al., 2004). To correct this bias, several correction methods have been proposed for valid inference. In this paper, we considered three commonly been shown to achieve competitive performance than the above three estimators (Ford and Westgate, 2017). We choose standard normal for a Wald test or a test using distribution with degrees of freedom − 3, which is the number of clusters minus the number of cluster-level regression parameters. Tests with degrees of freedom − 3 have shown good performance in parallel CRT settings (Ford and Westgate, 2017;Li and Redden, 2015), while it leads to conservative tests using − as the degrees of freedom, with as the total number of regression parameters (Ford and Westgate, 2020).

Simulation study
In this section, we describe the simulation studies under the two correlation structures. In section 1.1, datasets are simulated by specifying correlation structure for the outcomes as common exchangeable. In this section, we also introduce the studying questions, simulation settings (section 1.1.1), data generating process (section 1.1.2) and the analysis method (section 1.1.3). In section 1.2, we simulate data under the common block exchangeable correlation structure with two correlation parameters. Simulations are designed by including the two correlation parameters. We describe the setting difference in this section, while studying questions, data generation and analysis methods are the same with those in section 1.1.

Inference with a constant correlation parameter
In this section, the correlation structure is set to be exchangeable with a uniform parameter. We define the effect size as 1 − , and it can be interpreted as the extent to which the intervention decreases the odds of events. For instance, 1 = 0.75 means that treatment A decreases the odds of death by 25%, relative to the control group. We propose simulations to answer the following

Simulation settings
To study the type I error, the parameters in six different scenarios are set as shown in Table 1. For simplicity, we set the number of clusters to be the same in each sequence, and all the clusters in the same sequence receive the same intervention(s) in every period. In each scenario, one parameter is varied to study the impact on empirical test sizes. For example, in scenario 2, for testing the null add-on treatment effect. In this way, we can get an idea that how the type I error could change with parameters. Similarly, six scenarios are shown in Table S1 (see the Appendix) to study the empirical power of testing the add-on effect. All the parameter settings in Table S1 are the same with Table 1 except the effect size for intervention B. We use null effect size for intervention B in scenarios in Table 1 but nonzero effect sizes in Table S1.

Data generation
For data simulation, the cluster mean at each period is specified by model 1 with = 0 because we do not focus on exploring the impact of the period effect. We simulate correlated individual-level binary outcomes from each cluster using the EP method (Emrich and Piedmonte, 1991), which can be easily implemented in the R package "mvtBinaryEP" (By and Qaqish, 2011). The algorithm first simulates a multivariate normal variable and then dichotomizes the coordinates to get the binary outcomes. We note here that since the correlation matrix for simulating multivariate normal data is obtained by solving equations with and , the original R program may fail if the resulting correlation matrix is not positive semidefinite. In that case, we replace it by the nearest positive definite matrix using the algorithm proposed by Higham (2002).

Analysis
Considering MSW-CRTs are beneficial as add-on trials, we focus on testing the add-on effect of the second treatment B in this study. Therefore, we evaluate test sizes of the following hypothesis testing strategy: 0 : 2 = 0 . . : 2 ≠ 0.
For each dataset simulated under the 12 different settings in Tables 1 and S1, model 1 is used for GEE analysis. We compare Wald tests with the four variance correction estimators: Mancl and DeRouen (2001)  where ( + , + ) is the ( + 2, + 2) element of .

Inference with two correlation parameters
Even though constant ICC can be used under restrictive assumptions, it has been argued that the within-period and intra-period correlations should be treated as two types of intraclass correlations and common block exchangeable correlation structure is preferred in GEE analysis (Martin et al., 2016;Li et al., 2018;Ford and Westgate, 2020). Thus, to study the impact on design and analysis of assuming a different correlation structure, we simulated data under common block exchangeable correlation structure. That means is characterized by two parameters in this case: 0 − within-period correlation quantifying the similarity between responses from different individuals in the same cluster in the same period, and 1 − intra-period correlation quantifying the similarity between responses from different individuals in the same cluster but in different periods. In other words, ( , ′ ) = 0 and ( , ′ ′ ) = 1 . Specifically, previous parallel CRTs reported small values of 0 (Murray et al., 1998) and the intra-period correlation 1 was usually smaller than 0 (Martin et al., 2016).
Twelve simulation settings are proposed to study type I error and empirical power when the working correlation is incorrectly specified as common exchangeable with a uniform parameter.
Tables 2 and S2 (see the Appendix) show the simulation settings, in which we change the correlation parameter to be both 0 and 1 , but keep the other configurations the same as in Tables 1 and S1, respectively. Specifically, we use ( Table S3 in the Appendix. For computing the theoretical power under these settings, we still use equation (5), but the correlation matrix for computing should be common block exchangeable. using two correlation parameters in the common block exchangeable correlation structure ( 0 , 1 ).
Scenario ( 0 , 1 ) baseline design (Figure 2) 1-effect size   Table 1 using the true working correlation structure. The gray band in each plot represents 95% confidence interval of the nominal type I error rate 0.05. Figure 4A-F shows the empirical power with the theoretical power (from equation 5) for testing non-zero treatment effect of B using 5,000 simulated datasets. The Wald tests are not included for power comparison due to inflated type I error rates (Figure 3). We observed that when the working correlation is correctly specified, KC_t is closest to the theoretical value. Other tests (MD_t, FG_t, Average_t) are just slightly less powerful, indicating no significant power loss when we use conservative tests under these settings.  Table S1 using the true working correlation structure. The empirical power of the eight tests is shown with the predicted power calculated from equation (5).
Overall, we observe that greater power can be achieved when the number of clusters ( Figure   4A), baseline event rate ( Figure 4D) or effect size ( Figure 4F) increases. Specifically, Figure 4E shows the power comparison when we use the different designs shown in Figure 2. Adding more sequences results in greater power, which can be seen by observing that power increases from design A to B to C (and likewise from D to E to F). On the other hand, extra measurements with more periods at the end of the study do not show much power benefitthere is little change in power comparing designs A and D, B and E, or C and F. In contrast, the variability of cluster size ( ) does not have an obvious impact on power or type I error rate ( Figure 4B), and variation of the intraclass correlation does not change power significantly ( Figure 4C).

Inference with two correlation parameters
When we misspecify the working correlation structure in analysis, we observed generally similar patterns ( Figure 5B) to Figure 5A with the correct working correlation structure. The sandwich estimator (SW) is robust to a misspecified working correlation structure but still biased because of the small sample size. In general, the other three bias-correction variance estimators (  Results for the empirical type I error rate and power using the eight different tests are generally consistent with those from the constant ICC model (see Appendix Figures S1 and S2). As an example, we show the type I error and empirical power results for scenario 1 in Figure 6A and 6B. The change pattern in each figure from different sample sizes is the same with Figure 3A or Figure 4A. In addition, power is slightly reduced with this misspecification when comparing results shown in Figure 6B and Figure 4A. Also, we observed a smaller difference in power among the 6 bias-correction tests in Figure 6B than Figure 4A. Additionally, the magnitude of correlation parameters shows great impact on power. Increasing within-period correlation reduces power, while increasing inter-period correlation gains more power (see the Appendix Figure S2). Figure 6A. The empirical type I error rate in scenario 1 using misspecified working correlation (common block exchangeable with two parameters) in Table 2. B. The empirical power in scenario 1 using misspecified working correlation (common block exchangeable with two parameters) in Table S2.

Discussion
In this study, we build the marginal model for binary outcomes in MSW-CRT designs. We compare the performance of inference methods based on GEE and 5 variance estimators with or normal distributions, which can be used as guidance for implementation of MSW-CRT designs. We argue that this is especially helpful when investigators are interested in testing the add-on effect of a second intervention after the first beneficial treatment. From our simulations, when the inter-period and within-period correlations differ, GEE analysis with incorrectly specified correlation structure leads to incorrect inference using the Wald test with model-based variance or sandwich estimator. Therefore, we suggest the use of other bias correction inference methods for inference when the number of clusters is not enough (>60 in our settings). We found that average of MD and KC estimator (Ford and Westgate, 2017) with sampling distribution ( − 3) can adjust the bias and maintain type I error rate close to the nominal level. Kauermann and Carroll (2001) estimator with ( − 3) can also perform well on controlling type I error rate when the cohort sizes for each cluster are with less variation (coefficient of variation <0.3). This conclusion is also consistent with results in Ford and Westgate (2017). Also, we find the sandwich estimator is usually smaller than the MD estimator, which is close to the FG estimator but larger than the KC estimator. Moreover, the type I error rates produced by Average_t test are stable and robust to different parameter values and working correlation structure misspecification. For testing the second intervention effect, increasing the number of clusters, baseline event rate and effect size can increase the empirical power, and thus reduce the required sample size. Since the effect size of the first intervention determines the event rate before the initiation of the second intervention, power and required sample size for the second intervention depend on the effect size of the first treatment.
Another finding of this study is that when the working correlation structure is misspecified, we loss power and more data are needed to test the effectiveness of the intervention. Even though we do not observe a major impact of changing the correlation magnitude when the inter-period and within-period correlations are equal, power will decrease as the difference 0 − 1 becomes large. Our simulations show that the four tests (MD_t, FG_t, KC_t and Average_t) perform quite differently on power when the ( 0 , 1 ) combination changes. We found that when the difference 0 − 1 is largest, i.e., we are furthest away from the constant correlation assumption, these four tests are much less conservative and closer to the theoretical power values. Conversely, they lose more power as the difference between the two correlation parameters decreases.
In our study, we use 6 designs shown in Figure 2 to investigate the impact of using more sequences when the total number of clusters is fixed. Since the within-period comparison may help gain statistical power, more sequences may increase power. From our simulations, we conclude that when implementation resources are sufficient, we can gain power by using more sequences, even though we must initiate the second intervention before all the clusters receive the first intervention. Another concern in this design is that the cost of more sequences in a limited resources community-based study is high. For example, considering the designs D and F in Figure 2, the design shown in F will simultaneously implement two different interventions in different sequences (periods 3, 4 and 5), while in D, the interventions are initiated in different periods for each sequence. In practice, this usually means higher travel costs and more implementation resources on the field, which could be unfeasible and delay the trial progress.

Conclusion
Based on the above discussion, when the number of clusters is not very large in MSW-CRTs, inference can be conduct using GEE analysis with average of MD and KC estimator (Ford and Westgate, 2017) with ( − 3). When the cohort size does not change a lot across periods, Kauermann and Carroll (2001) estimator with ( − 3) can also be used for inference. For designing the MSW-CRT, we suggest using more sequences when implementation resources are sufficient, which means we may initiate the second intervention before all participants receive the first intervention. Moreover, interim analysis is also recommended to check event rate after initiating the first intervention, because the add-on effect of the second intervention will be hard to detect if the event rate is too low. Future work on MSW-CRT includes exploring the combination effect of different parameters and design aspects on power, and the effect of incorporating decaying correlations. Even though the simulation studies are limited compared with scenarios in practice, this work provides a basic guidance for MSW-CRT designs, which can save recruiting cost and time than implementing separate SW-CRT designs.