Covariate-constrained randomization in cluster randomized 2x2 factorial trials: Application to a diabetes prevention study

Background: Cluster randomized trials (CRTs) are randomized trials where randomization takes place at an administrative level (e.g., hospitals, clinics, or schools) rather than at the individual level. When the number of available clusters is small, researchers may not be able to rely on simple randomization to achieve balance on cluster-level covariates across treatment conditions. If these cluster-level covariates are predictive of the outcome, covariate imbalance may distort treatment effects, threaten internal validity, lead to a loss of power, and increase the variability of treatment effects. Covariate-constrained randomization (CR) is a randomization strategy designed to reduce the risk of imbalance in cluster-level covariates when performing a CRT. Existing methods for CR have been developed and evaluated for two- and multi-arm CRTs but not for factorial CRTs. Methods: Motivated by the BEGIN study—a CRT for weight loss among patients with pre-diabetes—we develop methods for performing CR in 2x2 factorial cluster randomized trials. We apply our methods to the BEGIN study and use simulation to assess the performance of CR versus simple randomization for estimating treatment effects by varying the number of clusters, the degree to which clusters are associated with the outcome, the distribution of cluster level covariates, and analysis strategies. Results: Compared to simple randomization of clusters, CR in the factorial setting is effective at achieving balance across cluster-level covariates between treatment conditions and provides more precise inferences. When cluster-level covariates are included in the analyses model, CR also results in greater power to detect treatment effects, but power is low compared to unadjusted analyses when the number of clusters is small. Conclusions: CR should be used instead of simple randomization when performing factorial CRTs to avoid highly imbalanced designs and to obtain more precise inferences. Except when there are a small number of clusters, cluster-level covariates should be included in the analysis model to increase power and maintain coverage and Type 1 error rates at their nominal levels.


Background
Cluster randomized trials (CRTs) are randomized controlled trials where randomization takes place at an administrative level (e.g., hospitals, clinics, or schools) rather than at the individual level.CRTs are an attractive research design when there are concerns of treatment contamination among participants, when it is logistically easier to conduct the trial by randomizing at the cluster level, and when the intervention of interest is delivered at the cluster level [1].
A major practical limitation when conducting CRTs is the ability to enroll a large number of clusters.When the number of available clusters is small, researchers may not be able to rely on simple randomization to achieve balance on cluster-level covariates across treatment conditions [2].If these cluster-level covariates are predictive of the outcome, covariate im-balance across treatment conditions may distort treatment effects, threaten internal validity, lead to a loss of power, increase the variability of the treatment effect, and usually requires statistical adjustment in the analysis stage [3].For example, in a two-arm CRT where clinics are randomized to treatment conditions and where the size of a clinic is related to the outcome of interest, researchers would want equal numbers of small and large clinics in the treatment and control conditions, respectively.
Factorial experiments are an efficient approach to determine which of several possible components of a proposed intervention have effects of practical significance [4].When implementing factorial experiments at the cluster level, the challenges involved in balancing cluster-level covariates across arms is magnified because there are more than two treatment conditions.For example, in a 2x2 factorial CRT, clusters will be randomized to one of 4 treatment conditions.
One approach to address imbalance in prognostic cluster-level covariates across treatment conditions is to include these covariates in the analysis model which can help ensure an unbiased estimate of the treatment effect.The drawback to including cluster-level covariates in the analysis model is the subsequent loss of degrees of freedom that are available to estimate treatment effects.This resulting loss of power can be substantial when there are a small number of clusters [5].An alternative to model-based covariate adjustment is to control for potential confounders at the design stage, by balancing the distribution of select measured characteristics across treatment arms.This can help ensure more precise treatment effects as well as confidence that observed treatment effects are not due to imbalance in prognostic covariates while at the same time avoiding the resulting loss of power due to covariate adjustment.
Individually randomized trials often rely on stratification to achieve balance on prognostic factors across treatment conditions.In CRTs with a small number of clusters, stratifying on more than one variable can be challenging because of an insufficient number of clusters to distribute among strata.This phenomenon is only exacerbated in factorial trials where there are at least 4 treatment conditions.For example, with two binary stratification variables there will be total of four strata.To conduct a 2x2 factorial CRT would require at least 4 clusters per stratum (16 clusters total) to avoid unequal allocation of treatments within strata [3].Furthermore, stratifying on a continuous factor requires converting it to a categorical variable, a process that can result in a loss of information.
Covariate-constrained randomization (CR) is an alternative procedure for achieving balance across treatment conditions on a set of pre-specified cluster-level covariates.Unlike individual level trials where participants are recruited sequentially, the participating units in a CRT are generally assembled at the start of the study so that cluster-level covariate values such as geographic location, clinic size, and the income level of patients are available at the design stage.
The first step in CR is to identify those cluster-level covariates that are predictive of the outcome on which one wishes to achieve balance.Using the terminology of Li et al. [6], we refer to these covariates as "potential confounders" because they are cluster-level prognostic factors that, when imbalanced, could distort estimates of treatment effects.
The second step in CR is-for every possible randomization scheme (or a random subset of schemes when the number of clusters is large)-to calculate a balance score that measures the difference in the distribution of cluster-level covariates across treatment conditions [3,7].
Next, a subset of schemes is chosen that meet some pre-specified balance criteria, such the 10% of schemes with the best balance scores.Finally, an allocation is randomly selected among those schemes that meet the pre-specified criteria and is used to randomize clusters.CR tends to produce better balance on average across treatment conditions as compared to simple randomization in which a randomization scheme is selected from all possible schemes with equal probability assigned to each scheme.Compared with stratification, CR may be preferred due to its capacity to accommodate multiple covariates, both categorical and continuous [8].
There are numerous variations of CR that use different balance metrics and different analysis strategies.In the two-arm setting, Raab and Butcher [7] and Li et al. [6] consider weighted and unweighted pairwise balance scores based on the difference in covariate means between arms.In the multi-arm setting, Zhou et al. [9] extend the pairwise balance score method, while Watson et al. [10] present a balance metric based on the sum of cluster-level mean differences.Ciolino et al. [11] calculate a Kruskal-Wallis test for each covariate across arms and assesses balance based on the p-values of these tests where a minimum p-value greater than 0.30 was found to appropriately identify acceptable balance.
Li et al. [6], Watson et al. [10], and Zhou et al. [9] also recommend adjustment for potential confounders in the analyses stage to maintain Type 1 error and provide adequate power.The most common approach for the analysis of CRTs is mixed-effects regression modeling with random cluster-level effects to account for within-cluster correlation.Mixed-effects models are sufficiently flexible to allow for adjustment of both cluster-level and participant-level covariates.Random effects at the participant level can be included if, for example, the study has repeated observations on the same individuals.
Existing work on CR has focused on two-or three-arm CRTs.The performance of CR in a factorial setting-where the minimum number of randomization conditions is 4has not been explored.Unlike in multi-arm trials, in CRT factorial designs, clusters are "recycled" [4] when estimating treatment effects so that every cluster is used in every estimate of a treatment effect.Whether CR operates differently in this setting is an area that requires further investigation.

Motivating Example
Our methods are motivated by the Behavioral Nudges for Diabetes Prevention (BEGIN) study [12], a 2x2 factorial CRT studying two pragmatic behavioral interventions that prompt In this manuscript, motivated by the BEGIN study, we extend and evaluate CR methods for multi-arm trials [6,7,10] to the 2x2 factorial CRT setting.The outline for the rest of this paper is as follows.In Section 2, we present methods for CR in the setting of a 2x2 factorial CRT and describe a simulation study to assess the performance of our methods as compared to simple randomization of clusters.In Section 3, we present the results of our simulation study and apply our methods to the BEGIN study.Section 4 provides discussion and areas of future work.We conclude in Section 5.

Methods
As mentioned above, once a set of potential cluster-level confounders are identified, the next step in performing CR is to calculate a balance metric to measure the difference in the distribution of these cluster-level covariates across treatment conditions for all possible randomization schemes.In this section we describe a balance metric for factorial trials that extends the balance metrics of Li et al. [6], Raab and Butcher [7] and Watson et al. [10].
Let J be the number of clusters and T be the number of treatment conditions so that n T = J T clusters are randomized to each treatment condition.Let x jk be the value of the k th covariate (k = 1, . . ., K) in cluster j (j = 1, . . ., J), and xtk = 1 n T j∈t x jk the mean value of the k th covariate in clusters assigned to condition t, (t = 1, . . ., T ).Finally xk = 1 J J j=1 xjk is the overall mean of covariate k across all clusters.Our balance metric is: where w k is a predetermined weight for the kth covariate.Following Raab and Butcher [7] and Li et al. [6], we set w k as the inverse of the variance of the kth covariate across all clusters.That is The metric in ( 1) and ( 2) describe the balance score introduced by Watson et al. [10] for use in multi-arm trials.A limitation to this metric is that balance is purely defined by covariate values and does not take into account clinical importance.For example, in the BEGIN study, if clinic volume is considered to be a stronger predictor of weight loss than percent of female visits, we may want to give clinic volume greater weight in the balance metric so that smaller balance scores using the weighted metric will reflect better balance on clinic volume at the expense of less balance on clinic percent female.To incorporate weights into the balance metric in (1) we use the approach of Yu et al. [8] to produce the weighted balance metric: where d k is a user-defined weight for the kth covariate.When d k = 1 for all covariates, then (3) reduces to the balance metric in (1).When researchers consider certain variables to be more predictive of the outcome than others or for which there is greater variability across clusters, a user-defined weight d k > 1 could be assigned to those variables when calculating balance scores [6].
To perform CR, the balance metric B (or B w ) is generated for all possible randomization schemes of the J clusters.The final allocation is chosen from a subset of allocations that meet a pre-specified balance criteria.Here, we select a cutoff value q which is the qth percentile of the balance scores.Yu et al. [8] note that the cutoff value q should be small and away from 1.0 (simple randomization) so that only the more balanced randomization schemes are retained in the constrained space.Following Yu et al. [8], we set q = 0.1 so that only the schemes in the top 10% of balance scores are included in our constrained allocation space.
When the number of clusters is small, it is feasible to calculate the balance score for all possible allocations where the number of allocations is J!
[(J/T )!] T .For example, when J = 8 and T = 4, there are only 2520 possible ways to randomize clusters.However, for CRTs with more clusters, for example, when J = 12 and T = 4, there are 369,600 possible ways to randomize the clusters and enumerating all possible allocations becomes computationally expensive.Following Li et al. [6], when J > 8, we randomly sample a subset of 20,000 allocations from all possible allocations, then select our final allocation from the top 10% (2,000) of allocations in terms of balance scores.

Simulation Study
We use simulation to assess our method of CR in the setting of a 2x2 factorial cluster randomized trial and how it compares to simple randomization in terms of estimating treatment None (β = 0), Medium (β = 0.5), Large (β = 1) Analysis model Control/Do not control for covariates effects.Following Li et al. [6] we simulate data using the following approach.Let x j1 , x j2 , x j3 be three independent cluster level covariates for cluster j, (j = 1, . . ., J); that are normally distributed with mean 1 and variance σ 2 x on which we wish to achieve balance.Let y ij be the outcome of interest for subject i(i = 1, . . ., n j ); in cluster j.We set n j = 100 throughout.
Let Trt 1j and Trt 2j indicate whether cluster j is assigned to treatments 1 and/or 2, respectively, where treatment is based on the factorial design in Table 1.We generate y ij from the following linear mixed-effects model: The parameters β 1 , β 2 , and β 3 are regression coefficients on the cluster-level covariates that are predictive of the outcome (when β ̸ = 0).For simplicity, we let β 1 = β 2 = β 3 .The coefficients γ 1 and γ 2 correspond to the effects of the two interventions.We set γ 1 = 5 and γ 2 = 0.The parameter b 0j is a cluster-level random effect where b 0j ∼ N (0, σ 2 b ) and ε ij is an error term where ε ij ∼ N (0, σ 2 ε ).We assume σ 2 ε = 36 and an intra-cluster correlation (ICC) We sought to investigate the following factors in our simulation study and examine how their effects differ when using CR as compared to simple randomization: number of clusters, the variability of cluster-level covariates, the magnitude of cluster-level effects on the outcome, and whether or not cluster-level covariates are controlled for in the analysis model.
Table 3 shows the factors that vary in the simulation.With five factors with two or three levels each, we evaluated a total of 2 × 2 × 3 × 3 × 2 = 72 scenarios.Simulation is based on the following steps: 1. Simulate K = 3 independent cluster level covariates of size J, where x jk ∼ N (1, σ 2 x ).
2. Use either CR (see code in Appendix A for implementing CR in R) or simple randomization to randomize the J clusters to one of the 4 conditions in Table 1. 5. Generate n j = 100 values of y ij using (4).
6. Analyze the data using a linear mixed-effects model with a random intercept for cluster and indicator variables for the two treatment conditions.Based on the simulation scenario, the analysis model either controls for or does not control for cluster-level covariates.
When controlling for cluster-level covariates in the analysis model, the analysis model is identical to (4).When the analysis model does not control for cluster-level covariates, the analysis model excludes x j1 , x j2 , x j3 .
Steps 1-6 were performed 10,000 times to generate 10,000 parameter estimates for each of the 72 simulation scenarios.We focus our attention on the performance of the treatment effects γ.Specifically, using γ 1 we assess the percent bias, variance, mean squared error (MSE), coverage and width of the 95% confidence interval, and the power to reject the null hypothesis.Using γ 2 , we assess Type 1 error.

Simulation Results
Tables 4 and 5 summarize the results of our simulation study for 8 and 12 clinics, respectively, using both CR and simple randomization under various degrees of cluster-level variability (σ x ) and cluster-level covariate effects (β).The results in Tables 4 and 5 are from simulations where cluster-level covariates are not controlled for in the analysis model.
Looking at Table 4, across all scenarios, the percent bias is essentially 0 for both CR and simple randomization.As the magnitude of cluster-level covariate effects increases (as measured by β) variance and MSE increase, with both performance criteria better under CR.A similar trend is seen with increasing values of cluster-level variability (as measured by σ x ), where variance and MSE increase as σ x increases and both performance criteria are lower under CR.Coverage and Type 1 error tend to be conservative under CR while these values are at their nominal levels under simple randomization.
Power in Table 4 is similar for both CR and simple randomization.However, in those settings where the magnitude of potential confounding is high and cluster-level variability is also high, power is low for both CR and simple randomization.For example, when σ x = 2 and β = 1, power is 24% under CR and 33% under simple randomization.As expected, covariate balance is better under CR compared to simple randomization.Because the balance metric in (1) standardizes each covariate by the inverse of its variance, values of σ x do not have an effect on the balance metric and balance is the same across all values of σ x under CR.
The results in Table 5 based on 12 clusters are similar to those based on 8 clinics, with better variance and MSE under CR and similar power as compared to simple randomization.
Again, coverage and Type 1 error are conservative under CR while these criteria are at their nominal level under simple randomization.However, with 12 clusters, power is much greater than in the setting with 8 clusters such that power is only inadequate in the scenario with the highest potential confounding (β = 1) and the highest between-cluster variability (σ x = 2).
It is also worth noting that in Tables 4 and 5, the performance criteria are the same when the product of σ x and β are the same.For example, the performance criteria when σ x = 0.5 and β = 1.0 are the same as when σ x = 1.0 and β = 0.5.This is because while the conditional variance of the outcome is the same in all the simulation scenarios, in the unadjusted analyses, we do not condition on the cluster-level covariates so that the variance of the outcome varies across all simulation scenarios and is reflected in an inflated where the term 3β 2 σ 2 x is the increase in variance due to not conditioning on covariates.When the analysis model does condition on covariates, V ar(y ij |x j ) = σ 2 e + σ 2 b .Appendix Tables 7 and 8 summarize the simulation results for 8 and 12 clusters, respectively, now based on an analysis model that controls for cluster-level covariates.Here, the analysis model is identical to the data generating model so that the results for CR are the same across all scenarios and the results for simple randomization are the same across all scenarios.Overall, even when controlling for covariates in the analyses, there is a benefit to using CR as compared to simple randomization in terms of lower MSE, greater power, and narrower confidence interval width.And unlike in the unadjusted analyses, coverage and Type 1 error are not conservative and are at their nominal levels when using CR with 12 clusters.
Comparing simulations with 8 clusters where the analyses does not control for covariates (Table 4) to simulations with 8 clusters where the analysis model does control for covariates (Table 7) we see that controlling for covariates has an especially adverse effect on power such that power is only 54% under CR and 43% under simple randomization.The only scenario where controlling for covariates produces better results than not controlling for covariates is the extreme scenario with the highest potential confounding and the highest between-cluster variability.Here, variance, MSE, power, and CI width are all better when controlling for covariates.
With 12 clusters (Table 8) there appears to be a clear advantage to controlling for clusterlevel covariates in the analyses.The effect on power as compared to not controlling for covariates (Table 5) is modest, and in those scenarios with a high degree of potential confounding, controlling for covariates results in a marked increase in power.For example, under CR in the scenario with the highest potential confounding and the highest between-cluster variability, power goes from 0.49 when not controlling for covariates to 0.98 when controlling for covariates.And as mentioned earlier, coverage and Type 1 error are at their nomial levels when controlling for covariates.

Application to the BEGIN study
We applied our methods for CR in factorial trials to the BEGIN study, using the clusterlevel covariate information in Table 2.With 8 clusters and 4 treatment conditions there are 8! 2 4 = 2520 possible schemes.Using the balance metric in (3), we calculated the balance score for each of these possible 2520 allocation schemes.Based on a belief by the BEGIN investigators that clinic volume was an important predictor of weight loss, and the fact that mean BMI was similar across all clinics, clinic volume was given a weight of 2 in (3), while percent female and mean BMI were given weights of 1. Figure 1 displays a histogram of the balance scores for all 2520 possible schemes.The vertical red line in Figure 1 indicates the cutoff corresponding to the top 10% balance scores among the 2520 scores.
In our setting, for a given set of clinic matches, the treatment assignments can be labeled 4! = 24 different ways, so that our 2520 possible allocations correspond to only 2520/24 = 105 unique balance scores.The allocations corresponding to the top 10 unique balance scores are listed in Table 6.

Discussion
In this paper we presented a method for performing CR in factorial cluster randomized trials.
We performed a simulation study to assess the effectiveness of our method as compared to simple randomization in terms of estimating treatment effects in the setting of a 2x2 factorial trial.In all scenarios, bias of the treatment effect was essentially 0. However, by balancing prognostic covariates across treatment arms, CR resulted in more precise estimates of the treatment effect as measured by MSE, a finding also noted by Kalish and Begg [16].And by constraining the allocation space, CR eliminates the possibility of a highly imbalanced allocation which may significantly undermine the power of a trial as well as threaten its internal validity [10].When covariates were not controlled for in the analysis model, we found that both CR and simple randomization produced similar rates of power but coverage and Type 1 error rates were conservative under CR, a finding that was also found in Li et al. [6].
When covariates were controlled for in the analysis, simulations again showed a clear benefit of CR versus simple randomization across all performance criteria in addition to coverage and Type 1 error close to or at their nominal levels.Still, the question of whether or not one should control for covariates in the analysis model is not clear-cut.The rationale to control for cluster-level covariates even when performing CR is that including these covariates helps adjust for any residual imbalances not controlled for during randomization and can also reduce residual variance.The trade-off is a reduction in the number of degrees of freedom for estimating treatment effects.For example, when there are 8 clusters and covariates are not included in the analysis model, there are 8 − 3 = 5 degrees of freedom available to estimate the treatment effects.Including 3 cluster-level covariates in the analysis model reduces this to only 2 degrees of freedom.
In our simulations with 8 clusters, the loss of power when controlling for covariates was so substantial that controlling for covariates is not recommended due to the decrease in degrees of freedom for estimating treatment effects.This loss of power highlights another benefit of CR-it allows to user to control for cluster-level covariates in order to avoid highly imbalanced designs and obtain more precise inferences-without the resulting decrease in degrees of freedom that would occur if covariates were controlled for in the analysis model.
With 12 clusters, the loss of power when controlling for covariates in the analysis model was minimal, and in some scenarios produced better power than not controlling for covariates.
When cluster-level covariates have small variance, as was the case in our simulations when σ x = 0.5, there is little benefit to controlling for covariates in the analysis model and a substantial loss of power.This can be seen by comparing Table 4 and Appendix Table 7 when σ x = 0.5 and β = 1.Here, power is 89% when not controlling for covariates but only 54% when covariates are included in the analysis model.Only when σ x = 2 and the degree of confounding is high is power better when controlling for covariates.
This finding is relevant to the BEGIN study, where there are only 8 clusters and the variability in the cluster-level covariates is small.In our simulation studies, where the mean of the covariates was 1, the coefficient of variation in the cluster-level covariates ranged from 0.7 when σ x = 0.5, to 1.4, when σ x = 2.In Table 2, the clinic volume coefficient of variation is 0.44.But the coefficient of variation for percent female is 0.08 and the coefficient of variation for mean BMI is only 0.01.These values suggest that the analysis model for the BEGIN study should not control for clinic-level covariates unless the distribution of clinic-level covariates in the actual trial data is much different from the values in Table 2.
There are several limitations to our study.CR requires the enumeration of all possible allocations.In a 2x2 factorial study, it is only feasible to enumerate all possible allocations when there are 8 clusters.When we simulated data with 12 clusters, we randomly sampled 20,000 allocations following the approach in Li et al. [6].We did not investigate whether this sample size is large enough to adequately represent all possible allocations.Furthermore, we did not assess the size of our constrained allocation space and used the top 10% of balanced allocations throughout.As shown by Li et al. [6], an overly constrained allocation space can result in conservative Type 1 error rates which was the case in our unadjusted analyses.
Expanding the allocation space may reduce this phenomenon while still retaining the benefits of CR over simple randomization.Finally, we evaluated our balance metric using simulated continuous covariates.Future work will evaluate how well our methods perform when binary or categorical group-level covariates are used to constrain the randomization set.

Conclusions
Our findings provide evidence for the use of CR instead of simple randomization when performing factorial CRTs to avoid highly imbalanced designs and to obtain more precise inferences.Except when there are a small number of clusters per treatment condition, clusterlevel covariates should be included in the analysis model to increase power and produce coverage and Type 1 error rates at their nominal levels.When there are a small number of clusters, we recommend cluster-level covariates should not be included in the analysis model due to the loss of power even though coverage and Type 1 error rates will be conservative in the unadjusted analyses.

Figure 1 :
Figure 1: Histogram of total balance scores for the 2520 possible allocation schemes for the BEGIN cluster randomized trial with 8 clusters and 4 randomization conditions.The vertical red line indicates the cutoff corresponding to the top 10% of balance scores among the 2520 possible scores.

Table 2 :
Total volume, percent female, and mean BMI of visits by patients who met the BEGIN eligibility criteria in 2019-2020 for each of the 8 clinics in the BEGIN trial.
study.These three potential confounders are: 1) Clinic volume, as measured by the number of office visits; 2) Percent of office visits by female patients; 3) Mean BMI of visits.It is worth noting that mean BMI is similar across the 8 clinics, but total volume varies considerably.

Table 3 :
Factors that vary in the simulation study

Table 4 :
Simulation results for the effect of treatment with 8 clusters, based on an analysis model that does not control for cluster-level covariates

Table 5 :
Simulation results for the effect of treatment with 12 clusters, based on an analysis model that does not control for cluster-level covariates

Table 6 :
Clinic pairings associated with the top 10 unique balance scores sorted by total balance score, using data from the BEGIN study.Note that in seven of the ten allocations in Table6, clinics C7 and C8 are matched together.Clinics C7 and C8 have the largest and smallest clinic volumes, respectively.Assigning them to the same treatment condition helps ensure balance across treatment conditions.
Conversely, clinics C1 and C2 are only matched together in two of the ten allocations.Clinics C1 and C2 are the second and third largest clinics.Putting them in different treatment conditions also helps ensure balance.

Table 8 :
Simulation results for the effect of treatment with 12 clusters, based on an analysis model that controls for cluster-level covariates