Aim One: Evaluating attained power and the impact of different factors on the risk of low attained power
In this study, we assumed that individual outcomes, , were generated from the following linear mixed-effects model:
[Please see the supplementary files section to view the equations.] (1)
where indexed the cluster, indexed the time period (denotes baseline), and indexed the individual within cluster and time period . The error terms were assumed to be independently sampled from a standard normal distribution (mean zero and variance ). treatment group indicator (0 = standard-of-care, 1 = intervention) was denoted as and the treatment effect, , was set to 0.26. As , the standardized effect size, , also equaled 0.26. Because the variance of the treatment effect estimate does not depend on the linear time effect coefficient, , its value was set to 1 without loss of generality. The cluster-specific effects, , were assumed to be independently sampled from a normal distribution with mean zero and variance . The value of was calculated from set values of the ICC, where ICC was defined as . The treatment-vs-time period correlation (TTC) was defined as the Pearson correlation coefficient between treatment and the time-period, , across all individual observations with clustering ignored. The treatment group imbalance (TGI) was defined as the difference in the numbers of participants who received the new intervention versus the standard-of-care.
Computational challenge and a practical solution:
Our aim was to obtain the set of attained powers by simulation and their corresponding power distribution for a specified randomization algorithm. First, we needed to obtain a list of all possible allocations that the randomization algorithm can generate, and then to evaluate the attained power associated with each possible allocation.
However, the evaluation of attained power for all potential allocations can be computationally challenging, given the potentially huge number of possible allocations. For example, a twenty-cluster cross-sectional SW-CRT with unique cluster sizes and four clusters transitioning at each of five steps has more than 300 billion unique allocations.
One strategy to reduce the number of unique allocations was by categorizing the clusters into size groups (e.g., (S)mall/(L)arge or (S)mall/(M)edium/(L)arge) and treating the clusters within each size group as identical, thereby reducing the number of unique allocations while increasing their multiplicities. Doing so will lead to only an approximate solution. However, this is often adequate in practice, given that the anticipated number of individuals that a cluster will enroll often can be only approximated at the start of a trial. In practice, we do not recommend having more than four size categories, as then the number of unique allocations likely will be too large to evaluate.
Simulation specifications:
As shown in Table 1, we investigated the impact of the total number of clusters (12, 24 or 48), the number of cluster size groups (2 – S/L or 3 – S/M/L), the distribution of clusters across cluster size groups (equal or unequal), the coefficient of variation (CV) of the cluster sizes (0.4, 0.7, 1.0, and 1.3), defined as the ratio of standard deviation of cluster sizes to the mean, and the ICC of the response variable (0.01, 0.05, and 0.10). A full factorial layout was investigated for all of these factors except the CV. An equal distribution of clusters across cluster size groups corresponded to an equal number of clusters in each cluster size group; an unequal distribution corresponded to distribution of clusters in a S:L ratio of 3:1 for scenarios with two cluster size groups or 3:2:1 to S:M:L for scenarios with three cluster size groups. For unequal distributions, CVs of 0.4. 0,7, 1.0, and 1.3 were investigated for all unequal distribution scenarios, but for equal distribution scenarios, CVs of only 0.4 and 0.7 were investigated as a CV larger than one cannot be obtained with equal distributions. (Thus, when the actual cluster sizes have a CV larger than one, trial designers need to use unequal distributions to apply the approach presented here.) Thus, there were a total of 108 scenarios.
Table 1 Factors and their specific values explored in the simulation study.
Factor
|
Values
|
Number of clusters
|
12, 24, 48
|
Number of cluster size groups
|
2, 3
|
Distribution of clusters to cluster size groups
|
Equal (6S+6L, 12S+12L, 24S+24L, 4S+4M+4L, 8S+8M+8L, 16S+16M+16L), Unequal (9S+3L, 18S+6L, 36S+12L, 6S+4M+2L, 12S+8M+4L, 24S+16M+8L)
|
ICC
|
0.01, 0.05, 0.1
|
CV
|
0.4, 0.7, 1.0*, 1.3*
|
* For equal distribution of clusters to cluster size groups, only CVs of 0.4 and 0.7 were investigated as a CV of 1 or larger is not possible.
Trial parameter values for all 108 scenarios are listed in the Additional file (Table A1). The number of transition steps was fixed at four. To determine the total number of individuals needed to obtain approximately 80% power in each scenario, we used Hussey and Hughes’s2 sample size formula for a SW-CRT with equal cluster sizes as implemented by Baio et al.11 in the R package “SWSamp14” (implementation of the Hussey and Hughes’s formula). Then, we re-distributed the total number of individuals to create unequal sized clusters that matched the CVs we set under each scenario (see Additional file, A2 for details).28 If a resulting cluster size was not an integer, we set it at random to one of the two bracketing integers such that the expected value matched the initially calculated cluster size. For example, if the calculated cluster size was 12.3, the cluster size was set at random to either 12 or 13, such that its expected value was 12.3. Within a cluster, individuals were allocated to the time periods using a multinomial distribution with equal probability for each period. The total number of unique allocations is presented in the third column from the right in Table A.1. Among the 108 scenarios, 72 scenarios (referred to as ‘completed scenarios’) had less than 2,000 unique allocations. For these scenarios, the attained power for every allocation was evaluated via simulation. For the other 36 scenarios (referred to as ‘sampled scenarios’), we evaluated the attained power for only a sample of 2,000 allocations. Subsequently, the attained power for two sampled scenarios, one with a relatively small number of allocations (scenario #64, with 8,623 allocations) and one with a relatively large number of allocations (scenario #103, with 113,949 allocations), had all of their allocations evaluated to enable validation of the prediction models that were constructed using only the original 2,000 allocations.
Performance metrics:
For each selected allocation, 10,000 datasets were simulated and analyzed using the model (1). Analysis was performed using the lme function in R.29 This P-value is calculated through comparing the Wald statistic to quantiles from a t-distribution with degrees of freedom equivalent to what would be used in a balanced, multilevel ANOVA designs30. The attained power was estimated using the proportion of P-values less than 0.05. Assuming a power near 80%, the simulation standard error of each attained power was approximately 0.4%. The PD associated with the unrestricted randomization algorithm was then constructed from the estimated attained powers by weighting each attained power by the probability of the corresponding allocation. Two measures of the risk of low attained power were then obtained from the PD: (1) the probability that the attained power falls more than 5% below the expected power (obtained in the simulations), and (2) the probability that the attained power falls more than 5% below the nominal 80% power (i.e., less than 75%) that would be achieved in a trial with equal cluster sizes. The first measure provides an indication of whether potential low attained power is a concern when power is calculated using the approach presented here for handling unequal cluster sizes. The second measure would be of greater interest when one is assessing whether power calculations obtained under the assumption of equal cluster sizes are adequate. We set a 5% power loss as being meaningful, but trial designers may wish to choose a different value more relevant to their context. In this paper, the phrase “risk of low attained power” will be used as a short form to refer to either of these measures. All computations were conducted using R version 3.5.0 on Cedar, Compute Canada31,32.
Aim Two: Explaining the variation in attained power across allocations and predicting attained power using allocation characteristics
Using the simulation results from Aim One, we first constructed scatterplots to examine the bivariate relationship between the simulated attained power and each of TTC and TGI separately. Then, to predict the attained power for each scenario, we fitted logistic regression models of the proportion of simulations in each allocation that achieved statistical significance (at level 0.05) as a function of linear and quadratic terms for TTC and TGI.
Based on the relationships observed in the bivariate scatterplots, we chose to investigate a sequence of four nested models. The terms included in the four models that were considered were: (Model 1: TTC), (Model 2: TTC, TGI), (Model 3: TTC, TGI, TTC2), and (Model 4: TTC, TGI, TTC2, TGI2).
We measured the predictive accuracy of each model by using five-fold cross-validation, conducted using cv.glm33 in R, to estimate the root mean squared prediction error (RMSPE), defined by , where represented the set of allocations in the -th partition, was the number of allocations in the -th partition, and were the simulated and predicted powers, respectively, for the -th allocation, and was the number of unique allocations in the scenario. From these models, we selected the one for which no meaningful improvement in RMSPE was achieved by adding another term to the model. For the selected model, we examined additional predictive performance measures for each scenario, including maximum and average absolute prediction errors (). For two sampled scenarios (#64, #103), we validated the selected model with respect to both prediction of attained power for individual allocations, and estimation of the risk of low attained power. For the former objective, we compared the predicted attained power to the simulated attained power for each allocation that was not in the set used to fit the predictive model. For the latter objective, we repeatedly sampled (10,000 times) 2000 allocations and estimated the risk of low attained power from the fitted model. We then compared these estimated risks to the “true” risk derived from the simulated attained powers.