Statistical approaches to computing sample size in cluster randomized trials: a simulation study

18 Background : Cluster randomized trials, which randomize groups of individuals to an 19 intervention, are common in health services research when one wants to evaluate improvement in 20 a subject's outcome by intervening at an organizational level. For many such trials sample size 21 calculation is performed under the assumption of equal cluster size. Many trials that set out to 22 recruit equal clusters end up with unequal clusters for various reasons. This leads to a 23 misalignment between the method used for sample size calculation and the data analysis, which 24 may affect trial power. Various weighted analysis methods for analyzing cluster means have 25 been suggested to overcome the problem introduced by unbalanced clusters; however, the 26 performance of such methods has not been evaluated extensively. 27 Methods : We examine the use of the general linear model for analysis of clustered randomized 28 trials assuming equal cluster sizes during the planning stage but ending up with unequal clusters. 29 We demonstrate the performance of three approaches using different weights for analyzing the 30 cluster means: (1) the standard analysis of cluster means, (2) weighting by cluster size, and (3) 31 minimum variance weights. Several distributions are used to generate cluster sizes to cover a 32 wide range of patterns of imbalance. The variability in cluster size is measured by the coefficient 33 of variation (CV). By means of a simulation study, we assess the impact of using each of the 34 three analysis methods with respect to type I error and power of the study and how it is affected 35 by the variability in cluster size. 36 Results : Analyses that assumes equal clusters provide a reasonable approximation when cluster 37 sizes vary minimally (CV < 0.30). In an analysis weighted by cluster size type I errors were 38 inflated, and that worsened as the variation in cluster size increases. However, a minimum 39 variance weighted analysis best maintains target power and level of significance under all 40 degrees of imbalance considered. 41 Conclusion : The unweighted analysis works well as an approximate method when the variation 42 in cluster size is minimal. However, using minimum variance weights performs much better 43 across the full range of variation in cluster size and is recommended.

Keywords: Cluster-randomized trials, unequal clusters, coefficient of variation, weighted 45 analysis, type I error, power 46 47 BACKGROUND 48 Importance of cluster-randomized trials 49 A cluster-randomized trial (CRT) is a trial where groups of individuals are randomized to 50 different interventions. These designs are commonplace in health care research(1) where they are 51 increasingly being used to assess the effectiveness of interventions against infectious diseases as 52 well as prevention and treatment of non-communicable diseases.(2, 3) Findings from such 53 studies have the potential to help government agencies formulate health policies. 54 In a CRT, the responses of individuals within a cluster have a tendency to be correlated, which is 55 measured by the intra-cluster correlation (ICC). The correlated responses preclude the use of 56 classical methods of inference based on the assumption of independence. 57 58

Sample Size Calculations in CRTs 59
Compared to an individually randomized trial, sample size in a cluster-randomized setting is a 60 function of the number of clusters, cluster size, and ICC, in addition to the other necessary 61 parameters. When the cluster sizes are equal, the simple sample size calculation method based on 62 the assumption of equal cluster sizes or using the mean cluster size would provide the exact 63 inference.(4) However, in reality, cluster sizes are often unequaleven for those trials that 64 assumed equal clusters. In fact, in a review of 200 CRTs, it was found that the recruitment 65 strategy had led to unequal clusters in two-thirds of the trials.(5) Unequal clusters affect trial 66 power when the calculation is performed under the assumption of equal cluster sizes and power 67 of a trial decreases as the imbalance among the clusters increase. 68 69 To overcome the loss in trial power, Manatunga et al.(6) used an asymptotic correction term 70 based on the distribution of cluster size in the large sample formula to compute sample size 71 provided by Donner et al.(4) A practical difficulty in using such an approach is to determine the 72 distribution of cluster size before the trial has begun. Further, in their simulation study, they only 73 consider values of ICC greater than 0.25 (rare in the literature of CRTs) and do not provide the 74 impact of proposed correction on the type I error of the trial. 75 76 Eldridge et al.(5) suggested incorporating coefficient of variation (CV) of cluster size in the 77 sample size formula. They show the utility of the proposed approach using a hypothetical 78 example. However, accurate estimates of CV may not be easy to obtain, and the appropriateness 79 of this method depends on the ICC being accurate. Additionally, their work mainly focused on 80 primary health care in England, which provided inferential information to researchers working in 81 similar fields, although further investigation into the accuracy of the ICC is necessary when 82 using the approach. Van Breukelen et al. (7) used relative efficiency of unequal versus equal 83 cluster sizes and presented an approximate formula based on CV of cluster size and the ICC. 84 More recently, You et al.(8) also proposed a relative efficiency measure based on the non-85 centrality parameter. Once again, both these methods require prior knowledge of CV of cluster 86 size that may not be always known at the planning stage. Thus, under the premise of the inability 87 to obtain accurate estimates of CV, and unclear knowledge of the distribution of cluster size, the 88 simplest approach available to clinical researchers for sample size calculation in cluster 89 randomized trials is to assume of equal cluster size. 90

91
Analytical methods for CRTs 92 When the unequal cluster size is unable to be accounted for in the planning, it is necessary to 93 take this issue into account in the analysis stage. Three commonly employed analytic methods 94 for data arising from a CRT are: (a) analysis of cluster means using classical methods, (b) linear 95 mixed effects model and (c) generalized estimating equations.(9) As long as the planned analysis 96 involves a test of hypothesis, one may use power analysis to guide the design and testing strategy 97 and although complex models (multivariate methods, mixed models, and GEEs) are popular for 98 analyzing the data, respective methods for power analysis have not found such universal 99 applicability.(10) Therefore, in this paper, we focus only on the first analysis approach so that the 100 power and data analysis are aligned under the generalized linear model (GLM) framework. 101 102 When the cluster sizes are equal i.e. balanced, using classical methods for analyzing the cluster 103 means allows us to use all the methods of classical theory.(11) However, this Exact Under 104 Balanced Design (EUBD) approach only provides approximate results in the presence of unequal 105 clusters. The EUBD approach provides a unified setup for power and data analysis under the 106 framework of GLM and in doing so makes available all commercial/free software, capable of 107 handling linear models, for design and analysis of CRTs with unequal cluster sizes. 108 109 As mentioned above, a CRT loses power in the presence of unequal clusters. A weighted 110 analysis of cluster mean has been suggested to offset the loss in power due to the unequal 111 clusters.(5, 12) Kerry and Bland(12) compared different weighting schemes and found that 112 minimum variance weights, proportional to the inverse of the variances of cluster means, were 113 the best in terms of achieving target power. However, they too did not discuss about the 114 implications of different weights on the type I error. Guittet et al.(13) used these weights to study 115 "Pareto" type imbalance where 80% of subjects belong to only 20% of clusters. Many CRTs may 116 not necessarily show imbalance of this nature. 117

118
In general, within the approach of analyzing cluster means under the generalized linear model 119 framework, there are two ways to analyze the datasimple unweighted analysis of cluster 120 means, and weighted analysis of cluster means. Where weighted analyses have been studied, 121 effects of weight on type I error has not been given much attention. Further, it appears that there 122 is no recommendation about when to use different weights under different scenarios. 123

124
Aligning the sample size calculation and analysis approach 125 In general, one would always like to align the sample size calculation and analysis approach for 126 best design characteristics. The EUBD approach, mentioned earlier, allows this alignment and 127 provides exact inference as long as equal clusters are expected during the design stage and equal 128 clusters are, in fact, obtained during recruitment stage. For cases where a defensible estimate of 129 CV of cluster size is available at the design stage, methods exist to compute the sample size that 130 also aligns the calculation with the analysis. 131 However, there are scenarios where this type of alignment may not be practical. A very likely 132 situation presents itself when equal clusters are expected at the design stage, but the study 133 recruits unequal clusters. Another such situation occurs when unequal clusters are expected but 134 there are no reliable estimates of CV of cluster size to base the sample size calculation upon 135 during the design phase of the trial, e.g. no prior information could be obtained from similar 136 previous trials. Under such circumstances, researchers will likely need to use a sample size 137 calculation approach based on equal clusters, while analyzing the data taking into account the 138 unequal clusters. This introduces a misalignment between the sample size calculation and the 139 data analysis approach and affects the trial power adversely. 140 141

Objective and outline 142
Our objective is to evaluate the impact of the misalignment between sample size calculation 143 based on the assumption of equal cluster size and three analytic approaches using different 144 weights, with respect to power and type I error of a trial under a range of scenarios. 145

146
The three different approaches are demonstrated as follows. The first method is based on the 147 EUBD approach that performs a standard analysis of cluster means. The second method 148 performs a weighted analysis of cluster means using cluster size weights, while the third method 149 uses minimum variance weights. Based on the assessment of type I error and power obtained, 150 we make recommendations on suitability of approaches under different conditions. 151

152
In Section 2, we describe the model of a CRT with unequal cluster sizes that will be used for 153 generating data and the univariate model associated with a balanced cluster trial for a two-group 154 parallel study. Details about the simulation study are given in Section 3. We present the results of 155 the simulation in Section 4, followed by discussion of results in Section 5. 156 The critical values of a t-distribution depend on the degree of freedom for the test and therefore 183 the formula may require two or more iterations,(14) however, this complexity is easily addressed 184 by using free or commercial software programs. 185 186 When cluster sizes vary, no exact method exists (see Appendix) and the transformation leads to a 187 GLM with heteroscedastic errors where the covariance matrix is of the form whereD 188 is a positive definite matrix. The elements of D are a function of the ICC and the size of cluster. 189

Model Statement and Sample Size Calculation
In this paper, we consider three analyses based on the structure of the D matrix. We present 190 them below in Table 1 As such, in this paper, we investigate the performance of cluster level analysis that allows us to 195 make use of a t-test to compare two independent samples. The first method uses a two-sample t-196 test on cluster means ignoring the variability in cluster sizes. The second and third approach use 197 a weighted t-test on cluster means where means are weighted by cluster size and minimum 198 variance weights, respectively. When cluster size is used as weights, more weight is given to 199  Table 2.

RESULTS 245
Performance of the three methods to analyze the data 246 Summary statistics for observed type I error and power are presented in Table 3, which shows 247 the minimum, maximum, mean and standard deviation for type I error and power values obtained 248 under different scenarios of data analysis for the three different categories of variation in cluster 249 size. We also present the percentage of cases where the type I error is inflated, and the power is 250 below target level. Under the EUBD approach to sample size calculation, both the weighted 251 approaches perform similarly in attaining target power. The choice of method for data analysis 252 does not seem to make a materialistic difference as long as the CV of cluster sizes is less than 253 0.30. However, the performance of weighted analysis using cluster size as weights deteriorates in 254 terms of maintaining the nominal level of significance as the CV of cluster size increases. 255 Average type I error was found to be inflated up to 0.073 (s.d. = 0.019) when the CV of cluster 256 size was greater than 0.50. 257 In general, the simple analysis of cluster means assuming equal cluster sizes performs reasonably 262 with respect to controlling the inflation in type I error compared to the other methods. However, 263 at the same time, it fails to maintain the target power as the variation in cluster size increases 264 (CV 0.30). On the other hand, the method of weighting cluster means with the size of clusters 265 does a better job of maintaining power at the target level but does so at a cost of inflated type I 266 error for a number of situations. Nearly 85% cases have an inflated type I error rate for the 267 analysis using cluster size as weights when CV > 0.50. Finally, the method of weighting cluster 268 means by the minimum variance weight performs the best in terms of balancing the type I error 269 and target power across the entire range of conditions. Since the simple analysis of cluster means 270 provides comparable results to the other two methods when the variation in cluster size is low 271 (CV < 0.30), we do not discuss the scenario in further results for brevity.  Table 4. It is clear from the results that the trial ICC can 291 affect the power of a study to a much greater extent when an analysis assuming equal clusters is 292 performed. For small ICC values ( ) the average power dropped to 0.59 with the 293 standard analysis of cluster means when the CV of cluster size was greater than 0.50. Using 294 either of the weighted approaches maintains the study power at target level but cluster size 295 weights leads to an inflated type I error (up to 0.086 when the CV > 0.50). These results are 296 similar to the results observed in 4.1 except for one interesting observationthat the power 297 decreases as the ICC increases. This appears contrary to the established fact that the power 298 should be maintained at a level equal to or greater than the target level as the ICC increases (as it 299 leads to an increase in the number of clusters required). This effect can be explained by the 300 conservative rounding of the number of clusters computed by SAS/GLMPOWER software. A 301 trial with correspondingly small number of clusters is more affected by this rounding. 302 303 Sample size calculation is an important part of designing a trial, which may become a 307 complicated task in a CRT when it is anticipated that there will be unequal clusters, due to the 308 absence of an exact method. This has led to approximate approaches that help overcome the loss 309 in power due to imbalance in cluster sizes. Methods are available to inflate the sample size when 310 the variability in cluster size is known a priori. However, the knowledge of distribution of 311 cluster size may not be available at the time of planning a study. Further, many trials anticipating 312 equal clusters eventually end up with unequal clusters for a variety of reasons. Given these 313 complications, the simplest way is to calculate the sample size assuming equal cluster size. This 314 leads to a misalignment of power analysis and data analysis resulting in loss of trial power. 315 Researchers may try to minimize the impact of unequal clusters on trial power by performing a 316 weighted analysis. However, there are several weights that may be employed in such a procedure 317 and there is a lack of consensus as to which method is best suited under certain circumstances. 318

319
In this paper, we considered three methods of analysis for a CRT: assuming equal cluster sizes, 320 weighted by cluster size, and weighted by minimum variance weights. Among these methods, we 321 chose to focus on the one that allows us the use of the analysis procedures of a GLM. This 322 approach allows us to align the power analysis with the analysis of data based on the test of the 323 hypothesis. Although these methods have existed in the literature and have been used in the 324 setting of CRTs for more than a decade, we found that an extensive comparison of procedures 325 was not provided in the literature. Thus, we set out to investigate these methods under a variety 326 of imbalance conditions, and to determine their effect on the type I error and power of the study. 327 As such, we used a variety of distributions to generate the cluster sizes that should encompass a 328 range of imbalances found in practice. We did not evaluate the bias of the estimates because the 329 mean treatment effect remains the same irrespective of whether an individual level analysis or a 330 cluster level analysis is undertaken. (12)  331   332 In this paper we looked in detail at the performance of three methods of analysis. The first 333 method is that of a simple analysis of cluster means assuming equal clusters. This approach 334 yields satisfactory results as long as the variation in cluster size is low (CV < 0.30). This result 335 agrees with a prior result(5) where authors concluded that variation in cluster size may be 336 ignored if CV of cluster size is less than 0.23. As such, investigators observing low variation in 337 cluster size may use a two-sample t-test to carry out their analysis. An example for such a case 338 would be when whole classrooms of a school are randomized. A 2008 study by Salmon et al. 339 researched on the effectiveness of an intervention to prevent excess weight gain on children. 340 Seventeen classes with approximately 25 children in each class were randomly assigned to one 341 of the three intervention groups or a control group.(15) In such scenarios, we expect to see 342 minimal variation in cluster size from classroom to classroom of the same school. In situations 343 where the cluster size variation is moderate to high, the approach fails to maintain the target 344 power. 345

346
The other two methods make use of a weighted approach to the final analysisone uses cluster 347 size weights whereas the other uses minimum variance weights. A common approach among 348 medical researchers is to use cluster size weights in the final analysis for its intuitive appeal that 349 gives more weight to clusters of larger size.(12) However, we found the performance of this 350 approach to be sub-par. Although the approach seems to maintain the power at target level, 351 associated high levels of inflation in type I error may deem the conclusions questionable. 352 Inflation in type I error becomes progressively worse as the variation in cluster size increase. 353 Among all weights, minimum variance weights have been found to be the best in terms of 354 maintaining power of the trial by different researchers.(1, 12, 14) However, we also assessed the 355 impact of using such weights on type I error of the trial which was absent from the literature. Our 356 simulation results support the claim that such weights perform the best. In addition, we present 357 an explicit proof in the Appendix as to why such weights are optimal for an unbalanced cluster 358 trial. Thus, we recommend using minimum variance weights in all situations where the cluster 359 size variation is more than minimal. 360 361 It should be noted that individual level analysis are more efficient than using the cluster 362 means(5); however, the use of cluster means allowed us to frame the problem in the context of 363 GLM. Hence, the results in this paper may not be applicable for some other commonly used 364 methods to analyze data from CRTs (e.g., mixed model analysis, generalized estimating 365 equations). With respect to sample size calculation, an alternate strategy for better design of CRT 366 with unequal clusters makes use of CV of cluster sizes in the sample size formula(5) and 367 adjusting the sample size using the relative efficiency of unequal to equal cluster size.(7, 8) 368 Application of such methods requires precise knowledge of CV of cluster size and ICC that may 369 not be possible at the design stage of the study. We believe that these methods are better suited 370 for sample size re-estimation procedures where the knowledge gained in the initial phase of a 371 trial is used to revisit the sample size calculation. Given the widespread usage of cluster-372 randomized designs, we are currently pursuing such methodologies. 373 374 Another limitation of our study is the weighted t-test requires having a good prior estimate of 375 intra-cluster correlation coefficient or between-cluster variability which is hard to obtain with a 376 small number of clusters. (16)  Theorem 1: For a CRT with unequal cluster sizes, the linear mixed model may be equivalently 463 expressed as a weighted univariate linear model that can provide exact inference when the 464 weights are completely known. The approach provides approximate estimates when the weights 465 need to be estimated from the data, which is the case for unbalanced cluster trails. 466 (1) As described in Section 2.1, a linear mixed model for gi p -correlated observations on a 467 cluster may be written as with and 468 . With only one random effect for cluster, , which leads to 469 . In terms of the total variance of the outcome variable, the covariance 470 matrix is . Thus, . 471 (2) The data from all the clusters may be "stacked" to represent a combined model. With a total 472 ofn observations, the model can be succinctly written as . 473 (3) where is of the order . This can 474 be realized by using cell mean coding for the design matrix and with only a 475 single random effect, . (Note: is a direct sum operator) 476 (4) Pre-multiplying the stacked model by a matrix of known constants, T , yields a 477 vector of cluster means. Hence, the new model is given by: 478

479
The model in ( 1) A is equivalent to the following model in ( 2 A ) with 480 and 481 482