The Behavioral Health Interdisciplinary Program - Collaborative Chronic Care Model (BHIP-CCM) cluster randomized controlled trial (ClinicalTrials.gov, NCT02543840) utilized a stepped wedge design (SWD) to stagger implementation of an evidence-based team care program in general mental health clinics at nine VA medical centers.(10, 11) Briefly, all sites received implementation support to establish mental health teams structured according to the evidence-based collaborative care model;(12, 13) however, sites were randomized with regard to the timing of the receipt of implementation support, which was stepped out in three waves of three sites each.(10)
Types of balance in controlled trials
Prior to implementation, the nine BHIP VA study sites were balanced on eight site-level characteristics over the three waves, with three sites assigned to each wave. Our method (7) allocates sites to study waves so as to minimize a pre-specified imbalance score. There are different types of balance in cluster randomized controlled trials. The most familiar, mean balance, tries to make the mean values of each of many site characteristics the same across every time-wave of a SWD. This is difficult to implement with a small pool of potential sites because an extreme outlier ruins mean balance. Another type of balancing, sequential balance, tries to minimize the time trend of continuous factors over the time-waves of a SWD. For example, a study hoping to reduce infection rates should avoid a design with all small hospitals in the first wave and all large hospitals in the last wave, as hospital size would be confounded with time (i.e. did infection rates fall over time or did large hospitals do better at preventing infection than small hospitals?) Perfect mean balance implies perfect sequential balance, but the latter is easier to achieve. The BHIP-CCM study sequentially balanced sites, but the same central question of obtaining balanced patient-level characteristics would be relevant had the study mean-balanced sites.
For the initial site balancing we drew on the concepts behind the method of fine balance proposed by Silber, Rosenbaum, and Ross for medical outcome analyses that profile and rank hospital performance.(14) With an abundance of patient data, their method can simultaneously balance up to 60 factors. We also considered stable balancing weights proposed by Zubizaretta et al.(15) These methods complement the propensity score methods put forth by Rosenbaum and Rubin.(8) However, these methods require a large pool of potential controls. In contrast, we only had a pool of nine sites.
Balancing sites in the BHIP-CCM study
The BHIP-CCM research team, which consisted of healthcare system operational leaders and health services and implementation researchers, began with a list of over 20 site characteristics that were winnowed down to a set of 8 factors (after removing redundant or correlated characteristics) which were then used to balance sites across the three waves: urban/rural; site complexity (a 5-level index including relevant characteristics such as array of available services, inpatient care intensity, size of training programs, etc.); region; proportion of care received via telephone calls (an index of use of non-traditional treatment methods); percentage of primary care patients seen in integrated primary care / mental health teams (an index of system redesign experience); number of BHIP-CCM teams established at the site prior to the study; VA employee psychological safety score (a critical component both of team-building and of system redesign efforts); and number of patients seen in general mental health clinics in the prior year (an indicator of size and flow through the mental health programs).
Detailed information about the site balancing algorithm that was used for the current study has been published previously.(7) To measure the aggregate imbalance over many characteristics requires creating an imbalance score with a weight and an imbalance term for each characteristic. All characteristics were categorized into tertiles. We generated n = 1,680 distinct site assignments from 20,000 permutations of the nine sites assigning the first three sites to wave 1, the next three sites to wave 2, and the last three sites to wave 3. After computing the imbalance score for all 1,680 assignments, we restricted to those 34 with the lowest overall imbalance scores, and randomly chose one permutation. Table 1 shows the imbalance scores among all 1,680 site assignments and the 34 site assignments with minimal scores. The primary set of factor representations used equal weights and re-expressed all characteristics as categorical variables.
Table 1
Imbalance scores among all 1,680 site assignments and to the 34 site assignments with minimal scores
Site characteristics (categorical) | All site assignments (n = 1,680) Mean (sd) | Site assignments with minimal scores (n = 34) Mean (sd) |
VA Employee Psychological Safety Score | 0.40 (0.42) | 0.12 (0.11) |
Number of BHIP teams | 0.38 (0.40) | 0.15 (0.13) |
Number of patients seen in general mental health clinics in the prior year | 0.36 (0.34) | 0.09 (0.07) |
Primary Care - Mental Health integration | 0.40 (0.42) | 0.10 (0.07) |
Proportion of care via telephone | 0.38 (0.40) | 0.13 (0.11) |
VA geographic region | 0.40 (0.42) | 0.09 (0.07) |
Urban / Rural | 0.40 (0.42) | 0.18 (0.15) |
Site Complexity | 0.38 (0.40) | 0.13 (0.11) |
Overall Imbalance Score | 3.10 (0.91) | 0.99 (0.32) |
BHIP-CCM study patient population
Having reduced imbalance among the site characteristics over time, for the current manuscript, we then explored internal differences in patient demographics, hospitalizations, and diagnoses by facilities in each study wave (n = 3 sites in k = 3 study waves). The eligible study population consisted of Veterans who had at least two visits in the prior year (with at least one visit within the past three months) to a general mental health clinic at one of nine Veteran Affairs Medical Centers (VAMCs) (n = 5,596).(10, 11, 16) A subset of up to 500 individuals from each VAMC were selected to complete a telephone-administered survey battery at baseline, 6, and 12 months with a total of 1,050 Veterans who completed an interview (n = 1,436 total interviews). Women were oversampled to increase gender balance. Two weeks prior to calls, potential participants received mailed study information and opt-out instructions if they chose not to be interviewed.
Statistical analyses
We calculated the differences in demographic and clinical variables across our three study waves via chi-squared tests for categorical variables and analysis of variance for continuous variables. The study waves were sequentially balanced on site-level variables (as discussed above) whereas we tested for the more stringent mean balance in patient-level variables across these waves. Effect size was obtained from Cramer’s V test for categorical variables and by calculating eta squared (i.e. the between-groups sum of squares divided by the total sum of squares) for continuous variables.(17, 18) For Cramer’s V, an effect size of 0.1 is considered a 'small' effect size, 0.3 a 'medium' effect size, and 0.5 a 'large' effect size.(19) For eta squared, an effect size of 0.01 was considered a 'small' effect size, 0.06 represented a 'medium' effect size and 0.14 a 'large' effect size.(19) Statistical significance was assessed at the p < .05 and p < .0001 level. We also explored the number of patient-level variables with a medium or large effect size difference among all 1,680 site assignments versus the 34 site assignments with minimal scores, identified in our site balancing algorithm.(7) All data analysis was generated using SAS software, Version 9.4. All study procedures were approved by the VA Central Institutional Review Board.