Prevalence and Methodological Characteristics of Subgroup Analyses in Stepped Wedge Cluster Randomised Trials: Protocol for a Systematic Review

Background: The stepped wedge cluster randomised trial is an increasingly common trial design. The design can be useful for informing real-world clinical decision-making, including decisions about the effectiveness of interventions in particular subgroups. However, there is little existing guidance about how to perform subgroup analyses in the stepped wedge design. We aim to determine the prevalence of subgroup analyses and describe statistical methods used to perform them in stepped wedge cluster randomised trials. Methods: We will conduct a systematic review following the methodology recommended in the Cochrane Handbook for Systematic Reviews of Interventions. We report this protocol according to the PRISMA-P checklist. The protocol has been registered in the Open Science Framework. We will search for terms related to ‘stepped wedge’. Sources will be PubMed, Embase, PsycINFO, Web of Science, CINAHL, Cochrane Library, and Current Controlled Trials Register up to 16 October 2020. Studies will be eligible if they are written in English, involve human participants and are primary or secondary reports of planned or completed stepped wedge cluster randomised trials. Two reviewers will rst screen the titles and abstracts, then full texts, to select studies that should be included in the review. Disagreements will be solved by consensus through discussion with a third reviewer. We will extract data related to study characteristics including presence or absence of subgroup analyses, characteristics of subgroup variables examined, statistical methods used to perform subgroup analyses, and adherence to the most consistently recommendations suggested for subgroup analyses in general including in clinical trials. We will perform a qualitative synthesis of the extracted data. Discussion: This protocol offers a reproducible and transparent procedure for a systematic review of the literature. It will provide a portrait of the frequency and types of subgroup analyses performed in stepped wedge cluster randomised trials. These results will inform the development of recommendations for subgroup analyses in such trials.

contributes observations under both control and intervention conditions, improving its statistical e ciency when making comparisons [5]. As all clusters ultimately receive the intervention, the study has more social acceptability. These advantages, among others [6], may explain the increasing use of the SW-CRT design in recent years [7]. But the design also has many disadvantages. It has numerous methodological complexities such as potential confounding of time [8][9][10]; possible within-cluster contamination [11]; possible time varying treatment effect [12]; potential cluster-treatment heterogeneity [13]; and complex correlation structures [14][15][16][17]. This makes data analysis of SW-RCTs more complex than other designs [10]. Statistical methods for data analysis of SW-CRTs have been proposed [8,18,19] and others are still in development [9,20]. The design can be useful to inform real-world clinical decisionmaking, including decisions about the effectiveness of interventions in particular subgroups [21][22][23][24]. For this reason, the CONSORT extension for the SW-CRTs recommends (item 18) to report results of subgroup analyses performed [25].
Subgroup analyses are used to examine whether the effect of one variable (e.g. exposure) on another (e.g. outcome) varies across strata (subgroups) of a third (e.g. demographic characteristics of patients) [26,27]. Such analyses are performed to inform individualized treatment decisions [28,29] or to investigate the consistency of the trial conclusions among different subpopulations [30,31]. When appropriately performed, subgroup analyses can lead to more targeted clinical recommendations, better informed clinical decision making and improved patient care [32]. Existing recommendations for subgroup analyses, derived from methodological papers and systematic reviews, mainly focus on design, analysis, reporting and interpretation [33][34][35]. Regarding design, subgroup analyses should 1) be based on strong biological reasoning, previous empirical evidence or current scienti c theory to reduce susceptibility to bias [36]; 2) be pre-speci ed (the plan should be laid out in the protocol) rather than posthoc to reduce risk of spurious ndings [37]; 3) include power calculation (only if the subgroup analyses are related to the primary trial objectives) to ensure the trial sample size is adequate to detect interaction [38]; 4) be measured prior to randomization so as not to be affected by treatment response; 5) stratify randomization based on subgroup variables to allow for balanced treatment assignments within subgroups; [35,39,40]; and 6) test a small number (≤ 5) of subgroup hypotheses to reduce falsepositives [34,40]. Regarding analysis, subgroup analyses depend on the statistical methods used to assess primary outcome effects [41,42]. To determine whether there is a subgroup effect, it is recommended to 1) use a formal statistical test for the interaction [35,41,42]; 2) adjust for multiple subgroup hypothesis testing by applying correction (e.g. Bonferroni method) [42][43][44][45]; and 3) check the subgroups for comparability of prognostic factors [46]. Regarding reporting, it is suggested that studies should report 1) all the subgroup analyses performed [33]; and 2) the scale (additive or multiplicative) on which subgroup analyses were assessed [33,47]. Regarding interpretation, experts strongly suggest that results of subgroup effects should be interpreted with caution to prevent mis-or overinterpretation [48][49][50]. From the 1980s to 2018, of all the methodological recommendations made, ve basic recommendations have remained consistent and are frequently suggested for subgroup analyses in general (Table 1) [26,34,40,42,51,52].
More recommendations are related to standard randomised controlled trials (RCTs) than to cluster randomised trials (CRTs). Subgroup analyses are rare in CRTs, which limits their ability to shed light on the extent of bene ts or risks for treatments tested [35]. Subgroup analyses are most straightforward in trials with a single measurement from each participant or cluster. However, multiple period trials such as SW-CRTs have different underlying modeling assumptions which can complicate the conduct of subgroup analyses [53]. Due to the inherent confounding of the treatment effect with time, analysis of a SW-CRT should always account for secular trends [9,54]. Subgroup analysis in SW-CRTs could require introducing into the statistical model two interaction terms involving the subgroup variable: with treatment and with time [31,41]. To the best of our knowledge, there is no recommendation about how to handle this issue. Recommendations might depend on whether the subgroup variable is a cluster-level or individual-level variable, and whether differences in the secular trend are anticipated across the subgroups. As a rst step in developing methodological recommendations for subgroup analyses in SW-CRTs, it will be useful to know how they have been handled so far in published trials and protocols of SW-CRTs. Therefore, we aim to review reports of SW-CRTs to identify and describe the statistical methods used to perform subgroup analyses. Speci c objectives are: 1) determine the prevalence of reporting subgroup analyses in SW-CRTs; 2) describe the characteristics of subgroup variables examined; 3) identify and describe statistical methods used to perform subgroup analyses in SW-CRTs; and 4) determine prevalence of adherence to the most consistently recommendations suggested for subgroup analyses in general including clinical trials.

Methods
We will conduct a systematic review. We report this protocol according to the PRISMA-P guidelines [55] and provide the Checklist in an additional le. We registered this protocol with the Open Science Framework (Registration ID: https://osf.io/2kwrz).

Eligibility criteria
Studies will be eligible if they are a) written in English, b) involve human participants, and c) are protocols, primary or secondary reports of planned or completed SW-CRT. Papers with their protocols will be considered as one completed original study. We will include only studies that use cluster randomisation with a minimum of two sequences and three periods or three sequences and two periods. The design may not have all clusters starting in control and ending in intervention, or may not have complete data [56]. We will not include any restriction on interventions, comparators, and outcomes. We will consider both healthcare and non-healthcare settings. We will exclude a) individually randomised trials; b) bi-direction cross-over; c) non-randomised stepped wedge designs; or d) trials retrospectively analysed as a stepped wedge design when the study was not originally designed as a stepped wedge [54]. We will also exclude systematic reviews, editorials, design manuscripts, and letters.

Information sources and search strategy
We will use an adaptation of three previously published search strategies of systematic reviews on stepped wedge design [7,19,57,58]. We will search the following databases up to 16 October 2020: PubMed, Embase, PsycINFO, Web of Science, CINAHL, Cochrane Library and Current Controlled Trials Register. Our search terms are "stepped wedge", "step wedge", "experimentally staged introduction", "oneway crossover", "one directional crossover", "one way cross-over", "SW-CRT" as well as the 28 combinations of the terms "incremental", "phased", "staggered", "staged", "stepwise", "step wise" and "delayed" with the terms "recruitment", "introduction", "implementation", "intervention". An information specialist will rst perform the search strategy in PubMed, and then will translate it into the other databases. A second information specialist will revise the initial search strategy with the Peer Review of Electronic Search Strategies (PRESS) Tool [59]. Our search strategy in PubMed is described in Table 2. We will report on the search process following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) ow diagram [60]. The ow diagram is presented in Fig. 2. We will also search grey literature by contacting stepped wedge design experts of relevant networks such as the Ottawa Methods Centre (OMC) at the Ottawa Hospital Research Institute, Canada [61], the Pragmatic Clinical Trials Unit (PCTU) in London, UK [62] and the National Institute of Health, USA [63].

Data management
We will rst merge the citations identi ed from seven electronic databases mentioned above in EndNote software. We will then identify and remove the duplicates. Unique citations will be considered for the selection process.

Study selection process
We will complete the screening in four steps. First, we will randomly select a small sample (10%) of unique records identi ed. Two reviewers will independently screen these studies using our inclusion/exclusion criteria. We will assess the agreement between screeners to ensure that the eligibility screening process is reproducible and reliable. We will describe the observed proportions of articles where pairs of screeners agree or disagree on their eligibility decisions and calculate Kappa statistics [64]. Disagreements will be resolved through discussion. If necessary, we will modify instructions and re-test another sample to improve the agreement between screeners [65]. Second, once agreement is acceptable, the two reviewers will independently screen the remaining titles and abstracts. Third, we will obtain fulltext articles for all the studies that passed the title and abstract screening and con rm eligibility. In addition, we will also examine references lists of relevant systematic reviews and eligible papers for additional papers which the search strategy would have missed. Papers with their protocols will be considered as one completed original study. The studies found not to meet the eligibility criteria will be excluded and we will document the reason for exclusion. Finally, from full text screening, we will identify SW-CRTs in which subgroup analyses were performed. We will email study authors when relevant information for the selection decision is missing or unclear. Discrepancies between the two reviewers will be resolved by consensus through discussion with a third reviewer.

Data extraction
We will develop a data extraction form (an Excel spreadsheet) by examining items reported in other relevant systematic reviews on subgroup analyses or on SW-CRTs [19,40,54,66,67] and developing new items speci c to our review. The data extraction form will be pilot tested on a sample of 10% trials (for which full-text screening is complete) to re ne the extraction items and to ensure that data will be collected consistently. Two reviewers will extract data from these trials and discrepancies will be identi ed and resolved through discussion. The data collection form will be improved if necessary. Once consensus is reached, the form will be nalized, and the trials will be randomly divided among the reviewers with two reviewers assigned to each trial. Reviewers will meet periodically (e.g. after every ve trials have been completed) to review discrepancies and come to a consensus. If consensus cannot be obtained, a third investigator will resolve differences. Kappa statistics will be calculated to determine the agreement between the independent reviewers. We will extract data in multiple domains related to our objectives. First, we will extract information on study characteristics: rst author, title, year of publication, type of article (e.g. primary report or secondary report of a completed trial), country, setting (e.g. nonhealth care), type of design (e.g. cross sectional), number of clusters, number of sequences, number of periods, number of participants per step, step length, trial duration, presence of subgroup analyses.
Second, we will extract characteristics of subgroup variables examined: number of subgroup variables, whether subgroup variables were de ned at cluster-level or individual-level, type of each subgroup variable (e.g. categorical), number of categories, how subgroup variables were measured during data collection phase (e.g. continuous), how the cut-off was justi ed (if categorised). We will extract detailed information about the statistical methods used to perform subgroup analyses in SW-CRTs. Third, we will consider a) methods used for the outcome on which subgroup analyses were based: outcome variable (e.g. primary outcome), type of outcome variable (e.g. continuous), unit of analysis (e.g. individual-levelanalysis), distribution (e.g. binomial), statistical models (e.g. Generalised Linear Mixed Model), association measure (e.g. risk ratio), assumptions about the correlation structure (e.g. model suggested by Hussey et al. [17]), whether time was modelled as discrete or continuous, whether the analysis adjusted for time, whether there was any interaction between time and treatment, whether the effect of the intervention on the outcome of interest was statistically signi cant. We will then consider b) statistical methods speci c to subgroup analyses: methods used to assess subgroup effect (e.g. test for interaction), scale on which interaction was assessed (e.g. multiplicative), whether there was a signi cant test for interaction, which interaction terms speci cally were included in the statistical model, what statistical methods were used to adjust for multiple subgroup hypotheses testing, whether authors claimed a subgroup effect (reported treatment effect by subgroup instead of overall treatment effect) [40]. Fourth, we will examine adherence to the most consistently recommendations suggested for subgroup analyses in general including in clinical trials (Table 1): whether a rationale for subgroup analyses was provided, type of subgroup analysis (e.g. prespeci ed), whether a formal test for interaction was used to assess subgroup effect, whether multiple subgroup analyses were performed and, whether a correction for multiplicity was applied when multiple subgroup analyses were performed.

Quality assessment
As our study is a systematic review of statistical methods used to perform subgroup analyses in SW-CRTs, we will not assess quality of trials. In addition, the quality of subgroup analyses in SW-CRTs will not be assessed since there is no methodological tool designed for this purpose.

Data synthesis
Kappa statistic will be performed to determine the agreement between reviewers for the main items. We will consider agreement as "fair" when kappa statistic values are between 0.40 and 0.59, "good" when values are between 0.60 and 0.74, and "excellent" when values are greater or equal to 0.75 [68]. Our results will be presented separately according whether they are protocols or completed original studies. We will perform a descriptive analysis in four steps. First, we will describe the characteristics of all included SW-CRTs. We will use median (interquartile range) to describe continuous variables and frequency (percentage) to describe categorical variables. We will then identify the number of completed original research (or protocols) that performed (or planned to perform) subgroup analyses, overall and in each category of study. Second, we will describe characteristics of subgroup variables examined. Third, we will identify and describe statistical methods used to perform subgroup analyses in SW-CRTs. Fourth, we will determine for each of the ve items (Table 1), the prevalence of adherence of the included trials to the most consistently suggested recommendations for subgroup analyses in general including in clinical trials. Analyses will be performed using version 9.4 of SAS software.

Discussion
This protocol offers a reproducible and transparent procedure for a systematic review of the literature. We aim to determine how often subgroup analyses are performed in SW-CRTs and what statistical methods are used. We hope this research protocol will achieve the following: First, this review will ll gaps in the literature on both subgroup analyses (rationale) and SW-CRTs (methods). Second, it will provide information on the characteristics of subgroup variables examined in SW-CRTs. Third, it will help researchers to perform such analyses, as recommended in the extension of the CONSORT guidelines for SW-CRTs [24,25].
We hypothesize that subgroup analyses are rare in SW-CRTs, as they are rare in health-related CRTs in general [35]. We also hypothesize that the prevalence of adherence to the rst two of the ve most consistently recommendations suggested in reference to subgroup analysis, i.e. justifying the subgroup analyses and specifying them a priori (Table 1), will be very low, as Moreira et al. found [42]. Finally, by reviewing the statistical methods used to perform subgroup analyses in SW-CRTs, [24] this review will lay the foundation for development of speci c recommendations for such analyses.
Potential limitations of our study relate to the selection of articles. First, we will not systematically search the grey literature (unpublished articles). We plan to contact authors of relevant papers, but the rate of response in this situation could be low [69,70]. In addition, data from unpublished studies can introduce bias [71]. They may re ect an unrepresentative sample of all unpublished studies and be of lower methodological quality than published studies. Furthermore, as the selection criteria focus strongly on methodology, we feel there is little chance that a validated new methodology of statistical methods for performing subgroup analyses will be hidden in unpublished studies. Second, our results will depend mainly on information reported in trials publication. Authors may have conducted subgroup analyses but failed to report them. We planned to contact authors, but due to the low rate of response [69,70], the prevalence of reporting subgroup analyses in SW-CRTs may be underestimated. Third, we could also miss some studies that report in languages other than English. However, since English remains the preferred language of scienti c communication [72,73], we anticipate that a small number of studies will be missed. The ethical approval is not required because we will only use literature as data source. Also there is no requirements of an inform consent.

Consent for publication
Our manuscript contains no individual person's data in any form.

Availability of data and materials
Data and materials used at this step are available in our protocol. We provided an additional le for the PRISMA-P checklist. All data and materials that will be used during the review, will be available from the corresponding author.

Competing Interests
The authors declare that they have no competing interests. The funders had no role in developing the protocol.

Authors' contributions
Évèhouénou Lionel Adisso made substantial contributions to the study conception, design and analysis, drafted the paper and substantively revised it; Monica Taljaard, made substantial contributions to the study conception, design and analysis and substantively revised the paper; Hervé Tchala Vignon Zomahoun made substantial contributions to the study conception, design and analysis, and substantively revised the paper; Louis-Paul Rivest made substantial contributions to the study conception and design; Pierre Jacob Durand made substantial contributions to the study conception and design; France Légaré made substantial contributions to the study conception, design and analysis, substantively revised the paper and is the guarantor of the review. The rst step in evaluating a subgroup analysis is to determine its logical sense [34]. Credibility is higher if there is a compelling causal rationale explaining the subgroup effect, and lower if not (biologic rationale, clinical rationale, other mechanism) [26,34,40,51,52].
2) A priori speci cation of subgroup analyses Subgroup analyses that are performed to test hypotheses generated before the study has started should be clearly distinguished from those identi ed after the main trial analyses are performed [34]. Credibility is higher if investigators stated a hypothesis prior to performing the study, lower if an explanation arose post hoc (con rmatory vs. exploratory; hypothesis testing vs. hypothesis generating) [26,34,40,42,51,52].
3) An explicit test for interactions should be used to assess subgroup effect To determine whether treatment e cacy differs between subgroups, it is recommended to use a formal test for interaction [26,34,42]. Credibility is higher if an interaction test suggests a small likelihood for a chance nding (rather than compatibility with chance or not interaction test at all) (test of homogeneity, test of heterogeneity) [40,51,52] 4) Authors limit the number of the subgroup analyses to be performed Authors should perform a small number of subgroup analyses [26,34,42]. Credibility is higher if only a small number of subgroup effect have been tested [40,51,52].
5)Indicate potential effect on type I errors (false positives) due to multiple subgroup analyses and report methods used to address this effect The greater the number of simultaneous subgroup analyses performed, the greater the probability of a false-positive nding caused [34]. Therefore, the signi cance of within-subgroup treatment effects should be adjusted for multiplicity [26,42]. Credibility is higher if investigators accounted formally or informally for multiplicity [40,51,52].