Choosing a trial design
Eleven out of 28 documents (39%) provided guidance relating to trial design. See D1 in Supplementary Table 1.
The options of trial design depend on the unit of randomisation and the intervention of interest. The key aspects of relevant designs are briefly summarised here. Many design options, and associated limitations, were discussed and no single document provided a single comprehensive summary.
In individually randomised trials, patients are the unit of randomisation. [Craig, 2019] When conducting these trials in surgery, differential expertise between the treatments being investigated can raise issues that can be alleviated by defining eligibility criteria for centres and surgeons, such as years in practice or number of interventions performed previously. [Boutron, 2008; National Institute for Health Research website, 2016] However, applying criteria that is too strict may reduce the generalisability of trial results. [Zwarenstein, 2008] Instead, a statistical analysis of inter-rater reliability, between individual centres and surgeons, can provide understanding of any impact due to expertise. This type of analysis can be useful when considering rolling out the interventions into routine healthcare, see Analysing a trial with clustering and learning. [National Institute for Health Research website, 2016]
In cluster randomised trials, groups of patients are the unit of randomisation. These designs are less common and are generally less efficient than individually randomised studies. They require more surgeons and introduce the potential for the treatment comparison to be confounded by the delivery, despite inflating the sample size to account for the intraclass correlation coefficient (ICC). [Craig, 2019; Ergina, 2009; Campbell, 2012]
Expertise-based designs are a half-way house between individual and cluster randomised trials. Patients are individually randomised to surgeon, who treats all patients with a single intervention. This can be the surgeon’s preferred technique or are an intrinsic feature in trials comparing interventions delivered by different specialties. [Boutron, 2008; Ergina, 2009] This design has the same limitations as cluster trials, and when a surgeon is only performing their preferred technique, shared waiting lists [Ergina, 2009] and understanding how the treatment can be rolled out into routine healthcare can be a challenge. Resultantly, this design is relatively uncommon. [Conroy, 2019; Conroy, 2019]
Tracker designs, proposed by Ergina et al, where new or evolving interventions can theoretically be developed within a single randomised study, and the incremental changes to the intervention tracked within the analysis, would be very challenging in practice. [Ergina, 2009]
Considering who will deliver the intervention
Thirteen out of 28 documents (46%) discussed the importance of deciding who will deliver the intervention. See D2 in Supplementary Table 1.
Some variation in delivery, in part, will depend on the skill and training of those delivering the intervention. [Boutron, 2008; Ergina, 2009; Jackson, 2010] As such, the selection of centres and treatment providers was a critical element of design discussed by a number of guidance documents. [Campbell, 2012; ICH E9, 1999; ICH E6, 1996; MHRA, 2021] Any eligibility criteria for participating centres and treatment providers, and a description such as the degree to which they are typical, should be reported. [Boutron, 2008; Zwarenstein, 2008; Boutron, 2017]
Suggested selection criteria for centres related to caseload for the procedure under investigation and, likewise, ensuring sufficient numbers of the target population and facilities to deliver the trial. [Boutron, 2008; MHRA, 2021]
No guidelines provided advice on selecting treatment providers. Treatment providers could be a limited group or all professionals offering the intervention. [Elias, 2019] If a limited group, guidance on selecting centres, and reporting requirements, may be looked upon as a proxy for triallists when deciding how to select treatment provider, for example caseload and ensuring specific qualifications. [Boutron, 2008; MHRA, 2021; Boutron, 2017]
The results of the main trial should report on the number of centres and treatment providers performing each intervention. [Boutron, 2017]
Ensuring that the intervention is standardised
Fifteen out of 28 documents (54%) discussed the importance of standardising the intervention. See D3 in Supplementary Table 1.
Variation in delivery can be reduced by standardising all, or aspects of, the intervention of interest. Limiting variation in treatment delivery may be more desirable in an efficacy trial than a pragmatic, effectiveness study. [Craig, 2019; McCulloch, 2009] In pragmatic trials. standardisation might consist of simply informing treatment providers to perform the treatment as usual. [Boutron, 2008] Regardless of the stage, trial delivery should be similar at all centres [ICH E9, 1999] and designed such that a clear description of the procedures performed can be provided. [Zwarenstein, 2008; Vanhie, 2016] Investigator meetings to prepare investigators and standardise performance were suggested by one document. [ICH E8, 1998]
Monitoring treatment adherence was an important aspect across documents. [Boutron, 2008; ICH E9, 1999; McCulloch, 2009; ICH E8, 1998; ICH E3, 1995] Suggested methods included reviewing case report forms, videotapes and audiotapes, extending to decertifying and excluding surgeons not submitting a videotape rated acceptable by an independent committee. [Boutron, 2008]
Reporting in depth details of the intervention, and comparator, was required by a number of documents. Aspects required included technical procedures, full details on preoperative, intraoperative and postoperative care and the extent to which delivery were permitted to vary between participants, treatment providers and centres. [Boutron, 2008; Zwarenstein, 2008; ICH E3, 1995]
Anticipating changes over time
Eight out of 28 documents (29%) discussed considering changes in delivery of the intervention over time. See D4 in Supplementary Table 1.
Delivery may still vary irrespective of training, experience and other steps to enforce standardisation. The amount of variation will depend on the stage and technicality of intervention development. [Craig, 2019; Boutron, 2008; McCulloch, 2009; Bilbro, 2021] An important aspect of surgical evaluation across the guidelines was that delivery may change over time for pragmatic reasons, changes in external factors, or as a result of expertise developing during the study. [Craig, 2019; Ergina, 2009; McCulloch, 2009]
Expertise can develop over a very long time and so requiring a set expertise level can slow the delivery of surgical trials. [Ergina, 2009] Some guidelines discussed evaluating the learning curve within the trial, [McCulloch, 2009] and highlighted this was particularly important in earlier phase trials. [Bilbro, 2021] In trials comparing more established techniques, the statistical advantages and gain in ‘internal validity’ need to be considered against the loss of generalisability or ‘external validity’ of applying too much emphasis on the learning curve. [Craig, 2019]
Reporting learning curve assessment results was required by one document but this was limited to early phase studies. [Bilbro, 2021]
Estimating the sample size
Eight out of 28 documents (29%) discussed sample size. See D5 in Supplementary Table 1.
A number of guidance documents highlighted the impact of failing to reduce variation within trial arms by standardising the intervention on the sample size and power calculation, where typical estimates assume that differences between the treatments across centres, or treatment provider, are unbiased estimates of the same quantity. [Craig, 2019; ICH E9, 1999] In the presence of multilevel data structures, where variability in individual level outcomes can reflect higher level processes, calculations are more complicated. [Jackson, 2010; ICH E9, 1999; Cook, 2004] To avoid associated imprecision in results, the sample size should adjust for any clustering effects as estimated by the intraclass correlation coefficient (ICC) and this should be reported in the main results paper. [Boutron, 2008; Boutron, 2017] Conversely, two documents that discussed sample size did not comment on adjusting for clusters. [National Institute for Health Research website, 2016; MHRA, 2021]
Ensuring balance of treatment within centre and treatment provider
Six out of 28 documents (21%) discussed ensuring that treatment allocations are equally distributed within centre. See D6 in Supplementary Table 1.
Balancing treatment groups with respect to prognostic factors enhances trial credibility. [MHRA, 2021; EMA, 2021] Ensuring balancing of patients within centre was highlighted as important within many of the guidance documents, [ICH E9, 1999; MHRA, 2021; EMA, 2021] and similar reasoning would lead surgical triallists to extend this to treatment provider which was not discussed within any document.
Balance can be achieved by stratifying the randomisation and stratifying by centre was a common topic, particularly when centre is expected to be confounded with other prognostic factors. [ICH E9, 1999; MHRA, 2021; EMA, 2021] When there are too few patients per centre, stratifying by a larger unit, such as country or region, may be warranted. [EMA, 2021] Despite stratifying by treatment provider not being specifically addressed within the documents, in some circumstances it may be desirable to stratify for more than just both centre and treatment provider, or treatment provider alone, where numbers allow. [EMA, 2021] The use of more than two stratification factors is rarely necessary. [ICH E9, 1999]
Analysing a trial with clustering and learning
When the randomisation was stratified
Two out of 28 documents (18%) provided guidance on adjusting the analysis following stratification. See A1 in Supplementary Table 1.
Stratifying randomisation and subsequently adjusting the analysis, are complementary methods of accounting for prognostic factors, unless the stratification factor was chosen for administrative reasons only. [ICH E9, 1999; EMA, 2021]
Two documents discussed the issue of adjusting for too many, or too small, strata in the analysis, for which there is no best solution. [ICH E9, 1999; EMA, 2021] When included in the randomisation scheme, ignoring centres or adjusting for a large number of small centres might lead to unreliable estimates of the treatment effect and p-values. [EMA, 2021] At best, using an unadjusted analysis should be supported by sensitivity analyses that indicate trial conclusions are not affected because of this. [EMA, 2021] As above, the statistical justifications for including centre could be considered to also include treatment provider in surgical trials but no guidance required specifically.
When analysing the primary outcome
Two out of 28 documents (18%) provided guidance on adjusting the primary outcome analysis. See A2 in Supplementary Table 1.
Unexplained differences between treatments, for example between adjusted and unadjusted analyses, can jeopardise the trial results. [EMA, 2021] For this reason, when the primary outcome is expected to be influenced by centre or treatment provider, an adjustment should be planned. When the potential value of an adjustment is in doubt, such as little existing prior knowledge, the primary analysis should be unadjusted analysis, supported by an adjusted analysis. [ICH E9, 1999; EMA, 2021] In general, larger datasets generally support more factors than smaller ones and results based on simpler models are generally numerically stable, the assumptions underpinning the statistical model easier to validate and improves generalisability. [EMA, 2021]
Analysing multi-centre trials
Six out of 28 documents (21%) provided guidance on analysing multi-centre trials. See A3 in Supplementary Table 1.
Investigations into heterogeneity of the main treatment effect across centre and/or treatment provider were covered by a number of documents. [Boutron, 2008; ICH E9, 1999; McCulloch, 2009; ICH E3, 1995; Bilbro, 2021] Further, the main trial publication should report methods to adjust for, and results into, clustering by centre or treatment provider. [Boutron, 2008; Boutron, 2017] These investigations are critical when a positive treatment effect is found and there are appreciable numbers of subjects per centre. [ICH E9, 1999] In the simplest multi-centre trial, a single investigator recruits and is responsible for all patients within a single hospital, such that centre is identified uniquely by hospital. When the definition of centre is ambiguous, such as a single investigator recruits from several hospitals or a clinical team recruits from numerous clinics, the protocol should provide a definition. [ICH E9, 1999; ICH E3, 1995]
Quantitative approaches may comprise graphical display of the results of individual centres, such as forest plots, or analytical methods, such as a significance test although this generally has low power. [ICH E9, 1999] One stated that investigations use a model which allows for centre differences but no interaction terms, [ICH E9, 1999] Fixed or mixed effects models can be used, although mixed models are especially relevant when there is a large number of centres. [ICH E9, 1999; ICH E3, 1995]
Methods for investigating the learning curve
Four out of 28 documents (14%) provided guidance on analysing the learning curve within centre and/or treatment provider. See A4 in Supplementary Table 1.
Reporting of continuous quality control measures can be useful for all phases of trial, particularly early phase surgical trials. [McCulloch, 2009; Bilbro, 2021] Time series and longitudinal models or multilevel models can be used to analyse long and short sequences of data respectively. [Craig, 2019; Jackson, 2010] Simpler exploratory methods such as cusum plots enable centres or surgeons to be compared against themselves which can be preferable to surgeons. [McCulloch, 2009; Bilbro, 2021]
Method for investigating clustering
Five out of 28 documents (18%) provided guidance on investigating clustering due to centre and/or treatment provider. See A5 in Supplementary Table 1.
Hierarchically structured data, such as patients within surgeon, can be analysed using multilevel models or generalised estimating equations (GEE). [Craig, 2019; Boutron, 2017] Multilevel models are subject-specific models whereas GEEs are population average. For multilevel models: fixed, random or mixed effects can be specified to account for clustering [Boutron, 2017] and different types of these models allow for flexible data structures. [Jackson, 2010]
For ordinary linear models, the treatment effect estimate is likely to be similar but not necessarily identical for adjusted and unadjusted models. Adjusted analyses are more efficient, and so a less significant result for unadjusted should not be a concern. For generalised linear or non-linear models, adjusted and unadjusted treatment effects may not have the same interpretation and may provide different results. [EMA, 2021]