Remote expert elicitation to determine the prior probability distribution for Bayesian sample size determination in international randomized controlled trials: Bronchiolitis in Infants Placebo Versus Epinephrine and Dexamethasone (BIPED) Study

31 Background: Bayesian methods are increasing in popularity in clinical research. The design of Bayesian 32 clinical trials requires a prior distribution, which can be elicited from experts. Current elicitation 33 approaches either use face-to-face sessions or expert surveys. In diseases with international differences 34 in management, the elicitation exercise should recruit internationally, requiring expensive face-to-face 35 sessions or surveys, which suffer low response rates. To address this, we developed a remote, real-time 36 elicitation exercise to construct prior distributions. These elicited distributions were then used to 37 determine the sample size of the Bronchiolitis in Infants with Placebo Versus Epinephrine and 38 Dexamethasone (BIPED) Study, an international randomized controlled trial trial in the Pediatric 39 Emergency Research Network (PERN). The BIPED study aims to determine whether the combination of 40 epinephrine and dexamethasone, compared to placebo, is effective in reducing hospital admission for 41 infants presenting with bronchiolitis to the emergency department. 42 Methods: We developed a web-based tool to support the elicitation of the probability of hospitalization 43 for infants with bronchiolitis. Experts participated in online workshops to specify their individual prior 44 distributions, which were aggregated using the equal-weighted linear pooling method. The Average 45 Length Criterion determined the BIPED sample size. 46 Results: Fifteen paediatric emergency medicine clinicians from Canada, USA, Australia and New Zealand 47 participated in three workshops to provide their elicitied prior distributions. The elicited probability of 48 admission for infants with bronchiolitis was slightly lower for those receiving epinephrine and 49 dexamethasone compared to supportive care in the aggregate distribution. There were substantial 50 differences in the individual beliefs but limited differences between North America and Australaisia. 51 From this aggregate distribution, a sample size of 410 patients per arm results in an average 95% 52 credible interval length of less than 9% and a relative predictive power of 90%. 53 Conclusion: Remote expert elicitation is a feasible, useful and practical tool to determine a prior 54 distribution for international randomized controlled trials. Bayesian methods can then determine the 55 trial sample size using these elicited prior distributions. The ease and low cost of remote expert 56 elicitation means that this approach is suitable for future international randomized controlled trials. 57 Trial Registration: clinicaltrials.gov Identifier: NCT03567473 58

participated in three workshops to provide their elicitied prior distributions. The elicited probability of 48 admission for infants with bronchiolitis was slightly lower for those receiving epinephrine and 49 dexamethasone compared to supportive care in the aggregate distribution. There were substantial 50 differences in the individual beliefs but limited differences between North America and Australaisia. 51 From this aggregate distribution, a sample size of 410 patients per arm results in an average 95% 52 credible interval length of less than 9% and a relative predictive power of 90%. 53 Conclusion: Remote expert elicitation is a feasible, useful and practical tool to determine a prior 54 distribution for international randomized controlled trials. Bayesian methods can then determine the 55 trial sample size using these elicited prior distributions. The ease and low cost of remote expert 56 elicitation means that this approach is suitable for future international randomized controlled trials. 57 Background 61 Bayesian statistical methods use Bayes' theorem to combine information from observed data with 62 previous evidence, characterised in a prior distribution, to make inferences about the parameters in a 63 statistical model (1). Bayesian methods have become increasingly popular in clinical research as concern 64 about frequentist methods had increased for several reasons (2). They can formally incorporate external 65 evidence into the trial conclusions, rather than making definitive conclusions based on evidence from a 66 single trial (3). They also provide a more natural interpretation of uncertainty (4), and easily allow for 67 frequent monitoring and adaptive designs (5). 68 To take advantage of Bayesian methods, the sample size for the proposed trial must be determined. 69 Bayesian methods for sample size determination (SSD) have several advantages over frequentist SSD 70 methods. First, Bayesian SSD methods incorporate the statistical uncertainty that is inherent in the 71 estimates of key quantities (6). This contrasts to frequentist SSD methods where fixed values for several 72 key quantities, such as size and the target difference (7), must be specified and the required sample size 73 is highly sensitive to the values selected for these quantities (8). Secondly, frequentist SSD methods do 74 not consider clinicians' current beliefs about a treatment, meaning that trial results that contradict 75 strongly held beliefs are often not convincing enough to change clinical practice (9). Finally, sample sizes 76 calculated using frequentist methods are often hard to achieve or even infeasible in rare diseases (10). 77 In this setting, Bayesian SSD methods can reduce the required sample size by combining trial data with 78 other information, such as expert knowledge or earlier studies, to provide a similar level of scientific 79 certainty (11). 80 To utilize Bayesian SSD methods, a "prior distribution" must be defined to represent the currently 81 available evidence about the key parameters of interest (12). This prior distribution can be defined in 82 several ways, including using historical empirical data (13), expert knowledge or a combination of the 83 two (8). To use expert knowledge as the basis for a prior distribution, this knowledge must be converted 84 into a quantitative expression. This is commonly achieved through a structured "elicitation process" (14) 85 in which experts are assisted in converting their knowledge into a prior distribution through a series of 86 steps that are viewed as formal data acquisition processes based on validated methodologies (15). 87 Expert elicitation in clinical trials is becoming more frequent. A recent literature review identified 42 88 studies relating to clinical trial design and analysis from 460 studies discussing Bayesian prior elicitation 89 (16). Elicitation has been used, for example, in randomised controlled trials (RCTs) that compare 90 treatments for trauma resuscitation (17), bacterial corneal ulcers (18) and in rare diseases (19). 91 However, these elicitation studies required experts to meet in person, which can be difficult to arrange, extremely expensive, especially in international studies, and is currently restricted due to the COVID- 19 93 pandemic. An alternative approach to in-person meetings is to undertake a survey (20). However, 94 survey-based elicitation exercises often have low response rates and only allow for limited assistance 95 during the expert elicitation session (21). Furthermore, experts are not able to discuss and calibrate 96 their beliefs, which is key to most elicitation frameworks (22,23). 97 As the goal of a RCT is to gather robust empirical evidence that could change clinical practice and health 98 outcomes, the prior for the key parameters in an international RCT should robustly represent the beliefs 99 of experts in all health systems where the trial results would be implemented. This representation is 100 particularly important in diseases where there are regional (international) differences in clinical practice 101 and presentation patterns. Therefore, an alternative, efficient, remote elicitation process is required to 102 generate representative priors to support Bayesian SSD for international RCTs. 103 Bronchiolitis, a viral infection of the small and medium airways, and the most common reason for 104 infants less than one year of age to be admitted to hospital in the developed world, is a disease with 105 strong regional differences in clinical practice (24). Currently recommended management of 106 bronchiolitis is predominantly the provision of parenteral fluids for hydration and oxygen for hypoxemia, 107 called "supportive care" (25,26,27,28,29,30). Despite a lack of high-quality evidence, use of additional 108 pharmacotherapy such as nebulized epinephrine, albuterol, hypertonic saline or oral corticosteroids 109 varies by region, with an odds of use of any of these of 11.5 in Canada and 6.8 in the United States, 110 compared to Australia and New Zealand (24). While the provision of pharmacotherapy is not supported 111 by most guidelines, exploratory evidence suggests that the combination of inhaled epinephrine and oral 112 corticosteroids has the potential to reduce hospital admission by a third in infants presenting to 113 emergency departments (EDs) with bronchiolitis (31). 114 The Bronchiolitis in Infants with Placebo versus Epinephrine and Dexamethasone (BIPED) study is an 115 international RCT comparing inhaled epinephrine and oral dexamethasone (a corticosteroid) to placebo 116 in the treatment of infants presenting to EDs with bronchiolitis for the primary outcome of reducing 117 admission into hospital, taking place in Canada, New Zealand, and Australia. Given the regional 118 differences in bronchiolitis management and the geographical spread of BIPED sites, we set out to 119 develop a remote, real-time elicitation exercise to overcome the limitations of the currently used 120 elicitation exercises. From this exercise, we were able to provide a well-justified, representative prior to 121 be used in the SSD and analysis of the BIPED study. This paper describes our novel approach to remotely 122 elicit expert opinions for the BIPED study and the resulting Bayesian SSD. 123

124
The BIPED study 125 The BIPED study is a phase III, multi-centre, randomized, double-blind, placebo-controlled trial within 126 the Pediatric Emergency Research Network (PERN) (32) to determine whether the combination of 127 inhaled epinephrine and oral dexamethasone (EpiDex) is successful at reducing hospitalisation within 128 the seven days following an initial presentation to an ED with bronchiolitis. The BIPED study is enrolling 129 participants across 12 international sites; 6 in Canada, 3 in New Zealand and 3 in Australia. The study will 130 enrol infants aged between 60 days and one year who present to the ED with an episode of wheezing or 131 crackles, alongside signs of an upper respiratory tract infection during the peak season 132 for Respiratory Syncytial Virus (RSV). The active treatment, to be compared with a placebo control, is 133 two treatments of epinephrine (either via nebulisation (3 mg) or via metered dose inhaler and spacer 134 (625 mcg)) given 30 minutes apart in the ED and two doses of once daily oral dexamethasone (0.6 mg/kg 135 per dose, up to a maximum of 10mg). Participants will be randomised in 1:1 ratio to either the placebo 136 or the EpiDex combination therapy. The BIPED study aims to provide the requested additional evidence (33,34) after a previous study unexpectedly found that EpiDex had reduced symptoms sufficiently to 138 decrease hospitalization within 7 days of an ED visit by one-third (31). 139 140 The BIPED study was approved by Health Canada and the local research ethics committee at each study 141 site prior to enrollment. The remote elicitation exercise was approved by the Hospital of Sick Children 142 research ethics committee. Implied consent was used for the remote elicitation exercise, meaning that 143 by partaking in the elicitation exercise, the experts agreed that their data could be used for research 144 purposes. 145

Designing the Remote Elicitation Exercise
146 Key Parameters and Clinical Setting. The primary outcome in the BIPED study is admission to hospital 147 within 7 days following initial presentation to ED with bronchiolitis, which can be modelled using a 148 binomial distribution. The key parameter of interest in the BIPED study is the probability of hospital 149 admission within 7 days for each arm, placebo and EpiDex, denoted 1 and 2 , respectively. Beta 150 distributions are commonly used to model beliefs about probabilities as the beta distribution is 151 constrained between 0 and 1, has a flexible shape and is conjugate to the binomial likelihood (35). Thus, 152 in our elicitation exercise, we assume that each expert's prior can be expressed as a beta distribution. 153 To enable the elicitation, we developed a clinical case study (see Supplementary Material) of an infant 154 with bronchiolitis, who would meet the inclusion/exclusion criteria of the BIPED study, and was likely 155 equivocal with respect to admission into hospital (i.e., EpiDex could potentially improve infant prognosis 156 if prior beliefs supported benefit). Experts were then asked to determine the expected number of 157 patients, out of 100, with characteristics similar to this patient who would be admitted to hospital within 158 7 days under two different treatment options (EpiDex and supportive care). The goal of the elicitation 159 exercise was to determine prior distributions for the BIPED Bayesian SSD and analysis. However, we 160 decided that there was limited available expertise on the probability of admission under placebo and 161 focussed on eliciting the probability of admission under supportive care. We then assumed that the 162 outcomes under supportive care would be similar to placebo in our Bayesian SSD. 163 Developing an Online Elicitation Tool. Our remote elicitation exercise was based on the Sheffield 164 Elicitation Framework (SHELF) methodology (17,23). Online tools have been developed to support the 165 use of the SHELF framework (36) and we adapted these tools to support our elicitation about the 166 number of hospitalizations for infants with bronchiolitis. For our remote elicitation exercise, we built a 167 web-based interactive elicitation tool using R software and the shiny package (37,38) 168 (https://phebelan.shinyapps.io/Elicitation/). In the online tool, experts were first asked to provide the 169 lower and upper plausible values that subjectively described their beliefs about the number of infants 170 with bronchiolitis who would be hospitalised within 7 days. We assumed that the lower and upper 171 plausible values represented the limits of the 95% central credible interval in the beta distribution. 172 Experts were provided their "Best" estimate for the number of hospitalizations, which we assumed was 173 then the mode of the beta distribution. Within the interface, we restricted the value for the mode to be 174 within the plausible interval. Using this method, we aim to prevent experts from anchoring to their 175 initial selection and thereby underestimating uncertainty (22). Within the online tool, experts were 176 provided with a real-time individual beta distribution plot and a quantitative summary of their beliefs to 177 help adjust their estimates if the fitted beta distribution did not represent their beliefs (see 178

Supplementary Material). 179
While the online tool supported the elicitation process, the Research Electronic Data Capture (REDCap) 180 application collected the elicited distributions from each expert. REDCap is a web-based application 181 designed to support secure data capture for research studies (39,40). Once we developed the online 182 elicitation tool and REDCap database, we piloted the elicitation workshop three times internally (AP, SD, workshops remotely to ensure they could be delivered seamlessly and were an efficient use of experts' 185

time. 186
Selecting the Experts. The BIPED study is being conducted in 12 sites across Canada, Australia and New 187 Zealand. Therefore, we aimed to recruit experts from Canada, the United States, Australia and New 188 Zealand to determine representative aggregate priors across the regions in the study, avoiding selection 189 bias. To be eligible for the study, participants had to be (i) individuals identified as experts in 190 bronchiolitis and its treatment and (ii) clinicians with experience in pediatric emergency medicine (PEM). 191 Participants were excluded in they had extensive prior involvement with the BIPED study, i.e., serving as 192 a site principal investigator. Potential participants were invited to volunteer to contribute by email. We 193 aimed to recruit between 10 and 20 experts to ensure a breadth of experience in terms of geography 194 and speciality (14,41) . a study dossier to read before attending the workshop. The goal of this dossier was to introduce the 211 concept of an elicitation exercise and the currently available literature on treatments for bronchiolitis 212 (22). Our study dossier included a published elicitation study (17) and four published studies presenting 213 the use of epinephrine and/or dexamethasone as a treatment for bronchiolitis (31,44,45,46). The goal of 214 including a previous elicitation study was to introduce the experts to the concept of elicitation, while the 215 other studies were included to complement the experts' knowledge with the current literature. 216 Remote Expert Elicitation Workshop. We conducted three remote elicitation workshops using Zoom, a 217 cloud-based video conferencing platform (47). Three facilitators from the BIPED study team with 218 statistical and medical expertise attended each workshop. The workshop began with an introduction to 219 the BIPED study and the rationale of Bayesian statistics. To familiarise experts with the elicitation 220 procedure, an example using our online elicitation tool was then shown. Experts were then asked to 221 provide their personal beliefs about the chance of hospitalisation for the patient identified in the case 222 study. 223 The elicitation exercise was structured over two rounds with a group discussion between the two 224 rounds (22,23). In the first round, experts used our online elicitation tool to provide their individual prior 225 distribution for the probability of hospitalisation with supportive care and EpiDex. The facilitator (JL) 226 then generated a deidentified boxplot (shown in Figure S1) to display all the individual-level priors and 227 support the group discussion. The group discussion allowed the experts to adjust and calibrate their responses but did not aim to reach a consensus (22). Thus, the group discussion began with the 229 facilitator interpreting the individual boxplots before the experts were encouraged to share their beliefs 230 and discuss their thoughts around the observed variations in beliefs across experts. When the group 231 discussion no longer resulted in an exchange of information, the facilitator would manage the discussion 232 and to help promote critical thinking (48). Following the group discussion, experts were again asked to 233 use the online elicitation tool to characterise their beliefs and these results then generated the 234 individual prior distribution to be pooled. 235

241
To determine the sample size in the BIPED study, we use the average length criterion (ALC) for Bayesian 242 SSD (49). This method selects the smallest sample size for which the average length of a specified 243 posterior credible interval is below a given threshold. The ALC uses a preposterior analysis where the 244 length of the posterior credible interval is estimated across the range of potential studies, as estimated 245 by the prior-predictive distribution of the data (49). To achieve this, we simulated the probability of 246 hospitalisation within 7 days under the two treatments based on the priors from the expert elicitation 247 exercises using a binomial likelihood. These simulated data were combined with our aggregated prior to 248 determine the posterior for the two probabilities of hospitalisation, using Markov chain Monte Carlo 249 (MCMC) methods. We then calculated the 95% high density posterior credible interval for the difference 250 in the probability of admission across the two treatments, placebo and EpiDex. We estimated the 251 average posterior credible interval length for sample sizes between 400 and 630 using 1500 simulations 252 from the prior-predictive distribution and 5000 simulations from the posterior. We selected the sample 253 size for which the ALC is below 0.09. 254 In the BIPED study, we will declare that EpiDex is superior to placebo if the posterior probability that the 255 probability of hospitalisation under EpiDex is greater than the probability of hospitalisation under 256 placebo exceeds a threshold ; 257 To select the threshold , we simulated data assuming the probability of hospitalisation is equal to 0.35 259 for both EpiDex and placebo and selected a threshold such that ( 1 < 2 ) > in at most 5% of the 260 simulated studies. For the fixed value of , we then computed the frequentist power of study by 261 computing the proportion of simulated studies with ( 1 < 2 ) > when the probabilities of 262 hospitalisation are 0.35 and 0.27 for placebo and EpiDex, respectively, representing a target difference 263 of 8%. Finally, we compute the relative predictive power of the decision rule, defined as the proportion 264 of simulated studies from the prior predicative distribution for which ( 1 < 2 ) > , standardised by 265 the prior probability that EpiDex is superior to placebo. These three calculations were based on 8000 266 simulated trials with 5000 simulations from the posterior. All Bayesian analysis were performed using 267 JAGS through R (38,50). 268

270
Baseline Characteristics 271 We invited 25 PEM clinicians from Canada, the United States, Australia and New Zealand to participate 272 in our three remote elicitation workshops. In total, 15 of these experts agreed to participate in the study; 9 from North America (NA) and 6 from Australia and New Zealand (ANZ). The three workshops 274 contained 5 (2 NA; 1 ANZ), 4 (4 NA) and 6 (3 NA; 3 ANZ) participants, respectively. Table 1    were less certain about the size of this reduction in the second round, demonstrating that the group 290 discussion led the experts to be more conservative. 291 We explore the pooled prior distributions separately for each workshop ( Figure S2 308 We computed the average length of the 95% high design posterior credible interval for the difference in 309 admission probability between the two arms ( Figure 3). From these results, we specify a sample size of 310 410 participants per arm for the BIPED study ensuring the average 95% credible interval is shorter than 311 9%. Adjusting for an expected 5% loss to follow up, the total sample size of the BIPED study is 432 per 312 arm. The average 95% credible interval would be less than 8% if the BIPED study recruits 610 313 participants per arm. 314

Bayesian Sample Size Determination
With 410 participants per arm, we select = 0.99 as the threshold for declaring that EpiDex is superior 315 to placebo. With this threshold, we incorrectly conclude that EpiDex is superior to placebo when there is 316 no effect in 4.6% of the simulated trials (Type 1 error). This threshold then results in correctly concluding 317 that EpiDex is superior in 81% of the simulated trials with a target difference of 8% and a relative 318 predictive power of 90%. 319

324
We developed a remote elicitation framework, which offers a practical and convenient method for 325 expert elicitation. Expert belief, elicited using this framework, can then form the basis of a Bayesian SSD 326 and analysis for an international RCT, where the prior distribution should represent the diverse beliefs 327 across the regions enrolling patients. Our remote framework allowed us to practically obtain diverse 328 opinions by running a synchronous online exercise with a reasonably large number of diverse experts.
We enrolled 15 experts from 4 countries (Canada, United States, New Zealand and Australia) within a 330 relatively short time frame, on a limited budget, under COVID-19 related travel restrictions and 331 determined a pooled prior distribution that represents the diversity of perspectives in an international 332 trial. As our elicitation exercise involved a relatively short time commitment, we were able to achieve 333 high response rates, resolving issues seen with survey-based elicitation (21). Finally, we were able to 334 hold multiple elicitation workshops assisted by a facilitator to further broaden the range of experts who 335 could attend. 336 Another advantage of our remote elicitation framework, compared to survey-based elicitation methods, 337 is that we were able to have real-time facilitation and a group discussion. This facilitated expert 338 interaction and allowed us to identify issues within the workshops. As can be seen by the differences 339 between the distributions between the two rounds, the group discussion was critical in calibrating the 340 experts' beliefs. In particular, the experts raised external factors that would influence the decision to 341 admit an infant with bronchiolitis, such as hospital resources and family circumstances. Experts also 342 shared their thoughts and clarifications related to the design of the elicitation exercise and their 343 understanding of the elicitation task. We were also able to respond to any technical issues and ensure 344 that all enrolled experts were able to provide responses. 345 The biggest challenge we encountered was scheduling the workshops to maximise attendance. 346 Challenges included large differences in time zones between the countries and accommodating the shift 347 patterns of practicing PEM clinicians working in the ED. We decided to run multiple workshops so a 348 greater number of experts could participate and aimed to include experts from each region in each 349 workshop and ensure there were enough participants to allow a fruitful group discussion. While we 350 were largely successful, we found the scheduling of these workshops to be a significant challenge and 351 highly recommend inviting a higher number of experts than required as some schedules may be 352 incompatible, especially when working across multiple time zones. 353 workshop was scheduled for 90 minutes and the experts were invited to read five manuscripts before 355 attending the workshop as preparationestimated to take another 90 minutes. This minimal time 356 commitment, compared to day-long meetings and travel, allowed us to recruit a range of experts to our 357 study and was key to enrolling practicing PEM physicians. However, the set 90 minutes meeting-slot did 358 limit the time available for presenting the theory behind elicitation, which could have impacted the 359 quality of our elicited prior distribution. 360

361
To overcome challenges associated with standard methods for trial SSD and analysis and to enable 362 successful application of Bayesian methods, we developed a remote elicitation framework that offers a 363 comprehensive, practical, affordable approach to obtaining prior distributions for a Bayesian analysis of 364 an international RCT, where the current state of knowledge about the key parameters across the 365 jurisdictions where the trial results will be implemented should be incorporated into the analysis. This 366 prior distribution can be used to determine the appropriate sample size for a proposed Bayesian analysis 367 of the completed RCT. Thus, our proposed remote elicitation process promotes the use of Bayesian 368 methods in randomized controlled trials. 369     Average 95% posterior credible interval length for "admission probability difference" between placebo and EpiDex plotted across the BIPED clinical trial sample sizes increasing between 400 and 630 in increments of 5 (solid black line). Average Length Criterion (ALC) thresholds of 0.09 and 0.08 are plotted as dashed black lines (see text).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.