In this section, we evaluate the performance of the proposed technique through simulations. No systematic comparison with the typical Bayesian approaches is provided as there always exist certain combinations of the tuning parameters for the model-based Bayesian methods that could match the results from this proposed method.
Let X follow a normal distribution with a mean of 0.3 and a standard deviation (SD) of 1, denoting as X ~ N (0.3,1). We generate 100 such random samples to represent the current data and certain numbers of additional random samples for the external data with varying means. The cap of borrowing from the external data is set at 50. The hypothesis test is:
$${H}_{0}: \mu \le 0 vs. {H}_{1}: \mu >0,$$
3
where \(\mu\) is the population mean. The nominal significance level is set at one-sided 0.025. To evaluate the type I error of the above scheme, let the mean for the current data to be 0, the mean for the historical data to vary from 0.1 to 0.6, which is a moderate range in comparison with the SD of 1. For power evaluation, the mean of the current data is set at 0.3. To put things into perspective, the statistical powers are 85% and 95% with a total sample size of 100 and 150 respectively at the mean value of 0.3 with a frequentist design (without borrowing). The simulations were conducted with runs of 20,000 for type I error and 5,000 runs for power.
The first type of simulation is to discount the external data only through the I2 statistic as described in the method section. Once the external data have been discounted, to test the hypothesis test in (3), the normal conjugate for the location parameter is used to combine the two sources of data. This method is denoted as the CJGI2 (Conjugate plus I2) scheme. The external data are randomly generated given a specific mean with a SD of 1. Of note, for a binary endpoint, the beta-binomial conjugate could be used.
Table 1
Type I error and power for the CJGI2 scheme with varying external means
Total external data points
|
|
Mean for external data
|
0.1
|
0.2
|
0.3
|
0.4
|
0.5
|
0.6
|
50
|
Type I error
|
0.079
|
0.094
|
0.106
|
0.085
|
0.062
|
0.042
|
|
Power
|
0.873
|
0.912
|
0.928
|
0.925
|
0.889
|
0.886
|
500
|
Type I error
|
0.051
|
0.116
|
0.147
|
0.088
|
0.040
|
0.031
|
|
Power
|
0.849
|
0.938
|
0.971
|
0.921
|
0.880
|
0.865
|
2000
|
Type I error
|
0.052
|
0.108
|
0.158
|
0.084
|
0.036
|
0.032
|
|
Power
|
0.870
|
0.950
|
0.969
|
0.923
|
0.891
|
0.869
|
As one can see, for different amount of external data, the type I error control is generally reasonable. The type I error is relatively higher near the mean of 0.3 for external data, going down in both directions. This is because at the mean value of 0.3 for external data, the strength against the null of mean at 0 is relatively strong and the borrowing is not heavily discounted due to the moderate disagreement between 0 and 0.3. Toward the lower side of the external means, though the borrowing is more, it is offset by weaker evidence against 0. Toward the higher side, the stronger evidence against 0 is offset by heavy discounting due to the stronger disagreement. The amount of total external data may also interplay with the type I error control and power performance as it influences the stability of the estimates of the mean and variance for the complete set of the external data.
The second type of simulation is to discount the external data by leveraging both the I2 statistic and the heterogeneity p-value also described in the method section. This method is denoted as the CJGI2PV scheme. The external data are also randomly generated given a specific mean with a SD of 1.
Table 2
Type I error and power for the CJGI2PV scheme with varying external means
Total external data points
|
|
Mean for external data
|
0.1
|
0.2
|
0.3
|
0.4
|
0.5
|
0.6
|
50
|
Type I error
|
0.060
|
0.071
|
0.076
|
0.058
|
0.048
|
0.034
|
|
Power
|
0.864
|
0.882
|
0.917
|
0.915
|
0.902
|
0.868
|
500
|
Type I error
|
0.052
|
0.089
|
0.093
|
0.058
|
0.034
|
0.0285
|
|
Power
|
0.854
|
0.922
|
0.946
|
0.918
|
0.894
|
0.872
|
2000
|
Type I error
|
0.050
|
0.092
|
0.097
|
0.059
|
0.036
|
0.026
|
|
Power
|
0.868
|
0.932
|
0.961
|
0.943
|
0.902
|
0.865
|
Based on Table 2, one can see that incorporating the heterogeneity p-value into the discounting better controls the type I error without sacrificing much of the power.
The operating characteristics of the CJGI2PV reflected in Table 2 is reasonably good as the Frequentist powers are 85% and 95% with a sample size of 100 and 150 respectively at the mean value of 0.3.
One may argue that in real-life situations, the external data are known, which means no variability. In Table 3, which is only performed for the CJGI2PV method, the external data are fixed at the specified mean with the standard deviation of 1. The results are similar to those in Table 2 with large number of external data points. As the mean and the standard deviation for the external data are fixed in Table 3, the random numbers do not need to be generated for the external data. In addition, the number of the varying total external data points no longer matters.
Table 3
Type I error and power for the CJGI2PV scheme with external mean and SD fixed
|
Mean for External Data
|
0.1
|
0.2
|
0.3
|
0.4
|
0.5
|
0.6
|
Type I error
|
0.053
|
0.095
|
0.105
|
0.058
|
0.035
|
0.031
|
Power
|
0.862
|
0.916
|
0.944
|
0.923
|
0.898
|
0.876
|
Under normal distribution, the ideal discount function should be associated with the disagreement of the location parameter adjusted by the variance. Putting the location difference and the variance together, a quality candidate to measure disagreement is the coefficient of variation (CV). Figure 1 generated by software R illustrates the inefficiency by using the I2 statistic alone – the discount does not seem to exist for CV up to 1 due to its inertia to discount given the fair proximity of the two sources of data. On the other hand, the heterogeneity p-value from the meta-analysis discounts aggressively – approximately 70% of the external data has been discounted away at CV of 1. The discounting controlled by the average of the I2 and heterogeneity p-value is more reasonable. In general, the discounting can be controlled by \(w\times (1-{I}_{2})+(1-w)\times {p}_{h}\), where w varies between 0 and 1. For serious adverse events such as death, w should be chosen smaller to have more aggressive discount in case of disagreement. If the discount by the p-value is not yet sufficient, customized discount function more aggressive than the bottom curve in Fig. 1 may be warranted. Interestingly, the I2 and p-value discount almost converge for large disagreement. The partnership between the I2 and the p-value is not a serendipitous finding. Based on Fig. 1, as long as the discount function is connected to CV, which is a standardized and unified scale for disagreement, there is a wide class of discount functions to serve the clinical needs under specific contexts. In later examples, we will see that this I2 and p-value combination works quite well without further fine tuning, though it may not hit the bull’s eye.
Automating the discount function or finding a good guiding value for the discount function is a major component of the current Bayesian scheme. Haddad et al. used the Weibull cumulative distribution function with the p-value as an input parameter to model the power prior parameter. Their method comes close to the CV linkage. Though the Weibull approach can control a range of type I error and power, it requires fine tuning in two-dimensional space due to the location and shape parameters of the Weibull model and it is less intuitive than the proposed method. Hobbs’s commensurate method considered the difference of the study parameter yet still missed the intuitive and unified CV scale.
A systematic comparison between this method and the existing Bayesian methods is beyond the author’s depth as it may require years of practice to reach the required level of competencies.
Examples
Two real-life examples are provided below to illustrate the utilities of the proposed method.
Case 1: Campbell (2017) provided a review of the Bayesian application in medical device clinical trials. In his paper, one example is focused on estimation. In this example, the historical data has a 20% event rate (50/250) while for the current data, the event rate is 17% (85/500). It is appropriate in such a case to use 0% as the reference to derive the “treatment effect”. Using the CJGI2 approach, it is full borrowing, while for the CJGI2PV method, the effective borrowing is close to that from the hierarchical model described in Campbell’s paper. If a performance goal (PG) is used in his example and the objective is to meet the PG, then the “treatment effect” estimation should be performed against the PG instead of 0%. Using normal approximation, the standard errors for the treatment effect for the historical and current data are 0.0253 and 0.0168 respectively. As the sample size for the historical data is 50% of that for the current data, no adjustment for the standard error is necessary. Using a generic version of the meta-analysis function metagen() in the R software, the I2 and the heterogeneity p-values are 0% and 0.3232 respectively. The effective sample size borrowing from the historical data under the CJGI2 method is (1 − 0)×250 = 250 while that under the CJGI2PV method is (1 − 0 + 0.3232)/2×250 = 165. The above-derived effective sample size of 165 is similar to that described in Campbell’s paper (slightly > 150 as quoted), which is derived post hoc (back engineering) based on a hierarchical Bayesian model. We can also calculate the “Bayesian” point estimate and the “credible interval” based on the beta-binomial conjugate starting from a non-informative super prior Beta(1, 1). The effective historical sample size of 165 is associated with an event count of 33 given the observed event rate of 20%. This leads to an updated beta prior of Beta(34, 133). The resulting posterior distribution is Beta(119, 548), whose point estimate is 0.178, with the 95% credible interval being (0.149, 0.207), both close to that from Campbell.
Case 2: For the XIENCE Diabetic indication, a hierarchical Bayesian modelling by Pennello and Thompson (2008) is employed to combine the data from four historical trials with the prospective data from two clinical centers (FDA 2015). The nature of the analysis is single-arm study against a PG of 14.8%. The primary endpoint is the 1-year Target Vessel Failure (TVF) rate, where TVF is a composite endpoint consists of cardiac death, target-vessel myocardial infarction, and ischemia-driven target-vessel revascularization. The four historical trials have event rates 7.8% (34/433), 11.8% (14/119), 7.3% (13/178), and 3.6% (6/169) respectively, with a total combined sample size of 899. The two clinical centers for the prospective data are combined into a single center for convenience, with an event rate of 8% (21/261). The different historical event rates are due to a combination of factors, such as different population and small sample size, where the latter contributes to a higher variability. For example, the highest event rate of 11.8% is for the long lesion patients with the smallest sample size. How to handle the historical data is subject to many considerations. For convenience of illustration, we use the random effect meta-analysis to integrate the historical data. This leads to an overall event rate of 7.3% with a 95% CI of [5.3%, 10%]. The standard error is approximately 0.012. Assuming the maximum number of patients to be borrowed is 50% of the prospective data, which translates into 130. The adjusted standard error for the historical meta-analysis mean is 0.032. The “treatment effect” against the PG for the prospective data and the maximum borrowable historical data are − 0.068 (0.08–0.148) and − 0.075 (0.073–0.148) respectively, with the associated standard errors of 0.017 and 0.032. The generic meta-analysis between the two sources leads to a I2 value of 0 and a p-value of 0.8468. The associated discount proportion is 0.92, which leads to an effective sample size of 120 for the historical data. With an effective sample size of 120 for the historical data and the observed rate of 7.3%, the associated number of events is approximately 9 by rounding. Combining the prospective data (21/261) and the effective historical data (~ 9/120) together through the beta-binomial conjugate starting from a non-informative super prior Beta(1,1), the posterior beta distribution is Beta (31,360), resulting in a Bayesian point estimate of 7.9% and the associated 95% credible interval of (5.3%, 10.6%). The above-derived results are again very close to the submitted Bayesian analysis results, which went through extensive simulations with the WinBUGS software and involved several rounds of regulatory interactions. In contrast, the computation using the proposed method is relatively intuitive and straight forward.