Performance evaluation of interim analysis in bioequivalence studies.

Under current bioequivalence guidelines in Japan, it is mandatory to establish bioequivalence using a single pivotal study. Clinical trials with limited resources usually have a pre-dened maximum permissible number of participants. In this manuscript, we considered a trial design that would allow for bioequivalence evaluation at an interim analysis in which the total number of participants takes into account the resource constraints. Then, available options at the interim analysis are group sequential designs and adaptive designs, A comparison of the performance of the two methods under a �xed maximum participant number has not been conducted thus far. So we examined which method should be used by conducting a simulation study. Since bioequivalence is expected to be achieved at the interim analysis, a study design using a Pocock-type alpha spending function is preferrable. Simulation results using a Pocock-type alpha spending function showed similar performance between group sequential and adaptive designs. Consequently, due to statistical and operational complexity, it is preferable to choose group sequential designs for bioequivalence study in Japan.


Introduction
Bioequivalence (BE) studies are a type of clinical trial conducted to compare the bioavailability of two products that are using the same active ingredient or molecule and verify how similar they are to each other.BE studies often utilize a cross-over study design for several compelling reasons.Firstly, this approach minimizes inter-subject variability by enabling each participant to serve as their own control.Consequently, the in uence of individual differences (inter-subject variability) on the study results is greatly reduced.Secondly, given that each subject acts as their own control, fewer participants are necessary to attain the required statistical power.Lastly, this design enhances precision by mitigating the impact of variability within the study population.This heightened precision allows for more accurate and dependable conclusions to be drawn regarding bioequivalence.
In the context of BE studies, one of the simplest and most commonly employed designs is the 2x2 crossover design.A common method for bioequivalence assessment involves comparing different formulations of the same drug by analyzing pharmacokinetic parameters like area under the blood concentration curve (AUC) and maximum blood concentration (Cmax).The goal is to determine whether the observed differences between these formulations are statistically signi cant.Both the FDA and EMA recommend the use of single-dose 2-way crossover studies, typically involving at least 12 healthy volunteers, as a general practice for BE studies.In 2020, Japan implemented a signi cant revision to its guidelines for BE studies of generic products (PMDA, 2020).Prior to this revision, there existed various opportunities for BE evaluation, including the utilization of add-on subject studies without the need for multiplicity adjustment.However, post-revision, a notable shift occurred in the regulatory landscape.There emerged a stringent requirement to control the type 1 error, ensuring that it did not exceed a 5% threshold (one-sided) in a single pivotal BE study.As a consequence it is not acceptable to use pilot study data in the BE assessment (Q-10 (A)).Similarly, analyzing the pooled data obtained in the pivotal study and add-on subject studies conducted separately from the pivotal study is also not acceptable (Q-11 (A)).However, it is acceptable to design a protocol with the acquisition of additional data based on the results of interim analysis (Q-11 (A)).Taking these key points into consideration, the possible trial designs are either a group sequential design (GSD) or an adaptive design (AD).
Several manuscripts related to two-stage design in BE studies have been published.Potvin et al. (2008) evaluated four methods for AD using assumed Geometric Mean Ratio (GMR) and actual coe cient of variation (CV) at BE evaluation of interim analysis in BE studies.The two-stage designs discussed here offer insurance against incorrect CV use during planning, resulting in an approximately 20% increase in average sample size when the planned CV is accurate.Karalis and Macheras (2013) selected the best method selected in Potvin et al. (2008), but used the actual GMR and CV at the interim analysis for sample size re-estimation of AD and set an upper limit of sample size.The simulation results in two-stage design ensures that the type 1 error rate remains below 5%, likely due to the inclusion of an upper sample size limit.Fuglsang (2014) investigated the impact on type 1 errors and powers in methods introduced by Potvin et al. (2008) under various futility rules.Although these manuscripts focused only on the results of adaptive design, Kieser and Rauch (2015) compared GSD with AD, and showed the ndings that powers in GSD are similar to those in AD but average sample sizes in GSD were fewer than those in AD.However, because the maximum sample sizes and the number of subjects targeted for interim analysis differed between the GSD and AD evaluated, it is necessary to compare their performance under matched conditions.
In actual clinical trials, it is anticipated that the maximum number of subjects for the trial is prede ned due to limited resources.The maximum allowable number of subjects in a BE study is typically smaller than that in Phase 2 or Phase 3 studies.Our manuscript takes these circumstances into consideration and evaluates the performance of GSD and AD in BE studies, aiming to examine which method should be used.BE study designs allow for the possibility of declaring BE and stopping the study at the interim analysis timing if the actual within-subject variability is lower than expected or if the GMR is closer to 1 than anticipated.
In Section 2, we describe the methodology of GSD and AD applied for BE studies.Section 3 covers extensive simulation studies that re ect the scale of typical small-scale BE studies.We provide a brief discussion in Section 4.

Methodology
When AUC and Cmax follow a log-normal distribution, the BE criterion for each parameter is de ned as a ratio of the population geometric means of the test product and reference product, falling within the range of 0.80 to 1.25.BE is declared for a product if the 90% con dence interval of the difference in the average values of logarithmic parameters between the test and reference products is contained within the acceptable range of log(0.80) to log (1.25).
When the population geometric means before log-transformation of the parameters for evaluation of BE in the test product and products are described and , the hypotheses of the BE test are : The bioequivalence margins ( , ) are and The null hypothesis can be evaluated with two one-tailed t-tests at a 5% signi cance level as follows, 1 are means of the test and reference products, is signi cant level, is total sample size and is standard deviation of within-subject error.
In this manuscript we consider study designs that allow for the termination of the study if the BE criteria are met at the interim analysis.In the following section, we will explain the application of GSD and AD to the study design.

Group sequential design
Given the multiple hypothesis tests conducted in GSDs, it becomes necessary to adjust the signi cance level for each analysis to maintain the overall signi cance level .These repeated analyses incorporate data from earlier interim analyses, resulting in correlated test statistics within GSDs.Various strategies exist for determining the interim-wise signi cance levels in GSDs.An early approach, as proposed by Pocock (1977) O'Brien-Fleming-type function: (3) α is the overall signi cance level of the study, is Napier's constant and is the cumulative standard normal distribution function.BE is evaluated at the time of an interim analysis using the appropriate signi cance level for the t-test in (1).

Adaptive design
Adaptive designs allow a exible modi cation of design characteristics during an ongoing study while at the same time controlling the overall type I error rate.Adaptive designs offer the option of mid-course sample size recalculation based on interim results.There exist various approaches to construct adaptive designs that control the type I error rate also in case of sample size recalculation.One standard method is the inverse normal approach (Cui L et al., 1999; Lehmacher W and Wassmer G, 1999; Kieser and Rauch, 2015).In essence, it transforms p-values ( and ) from each stage of a two-stage design.When these p-values follow a uniform distribution under the null hypothesis, the inverse normal transformation converts them into standard normal random variables.For the inverse-normal combination test (Wassmer and Brannath, 2016), the test statistics at the interim analysis and at the nal analysis are Respectively, where at the interim analysis, and are the p-values at the interim and nal analysis respectively.Then, the overall test statistics is: 4 In the manuscript, we set weight of as .follows a standard normal distribution under the null hypothesis and critical values used in the GSD can also be used in the AD.Consequently the probability of terminating the study at the interim analysis in the AD is same as that in the GSD.
Sample size re-estimation is a characteristic feature of AD, and involves adjusting the sample size during an ongoing clinical study based on the accumulating data to preserve or increase power.The conditional power that signi cance is achieved at the second stage conditioned on the test statistics at the rst stage (at the interim analysis) is: where is critical value at the nal analysis.From (5), we can consider the signi cance level at stage 2 ( ) as: In other words, conditional power can also be calculated using the following formula.

6
GMR is calculated from the data at the interim analysis, and we search for where the following conditional powers exceeds and is no larger than the maximum allowable sample size ( ).

Simulation study design
To understand the performance of the GSD and the AD in a small-scale clinical study we conducted an extensive simulation study.To simplify the discussion, the simulations consider only a single endpoint, i.e.AUC or Cmax.Also, we decided to set up GMRs below 1 because it is likely to show symmetrical simulation performance around 1 (i.e.We could have used 0.8 for GMR and obtained equivalent results to when the ratio is 1.25.).We calculated the sample size ( ) to ensure 90% of power under various true GMRs (GMR*) and CV (CV*) in Table 1.As mentioned in Section 1, BE must be proven in the one study, so the GMR (GMR*) and CV (CV*) as the study assumption are estimated a little conservatively and the sample size is calculated.We then assume a situation where the true GMR (GMR0) is close to 1 or the CV (CV0) is small, and BE is expected to be achieved at the interim analysis point.With this in mind, we set up two scenarios.Scenario A includes incorrect assumed CV (CV*≠CV0) but correct assumed GMR (GMR*=GMR0) for sample size calculation.Scenario B includes incorrect assumed GMR (GMR*≠GMR0) but correct assumed CV (CV*=CV0) for sample size calculation.Scenario C includes incorrect assumed GMR (GMR*≠GMR0) and incorrect assumed CV (CV*≠CV0) for sample size calculation.

Simulation results
First, we discuss simulation results using Pocock-type alpha spending functions.Figure 1 summarizes overall powers and average sample sizes by method (FD, GSD and AD (CP = 80%, 90%)) in Scenario A (Pocock).Figure for powers at the interim analysis is attached in the appendix (Similar gures for other simulations scenarios are also attached in the appendix).Overall power and average sample size are at their highest in FD, those in GSD are almost similar to those in AD (CP = 80%, 90%).Figure 2 also summarizes overall powers and average sample sizes by method in Scenario B (Pocock).Similar to Fig. 1, overall powers and average sample sizes are the highest in FD, those in GSD are almost similar to those in AD (CP = 80%, 90%).
From another perspective, we consider the results when the true CV0 was smaller than the assumed CV* (e.g., CV0 = 0.2 & CV*=0.4 in Fig. 1) or when the true GMR0 was closer to 1 than the assumed GMR* (e.g., GMR0 = 1 & GMR*=0.9 in Fig. 2).At this time, the powers in FD, GSD, and AD (CP = 80%, 90%) are almost the same, but the average sample size in FD is larger than that in GSD and AD (CP = 80%, 90%), and the difference became more variable.In Fig. 3 for overall powers and average sample sizes by method in Scenario C (Pocock), the ndings are generally supported by results under incorrect assumed GMR and incorrect assumed CV (e.g.CV0 = 0.2 & CV*=0.4 & GMR0 = 1 & GMR*=0.9 in Fig. 3).This is thought to be because the probability of achieving BE has increased at the time of the interim analysis.
Next, we focus on simulation results using O'Brien-Fleming-type alpha spending functions.Figure 4 summarizes overall powers and average sample sizes by method in Scenario A (O'Brien-Fleming).Overall powers are the highest in FD, almost the same in FD and GSD.The next highest was AD (CP = 90%), followed by AD (CP = 80%).Average sample sizes are the highest in FD, followed by GSD, AD (CP = 90%) and AD (CP = 80%).Figure 5 also summarized overall powers and average sample sizes by method in Scenario B (O'Brien-Fleming).The order of overall powers and average sample size for each method was the almost same as in Fig. 4. The similar results were obtained when the true CV0 was smaller than the assumed CV*, or when the true GMR0 was closer to 1 than the assumed GMR*.In Fig. 6 for overall powers and average sample sizes by method in Scenario C (O'Brien-Fleming), the ndings are generally supported by results under incorrect assumed GMR and incorrect assumed CV.However, in some cases (e.g.CV0 = 0.2 & CV*=0.4 & GMR0 = 1 & GMR*=0.9 in Fig. 6), the powers in FD, GSD, and AD (CP = 80%, 90%) are almost the same, but the average sample size in FD is larger than that in GSD and AD (CP = 80%, 90%) because the probability of achieving BE has increased at the time of the interim analysis.
We shall delve into the reasons for the difference of the performance between Pocock-type alpha spending functions and O'Brien-Fleming-type alpha spending functions.This can be attributed to the more stringent BE criteria at interim analysis under the O'Brien-Fleming approach compared to the Pocock approach.Consequently, in the O'Brien-Fleming framework, there is a higher likelihood of the GSD proceeding to the nal analysis stage.On the other hand, when calculating the CP of the AD based on interim analysis data, it becomes more attainable to achieve CP of 80% or 90%.This results in an increased occurrence of scenarios where .Consequently, this leads to a small reduction in both the overall powers and average sample size in AD.

Conclusion
Under current bioequivalence guidelines in Japan, it is mandatory to establish bioequivalence using a single pivotal study.Also, there is typically a predetermined maximum allowable number of subjects in clinical trials with limited resources.In this manuscript, we set the total number of subjects in the clinical trial conservatively, taking into account resource constraints, and considered a trial design that would allow for bioequivalence evaluation at an interim analysis.
The results of our simulation study indicated that when using the O'Brien-Fleming-type alpha spending function at the interim analysis, both the total power and average number of subjects in the group sequential design tended to be higher than that of the adaptive design.On the other hand, bioequivalence is expected to be achieved at the interim analysis point, so the study design using a Pocock-type alpha spending function would be preferred.
Simulation results for the Pocock-type alpha spending function showed little difference in performance between the group sequential design and the adaptive design.Therefore, considering the statistical and operational complexity, there may be preferable to choose group sequential designs for bioequivalence study in Japan.

Declarations Author contributions
All of the authors listed made substantial contributions to the analysis and interpretation of the data described in this paper.All of the authors commented on and revised previous versions of this manuscript before reviewing and approving this nal version.The authors agree to be accountable for all aspects of the work, ensuring that any questions related to the accuracy or integrity of the work are appropriately investigated and resolved.

Funding statement
The research was not conducted with a research fund from UCB Japan Co. and UCB S.A and University of Tsukuba which the authors belong to.The opinions expressed in this manuscript are solely those of the authors and do not express the views or opinions of our companies.

Figure 1
Figure 1 Simulation results of Scenario A by methods (Pocock).

Figure 2 Simulation
Figure 2 Simulation results of Scenario B by methods (Pocock).

Figure 3 Simulation
Figure 3 Simulation results of Scenario C by methods (Pocock).

Figure 4 Simulation
Figure 4 Simulation results of Scenario A by methods (O'Brien-Fleming).

Figure 5 Simulation
Figure 5 Simulation results of Scenario B by methods (O'Brien-Fleming).

Table 1 .
Sample size (n) to ensure 90% of power under various GMR* and CV*.