The decision to start a clinical trial to investigate a new drug or medical device is informed by preclinical studies to evaluate efficacy and safety. Depending on the medicinal product, some types of testing like toxicology studies are regulated and mandatory before moving from bench to bedside, others are specific to the disease, drug, and/or (animal) model. Here, we focus on preclinical efficacy studies where less regulatory prescriptions apply. The ultimate goal of such studies is to make knowledge claims[1]. Articulated on different effect levels, these include for example the claim of a specific role for a protein in a physiological process, or that an intervention will cure or slow the progression of a disease. To arrive at a knowledge claim, preclinical studies are performed in a stepwise approach. Hypothesis generating exploratory studies evolve along a continuum through within-lab replications to knowledge-claiming confirmation. During this process, investigators need to continuously re-evaluate premises and refine study designs to increase validity and reliability. This includes defining Go/No-Go criteria for further studies already at early stages[2]. When it comes to a detailed guidance for this transition process, information on planning, conducting, analyzing, and evaluating confirmatory studies in preclinical research is scarce. The need for such guidance is emphasized by recent initiatives investigating evidence from single studies, for example in cancer biology, that find a substantial number of experiments that do not replicate. That is, effect sizes are substantially lower than in the original study and results are no longer significant[3]. Whereas this is not unexpected, and science has the potential to self-correct, efficient strategies need to be devised to foster translation into the clinic and generate patient benefit. This includes the essential questions of when and how to conduct a confirmatory study.
To close this gap, biostatisticians, preclinical scientists, clinicians, and meta-researchers held a workshop to discuss the aforementioned issues for preclinical multicenter confirmatory studies (see Figure S1 for composition of workshop participants). Whereas the collaborative conduct of a study by more than one independent study site using shared protocols is common practice in clinical trials, this is a rather recent approach in the preclinical context[4]. Herein, most participating researchers currently conduct confirmatory studies funded by the German Federal Ministry of Education and Research[5]. Importantly, investigators aim to confirm their own previous exploratory research findings and underlying knowledge claims in a preclinical multicenter setting. Generated evidence should inform decisions to start a clinical trial. To develop guidance for conducting confirmatory studies, we have reviewed and discussed current approaches to identify what strength of evidence is needed before engaging in a confirmatory study and how evidence generation can be optimized in a confirmatory study with respect to the knowledge claim. In this report we will present suggestions from a transdisciplinary perspective and highlight open questions and opportunities for further research.
Towards Robust Evidence
For the decision to proceed to confirmatory experiments, a priori criteria need to be set. These criteria reflect the evidence gathered so far and address the necessarily high uncertainty and possible bias of exploratory experiments. To evaluate robustness of evidence, two factors are of main importance: reliability and validity. Reliability refers to the characteristics of a result that reflect the level of replicability measured for example by effect size precision or statistical significance. Importantly, a reliable experiment is not necessarily valid as results might be replicable, but not reflect the underlying postulated mechanism. For this, experiments also need sufficient validity to substantiate the knowledge claim. Here, we recommend minimum criteria for validity and reliability to support the decision to conduct a confirmatory study.
Minimum reliability and validity criteria
In exploratory studies, low sample sizes often threaten the reliability of results. Two factors contribute to this. First, significant results do not necessarily reflect the existence of a biologically relevant effect. Second, even if they do the estimated effect size will be an overestimation of the actual effect. To understand the first issue, one has to look at a set of scientific hypotheses that are experimentally tested. Some of these will reflect an underlying biologically relevant effect whereas others do not. The probability to detect a relevant effect is closely correlated with the sample size. Low sample sizes as frequently seen in preclinical experiments and with that low statistical power will have decreased detection rates for these relevant effects[6, 7]. Additionally and inherent to statistical test procedures, experiments also produce false positives, usually 5% of all cases in which a biologically relevant effect does not exist. This results in a dilution of the small number of identified relevant effects by a number of false positives. That is, a significant finding derived in a low sample size experiment is at an increased risk not to reflect a true cause-effect relationship. The second effect caused by low sample sizes is an inflation of effect sizes for significant results. This so called winner’s curse is elicited by the applied p-value filter wherein only large experimental effect sizes yield significant results in low powered experiments[8]. That is, even if experiments detect relevant effects the effect estimate carries a risk of inflation.
Consequently, when deciding whether to conduct a confirmatory study, the inflation of effect sizes and limitations of the p-value[9] need to be considered. If uncertainty about effect estimates is still high, within-lab replications could be a viable way to substantiate exploratory findings (see section Within-lab replications as a road to rigorous evidence). Alternatively, and similar to clinical trials, investigators can set an a priori determined smallest effect size of interest that reflects biological or clinical relevance to argue for a specific mechanism of action or to predict efficacy of an intervention, respectively. Such a lower bound could be informed by published effect size distributions, discussion with clinicians about viable clinical effects, and/or available resources that will only allow for a certain minimal effect size to be detected[10]. This discussion should involve biostatisticians and biomedical researchers who need to set decision-critical a priori criteria (e.g. smallest effect within confidence interval (CI) of exploratory study estimate) for progression to the next phase of experiments.
Regarding validity, the minimum set of criteria[11, 12] spans mainly three domains; internal, external, and translational validity. A high degree of internal validity is necessary already in early stages. This not only includes measures to reduce the risk of bias such as randomization[13] and blinding[14], but also the use of validated methods that measure outcomes with low bias and high accuracy[15] (Table 1). To promote generalizability of results beyond the single experiment, external validity needs to be increased for example by investigating or systematically introducing sources of variation through systematic heterogenization. This can be achieved by varying genetic and/or environmental conditions, for example, by testing immune competent animal models instead of specific pathogen free (SPF) immunocompromised strains[16, 17] or by introduction of environmental variation in a multicenter approach. To what extent this is necessary and feasible already in exploratory stages is an open question. Another powerful tool that adds to external validity is triangulation where different methods and approaches are combined to support the same claim. If different methods yield converging evidence, validity of generated evidence increases at the potential cost of adding complexity to a study design[18]. Additionally, within-lab replications potentially increase external validity (see section Within-lab replications as a road to rigorous evidence). As the ultimate goal of these experiments is clinical translation, factors that are diagnostic for the human case need to be considered and outcomes defined to facilitate interpretation in the clinical context (translational validity). Particularly, (animal) models should reflect targeted aspects of the human disease and converging evidence from different methods and contexts. We also recommend investigating bioavailability of the drug before or very early in the confirmatory stage, which ideally includes pharmacokinetics. Here, dose-finding experiments should be performed before a large multicenter confirmation to either start with a predefined dose, or at least narrow it down to a minimum range. Other factors are less concerting for the decision to continue to a confirmatory study. For example, testing clinically relevant biomarkers and route of administration can be the part of complementary experiments in the confirmatory phase. Those complementary experiments might be exploratory in nature or considered as flanking experiments to strengthen evidence.
Table 1
Minimum criteria that need to be fulfilled/considered before starting a preclinical confirmatory multicenter trial. Best practices are based on existing (reporting) guidelines and sketch the ideal situation. However, there can be practical limitation that hinder e.g., blinding or randomization.
Criteria | Minimum requirement | Best Practice | Restrictions/ Considerations |
Internal Validity | | | |
Blinding Concealment of group allocation from one or more investigator(s) involved in a preclinical study | Blinded outcome assessment | Blinding of treatment allocation, experiment(s), outcome assessment and analyses | Experiments in which the treatment allocation is directly linked to an obvious phenotypic difference from the start of the experiment (e.g. genetically modified mice with different fur colors) |
Randomization Using chance methods to allocate subjects to intervention and/or treatment according to a clearly defined probability distribution | Completely randomized[13] | Block design and stratification within known (not post-hoc) important predicting strata (like bodyweight) | Social transfer of e.g. pain may limit randomization options[19, 20] |
Inclusion/Exclusion Differentiate between animal attrition or drop-out and (data) outlier management | Clearly a priori defined inclusion/exclusion criteria Reporting of drop-out rate and/or animal attrition If data points are removed, it must be performed before unblinding according to a pre-defined protocol | Report full datasets and report all excluded animals with reason | Inclusion/exclusion criteria can be based on animal welfare (severity assessment and human endpoint), on scientific outcome (e.g. three times SD) or on characteristics of the model (genotype, phenotype, stage of disease) |
Outcome | Primary outcome needs to be clearly defined (measurement unit and time point) and disease relevant (as defined involving a clinician) | Primary and secondary outcomes are clearly defined | |
Quality Management/ Assurance Including standardization (and harmonization) of protocols | Protocols /work instructions and/or standard operating procedures in place Measures to assure quality of methods and models are defined (e.g. baseline measures across laboratories) | Harmonization of protocols across laboratories prior to the multicenter study (identification of differences) Training of experimenters | Different regulatory requirements regarding animal welfare in multi-center studies performed across different legal jurisdictions | |
Claim specification | Knowledge claim specification | Preregistration including specification of hypotheses (knowledge claims) and criteria for acceptance/ rejection | preclinicaltrials.eu animalstudyregistry.org osf.io |
Statistical methods | Need to be defined in advance (which methods are to be performed and which assumptions been made) including sample size calculation | Preregistration[21]; Registered reports[22] | Reach out to statistical consultants if needed | |
Reliability Consistency in a measurement | Sufficient number of animals to assess the clinically or biologically meaningful effect and its associated uncertainty to inform sample size calculations | Increase sample size via within-lab replication to estimate effect size with adequate precision | Within-lab replication can happen in parallel or across time (preferred) |
Translational Validity Extent to which a scientific finding can be translated from preclinical to clinical (human) contexts | Animal model is relevant for disease and reflects some of its characteristics Indicating context of relevance (diagnostic manuals and categorical criteria or transdiagnostic approaches) Be aware of model limitations! | Include clinically relevant biomarker(s) and/or diagnostics For medicinal product: biodistribution and/or bioavailability Animal model is highly relevant and carries many disease characteristics And/or perform experiment using different (animal or human cell based) models/ tissue with complementary characteristics (Triangulation) | Experiments focusing on e.g. mechanistic understanding that do not aim directly at clinical translation |
Within-lab replications as a road to rigorous evidence
If the minimum criteria (as presented in Table 1) are not met with the first exploratory study, replication experiments potentially serve as a powerful validation tool before conducting a larger (multicenter) study. In this context, within-lab replications or also mini-experiments[23] with refined experimental design and improved internal as well as external (by considering batch effects) validity will be valuable. Moreover, refined animal models generate evidence to assess translational potential in this early-stage replication e.g., from a low complex cell line-based xenograft cancer mouse model to a patient-derived xenograft model[24]. Exact within-lab replications might also be used to increase the reliability of the results via increased sample size and/or increasing the number of (smaller) batches[25]. This will decrease outcome uncertainty and aid in sample size planning for confirmatory studies. Ethical constraints, e.g. regarding studies including large animals, potentially prohibit stand-alone exact replication experiments. However, a replication study might be integrated as positive or negative control group(s) into the experimental design of a new exploratory study.
Ideally, exploration and within-lab replication studies have the potential to reveal effect modifiers, confounders, and colliders. This may require adjustment of experimental design, for example by including an estimate of drop-out rate either due to the animal model or due to the intervention that affects sample size planning. Information on such covariates can then lead to a refinement of e.g. the randomization scheme if body weight is affecting the outcome of a study. In this example, to control for the variation in body weight, the experiment could be split up into smaller blocks and interventions would be randomized to experimental units within each weight block. It can also support the selection of Go/No-Go decision points prior to confirmation. Finally, the decision about transition from exploration to confirmation needs to include all stakeholders including preclinical and clinical researchers as well as biostatisticians.
Engaging in a confirmatory multicenter study -reality check
Irrespective of the generated evidence from an exploratory study, feasibility needs to be evaluated to decide whether a multicenter decision-enabling experiment should be conducted. This evaluation includes practical constraints such as available resources (can increased animal numbers be handled?) or ethical approval (replication experiments as area of tension[26, 27]), and medical need. According to the animal welfare act and Directive 2010/63/EU of the European parliament[28], an animal experiment can only be justified if it generates new knowledge and if that knowledge outweighs the harm for the animals[29]. Thus, confirmatory studies need to go beyond exact replications and generate diagnostic (= decision enabling) evidence about a knowledge claim[26, 30, 31]. In general, exploratory studies provide only preliminary evidence. Building on such initial findings, confirmatory studies allow generalization beyond specific experiments gathering support for the underlying knowledge claim. For this, investigators need to ensure that validity and scientific rigor are preserved at a high level throughout the preclinical research trajectory (Fig. 1).
Optimization of evidence generation during confirmation
The goal of the (multicenter) confirmatory study is to support a knowledge claim and potentially inform the decision to move to the clinic. Again, a clear a priori definition of Go/No-Go decision points and clearly defined primary and secondary outcomes are indispensable. Other parts of the planning process are less generalizable (Fig. 2). Some of these aspects are beyond the scope of this manuscript and we will solely focus on biometry related issues and/or practical constrains/aspects (v-vii) (Fig. 2).
Protocols, Standardization and Systematic Heterogenization
One important step in conducting multicenter studies is harmonization of protocols (Fig. 2 (i, v)). In this process, involved laboratories need to decide on which aspects of the experimental protocols need standardization and which will systematically vary between centers. Important aspects that need to be standardized and quality controlled include the treatment scheme to ensure comparable dosage and the same quality of the drug. Additionally, quality control measures identified through initial baseline studies are recommended. A comparison of outcomes from control groups for example can identify potential problems between centers early on. Knowledge about center variability and information on factors that influence variance of results can be gained by introducing systematic heterogenization. This includes comorbidities and the use of both sexes[32, 33]. The latter is considered a minimum requirement in a confirmatory approach except for sex-related diseases like prostate cancer or in case of well-grounded arguments. Heterogeneity will also be introduced by each study center. To assess replicability of results across centers, a low number of centers already is sufficient. A minimum of two participating laboratories may already be sufficient and the added value of additional laboratories decreases rapidly[33]. A small number of centers precludes, however, estimation of between center heterogeneity. Here, strategies need to ensure that centers actually can be jointly analyzed. Additionally with regards to animal experiments, husbandry conditions including food, temperature and cage mates will most likely vary between centers and laboratories and need to be considered if those affect the outcome[20].
Primary outcomes should be complemented by evidence from other sources. Here, selection of partner laboratories can also be based on such complementary methods and approaches. One example are patient derived 3D cell cultures to gain a deeper understanding about underlying mechanisms and to capture effects only seen in human cells. By increasing the number of donors or models to support a research claim, the validity of an observed effect can be increased (triangulation[34]). For studies that aim at clinical translation, translational validity should be improved by including (several) biomarkers or other diagnostic tools[35, 36] in the analysis and/or experimental design. For drug efficacy testing, control groups in the confirmatory study should include a competitor drug i.e. clinical standard treatment and/or other negative and/or positive control groups. Researchers should be in close, early on contact with regulatory authorities to ensure that experiments already incorporate requirements for approval. To avoid increasing the sample size by additional positive and negative control groups, it can be feasible to consider historical cohorts[37, 38] or an unbalanced design[39, 40] with smaller but more control groups (multi-arm design) that can be pooled. The latter two points led to extensive discussions between the authors and should thus be viewed as controversial[41].
Sample size calculation for confirmatory studies
The basis for sample size calculation is the anticipated effect size that is defined in various ways [42, 43]. Herein, we refer to effect size as a mean difference divided by a measure of spread. In a typical preclinical efficacy study, that could be the difference between the mean of the primary outcome measure of an intervention group and of the control group divided by the pooled standard deviation[44]. As already mentioned earlier, the effect size estimate from exploratory studies tends to be inflated (“winners curse”)[8]. Basing a sample size calculation of a confirmatory study on such an inflated effect size results in an underpowered study that runs the risk to miss an existing effect. This is aggravated in experiments with low internal validity [8, 45]. Sample size calculations for confirmatory studies should take this potential effect inflation into account and apply a shrinkage to exploratory effect size estimators to avoid underpowered studies. This also applies to effect sizes from published studies that are exploratory. This needs not necessarily be stated in the published study, but we recommend treating all research that does not explicitly state its confirmatory nature as exploratory. In case several prior studies are available (pilot, exploration, mini-experiments), effect sizes can be pooled via meta-analyses if heterogeneity between experiments is limited. Moreover, effect sizes do not typically extrapolate from animals to humans and are potentially smaller in humans[46]. It is thus necessary to apply shrinkage to effect sizes from exploratory studies, the exact magnitude, however, is still a matter of debate.
An alternative approach is to define a smallest effect size of interest as outlined above. This will set a lower bound under which results are no longer considered worthwhile exploring. Choosing such a threshold needs to reflect knowledge of the human disease, biology, effect size distribution in previous studies using similar model systems, available resources, and feasibility considerations[10]. That is, if the smallest effect size of interest is set too high the experiment will not be able to detect an actually existing effect. Contrary, an unnecessarily low smallest effect size of interest potentially requires a substantial amount of resources and animals threatening the reduction principle of the 3R.
Once an effect size is chosen, this has an implication on the statistical power. With discussions on the utility of p-values and standard threshold of p < 0.05, the planning of a confirmatory trial can have a stricter bound such as a threshold of p < 0.005 or an increased power of for example 0.9[47–49]. Again, this has to be weighed against the increased effort and cost-benefit calculations are necessary to avoid spending resources that could be used for other complementary studies[49, 50]. In confirmatory studies, strict correction for multiple comparisons should be applied to preserve the pre-specified false positive rate. As there is considerable uncertainty about the true effect, power could be calculated across a range of plausible effect sizes[51], instead of a point estimate to illustrate limitations for investigators. Particularly when confirmatory studies are conducted in a sequential manner[52], this may increase efficiency. Moreover, as the exploratory study has already registered the direction of the effect, sample size calculations and subsequent analysis can be based on one-sided tests. However, in case of an underpowered exploratory study aiming at mechanistic understanding confirming a prior knowledge claim, a sign error (type-S error) can occur where the replication detects an effect estimate in the opposite direction of the initial experiment or the actual effect size[53].
Multicenter considerations
A balanced design, where each center is allocated the same number of animals, is considered ideal as it increases the precision of estimates under between-center heterogeneity. One advantage over clinical trials here is that recruitment differences can be held to a minimum. Heterogeneity between centers is not due to different patient populations with different comorbidities but as outlined above most of the heterogeneity is systematically implemented in advance. The randomization to centers should take these previously planned factors into account in a block randomization scheme across centers. That is, factors need to be stratified and centers should for example test equal numbers of male and female animals, or animals from similar weight categories should be allocated to treatments similarly across centers. For this, a small number of additional animals may be needed to ensure a balanced design over all centers. Noteworthy, the impact on statistical efficiency with unequal or equal numbers of subjects in different centers also depends on the type of estimator used (e.g., fixed vs random effects). Finally, unbalanced numbers are not necessarily a sign of poor planning but a consequence of varying capacities or breeding of animals[54].
It is important to consider which experiments need to be performed by the initiating institute and which experiments by the partner laboratories. If a within-lab replication already indicated within-lab replicability of a result within the initiating institute, then this lab potentially does not need to perform the analogous experiment, but instead proceeds with triangulating evidence, a different strain, a different (large) animal model or flanking ex vivo experiments. In agreement with the initiating lab, partner labs can consider only selectively replicating core results to save on resources. Core results refer to assessment of the primary and important secondary outcome variables. If a costly method like single cell sequencing has been conducted in the initiating lab, a replication across all labs could lead to an undue increase in costs with little generation of additional insights. With respect to the animal model, subsequent designs are recommended (rodents -> non-rodents -> non-human primates). As sample sizes in large mammals including non-human primates typically need to be smaller due to ethical constrains, a smaller number of centers may be acceptable. It is an open question to which extent evidence from rodent experiments can be extrapolated to large animals and inform sample size planning. The effect size magnitude in rodents may neither translate to larger animals nor to the human case.
Table 2
Summary points and recommendations for the conduct of a confirmatory multicenter study including open questions that require further discussion and will be subject matter of future research.
Summary points | Open Questions |
Minimum validity and reliability criteria need to be fulfilled before engaging in a confirmatory multicenter study (Table 1) | Are dose-response effects a prerequisite for the confirmation? |
If uncertainty is still high, optimization of evidence via (within-lab, in-house) replication studies to (i) increase sample size, (ii) improve internal validity, (iii) introduce systematic homogenization and/or (iv) flanking experiments | What if evidence from pilot, exploration and within-lab replication are contradictory (positive and negative results)? |
(Standardized) protocols should be in place before starting a confirmatory study | - |
(Animal) Model(s) should be disease relevant and limitations be acknowledged | - |
Depending on the experimental objective control groups should include positive, negative controls and/or in case available a comparator from standard clinical care | What requirements need to be fulfilled to use historical control groups? |
For planning a confirmatory study, sample size calculation should be based on smallest effect (size) of interest (clinical/biological relevant) or a shrinkage of the effect size(s) from exploratory studies should be considered | Field specific effect sizes distributions are scarce, how can the situation be improved? What is the optimal approach to calculate the sample size? |
Flanking experiments (triangulation) might be performed early on and are highly recommended for confirmatory studies | How can in vitro studies be integrated in the confirmatory study design and sample size calculation? |
Introduction of sources of variation like sex or strain (systematic heterogenization) | How to best balance standardization and systematic heterogenization? |
Multicenter considerations include (i) harmonization of protocols, (ii) skills and expertise of partner lab(s), (iii) balanced design and (iv) block randomization across centers | Which experiments should be confirmed in several laboratories? |