Most likely you have read some variation of the following statement in a clinical trial: “We found a statistically significant difference between group A and group B, with group A improving more than group B (P < 0.05). Therefore, we recommend using intervention X to treat Y condition.”
This statement should never be taken at face value without evaluating how the trial was conducted and reported. While there are many components that influence the validity of clinical trial results, one of the simplest, yet most important, is the total number of participants that were studied in the trial. This is the “sample size.” Studies must enroll enough participants to provide the statistical confidence that the results are valid. If an inadequate number of participants are enrolled, the results may be incorrectly estimated, leading to biased conclusions.
Planning a clinical trial with sample size in mind
When planning a trial, researchers determine how many participants they must recruit into the study. The size of the trial is influenced by three factors:
1. Effect size
The effect size is a number that measures the strength of a relationship between two variables. For clinical trials, the effect size is related to the minimal clinically important difference (MCID). The MCID is the clinical threshold that the results must meet to have the clinical importance that would change a patient’s management. If the intervention works, the participants will observe an effect that is valuable for their condition.
2. Significance level
Also known at the P-value, or alpha (α), the significance level is the probability that the difference between groups could have occurred by chance.
The most common significance levels are 0.05, 0.01, or 0.001. When you read that a P-value is less than (<) 0.05, this means there is less than 5% chance that the difference between groups could have occurred by chance, making the observed difference more likely to be true.
3. Statistical power
Also referred to as sensitivity, the statistical power is the probability of correctly detecting a true difference between the groups.As it is often described, the probability of correctly rejecting the null hypothesis that there is no difference between groups.
The most common levels of power are 0.8, 0.85, and 0.9 that respectively can be stated as 80%, 85%, or 90% power.
These three variables are used to calculate the sample size needed for a trial so appropriate conclusions can be derived regarding if there is (or is not) a true difference between groups. Of the three variables, the effect size is the most challenging to define. It is possible that the MCID is not known, in which case an exploratory trial must be performed before a larger clinical study to determine the effect size and establish proof of concept.
However, pilot trials are not always conducted. Instead, researchers often prematurely perform larger studies without knowing the MCID. When this occurs, the sample size is either arbitrarily chosen, or based on a hypothesized rational. This may bias the results and conclusions.
Underpowered studies
If the sample size chosen is too small compared to the true sample size required for the trial, the study is “underpowered.”
The trial is underpowered if:
- The enrolled sample size is adequate, if enough participants drop out of the trial before the study is completed that decreases the sample size below the minimum number required to attain power
- The participants that dropped out are not included in the analysis via a statistical technique called an “intention-to-treat analysis”
Results from an underpowered trial can still be computed. However, since the sample size is too small, the null hypothesis that there is no difference between groups cannot be rejected or confirmed. Only a lack of evidence of effect can be stated, or in other words, an absence of evidence. It is not appropriate for researchers to make definitive conclusions regarding differences between groups, because statistically, the researchers truly do not know if the intervention produced statistically significant differences.
Consequences of underpowered studies
Unfortunately, many clinical trials are underpowered, and yet, researchers wrongly provide decisive conclusions.
A consequence of underpowered trials is that the results tend to be overestimated with large observed effects. The exaggeration is worse when the results are highly statistically significant with very small p-values.
Hence, caution must be taken when interpreting the results from studies with inadequate sample sizes, especially those with a very small number of participants. Smaller effect sizes that are closer to the true effect can also be derived from underpowered trials. However, these are less likely to yield a statistically significant difference and may be ignored as an underpowered negative result.
Underpowered trials can be clinically deceiving if healthcare decisions are made, or at least strongly influenced, by results from these trials. The scientific publication system is also impacted by these inadequate studies. Positive outcome bias, a form of publication bias, results when trials with novel and statistically significant results are favored and published by journals compared to those with non-significant effect.
Between 1990 and 2007, the number of studies reporting positive results grew 22%, a statistically significant trend increasing by 6% per year and one that is consistent across countries and academic disciplines.
Between 1991 and 2008, there was a statistically significant decrease in the ratio of non-significant to significant results in published studies. Positive outcomes may be partly due to research with false-positive results.
Between 1992 and 2014, out of 44 published studies the mean statistical power in the studies was small at 0.24. This power had not increased in six decades.
How to avoid underpowered studies
The remedy for inadequately powered studies is to conduct better trials. While this may seem overly simplistic, there is no way around it.
Adequately powered studies performed for the right reasons will help reduce the amount of research reporting false positive results that overestimate and bias the literature pool. It is possible that researchers are not trained how to properly calculate sample sizes. They may choose a sample size based on prior research conducted in the same field but not clinically justified.
Improper motivations may also lead researchers to choose small sample sizes. Conducting clinical trials can be costly. While studies with smaller sample sizes are less expensive to conduct, as demonstrated, their conclusions may be deceptive if they are underpowered.
Researchers may be motivated to perform studies for the sake of advancing their careers rather than meaningfully adding quality research to the literature pool. However, clinical research should be conducted to advance knowledge to improve health and quality of life, not solely performed with financial motivations as the catalyst.
Piloting exploratory trials that enroll a small number of participants is perfectly appropriate because these trials need to be conducted for infant fields to establish sample size calculations for larger subsequent studies and determine if the intervention is safe. However, exploratory research represents the minority of trials. It is appropriate to use a convenient sample size between 24-50 participants in these initial trials, as this sample size has been recommended to estimate the standard deviation required to attain a sample size calculation for subsequent large studies.
Final thoughts
Clinical research needs to be conducted with the highest methodological standards and lowest risk of bias to increase the probability that the results are a faithful estimation of the truth. If researchers overlook the importance of enrolling an appropriate sample size they waste time, resources, funding, and most seriously, the health and hope of individuals that are impacted by study results. Ultimately, to optimize trust in clinical trial results, the necessity of adequate sample size calculations must be taught, understood, and implemented into practice.