Hypothesis testing is a method of testing an assumption about a population. It uses statistics to determine the probability that a given hypothesis is true by computing what is called a p value. The p value is the probability of observing such an extreme value if the null hypothesis, the hypothesis that there is no difference between populations, is true. The lower the p value, the stronger the evidence that the null hypothesis is false and that there is a real difference between the populations.
- Collecting a representative sample: At the beginning of an experiment, it is good practice to estimate the sample size needed to answer the question. A power analysis estimates the number of samples needed to observe a meaningful difference.
- Defining the significance threshold (α): Before performing the statistical test, it is important to define the significance threshold or “α” that will be used to determine whether the data provide evidence for or against the hypothesis. Traditionally, p<0.05 has been used as a standard cut-off; however, this value is arbitrary. The acceptable cut-off will depend on the research question. For example, a 5% probability of the null hypothesis being true may not be acceptable when testing the effectiveness of a drug in treating a serious illness.
- Testing the hypothesis: Once enough samples have been collected, the hypothesis can be tested. Typically, one proposes the null hypothesis, which states that the samples come from the same population. In the other words, there is no difference between samples. The alternative hypothesis proposes that the samples come from different populations and there is a difference between them.
➜The particular statistical test used to test the hypothesis should be appropriate for the data. For example, if the data are paired, a paired t-test may be most appropriate. If the data are not normally distributed or non-parametric, a Mann-Whitney U test should be used.
- Interpreting the result: The statistical test outputs a probability value or “p value”, which indicates the likelihood of observing such an extreme value if the null hypothesis is true (i.e., if there is no difference between groups). In other words, are the data compatible with the null hypothesis? The lower the p value, the less likely it is that the null hypothesis is true.
- Because it isn’t usually possible to measure an entire population, hypothesis testing and the resultant p values are a useful method to make inferences about the population based on a representative sample.
- One should not base scientific conclusions based on the p value alone. The p value should be interpreted within the context of the entire experiment and other data presented.
- The p value does not indicate the probability of the result occurring by chance alone or the probability that the hypothesis is true. The p value indicates the likelihood of observing such an extreme value if the null hypothesis is true.
- The p-value does not measure the size of an effect or the importance of a result. Smaller p values do not necessarily indicate a larger or more important effect. Even a very small effect can produce a small p value if the sample size was very large or the measurements were extremely precise. Conversely, large effects can produce large p values if the sample size was small or the measurements were imprecise.
- The p value on its own is not a good measure of evidence for or against a model or hypothesis. The p value must be interpreted within the context of other evidence.
What to watch for
- Consider the sample size used in the study. A statistically significant p value does not necessarily imply importance. P values are affected by the sample size. Larger sample sizes produce smaller p values and smaller sample sizes produce larger p values. A study with a very large sample size may find a "true" difference, but that difference is so small that it may not be important.
- Was a power analysis performed prior to data collection? Adequately powered studies have determined the sample size needed to examine the desired effect. Underpowered studies are less likely to detect a true effect and are at a greater risk of being biased (i.e., produce more false negatives, exaggerate true effects).
- Were p values adjusted for multiple comparisons? Multiple statistical comparisons increase the likelihood of a "significant" result. Studies that perform multiple comparisons should adjust p values to account for this.
- Non-significant p values cannot prove a null hypothesis. In other words, the absence of evidence that two groups are different is not evidence that there is no difference between groups. A non-significant result may imply a number of things: for example, that the study was underpowered, that the appropriate study design was not implemented, or that measurements were imprecise.
Discover more Explainers and other Reader Resources on the Research Square Blog.