This independent survey may be considered a validation of the support of a group of researchers to a recommendation to abandon the use of the concept of statistical significance. With very few exceptions, the signatories correctly interpreted the p-value. This result is not superfluous because some studies suggest that the misinterpretation of the p-value can be frequent even among academics [11–13]. Besides, most responders strongly agree to abandon the use of “statistical significance” and, for the most part, were motivated by the arguments presented against this concept.
However, regarding the feasibility of abandoning statistical significance, close to a quarter are fully convinced that they will never use this concept again. On the other hand, about 42% declared being neutral or that would likely use it in future publications. Assuming that the researchers surveyed represent those against the concept, the distribution of answers to this question suggests that the fully retire of the statistical significance does not seem feasible.
Because we were looking for a high response rate in our survey, we did not include questions related to the causes for which signatories would again use statistical significance in future publications. Therefore, the fact of using the concept of statistical significance does not mean that they are going to base their conclusions solely or primarily on this result. Also, it is possible those continuing to use this term would be motivated by compliance the expectations of journals, reviewers, or readers, more than by their way of interpreting the results.
The p-value will continue to be presented, and dichotomization results seem to be inevitable regardless of the criterion chosen. Despite this, based on the validation we have made, we consider that Amhrein et al. materialized in their paper a legitimate concern of researchers from different areas. In that sense, we agree with the importance of a research finding not being based only on statistical significance [14, 15].
An aspect to highlight is to differentiate the application scenarios from statistical significance [12]. For example, there is a critical distinction between studies of causal inference vs. those for prediction purposes. In the latter, the interpretability of the estimates may be optional, and the statistical criteria can command decisions to use or not a predictor [16, 17]. However, in studies of causal inference, the concept of statistical significance should not be a primary concern. Before looking at a p-value, the researcher will have to avoid biases and look on conceptual structures to control confusion and consider contexts in which effects can be modified [18, 19]. After that, the measures of association and impact are those that must define when a result is significant in the clinical and public health scopes [20].
Therefore, it is not surprising that one of the major concerns expressed by several of the signatories is the misuse and misinterpretation of the value of p. Also, well-documented publication biases in favor of “positive results” are a consequence of overvaluation of statistical significance [21, 22]. These are often concerns among editors and statistical consultants of biomedical journals. For example, The New England Journal of Medicine recently modified its guidelines for statistical reporting by including a requirement to replace P values with estimates and confidence intervals when neither the protocol nor the analysis plan has specified methods to adjust for multiplicity [23].
We agree that the value of a result must be based on the interpretation of the spectrum of values compatible with the data, such as Amrhein et al. suggested [10]. However, removing a term such as statistical significance is far to be a solution to avoid the publication biases. We regret that, even based on point and interval estimates, associations compatible with the null value undoubtedly would continue being under-reported. Conversely, the absence of a preset threshold to interpret a p-value could subtract objectivity [24].
Faced with the seemingly inevitable use of statistical significance[21], we must give due value to statistical tests, promoting the understanding of their limitations [25]. In that sense, one critical issue is the widespread application of an arbitrary significance level (i.e., 0.05) [12, 26]. As an analogy, diagnostic tests may need different cut-offs according to the disease prevalence to maintain high predictive values [27]. Similarly, it would be negligent in using the same significance level for all research problems. The pre-test probability of an association would help to define a cut-off to increase the chance of both identifying the true associations and discarding those spurious [7]. Nevertheless, no value should become a new thumb rule applicable to all situations.
Reducing the significance level can reduce the false positive rate but increase the false negative rate, which is reducing the power of a study. This can be a problem when decisions have to be based on studies with small sample sizes, such as in preliminary outbreak investigations or in the research of extremely rare but severe diseases.
For another purpose, a study aimed to replicate or confirm results from other well-designed studies would not need to use the same level of significance, since the state of knowledge has changed. A higher significance level could be justified when previous studies suggested a high pre-test probability.
Moreover, other issues may be necessary to consider in each case, such as the implications of a false positive and false negative result. For example, it does not seem sensible to use the same significance level to approve a drug with a high risk of adverse effects as for a low-risk educational intervention. Probably in the former, we were more interested in ruling out the alpha error. As with diagnostic tests, the cut-off point for significance should also be adjusted to increase the likelihood that our research will cause more benefit than harm.
For all the above, it is likely that we have not yet found a magical formula to choose levels of significance. Therefore, we share the frustration of decisions being guided by an arbitrary or poorly justified rule. Statistical significance may play a supporting role, but not a leading one. However, better than trying to abolish this concept, we consider it is necessary to develop strategies to justify and predefine the significance levels considering both the evidence and the implications of errors resulting from the statistical tests. Moreover, efforts to define what is clinically or epidemiologically significant may be more useful to guide research and interventions [18, 19].