Accurately estimating reproducibility of scientific methods is critical for guiding researcher’s methodological decisions. Our results demonstrate that estimating statistical errors by resampling with replacement from random data results in large biases when resampling near the full sample size. We explain this fully by compounding sampling variability of test statistics when resampling and its knock-on effects on estimated statistical errors. We further simulate ground truth data with true effects to show that statistical power is inflated when the true power of the discovery sample is low and slightly deflated when true power is high. This could lead to circular reasoning in cases where we must assume we have high statistical power before we can rely on the estimation that we have high statistical power. Lastly, we show that this bias is largely avoided when subsampling only up to 10% of the full sample size after Bonferroni correction. This 10% rule of thumb is consistent with the use of resampling techniques in a recent evaluation of statistical power and false discovery rates for genome-wide association studies with hundreds-of-thousands of participants28, as well as recommendations for 10-fold cross-validation to reduce prediction error in machine-learning29.

What are the implications for the results presented by Marek, Tervo-Clemmens *et al*.1? For the strictly denoised Adolescent Brain Cognitive Development (ABCD) sample (*n* = 3,928), they report around 68% power at *n* = 3,928 after Bonferroni correction when resampling at the full sample size (Marek, Tervo-Clemmens *et al*.1 Fig. 3d.). Our true effect simulation results indicate that this estimate could be inflated from a true average power anywhere between 1% and 40%. Furthermore, when subsampling from the UK Biobank with a full sample size of *n* = 32,572 Marek, Tervo-Clemmens *et al*.1 report around 1% power for *n* = 4,000 and *α* = 10 *−* 7. We therefore argue that the 68% power reported for the full ABCD sample (*n =* 3,928, *α* = 10 *−* 7) more likely reflects methodological bias, rather than a result of increased signal after strict denoising of brain data. While the largest BWAS effects may be highly reproducible with 4,000 participants, the average univariate BWAS effect is most likely not reproducible. On the other hand, our true effect simulations (Fig. 4.) also indicate that the UK Biobank estimates at the full sample size are more reliable, with an underlying power likely between 70% and 90% at *n* = 32,572 after Bonferroni correction. Ultimately, our results suggest that replicating the univariate BWAS tested in Marek, Tervo-Clemmens *et al*. requires tens-of-thousands of individuals.

Our results only have direct implications for mass univariate association studies, however it is worth noting how methodological decisions could influence reproducibility in neuroimaging. For example, it should be noted that inter-individual correlation studies offer “as little as 5%-10% of the power” of within-subject t-test studies with the same number of participants4. Other methodological choices, such as data modelling, should also be carefully considered. The lack of power in univariate BWAS considered by Marek, Tervo-Clemmens *et al*. could also be influenced by the choice of a group-averaged brain parcellation30, which fails to account for individual level variations in resting state functional connectivity31,32. Brain models33 which do account for such individual variability generalise better, as demonstrated by stronger out-of-sample prediction31,34, and could also lead to higher replication rates in null-hypothesis significance tests. Note also that how we model null distributions35 of brain-wide statistics has a large influence on resulting *P* values. With this in mind, one could consider a predictive framework rather than an explanatory one36, which could be replicable with only hundreds of participants37,38

It is clear that investigations of reproducibility of wider BWAS methods are required. We urge such meta-analyses to evaluate their meta-analytic methods, for example with null data, so they may reliably evaluate the reproducibility of scientific methods used in research.