## ‘Disease subtypes’ occur in simulations using published data and UNIMODAL distributions

In microbiome rodent studies, the selection of a sufficient number of both human donors (N), and the number of mice required to test each human donor (MxD), is critical to account for the effects of random sampling, which exist when the hGM induces variable disease severity in humans and rodents. Thus, to visualize the variability of disease severity (data subtypes/modes) in rodents, and the effect of N on the reproducibility of pairwise statistical comparisons between groups of hypothetically, randomly selected human donors, we first conducted a series of simulations using the mean ± SD (observed data) from hGM-FMT mice in Basson et al.[46] (note the dispersed overlapping variances, SD in Fig. 1A). Using the observed data we generated random datasets using functions designed to draw numbers from an inversed Gaussian distribution (with unimodal normal continuous probability; 0,∞). We demonstrate how the random selection of donors (sampled as groups for each of three iterative datasets) influence the direction and significance level in pairwise comparative statistics (Fig. 1B).

Simulations showed that the number of MxD is important because mice have various response patterns to the hGM (i.e., disease severity, data subtypes/modes), which can be consistently detected depending on the MxD and thus the variability introduced by random sampling. Simulations showed that for the three group datasets (plotted as ‘Dis1’, ‘Dis2’ and ‘Healthy’), it was possible to reproducibly identify two-to-three unique donor disease severity subtypes (data modes) in mice induced by the hGM (‘high’, ‘middle’, and ‘low’ disease severity). Simulation plots made it visually evident that testing < 4–5 MxD yielded mean values more likely to be affected by intrinsic variability of random sampling; thus, making studies with > 6 MxD more stable and preferable. Conversely, studies with 1–2 MxD are at risk of being strongly dependent on randomness. Iterative simulations showed that the mean effect (e.g., ileal histology) in transplanted mice varies minimally (i.e., stabilizes) after 7 ± 2 MxD, depending on the random dataset iterated. Beyond that, increasing MxD becomes less cost-effective/unnecessary if the focus is the human donors (Fig. 1C).

## Random sampling from overlapping diseases yield ‘linear patterns of analytical irreproducibility’

Often, published literature contains figures and statistical analysis conducted with 3 donors per disease group. Thus, to mimic this scenario and to examine the role of random sampling on the reproducibility of pairwise statistical results (‘significant’ vs. ‘non-significant’), we conducted, i) multiple 3-donor/group (‘trio-trio’) pairwise comparisons, and ii) a simultaneous overall analysis for the cumulative sum of all the 3-donor trios simulated for each disease group. That is, we monitored and quantified whether results for each random iteration were significant (using univariate Student’s t-statistics p < 0.05) or non-significant (p > 0.05) for groups of simulated donor datasets (‘Dis1’, ‘Dis2, and ‘Healthy’). Assessing the effect of random sampling at various N, and also as N accumulated, we were able to illustrate that pairwise trio-trio comparisons between the simulated datasets almost always produce non-significant results when iterative trios were compared (due to large SD overlapping; see bars in Fig. 1D representing 21 sets of pairwise trio-trio p-values). However, as N increases by the cumulative addition of all (mostly ‘non-significant’) donor trios (i.e., N increases in multiples of 3, for a range of N between 3 and 63 donors/group; [3, 6, 9, 12…63]), pairwise statistical comparisons between the simulated datasets did not produce consistent results (see line plots in Fig. 1D representing p-value for cumulative addition of donors when sampling iterations were simulated).

Results are clinically relevant because the simulated N, being much larger (63 donors/group) than the largest N tested by one of the studies reviewed by Walter et al[7] (21 donors/group)[40] demonstrates that the analysis of randomly selected patients would not always yield reproducible results due to the chance of sampling aleatory sets of individuals with varying degrees of disease severity, regardless of how many donors are recruited in an study. To provide a specific example, using ‘Dis1’ as a referent, cumulative pairwise comparisons (vs. ‘Dis2’, and vs. ‘Healthy’) revealed at least five different patterns of ‘irreproducible’ statistical results as N increased between 3 and 63 per group. Figure 1D illustrates four of these variable cumulative linear patterns of analytical irreproducibility, in which, remarkably, i) ‘Dis1’ becomes significantly different vs. Dis2, and vs. ‘Healthy’, as N increases, ii) ‘Dis1’ becomes significantly different from ‘Dis2’ but not vs. ‘Healthy’, iii) ‘Dis1’ was significantly different from healthy but not vs. ‘Dis2’, and iv) ‘Dis1’ never becomes significantly different despite sampling up to 63 donors/group. See Supplementary Fig. 1 for complementary plots illustrating linearity of patterns (R2, mean 0.51 ± 0.23, 21 simulations) Hence, results clearly illustrate that seeking funds to recruit more donors is not a prudent statistical solution to the problem of understanding disease causality of widely variable conditions in both humans and animals. By analytical irreproducibility, herein, we refer to the inability to reproduce the direction and statistical significance of a test effect when analyses are conducted between groups created by the random selection of subjects from distributions defined based on observed (mean ± SD) data.

## 100,000 Monte Carlo simulations illustrate the effect of randomness on analytical reproducibility

To summarize the overall significance of the inconsistent patterns observed via random sampling, we computed an aggregate ‘cumulative probability of being a significant simulation’ for 50 pairwise statistical simulation sets fulfilling the 4 linear patterns described above. Emphasizing the concept that increasing N is not a reproducible solution, Fig. 1E shows that only 35.3 ± 4.0% of comparisons between ‘Dis1’ and ‘Dis2’, and 58.8 ± 3.3% for ‘Dis1’ and ‘Healthy’ were significant.

Expanding the validity of these inverted-Gaussian simulations for N = 63 donors/group, we then conducted i) Monte Carlo adjusted Student’s unpaired t-tests, and ii) Monte-Carlo adjusted one-way ANOVA with Tukey correction for family errors and multiple comparisons. Monte Carlo simulations used data drawn from a normal (non-inverted) Gaussian distribution around the group means with a pooled SD of ± 4, and were conducted using GraphPad, a popular statistical software in published studies. To estimate a probability closer to the real expectation (narrower confidence intervals), 100,000 simulations were performed. Supporting the observations above (based on inverse normal simulations), Monte Carlo Gaussian simulations showed that, using pairwise comparison, ‘Dis1’ would be significantly different from Dis 2 (adjusted T-test p < 0.05) only 57.7% of the time (95%CI = 58-57.4), with 1540 simulations producing negative (contradictory) mean differences between the groups. Compared to ‘Healthy’, ‘Dis1’ and ‘Dis2’ were significant only 9.1% (95%CI = 9.2–8.9) and 78.3% (95%CI = 78.6–78.1) of the time, respectively.

Under the ‘Weak Law of Large Numbers’[47–49], and randomization principles, it is almost always possible to detect some level of statistical significance(s) and mean group differences when asymptotic mathematical methods based on numerous simulations are used, for example, as a surrogate for multiple experiments which are not possible in real research settings. However, in this case, the mean simulated differences yielding from 100,000 simulations were minuscule (1.6 for ‘Dis1’-‘Dis2’; -1.97 ‘Healthy’-‘Dis2’, and 0.42, ‘Healthy’-‘Dis1’). Compared to the range of disease variance for each disease, such minuscule differences may not be clinically relevant to explain disease variance at the individual level. Note that the SD was 4, therefore it is intuitive to visualize in a numerical context such as small differences across greatly overlapping unimodal simulations. Correcting for family errors, One-way ANOVA corrected with 10,000 Monte Carlo simulations with N = 63/group, showed that at least one of the three groups would be statistically different in approximately only 67.2% of the simulations (95%CI = 64.2–70.0), whereas in 32.8% (95%CI = 64.2–70.0) of simulations, the groups would appear as statistically similar (see Table 1 for estimations after 100,000 Monte Carlo simulations; note narrower CI as simulations increase). The comparison of ‘Dis1’ vs. ‘Dis2’ in Table 1, clearly demonstrates that the percentage of cases, in which a simulation could be significant, depends on the degree of data dispersion. For example, simulations with SD of 4, compared to SD of 10, produce significant results less often, illustrating how data with larger dispersions contribute to poor statistical reproducibility, which cannot necessarily be corrected by increasing N.

Table 1

Comparative percentages of simulations that yielded significant results for two statistical approaches

| Inverse Normal Gaussian | Monte Carlo Normal Gaussian |

Simulation, n = and statistical test | 50 T-tests (significant cumulative linear pattern*) (95%CI=) | 100,000 Adjusted T-tests (overall significance with N = 63/group)(95%CI=)a | 100,000 Adjusted One Way with multiple comparison Tukey testb (95%CI=) |

Dis1 vs Dis2 | 35.3% (22.9, 50.8) | 57.7% (57.4, 58.0) | 37.8% (37.5, 38.1) |

Dis1 vs Healthy | 58.8% (43.2, 71.8) | 9.1% (8.9, 9.3) | 3.8% (3.7, 3.9) |

Dis2 vs Healthy | ND | 78.3% (78.0, 78.6) | 59.6% (59.3, 59.9) |

| | | One Way ANOVA p < 0.05 68.1% (68.4 to 67.8) p > 0.05 31.9% (32.2 to 31.6) |

Table based on randomly simulated data derived from unimodal distributions. |

*Not overall p-value at N = 63. ND, not determined. |

a,b Notice that the percentage of simulations achieving significance is inflated when analysis for three groups is conducted with T-tests (instead of ANOVA) which does not control for false positives due to family errors. Proper comparison between > 2 groups should be performed with methods to control for such family errors (e.g., ANOVA-post-hoc Tukey statistics). Note that the percentage is different as illustrated in Fig. 1D because the patterns with non-linear behavior are not considered. |

## Random sampling can lead to ‘erratic patterns of analytical irreproducibility’ as N increases

To increase the external validity of our observations, we next simulated the mean ± SD data published from a hGM-FMT study on colorectal cancer conducted by Baxter et al[16]. In agreement with Basson et al, Baxter et al revealed comparably bimodal colorectal cancer phenotypes in mice resulting from both the diseased (colorectal cancer) patients and healthy human donors (Fig. 1F). Equally important, we observed for both Basson et al[46] and Baxter et al[16], what we describe as the fifth ‘pattern of analytical irreproducibility’ in this report. That is, in some cases, the steady addition of donor trios/group (as simulations proceeded for increasing values of N) made it possible to identify simulations where erratic changes in the statistical significance for group comparisons switched randomly, yet gradually, from being significant to non-significant as more donor trios were ‘recruited’ into the simulations (Fig. 1G). Clinically relevant, simulations indicated (in a reproducible manner), that adding extra patients could at times actually invert the overall cumulative effect of the p-value, possibly due to the variable distribution and multimodal nature of the human and rodent responses to experimental interventions. As such, simulations indicate that it is advisable to conduct several a-priori determined interim results in clinical trials to ensure that significance is numerically stable (p < 0.05), as well as the relevance of personalized analysis to examine disease variance in populations. Unfortunately, there are no guidelines or examples available to assist in determining how many donors would be sufficient, and to visualize the effect of random sampling of individuals from a vastly heterogeneous population of healthy and diseased subjects, once rodent preclinical data is generated.

## Violin plots and statistical methods for visualization of MULTIMODAL ‘disease data subtypes’

To visualize the underlying mechanisms that could explain the ‘linear and erratic patterns of analytical irreproducibility’ introduced by random sampling, we first used dot plots based on observed and simulated data, followed by kernel-based statistics and plots. Plot appearance and one-way ANOVA statistics showed that when N is increased, significant results, when present for largely overlapping phenotypes, are primarily due to small differences between sample means (Fig. 2A-B). Simulations that compared 3 groups of 65 donors/group almost always yielded a significantly different group; however, dot plots show that the significant differences between means are just a small fraction of the total disease variability as verified with Monte Carlo simulations above. That is, as N increases, comparisons can become significant (see plot with 65 donors in Fig. 2C). In this context, a significant difference of such a narrow magnitude may not be clinically relevant, or generalizable, to explain the presence of a disease phenotype in a population, especially for those individuals at the extreme ranges of the disease distribution.

Mechanistically, the detection of significant comparisons can be attributed to the effect that ‘increasing N’ has on the data mean and variance, which increases at a higher rate for the variance as shown in Fig. 2D. Instead of increasing N as a general solution, we propose to scientists to use violin plots, over other plots commonly encouraged by publishers[50] (e.g., bar, boxplot and dot plots), because violin plots provide an informative approach, at the group-sample level, for making inferences about ‘disease data subtypes’ in the population (see ‘subtypes’ shown with arrows in Fig. 2E).

Violin plots are similar to a box plot, as they show a marker for the data median, interquartile ranges, and the individual data points[51]. However, as a unique feature, violin plots show the probability density of the data at different values, usually smoothed by a kernel density estimator. The idea of a kernel average smoother is that within a range of aligned data points, for each data point to be smoothened (X0), there is a constant distance size (λ) of choice for the kernel window (radius, or width), around which a weighted average for nearby data points are estimated. Weighted analysis gives data points that are closer to X0 higher weights within the kernel window, thus identifying areas with higher data densities (which correspond to the disease data modes). As an example of the benefits of using violin plots, Fig. 2F illustrates that as N increases, so does the ability of scientists to subjectively infer the presence of disease subtypes. To strengthen the reproducibility of ‘subtype’ mode identification, herein we recommend the use of statistical methods to identify disease data modes (e.g., see the modes function in Methods and Discussion), because as N increases, the visual detection of modes becomes increasingly more subjective as shown in Fig. 2F.

## Kernel density violin plots help guide subtype analysis to identify biologically significant results

Violin plots and kernel density distribution curves in Fig. 3 illustrate why comparing groups of randomly sampled individuals may not yield biologically relevant information, even though statistical analysis identifies that the mean values differ between compared groups. Figure 3A illustrates the different patterns of potential donor subtypes (i.e., data modes, visualized in violin plots as disease data/curve ‘shoulders’) that would yield significant results in a single experiment depending on the donors sampled. However, the kernel density plots in Fig. 3B show that significant findings do not necessarily indicate/yield clinically relevant thresholds or parameters to differentiate between the populations (due to the overlapping and inflation of data ‘shoulders’ in some subjects within the samples). To contrast the data simulated from Basson et al., we replaced data from ‘Dis1’ dataset with a Gaussian distributed sample of random numbers (within 13.5 ± 3.5, labeled as ‘fake disease X’; vs. 6.4 ± 4.3, and 4.5 ± 2.5 for ‘Dis2’ and ‘Healthy’, respectively) to illustrate how a kernel plot would appear when significant differences have a clinically relevant impact in differentiating disease subtypes (Fig. 3C-D).

Collectively, simulations indicate that the uneven random sampling of subtypes across a disease group would be an important factor in determining the direction of significance if studies were repeated, owing primarily to the probability of sampling data ‘shoulders’ or ‘valleys’ in both healthy and diseased populations.

## Simulation of BIMODAL diseases illustrate mechanism of analytical irreproducibility

In our report thus far, we have used unimodal simulations to show how random sampling affects statistical results. However, there has been an increased interest in understanding data multimodality in various biological processes[52, 53] for which new statistical approaches have been proposed. Methods to simulate multimodal distributions are however not trivial, in part due to the unknown nature of multimodality in biological processes. To facilitate the understanding of the conceptual mechanisms that influence the effect of data multimodality and random sampling on statistical significance, Fig. 4 schematically contextualizes the statistical and data distribution principles that can interfere with reproducibility of statistical results when simulations are repeated.

Random simulations from unimodal distributions work on the assumption that numbers (e.g., donors’ disease severity) are drawn from a population, independently from one another. That is, the probability of sampling or drawing a number from a population is not influenced by the number that was selected prior. While this form of random sampling is very useful in deterministic mathematics, it does not capture the dependence of events that occur in biology. That is, in biology, the probability of an event to occur depends on the nature of the preceding events. To increase the external validity of our report, we thus conducted simulations based on three strategies to draw density curves resembling bimodal distributions. Figure 5A depicts distributions derived from both ‘truncated beta’, and the combination of two ‘mixed unimodal’ distribution functions (e.g., two independent Gaussian curves in one plot), which are illustrative of multimodality, but not necessarily reliable methods to examine the effects from dependent random sampling in multimodality.

Thus, we used ‘Random walk Markov chain Metropolis-Hastings algorithms’ to simulate random sampling, accounting for the hypothetical dependence between two different disease subtypes. To simulate the statistical comparison of two these two hypothetical bimodal diseases, we i) ran Markov Chain Monte Carlo (MCMC) simulations (Fig. 5B), ii) used the ‘dip test’ to determine if the simulated data were statistically multimodal Fig. 5C, and iii) used the Student’s t-test to determine the statistical significance, the mean differences and directions for the simulated distributions, using N = 100. The MCMC simulations clearly illustrate how random sampling of two bimodal hypothetical diseases lead to inconsistent patterns of statistical results when compared. Notice that the data dispersion increases as N increases; see summary statistics in Fig. 5D.

Conclusively, MCMC illustrations emphasize that increasing N in the study of multimodal diseases in a single study should not be assumed to provide results that can be directly extrapolated to the population, but rather, MCMC emphasize that the target study of data subtypes could lead to the identification of mechanisms which could explain why diseases vary within biological systems (e.g., humans and mice).

## Personalized ‘data disease subtyping’ must be combined with proper ‘cage-cluster’ statistics

One important caveat to consider across animal studies is that increasing N alone is futile if clustered-data statistics are not used to control for animal cage-density (> 1 mouse/cage), which our group showed contributes to ‘artificial heterogeneity’, ‘cyclical microbiome bias’, and false-positive/false-negative conclusions[2, 54]. To infer the role of scientific decision on the need for particular statistical methods, we examined the studies reviewed by Walter et al.[7] for ‘animal density’ and ‘statistical’ content (see Methods). Of note, only one of the 38 studies (2.6%, 95%CI = 0.1–13.8%) used proper statistical methods (mixed-models) to control for cage-clustering[18]. Although on average, studies tested 6.6 patients and 6.4 controls/group (range = 1–21), most studies were below the average (65.7%, 25/38, 95%CI = 48.6–80.4%), with 14 having < 4 donors/group (Fig. 6A). However, of interest, the number of human donors included in a study was inversely correlated with the number of mice/per donor used in the FMT experiments Fig. 6B.

Unfortunately, the majority of studies (25/38, 65.8%, 95%CI = 48.6–80.4%) did not report animal density, consistent with previous analyses[2]; while 10.5% of the studies (4/38, 95%CI = 2.9–24.8%) housed their mice individually, which is advantageous because study designs are free of ICC, eliminating the need for cage-cluster statistics (Fig. 6C). Our review of the statistical methods used across the 38 studies also revealed that most scientists used GraphPad chiefly for graphics and univariate analysis of mouse phenotype data. This finding suggests an underutilization of the available functions in statistical software, for example, Monte Carlo simulations, to help understand the effect of random sampling on the reproducibility and significance of observed study results, and the likelihood of repeatability by others (Monte Carlo adjusted 95% confidence intervals) (Fig. 6D).