Analysing microbiome intervention studies: Does choice of a statistical method affect biological interpretation?

The diet plays a major role in shaping gut microbiome composition and function in both humans and animals, and dietary intervention trials are often used to investigate and understand these effects. A plethora of statistical methods for analysing differential abundance of microbial taxa exists, and new methods are constantly being developed, but there is a lack of benchmarking studies and clear consensus on the best multivariate statistical practices. This makes it hard for a biologist to decide which method to use. We compared the outcomes of a wide range of ANOVA-like statistical methods to explore to what extent the choice of method affects the biological inferences made. The comparison is based on ve published dietary intervention studies representing different subjects and study designs. We found that the methods producing outputs at the community level were in agreement regarding both effect size and statistical signicance. However, the methods that provided ranking of operational taxonomic units (OTUs) gave incongruent results, implying that the choice of method is likely to inuence the biological interpretations.


Abstract Background
The diet plays a major role in shaping gut microbiome composition and function in both humans and animals, and dietary intervention trials are often used to investigate and understand these effects. A plethora of statistical methods for analysing differential abundance of microbial taxa exists, and new methods are constantly being developed, but there is a lack of benchmarking studies and clear consensus on the best multivariate statistical practices. This makes it hard for a biologist to decide which method to use.

Results
We compared the outcomes of a wide range of ANOVA-like statistical methods to explore to what extent the choice of method affects the biological inferences made. The comparison is based on ve published dietary intervention studies representing different subjects and study designs. We found that the methods producing outputs at the community level were in agreement regarding both effect size and statistical signi cance. However, the methods that provided ranking of operational taxonomic units (OTUs) gave incongruent results, implying that the choice of method is likely to in uence the biological interpretations.

Conclusions
We concluded that inferences at the OTU level should be made with caution, and that a combination of several statistical methods allows to interpret the outcomes with higher con dence and simultaneously account for the limitations of each method.

Background
The microbiome has emerged as an important link to health and disease [1]. Microbiome analysis methods are rapidly advancing, in particular in areas such as compositional data analysis, multi-omics, and data integration [2,3]. A clear understanding of the type of data being analysed is crucial, given the growing number of studies uncovering the key role of microbiome, its composition and functions following diet intervention or medical treatment [4]. At present, analysis of complex microbial data bene ts from adapting the multivariate statistical toolbox from ecology and environmental sciences, and a proper choice of statistical tools is becoming increasingly important [5][6][7]. However, a lack of benchmarking studies and clear consensus on the best multivariate statistical practices make comparisons across microbiome data sets di cult [3,8].
Different statistical methods have different properties, and the choice of the method should depend on the scienti c question, experimental design, data characteristics, and expected relationships among the variables. In reality, the choice is often biased by the research groups' tradition and familiarity with speci c "toolboxes". Main differences between existing statistical approaches for analysing microbiome data are related to: (1) explorative versus con rmative; (2) univariate versus multivariate; (3) parametric versus nonparametric; (4) linear versus nonlinear; (5) compositional versus non-compositional; (6) distance-based versus count/abundance-based; and (7) incorporating phylogenetic information into the analysis or not [9]. Here, we focus on statistical methods for analysing data from dietary intervention studies. In contrast to observational studies, these are usually small in sample size but performed in (semi)-controlled environments and tailored to a speci c research hypothesis. The studies often include multiple experimental factors, possibly with more than two levels, and it is therefore natural to turn to ANOVA-like methods. From a biologist's point of view, it is important that the methods are easy to interpret, both on the multivariate (microbial community) and univariate (microbial taxa or operational taxonomic units, OTUs) levels.

Distance-based methods
The distance-based methods are multivariate by de nition, since multiple variables (microbial OTUs) are used to calculate pairwise distances between samples. Among distance-based methods, permutational multivariate analysis of variance (PERMANOVA) is the most widely used and more powerful than analysis of similarities (ANOSIM) to detect changes in community structure [10][11][12]. Both methods may be implemented with any dissimilarity metric. Among abundance-based beta diversity indices, Bray-Curtis is the most common choice due to its theoretical properties and empirical accuracy [13,14]. The most widely applied phylogenetic beta diversity indices is UniFrac-type metrics [14][15][16]. However, UniFrac is unsuitable as a distance metric for studies with small sample size, which is usually the case for dietary intervention trials [17,18]. Both PERMANOVA and ANOSIM test differences at the community level but do not provide any information at the OTU level. Similarity percentage analysis (SIMPER) is a method that can be applied for post hoc testing to evaluate the relative contribution of each analysed microbial taxon (i.e. OTU) to the overall average Bray-Curtis dissimilarities by pairwise comparison of two or more groups [12]. To the best of our knowledge, no such method exists for the other distance metrics.
Distance-based methods have their strengths and weaknesses that are important to account for beforehand. ANOSIM cannot deal with multifactorial designs, and both ANOSIM and PERMANOVA have problems detecting differences unless they are present in taxa with high variability [19]. Newer methods aimed at assigning more interpretable effect sizes are under constant development [6,20]. For example, the more exible PERMANOVA-S is an extension to existing distance-based methods that can adjust for covariates and simultaneously incorporate multiple distance metrics [21]. However, these methods do not consider covariance/correlation between microbial species, and they encounter signi cant power loss if all microbial species are used for distance calculations [22].

Abundance-based methods
The abundance-based methods can be either univariate (analysing each OTU individually) or multivariate (focusing on covariance structure between OTUs). To account for compositionality in microbiome data, compositional data analysis (CoDA), based on transformation into a log-ratio coordinate space, is often used [23,24]. In this study, the centred log-ratio (clr) transformation is applied [24,25]. After transformation, any suitable univariate or multivariate method can be used. Univariate analysis of variance (ANOVA) is a well-established family of methods that can incorporate complex data structures such as random effects, longitudinal or repeated measurements, nested designs etc. Non-parametric methods (such as Wilcoxon signed-rank test) are popular for assessing differences between groups when sample size is small and normality assumptions are not met. Popular methods tailored for highthroughput sequencing microbiome data are ANOVA-like differential expression analysis (ALDEx2) and the analysis of composition of microbiomes (ANCOM) [23,26]. Both approaches use the log-ratio transformation of microbiome data prior to univariate assessment of statistical signi cance and differ in how zero values are handled [4,27]. ALDEx2 uses a Dirichlet-multinomial probability distribution to estimate abundances from counts and calculates the false discovery rate (FDR) based on Monte Carlo simulations. In ANCOM, on the other hand, the compositional nature of the data is considered by testing the log-ratio for all pairs of OTUs, and then counting the number of tests where the log-ratio is signi cantly different from zero. This number (W-stat) can be used to obtain a ranking of OTUs most likely to differ between the groups.
The drawback of univariate methods is that they treat all species as independent variables, not taking the covariance between OTUs into account. Such methods may fail to detect community-level differences [28]. The classical generalisation of ANOVA to multiple variables (MANOVA) cannot be used when the number of variables exceeds the number of samples, as it suffers from the problem of singularity of covariance matrices and assumptions that are not ful lled [29,30]. Novel statistical ANOVA-like methods include fty-fty multivariate analysis of variance (FFMANOVA) [31] and ANOVA-simultaneous component analysis (ASCA) [30]. Both methods are based on principal component analysis (PCA), they can handle multiple collinear responses and can be used on any data types. ASCA has gained popularity in metabolomics research [32][33][34]. The output of ASCA are scores and loadings related to the experimental factors, which can be visualised in the same way as for PCA to better understand covariance patterns within the data. In addition, the hypothesis testing can be performed at the multivariate level through permutation tests. Both ASCA and FFMANOVA methods have successfully been applied on microbiome data [35][36][37][38]. Linear discriminant analysis effect size (LEfSe) is a stepwise approach that combines univariate analysis with multivariate discriminant analysis [39], but it is not well adapted to experimental designs with several multilevel factors. The method has found a wide use in microbiome research due to its easy-to-use and interpret visualisation [40,41].
An overview of the different methods compared in this study is given in Fig. 1 and Table 1. As can be seen from Fig. 1, PERMANOVA and ANOSIM provide results only at the community level, while SIMPER, ANCOM and ALDEx2 report results for single OTUs only. ASCA and FFMANOVA, on the other hand, give results both at the community and OTU levels. For FFMANOVA, this is done by rotation test procedure which adjusts the p-values for multiple testing [42]. For ASCA, OTU ranking is based on ASCA loadings for each experimental factor or by partial least squares discriminant analysis (PLS-DA) for pairwise comparisons (for details see Methods section).

Method comparison
The aim of the method comparison was to investigate if the choice of a statistical method affects biological inference. At the community level, methods were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values. At the OTU level, the ranking of OTUs was compared by Spearman's rank correlation and by investigation of scatterplots between the different ranking metrics.
We focused on dietary intervention trials since the link between health balance and disease is of major interest to the microbiome research community, precision medicine and food science. We used ve recently published data sets as basis for the comparisons (Table S1). The following criteria for studies to be included were considered: (1) at least two-factorial experimental design with a minimum of two factor levels; (2) either human or animal gut microbiome surveys; and (3) a taxonomic assignment at the OTU level reported. Diet is the main factor of interest in all ve studies, and we restricted our comparisons to this factor.

Community level
Four of the methods can be used to test the association between diet and the overall microbiome composition, namely distance-based ANOSIM and PERMANOVA and compositional ASCA and FFMANOVA. The results across studies are summarised in Table 2. Data set 3. In a longitudinal study by Le Sciellour et al. [44] the authors tested the effect of dietary bre content on faecal microbiota in growing-nishing pigs fed alternately a low-bre and a high-bre diet during four successive 3-week periods. In this survey, the effect of diet was small (2%) compared to the effect of diet in data sets 1 and 2. Similarly, FFMANOVA and ASCA have produced slightly lower p-values than PERMANOVA, but all three methods were in agreement on the effect of diet.
Data set 4. In a longitudinal study by Wang et al. [45] the objectives were to determine the impact of beta glucan on the composition of faecal microbiota in mildly hypercholesterolemic individuals. The individuals received for 5-week either a treatment containing 3 g high molecular weight (HMW), 3 g low molecular weight (LMW), 5 g LMW barley beta glucan or wheat and rice (control group) [45]. The effect of diet accounted for ~ 2% of the explained variation reported by FFMANOVA, ASCA and PERMANOVA. Diet was signi cant on a 5% level for PERMANOVA and ASCA (p = 0.036 and p = 0.038, respectively) and on a 10% level for FFMANOVA (p = 0.067). However, different conclusions were drawn with respect to time.
Signi cant results at the community level were obtained only for FFMANOVA (p = 0.007).
Data set 5. Birkeland et al. [46] assessed the effect of prebiotic bres or a control supplement on faecal microbiota composition in human subjects with type two diabetes. The interaction effect of treatment and day accounted for 1-2% of the explained variation according to the FFMANOVA, ASCA and PERMANOVA. Signi cant results at the community level were obtained only for FFMANOVA (p < 0.001).
To summarise, PERMANOVA, FFMANOVA and ASCA gave almost identical results regarding effect sizes and statistical signi cance across studies ( Table 2). ANOSIM provided the most different results, likely due to less resolution in the analysis since the multifactorial nature of the studies cannot be taken into consideration by this approach. Secondly, there is a considerable difference in effect sizes of the main factor of interest (diet) between animal (2-37%) and human (1-2%) dietary intervention trials, with effect sizes being very small in human studies. Lastly, three of the studies had crossover designs, allowing for estimation of interindividual variation. This variation was considerably higher for trials involving human subjects (54-74%) compared to the animal study (26%).

OTU level
Six of the methods, namely SIMPER, ASCA, FFMANOVA, LEfSe, ANCOM and ALDEx2, can be used to make biological inference for individual OTUs. The methods give different outputs which can be used to identify differentially abundant OTUs and/or rank the OTUs according to effect sizes (see Methods for details). The summary tables and scatterplots are given as Appendix 1 and Figure S1, respectively. Agreement between the methods was investigated by calculating Spearman's rank correlation between all pairs of output metrics (Fig. 2). The congruence between the methods was higher in animal studies (Moen, Lai and Le Sciellour) than in human studies (Wang and Birkeland), as expected. In general, animal studies are more controlled, and interindividual variation is smaller compared to studies involving human subjects. Interestingly, ASCA and FFMANOVA are the two methods that had highest agreement across all ve data sets, while results obtained by pairwise comparisons of ALDEx2 versus the other methods were generally less correlated (Fig. 2). LEfSe was evaluated only for the data sets Moen and Lai, since very few OTUs were agged as signi cantly different between the conditions for the remaining three data sets. A high congruence between ASCA and ANCOM results was observed only for animal studies (data sets Moen, Lai and Le Sciellour) compared to human studies (Wang and Birkeland). For easier visualisation, SIMPER was excluded as it showed low agreement with the other tested methods.
Gut microbiome data are typically represented by a few dominant OTUs (relative abundance > 1%) and a majority of low-abundant OTUs (< 0.05%). One is therefore interested in ranking OTUs independent of their average abundance, to be able to detect biologically relevant changes in low-abundant OTUs. Our results show that ranking of OTUs is highly dependent on abundance for SIMPER and ANCOM (Fig. 3 and Figure S2). Notably, ranking based on SIMPER was completely determined by abundance, since dominant OTUs had larger in uence on the Bray-Curtis distance, as had previously been reported [19]. SIMPER is therefore not suited for identifying changes in low-abundant OTUs. For ANCOM, the results indicated that highly abundant OTUs had either very high or very low W-stat, while low-abundant OTUs always had medium-to-low W-stat. To the best of our knowledge, this nding has not been reported before and shows that ANCOM, like SIMPER, is not able to identify changes in low-abundant OTUs. Tentatively, this trend could be due to different distributions of log-ratios for highly and low-abundant OTUs, but it is beyond the scope of the present study to investigate this further.

Discussion
Diet is often considered to be a driver of microbiome variation [47]. In observational population-based studies, diet consistently accounts for only a small proportion of microbiome variation, and this is partly due to large interindividual differences, small sample sizes and limitations in study designs such as potentially insu cient washout periods in crossover studies [47][48][49][50][51]. In general, higher interindividual variation is observed in gut microbiota of human subjects compared to animal species [52]. Nevertheless, use of animal models to study the causal role of gut microbiota in health and disease is an established practice although animal models lack the speci c interactions present in the complex system of human organism [53].
Univariate and multivariate analyses provide information at different levels. Biologists will often nd that the outputs of univariate analyses are easier to interpret compared to those generated by multivariate analyses, though assumptions are similar for both method types [9]. In general, multivariate methods provide a more holistic overview of differences between samples and account for correlations and interactions between the variables, whereas univariate methods are well suited to point out the differences for speci c microbial groups. The two levels therefore provide complementary information, and we think that it is generally of biological interest to report differences at both levels.
Multivariate ANOVA methods, namely FFMANOVA and ASCA, consider the underlying covariance between multiple variables in comparison to traditional ANOVA. In complex microbial data sets, where the variables (i.e. OTUs) are not independent, FFMANOVA and ASCA can serve as suitable tools to make biological inference at both the community and OTU levels. Our results show that these methods performed similarly on the example data (similar community-level effect sizes and similar ranking of OTUs; see Table 2 and Fig. 2). Distance-based PERMANOVA provided similar results at the community level, but the results should be complemented by other methods to gain insight into OTUs that are affected.
ASCA, FFMANOVA and LEfSe (through the LDA step) are based on the covariance structure between OTUs, and therefore depend on the relative scaling of the OTUs. It is usual practice in many areas to scale all variables to equal variance, giving them equal weight in the model. Another option, which retains more of the original variability, is pareto scaling. An overview of different scaling options used in metabolomics, also relevant for microbiota, can be found in [54]. To prevent that the results are dominated by a few highly-abundant OTUs, data should be scaled when analysing abundances. Clrtransformation puts the variables on comparable levels, and the need for scaling is less obvious. However, the highly-abundant OTUs might still have slightly higher variance, and scaling should be considered depending on the data characteristics.
The tools tested in the present study vary in how exible they are regarding possible comparisons, namely two-group comparisons (SIMPER, ASCA and LEfSe), possibility to adjust for confounder (ANCOM, FFMANOVA and ASCA) or to specify complex models with interactions (ALDEx2, FFMANOVA and ASCA). These aspects should be considered when selecting methods, as different study designs might require different types of statistical models and tests.
Although the Spearman's rank correlation indicated good agreement for the animal studies, little overlap between lists of "signi cant" results could be detected. There can be several reasons for this. One important aspect is that different criteria must be used to de ne "signi cance" or generally "importance". FFMANOVA and ALDEx2 provide p-values, whereas more heuristic tools must be applied with ASCA, SIMPER and ANCOM. Also note that for some methods (FFMANOVA, ALDEx2 and ANCOM), the ranking is related to all levels of the experimental factors, whereas the other methods use pairwise comparisons only. Moreover, differences in sample collection, sample preparation and sequencing contribute to additional variability, which, in turn, affects the validity of the results [55]. Hence, comparisons across different studies with similar interventions would be even more di cult.
Newly introduced statistical tools can produce contrasting results compared to the existing methods, but such comparisons tend to be biased in favour of the new approach [56]. Past benchmarking studies [57][58][59] have widely reported varying results from different tools, which was also con rmed in our study.
Currently, there is no consensus for the best existing tool for detecting differentially abundant microbial taxa, and there is no reason to believe that one single method is best in all cases. It is therefore good scienti c practice to compare and report outputs from several methods. The strategy for making inference from multiple outputs depends on whether it is more important to reduce the number of false negatives or false positives. If it is important not to miss out any possible ndings (typically in early stages of research), any OTU listed as signi cant/important by any method should be investigated further. To obtain robust results, on the other hand, OTUs should be reported as differentially abundant only if they were selected as "signi cant" by several methods [56].
Finally, microbial data at the OTU level are zero-in ated, and rare OTUs should be removed prior to downstream statistical analyses. We have observed that the threshold for ltering out OTUs can signi cantly affect the results, both at the community and OTU levels, which is a topic for further research.

Conclusions
In the present study, we compared the performance of several multivariate ANOVA-like statistical methods taking several dietary intervention microbiome data sets as examples. At the community level, the different methods came to similar conclusions. However, at the OTU level the agreement between methods varied considerably. Some of the methods provided output metrics that were dependent on average abundance, making it impossible to detect differences in low-abundant OTUs. This suggests that choice of statistical method might affect biological inference at the OTU level. Ranking of OTUs obtained with different methods correlated better for animal studies than for human studies, possibly due to lower interindividual variation in animal studies. Based on our results, we suggest that microbiome surveys should report effect sizes at the community level before making any inferences at lower taxonomic levels.
We also advise to combine several multivariate statistical methods as this allows to interpret the results with higher con dence and simultaneously account for the limitations of each method, when applied separately.

Methods
Experimental design and data characteristics. The dietary intervention data sets were selected based on the study design, with a minimum of two independent variables (for example, diet and dose; see Table S1 for details). Prior to the statistical analyses, the data were ltered to keep the OTUs that were present: (1) with relative abundance more than 0.005% in an individual, and (2) in at least 50% of individuals and in one of the groups. For each study in total 507, 561, 560, 397 and 216 OTUs passed this lter and were subsequently used in downstream analyses (Table S1 and Appendix 2). All statistical analyses were performed in R version 3.6.2 [60] unless otherwise speci ed and were run with 999 permutations.
Zero-value replacements need to be done prior to clr-transformation [24,25]. This was done by applying function cMultRepl with a setting method = 'CZM' in the R package zCompositions [61].
ANOSIM and PERMANOVA were run using the functions anosim and adonis in the package vegan [62], with a setting method = Euclidean and the clr-transformed data as input.
SIMPER was run on ltered relative abundance data using function simper from the R package vegan. The average contribution of each OTU to the Bray-Curtis dissimilarity between selected pairwise comparisons of diet levels (speci ed below) was used as ranking metric.
FFMANOVA was performed by using the function ffmanova implemented in the ffmanova R package [31] on clr-transformed and centred data. FDR adjusted p-values (setting pAdjFDR) were used as ranking metric.
ASCA was run on clr-transformed and centred data using an in-house implementation in MATLAB (R2018b, The MathWorks Inc.); p-values at the community level were calculated by permutation tests using 999 permutations. For study designs with two diet levels, the ASCA loadings were used to rank OTUs. For designs with more than two diet levels, pairwise comparisons were performed by PLS-DA [63] on the ASCA diet effect matrix plus residuals, and the regression coe cients were used as ranking metric.
LEfSe was run on clr-transformed data in the Galaxy web application and work ow framework [39]. The settings used were as follows: per-sample normalisation = yes, alpha = 0.05 for both Kruskal-Wallis and Wilcoxon signi cance tests, LEfSe threshold = 2 (on a log 10 scale), one-against-all strategy for multiclass analysis, class variable = Diet, subclass variable = Dose/ Time/ Exercise/ Period, and subject variable = Individual.
ANCOM was run by using the ANCOM 2.0 source code implemented in R [26], using ltered relative abundance data as input. The W-stat was used to rank OTUs, indicating the number of signi cantly different pairwise log-ratios while adjusting for false discoveries by applying a Benjamini-Hochberg correction at 0.05 level of signi cance.
ALDEx2 was run using the functions aldex.clr and aldex.glm from package ALDEx2 v.1.18.0. Raw counts were used as input, and we used p-values for selected factor level contrasts (speci ed below) to rank OTUs.
Factor level comparisons. For study designs with more than two diet groups, only one of the pairwise comparisons was made for the methods based on the pairwise group comparison, namely ALDEx2, ASCA, LEfSe and SIMPER. The following pairs with the most contrasting outcomes were compared: (1) BSG group vs. IN group [38]; (2) control group vs. 3 g HMW [45] and (3)

Declarations
Ethics approval and consent to participate Not applicable. The study is based on already published data sets listed in Table S1.

Consent for publication
Not applicable.

Availability of data and materials
All data analysed during this study are included as additional les. Custom R and MATLAB scripts are available from the corresponding authors upon request.

Competing interests
The authors declare that they have no competing interests.

Tables
Due to technical limitations, the tables are only available as a download in the supplemental les section. Table 1. An overview of statistical methods and their properties. Table 2. Community-level method comparison across ve microbial data sets. Distance-based ANOSIM and PERMANOVA and compositional ASCA and FFMANOVA were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values. Figure 1 A diagram of statistical methods used in this study.

Figure 2
Spearman's correlation (x axis) calculated for pairwise comparison of statistical methods (y axis) for ve microbial data sets. Each point represents a Spearman's rank correlation coe cient between OTU ranking metrics from Method 1 and Method 2. Point color re ects data set: Moen (red), Lai (blue), Le Sciellour (green), Wang (yellow) and Birkeland (black). LEfSe was evaluated only for data sets Moen and Lai.