Unraveling Key Features of Microbial Alpha-Diversity Metrics and Their Practical Applications

doi:10.21203/rs.3.rs-2595260/v1

Download PDF

Article

Unraveling Key Features of Microbial Alpha-Diversity Metrics and Their Practical Applications

https://doi.org/10.21203/rs.3.rs-2595260/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Studies of microbial communities vary widely in terms of analysis methods. In this exponentially growing field, the wide variety of diversity measures and lack of consistency make it harder to compare different studies. Most existing alpha-diversity metrics are inherited from other disciplines and their assumptions are not always directly meaningful or true for microbiome data. Many existing microbiome studies apply one or some alpha diversity metrics with no fundamentals but also an unclear results interpretation. This work focuses on a theoretical, empirical, and comparative analysis of 19 frequently and less-frequently used microbial alpha-diversity metrics grouped into 4 proposed categories, including key features of every analyzed metric with their mathematical assumptions, in order to provide a deeper understanding of the existing metrics and a practical implementation guide for future studies.

Biological sciences/Computational biology and bioinformatics

Biological sciences/Computational biology and bioinformatics/Classification and taxonomy

Biological sciences/Genetics/Microbial genetics

Biological sciences/Biological techniques/Bioinformatics

Biological sciences/Biological techniques/Software

The analysis of the composition of microbial communities has led to better insights in the roles that microbes play in human health^1–3, animal health, biodiversity, and agrotech niches^4,5. However, technical and procedural issues are disabling this to be possible in an exponentially growing area without a clear characterization of technical features of currently used metrics. Because our microbes have tremendous potential to impact our physiology, both in health and disease, a precise and reproducible characterization of each sample microbiome composition is fundamental^6,7. Given that the microbiota composition is related to the host health^8–10, the study of its intrinsic diversity, including the microorganisms' abundance and their distribution present in a sample has become a prerequisite in every published study.

Two common measures reported in many microbiome studies are alpha- and beta-diversity. Alpha-diversity is a general term for metrics that describe the species abundance, evenness or diversity within a sample. In contrast, beta-diversity measures compare the similarity of two or more communities ¹¹. Although the term alpha diversity is widely used, it is an ambiguous concept since it encompasses several complementary aspects: the number of microorganisms, the distribution of their abundances, and their phylogenetic relationship. These aspects should be taken into consideration separately or combined, but are not as synonymous as they are usually used.

Most existing alpha-diversity metrics are inherited from other disciplines, such as plant or insect biodiversity studies^12,13 and their assumptions are not always directly meaningful or true for microbiome data. In addition, the measurement units of some metrics have elusive or not easily understandable biological meanings, being difficult to interpret, and neither translatable nor comparable. Given this basis, many existing microbiome studies apply one or some alpha-diversity metrics with no fundamentals but also without a clear results interpretation.

Due to the relative novelty of microbiome studies as a discipline, the lack of consensus and standardization processes among researchers and clinicians, methods and tools for obtaining, analyzing, visualizing, interpreting, and comparing human microbiome data pose significant challenges to be solved robustly going to clinical practices^14–16. This work focuses on a theoretical and comparative analysis of several alpha-diversity metrics used in microbiome analysis, including key features of every analyzed metric with their mathematical assumptions, in order to provide a deeper understanding of the existing metrics and a practical implementation guide for future studies.

Alpha metrics used in 68 microbiome studies or in the QIIME 2 suite were grouped into four categories after performing a detailed theoretical analysis of each metric’s mathematical formula (see Methods section for a detailed explanation). Within each category, matrix correlation and Principal Component Analysis (PCA) was performed in order to identify correlations among metrics. Detailed correlations and PCA analysis of each category is included in the online Supplementary information (Suppl. Figures S1a to S1f, Table S1 and Figure S2). The proposed categories with their associated metrics are:

Abundance (also known as richness): Chao1, ACE, Fisher, Margalef, Menhinick, Observed, and Robbins.
Dominance (also known as evenness): Berger-Parker, Dominance, Simpson, ENSPIE, Gini, Mcintosh, and Strong.
Phylogenetics: Faith.
Information: Shannon, Brillouin, Heip and Pielou.

The 19 selected alpha-diversity metrics were calculated for the sequence datasets obtained for a total of 4,596 stool samples included in 13 publicly available human microbiome projects (Table S3). All sequence data were reanalyzed for this paper using the same data processing pipeline.

Two key factors of significant relevance to the estimation of different alpha diversity metrics were identified from the above mentioned theoretical analysis: Amplicon Sequence Variants (ASV) with only one read (aka singletons) and the number of observed ASVs. Figure 1A shows that values from metrics included in the Abundance category depend on these two key factors. The color scale for each graph depicts the normalized value (proportion). Almost all Abundance metrics increase as there are more observed ASVs, except Robbins which depends on the number of identified singletons. Figure 1B shows Dominance metrics values according to the corresponding factors. In this category, analysis is more complex and requires detailed considerations. Berger Parker and ENSPIE values tend to decrease when the number of ASVs increases. Due to the Simpson calculation formula, the previous trend is also present but with the opposite sign (dominance values close to the origin tend to decrease). Dominance values tend to decrease when singletons increase and the number of ASVs is less than 200 (in Strong and Simpson metrics opposite sign should be applied due to formula calculation). Figure 1C contains the same analysis applied to metrics of the Information category. As all information metrics are constructed under the Shannon formula, they are expected to have similar behavior. On the other hand, in Shannon, Pielou and Brillouin plots it is observed the identified trend in dominance category on samples with less than 60 singletons and less than 200 number of ASVs. In this sample cluster, values tend to decrease. Figure 1D shows Faith values according to the two factors where it can be observed that the value depends independently on both factors. Samples with low Faith value that do not follow previous remarks (in the center of the plot) are isolated to the 248_citizen project. According to this study, considerations related to the Fig. 1 as a whole are not biased by each dataset's nature, which can be observed at Fig. 1E, depicting samples colored according to their original study/bioproject and the 16S amplified region.

Correlation matrices were created between all metrics of the same category and are available in the Supplemental Figure S1. In the abundance category (S1 a-b), Chao1 and ACE show the strongest linear correlation between them, while Margalef and Robbins show some variation, but are still highly correlated to Chao and ACE. Menhinick presents a slope variation that is dataset dependent. Robbins, as shown before, is calculated based on the total number of singletons instead of number of ASVs, and no strong correlation is found. All abundance metrics except Robbins are highly correlated with each other and the number of ASVs, implying that differences in their formula have no significant impact when applied to microbiome data.

A similar behavior can be observed between metrics in the Information category (S1 e-f) where all metrics show strong correlation between them. This is expected as all information metrics use Shanon’s entropy as a reference in their mathematical formulation, and its implications will be further discussed in the Discussion section.

Strong correlations with a higher degree of complexity can also be observed between dominance metrics (S1 c-d). Dominance and ENSPIE are variations of the Simpson metric, and are strongly correlated with each other as expected. Mcintosh and berger-parker are also correlated to these metrics. It is of particular interest that Berger-Parker has a clear biological interpretation from its mathematical formula (the proportion of the most relevant taxa), unlike the other metrics.

Figure 2 shows, for each metric placed in the Dominance category, a linear regression from all samples where X axis is the Berger Parker value and the Y axis is the corresponding value of the metric. Mcintosh and Simpson models are well fitted. In the case of ENSPIE exponential transformation fits better than linear model. Gini not only does not fit well in this model but it also does not fit well when a correlation model was applied on all dominance metrics (in Additional Information Section, Figure S2a). Based on the fact that the Gini metric is used in economy and sociology fields, previous empirical results and considering that Gini values variability is low (mean = 0.99 and sd = 0.0083) we recommend not to apply this metric in microbiome projects. This metric was not rejected and excluded from the category because it was used by some current literature.

Faith is the phylogenetic metric most applied in experimental studies (See Supplementary table S2). Observed features and singletons are confounding factors, as shown in Fig. 1D. Based on this analysis, a local polynomial regression fitting was applied on the relationship between observed features (the selected metric proposed as representative for the Abundance category in the Discussion section) and Faith, for each dataset. Figure 3A shows that the calculated regression models are close to a linear regression, and that the determination coefficients (R²) are statistically significant except for three models that are not well fitted due to the nature of its samples and, more specifically, because the number of Observed features is low (for example 248_citizen, green dots). Previous proposal was also performed with singletons and Faith values and it is shown in Fig. 3B. This let us conclude that Faith values are strongly linked with observed features and singletons. No correlation was found between Faith and dominance metrics.

From these observations we developed a set of final practical recommendations to be taken into consideration for future studies:

Provide at least one metric from each of the following complementary alpha diversity categories: Abundance, Dominance, Information and Phylogenetics. The recommended eligible metrics are mentioned in the following bullets.
For the Abundance category one of the following metrics could be informed: Observed, ACE, Chao1, Fisher, Margalef, Mcintosh and Menhinick. Any of the previous metrics could be used because they are all correlated, however we recommend reporting the number of observed features due to the simplicity of its calculation technique and its intuitive biological interpretation. Attached to the chosen metric, we also propose to include the Robbins value (the probability of not observing present taxa in the sample).
For the Dominance category one of the following metrics should be informed: Berger-Parker, ENSPIE, Strong and Simpson. Among these options we propose to use the Berger-Parker value (the proportion of the most relevant taxa). When high dominance is reported, i.e. greater than 0.95, it is recommended to identify the most preponderant microbe and also inform its functional annotations.
For the Phylogenetic category inform the Faith value.
Report the bioinformatics tool used to perform the denoise process and inform the singleton values in order to better evaluate abundance and dominance metrics.
Report the Shannon value to inform not only the entropy but also as a parameter that summarizes/weighs both abundance and dominance categories in one value (see Discussion section).
Include a biological interpretation and conclusions for each reported metric considering the microbiome's aspect it measures, as well as an interpretation of all metrics as a whole.

Microbiome sequencing raw data processing involves several bioinformatic challenges in order to obtain unbiased results (merging and removal of primers and barcodes, denoise processing, dereplication, chimera removal, taxonomic classification, among others). In addition, the study of microbiota diversity includes broad and different concepts that should be considered separately, i.e. within alpha diversity: dominance, richness; within beta diversity: distance calculations methods, dimension reduction techniques, among others. For this purpose, metrics used to characterize these concepts, and the tools that allow their estimation, were proposed and developed inherited from other disciplines, such as the study of animal and plant communities, economics, sociology, or mining^17–19 and used practically without major distinctions^20–22. Although these disciplines are based on information obtained through sampling, microbiota information has particular and specific characteristics that requires special attention.

Based on this, here we focused on fundamentals, assumptions, and formula calculations for selected metrics. Some metrics were excluded because they were not used by the scientific community on recent microbiota studies (Etsy, Jacknife, Good’s coverage of counts) or their mathematical formula is complex, derived from a previous one, with its underlying concept close to another with a simpler formula (Renyi entropy, Tails, Kempton Taylor's Q, Lladser’s point estimate, Rao, InvSimpson) or fundamentals and assumptions are not directly applicable to microbiota data (Hill, Etsy). Selected metrics were those mainly emphasizing and focusing on one alpha-diversity aspect where assumptions and formula were consistent. Under these considerations, four different aspects (categories) that are complementary to each other and that altogether show an alpha diversity better picture were considered. In the following bullets we include a brief explanation of each proposed category along us with proposed metrics to be used:

Abundance: The number of different microorganisms in a sample. In real projects, instead of the number of observed microorganisms (detected by the sequencing technique) it is more interesting to estimate the number of present microorganisms in the population. The number of microorganisms not observed but present in a sample is usually estimated through singletons reads, i.e. a sequence detected only once within the sample. From a bioinformatic point of view, this read can be treated as a sequencing error candidate or, conversely, relevant information to infer low-abundance microorganisms not sequenced but present in the sample. Under this assumption, some alpha diversity metrics (Chao1, for example) attempt to estimate the real richness from the low-abundance information. In this way, the bioinformatic analysis of singletons is also a challenge because tools that perform the denoise process have different algorithmic approaches. DADA2²³ and DEBLUR²⁴ not only have different algorithmic denoising techniques but also have significant differences in their singletons treatment approach. DADA2 removes singletons and DEBLUR keeps them. This difference should be considered as a source of variability when reporting the estimated microorganism abundance of the sample, if the denoise tool is not reported. This recommendation is also significant since the number of identified taxa is tool dependent and therefore alpha diversity metrics can be biased²⁵. In case of using DADA2, Chao1 and Robbins should not be used because they require singletons for its calculation.
Dominance: The degree of balance (or evenness) observed in microorganisms richness distribution in a sample. The goal of this group of metrics is to inform if some microorganisms are dominant over others in terms of richness. After observing each formula calculation of the metrics grouped in the dominance category and the result of applying them to the 4,596 samples, it was found that Berger Parker (the metric that reflects the ratio between the most dominant microorganism and all other microorganisms in the sample) best represents this category. As it is correlated with the proportion between the most dominant over the the second most dominant microorganism (see online Supplementary Information, Figure S3), it could be considered that the two most dominant microorganisms are confounding factors of dominance metric value, no matter the selected metric to inform about this category. One identified benefit of providing Berger Parker as a reference metric of Dominance category is that the biological meaning of the reported metric is clear and easy to inform, namely the proportion of the most dominant microbe over the total number of microbes present in the sample.
Phylogenetics: It is a measure of biodiversity that informs the phylogenetic distance between microbes within a corresponding sample. This category, unlike the other ones, is not inherited from other disciplines. Faith is the only Phylogenetic metric broadly adapted in experimental studies, and its value is strongly linked to the amount of observed features and singletons.
Information: It is a measure inherited from the physics discipline thought to inform the entropy of a system. Shannon is the most used in experimental studies and, in fact, other metrics included in this category (Brilloun, Heip y Pielou) use this metric as a reference. In microbiome studies the Shannon index is supposed to reflect how many different microbes are and how evenly they are distributed within a sample: it is a number that aims to encompass and integrate abundance and dominance information. Basically, the more microbes in a sample are, the higher the value for the Shannon index is; and the less the inequality of relative abundances is, the higher the Shannon index is too. Since the three proposed metrics as representative for the categories Information (Shannon), Abundance (Observed_features) and Dominance (Berger Parker) are correlated (see online Supplementary information, Figure S4) we state that, if abundance and dominance metrics are reported, then Shannon does not provide new information to the alpha-diversity big picture. Furthermore, Shannon value is elusive to interpret from a biological point of view. On the other hand, if Shannon value is the only reported value for a given study, it could also be assumed that it provides a weighted/synthesized representation of information and abundance categories, in just one value. Thereby, there remains the remarks about the non-straightforward biological interpretation of this metric.

Selection of Alpha-diversity Metrics

Alpha diversity metrics included in this study were selected from a detailed examination of 68 microbiome studies (see online Supplementary Information, Table S2) conducted on human or animal samples, or that are included in the bioinformatics suite QIIME 2.

The 68 studies were selected to identify the most applied alpha diversity metrics in research articles. Scientific projects in humans, published after 2008 that performed the sequencing of the 16S ribosomal biomarker were included. The most recent studies (10 from 2019, 12 from 2020 and 3 from 2021) were prioritized, although three older ones (2008, 2010 and 2012) were also included due to their experimental design quality and sample size. In addition, some studies involving cats or mice were also included. These studies perform at least one of the following alpha diversity metrics: Chao1, Shannon, abundance, Berger-Parker, ACE, Simpson, Faith and Jost. All of these metrics are included in this work except for the Jost metric, designed for ecological communities. Its main assumption is based on the formula \(\alpha +\beta =\gamma\) (where \(\alpha\) is alpha metric, \(\beta\) es beta diversity and \(\gamma\) is gamma diversity) and as \(\gamma\) -the total species diversity in a landscape- is not used in microbiome diversity, it was considered that Jost metric should not be included in the selected metrics.

In addition, alpha metrics implemented in the QIIME2²⁶ suite -some of which were not used in the 68 studies mentioned above- were also included because it is one of the most widely used packages for bioinformatic analysis of microbiota studies ²⁷.

Theoretical analysis

The theoretical analysis performed on each metric was carried out by considering which aspect of biodiversity is measured, which scientific discipline the metric was derived from, how often it is applied in the microbiota field, the variables that appreciably affect/impact the proposed metric and whether the metric includes an estimation or is reflecting observed data, i.e. if the metric is generating a new concept (like information theory, entropy, evenness) or is it just an alias for a quantitative value that is intrinsic to the dataset (number of ASVs, singletons)²⁸. Synonymous, metrics with significant similarity in the mathematical formula or significatively correlated metrics were grouped together and the chosen one was selected because of its easy calculation method or the biological interpretation of the result. The 19 selected metrics were grouped into four broad categories, i.e. abundance-, dominance-, phylogeny-, or information-based. Below we provide a summary of the theoretical analysis of each selected alpha metric, grouped into these four categories (see online Supplementary Information, Table S4, for the corresponding mathematical formulas).

Abundance

These metrics report on the number of species in a sample. Since species definition in microbes is less defined than in e.g., plant or insect studies, and since microbiome studies usually are based on genetic analysis, other methods of grouping can be instead of "species". Most commonly, highly similar microbial sequences are grouped into "OTUs" (operational taxonomic units) or "ASVs" (amplicon sequence variants), which roughly equate species or strains, depending on which sequence identity cutoffs are used. Here, we will use the more general term "taxa", to indicate the microbial equivalent of "species".

Chao1 ²⁹: Point estimator based on observed abundance with an added term that infers the number of taxa not observed -due to sequence technique constraints- but present in the sample according to a rate between the number of singletons and the number of doubletons.
ACE (Abundance-based Coverage Estimator): Point estimator that follows the same strategy of Chao1: a baseline given by the observed abundance and an addition aimed to infer the microorganisms not observed but present in the sample. The difference relies on the fact that ACE applies a more sophisticated statistical technique for the inference. This proposal includes a threshold to cluster “rare” taxa (not abundant) from “not rare” ones. In the empirical values evaluation of this work Chao1 and ACE have similar results.
Fisher ³⁰: A univariate parametric estimator (\(\alpha\) value) that assume that abundance values follow a logarithmic series model. It was widely used in entomological research. The parameter \(\alpha\) has no biological interpretation. This metric comes from other disciplines and therefore its parametric proposal does not fit very well to the nature of the samples.
Margalef ³¹: This metric proposes to correct biases caused by variations in sampling from the total number of reads.
Menhinick ³²: This metric follows the same proposal of Margalef metric, but it differs only in the way to transform the total number of reads. Margalef reports a lower abundance over Menhinick in samples with a large number of microorganisms. An identified constraint of both metrics (Margalef and Menhinick) comes from the fact that they use the total number of reads to fit the reported bias. In this way they do not discriminate between the number of microorganisms (abundance) and how they are distributed (dominance).
Observed: The number of observed features present. Features may refer to Amplicon Sequence Variant (ASVs) or Operational Taxonomic Units (OTU) depending on the denoising tool.
Robbins: Is a proportion between the number of singletons and the total number of ASVs. In this way, we consider that this metric should not be considered as an alpha metric diversity but a likelihood of the existence of present and not observed taxa in the sample.

Dominance

Berger Parker ³³: This metric is a ratio between the abundance of the most abundant ASV and the number of ASVs in the sample. In this way, it informs the distance of a given ASV from the most abundant ASV. Compared with other dominance formula metrics, it is easy to calculate and has a clear biological interpretation.
Simpson ³⁴: Measures the probability that two microorganisms randomly selected from a sample belong to the same species. It is usually used in ecology research. In this way, 0 represents infinite diversity and 1, no diversity. That is, the bigger the Simpson metric, the lower the diversity. As this is unintuitive the value is usually subtracted from 1. Another option is to obtain the inverse value (1/Simpson). In both cases the resulting value does not match with a biological interpretation.
Dominance: As reported by the QIIME2 package and is often used in different studies. Its formula is 1-Simpson.
ENSPIE (Effective Number of Species, Probability of Interspecific Encounter): This metric is equivalent to the inverse of the Simpson metric.
Gini ³⁵: This metric measures the inequality among values of a frequency distribution. It is usually applied in economics and, from a conceptual point of view, is the area difference between the cumulative distribution of a perfect evenness richness and the real distribution curve of the sample, following the Lorenz model. It ranges between 0 and 1 where 0 corresponds to a maximum degree of homogeneity and 1 indicates the maximum degree of inequality. After a detailed study and a simulation (not shown in this work) it was found that assumptions and required parameter settings do not fit well to microbiome data. It is coherent with the empirical study results of this work where it was found that mean Gini value 0.99 and its standard deviation 0.0083.
Mcintosh ¹³: This metric expresses the heterogeneity of a sample in geometric terms. Formula requires the use of a variable (U) that contains the distance of the sample from the origin in an S dimensional hypervolume. There is no easily understandable biological interpretation, related to this variable. It has been little used in the literature.
Strong: This metric has a similar Gini conceptual proposal and they share the same conclusions and remarks.

Phylogenetic

Faith ³⁶: Measures the amount of the phylogenetic tree covered by the community. Its calculation is performed by an algorithm on a tree, not from a mathematical formula. The value is the sum of the branch lengths of a phylogenetic tree connecting all species in the target assemblage. In this way, a higher value means more branches, means more richness. This metric does not consider species abundances and depends on the sequencing depth, denoising method, rarefaction depth and phylogenetic tree^37,38.

Information

Shannon ³⁹: This metric measures uncertainty about the identity of species in the sample, and its units quantify information. It comes from physics where it quantifies the entropy of a system and it has been also a popular diversity index in the ecological literature. This metric depends on the sample size and intends to provide a weighted average, or a synthesis, between the abundance and the dominance of a sample. Other information metrics derive from this formula. The main hurdle could be found in the interpretation of the value since it is not possible to know if a given variance is due to abundance in the sample, or its dominance or both.
Brillouin ⁴⁰: This metric has a similar theoretical proposal and results than Shannon metric but it assumes that there is no uncertainty in the population from where sampling is applied. It has its origins in ecology and is useful when the entire population is known or when the randomness of the sample cannot be guaranteed. This metric is not applicable to sequencing data because the completeness of the census cannot be guaranteed. It is difficult to calculate, and its correct biological interpretation even more difficult.
Heip ⁴¹: This proposal differs on the Shannon index since it does not depend on the sample size: it is a proportion of entropy.
Pielou ⁴²: This proposal is mainly similar to Heip but they differ in the normalization strategy. While Heip divide the entropy by the total number of taxa, Pielou divides by the log of the total number of taxa.

Selection of sequence data sets to test the metrics on

The 19 selected metrics were applied to the sequence data with public access to raw data of 16S ribosomal biomarker sequencing files obtained from a total of 4,596 stool samples described in 13 public available human microbiome studies, published from 2008 to date (see online Supplementary Information, Table S3). When selecting projects, a study was excluded if it presented samples without associated metadata, but not if there were missing samples reported in the metadata. Reference cohorts is a concept to describe usually called “healthy patients'' which we found as an imprecise term when comparing different cohorts due to the complex nature of microbial communities, and their interactions between them and with the host, so the concept of "healthy patient" is not accurate in the field of microbiota, which is why it is better mentioned, instead, as reference cohorts or reference control groups ^43,44. Raw data (sequencing files) and metadata of the selected studies were downloaded from the repository of the National Center for Biotechnology Information⁴⁵ (NCBI) or the European Molecular Biology Laboratory⁴⁶ (EMBL). Based on the fact that alpha-diversity is a metric applied within each sample, conditions/groups are not relevant for this work.

Bioinformatic processing

Downloaded samples were processed with the QIIME2 tool suite, installed on a conda environment on a personal computer with Ubuntu 21.04. Samples identified as paired-end that only had a single read were removed. DEBLUR was used for noise removal, dereplication, chimera removal and ASV selection. All datasets were also processed with DADA2 to verify its impact on singleton values, all further tests were performed only on the DEBLUR results. Phylogenetic trees were created using the q2-phylogeny plugin. Selected alpha diversity metrics were calculated for each sample using q2-diversity (from qiime2 package), combined with the metadata and aggregated for all datasets. For systematic alpha metrics generation on all studies, Bash and Python scripts were created. For all further visualizations and comparisons, Jupyter Notebooks⁴⁷ were designed in the Visual Studio Code⁴⁸ environment. Pandas⁴⁹ and Numpy⁵⁰ were used for managing data frames. Numpy⁵⁰, Scipy⁵¹ and R⁵² were used for statistical analysis and regression models. Plotly⁵³ and Seaborn⁵⁴ for visualizations.

Author contributions

IC collected the data, proposed statistical analysis tools and wrote the paper. MI collected theoretical reviews and mathematical descriptors, ran the experiments, implemented tools (code and repositories) and performed the analysis of the results. JPB: Conceived the research, revising the draft critically for important intellectual content, review and editing. All authors contributed to the article and approved the submitted version.

Acknowledgments

Authors want to thank Dr. Elizabeth Bik for her very helpful and critical review of the whole manuscript.

Competing Interests

The authors declare no competing interests.

Supplementary information

Supplementary figures, tables and information are available via the online additional file associated with this article.

Data availability

Project numbers (and doi) for sequencing data of the 13 projects used to perform the analysis are available online at the Supplementary Information, Table S3. Computer code and datasets generated/analyzed along the current study are available at https://github.com/MauroIb/alpha-diversities. Computer code not included in the repository and qiime2 artifacts are available on reasonable request to the corresponding author.

Leviatan, S., Shoer, S., Rothschild, D., Gorodetski, M. & Segal, E. An expanded reference map of the human gut microbiome reveals hundreds of previously unknown species. Nat. Commun. 2022 131 13, 1–14 (2022).
VanEvery, H., Franzosa, E. A., Nguyen, L. H. & Huttenhower, C. Microbiome epidemiology and association studies in human health. Nat. Rev. Genet. 2022 1–16 (2022) doi:10.1038/s41576-022-00529-x.
After the Integrative Human Microbiome Project, what’s next for the microbiome community? Nature 569, 599 (2019).
Lori, M. et al. Compared to conventional, ecological intensive management promotes beneficial proteolytic soil microbial communities for agro-ecosystem functioning under climate change-induced rain regimes. Sci. Reports 2020 101 10, 1–15 (2020).
Aka, B. E. Z. et al. High-throughput 16S rRNA gene sequencing of the microbial community associated with palm oil mill effluents of two oil processing systems. Sci. Reports 2021 111 11, 1–12 (2021).
Campanaro, S., Treu, L., Kougias, P. G., Zhu, X. & Angelidaki, I. Taxonomy of anaerobic digestion microbiome reveals biases associated with the applied high throughput sequencing strategies. Sci. Reports 2018 81 8, 1–12 (2018).
Lawson, C. E. et al. Common principles and best practices for engineering microbiomes. Nat. Rev. Microbiol. 2019 1712 17, 725–741 (2019).
Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nat. 2017 5507674 550, 61–66 (2017).
Cullen, C. M. et al. Emerging Priorities for Microbiome Research. Front. Microbiol. 11, 136 (2020).
D’Argenio, V. & Salvatore, F. The role of the gut microbiome in the healthy adult status. Clin. Chim. Acta 451, 97–102 (2015).
Hughes, J. B., Hellmann, J. J., Ricketts, T. H. & Bohannan, B. J. M. Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity. Applied and Environmental Microbiology vol. 67 4399–4406 (2001).
Thukral, A. K. A review on measurement of Alpha diversity in biology. Agric. Res. J. 54, 1 (2017).
McIntosh, R. P. An Index of Diversity and the Relation of Certain Concepts to Diversity. Ecology 48, 392–404 (1967).
Goodrich, J. K., Davenport, E. R., Clark, A. G. & Ley, R. E. The relationship between the human genome and microbiome comes into view. Annu. Rev. Genet. 51, 413 (2017).
Aydin, Ö., Nieuwdorp, M. & Gerdes, V. The Gut Microbiome as a Target for the Treatment of Type 2 Diabetes. Curr. Diab. Rep. 18, 1–11 (2018).
Ducarmon, Q. R., Hornung, B. V. H., Geelen, A. R., Kuijper, E. J. & Zwittink, R. D. Toward Standards in Clinical Microbiota Studies: Comparison of Three DNA Extraction Methods and Two Bioinformatic Pipelines. mSystems 5, (2020).
Kim, B. R. et al. Deciphering Diversity Indices for a Better Understanding of Microbial Communities. J. Microbiol. Biotechnol. 27, 2089–2093 (2017).
Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S. & Bähler, J. Proportionality: A Valid Alternative to Correlation for Relative Data. PLOS Comput. Biol. 11, e1004075 (2015).
Lozupone, C. A. & Knight, R. Species divergence and the measurement of microbial diversity. FEMS Microbiol. Rev. 32, 557–578 (2008).
Reese, A. T. & Dunn, R. R. Drivers of microbiome biodiversity: A review of general rules, feces, and ignorance. MBio 9, (2018).
Shen, X. J. et al. Molecular characterization of mucosal adherent bacteria and associations with colorectal adenomas. Gut Microbes 1, 138–147 (2010).
Su, X. Elucidating the Beta-Diversity of the Microbiome: from Global Alignment to Local Alignment. mSystems 6, 363–384 (2021).
Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).
Amir, A. et al. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems 2, (2017).
Nearing, J. T., Douglas, G. M., Comeau, A. M. & Langille, M. G. I. Denoising the Denoisers: An independent evaluation of microbiome sequence error- correction approaches. PeerJ 2018, e5364 (2018).
Bolyen, E. et al. QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. doi:10.7287/peerj.preprints.27295v2.
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 2019 378 37, 852–857 (2019).
Willis, A. D. Rarefaction, alpha diversity, and statistics. Front. Microbiol. 10, 2407 (2019).
Chao, A. Nonparametric Estimation of the Number of Classes in a Population. Scand. J. Stat. 11, 265–270 (1984).
Fisher, R. A., Corbet, A. S. & Williams, C. B. The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. J. Anim. Ecol. 12, 42–58 (1943).
Margalef, R. Diversidad de especies en las comunidades naturales.
Menhinick, E. F. A Comparison of Some Species-Individuals Diversity Indices Applied to Samples of Field Insects. Ecology 45, 859–861 (1964).
Berger, W. H. & Parker, F. L. Diversity of Planktonic Foraminifera in Deep-Sea Sediments. Science (80-. ). 168, 1345–1347 (1970).
Simpson, E. H. Measurement of diversity [16]. Nature 163, 688 (1949).
Gini, C. Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche.[Fasc. I.]. (Tipogr. di P. Cuppini, 1912).
Faith, D. P. Conservation evaluation and phylogenetic diversity. Biol. Conserv. 61, 1–10 (1992).
Chao, A., Chiu, C. H. & Jost, L. Phylogenetic diversity measures based on Hill numbers. Philos. Trans. R. Soc. B Biol. Sci. 365, 3599–3609 (2010).
Chao, A. et al. Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecol. Monogr. 84, 45–67 (2014).
Shannon, C. E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Pielou, E. C. Ecological Diversity. New York, Wiley InterScience (1975).
Heip, C. A New Index Measuring Evenness. J. Mar. Biol. Assoc. United Kingdom 54, 555–557 (1974).
Pielou, E. C. & others. An introduction to mathematical ecology. (New York, USA, Wiley-Inter-science, 1969).
Moles, L. & Otaegui, D. The Impact of Diet on Microbiota Evolution and Human Health. Is Diet an Adequate Tool for Microbiota Modulation? Nutrients 12, (2020).
Rinninella, E. et al. What is the Healthy Gut Microbiota Composition? A Changing Ecosystem across Age, Environment, Diet, and Diseases. Microorganisms 7, 14 (2019).
Sayers, E. W. et al. Database resources of the National Center for BiotechnologyInformation. Nucleic Acids Res. 50, D20 (2022).
Kanz, C. et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 33, D29 (2005).
Kluyver, T. et al. Jupyter Notebooks -- a publishing format for reproducible computational workflows. in Positioning and Power in Academic Publishing: Players, Agents and Agendas (eds. Loizides, F. & Schmidt, B.) 87–90 (2016).
Balena, F. & Fawcette, J. Programming Microsoft Visual Basic 6.0. vol. 1 (Microsoft press Washington, 1999).
pandas development team, T. pandas-dev/pandas: Pandas. (2020) doi:10.5281/zenodo.3509134.
Harris, C. R. et al. Array programming with {NumPy}. Nature 585, 357–362 (2020).
Virtanen, P. et al. {SciPy} 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 17, 261–272 (2020).
R Core Team. R: A Language and Environment for Statistical Computing. (2021).
Plotly Technologies Inc. Collaborative data science Publisher: Plotly Technologies Inc. (2015).
Waskom, M. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).

No competing interests reported.

Supplementarymaterial.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Unraveling Key Features of Microbial Alpha-Diversity Metrics and Their Practical Applications

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

Author contributions

References

Additional Declarations

Supplementary Files

Status:

Version 1