Selection of Alpha-diversity Metrics
Alpha diversity metrics included in this study were selected from a detailed examination of 68 microbiome studies (see online Supplementary Information, Table S2) conducted on human or animal samples, or that are included in the bioinformatics suite QIIME 2.
The 68 studies were selected to identify the most applied alpha diversity metrics in research articles. Scientific projects in humans, published after 2008 that performed the sequencing of the 16S ribosomal biomarker were included. The most recent studies (10 from 2019, 12 from 2020 and 3 from 2021) were prioritized, although three older ones (2008, 2010 and 2012) were also included due to their experimental design quality and sample size. In addition, some studies involving cats or mice were also included. These studies perform at least one of the following alpha diversity metrics: Chao1, Shannon, abundance, Berger-Parker, ACE, Simpson, Faith and Jost. All of these metrics are included in this work except for the Jost metric, designed for ecological communities. Its main assumption is based on the formula \(\alpha +\beta =\gamma\) (where \(\alpha\) is alpha metric, \(\beta\) es beta diversity and \(\gamma\) is gamma diversity) and as \(\gamma\) -the total species diversity in a landscape- is not used in microbiome diversity, it was considered that Jost metric should not be included in the selected metrics.
In addition, alpha metrics implemented in the QIIME226 suite -some of which were not used in the 68 studies mentioned above- were also included because it is one of the most widely used packages for bioinformatic analysis of microbiota studies 27.
Theoretical analysis
The theoretical analysis performed on each metric was carried out by considering which aspect of biodiversity is measured, which scientific discipline the metric was derived from, how often it is applied in the microbiota field, the variables that appreciably affect/impact the proposed metric and whether the metric includes an estimation or is reflecting observed data, i.e. if the metric is generating a new concept (like information theory, entropy, evenness) or is it just an alias for a quantitative value that is intrinsic to the dataset (number of ASVs, singletons)28. Synonymous, metrics with significant similarity in the mathematical formula or significatively correlated metrics were grouped together and the chosen one was selected because of its easy calculation method or the biological interpretation of the result. The 19 selected metrics were grouped into four broad categories, i.e. abundance-, dominance-, phylogeny-, or information-based. Below we provide a summary of the theoretical analysis of each selected alpha metric, grouped into these four categories (see online Supplementary Information, Table S4, for the corresponding mathematical formulas).
Abundance
These metrics report on the number of species in a sample. Since species definition in microbes is less defined than in e.g., plant or insect studies, and since microbiome studies usually are based on genetic analysis, other methods of grouping can be instead of "species". Most commonly, highly similar microbial sequences are grouped into "OTUs" (operational taxonomic units) or "ASVs" (amplicon sequence variants), which roughly equate species or strains, depending on which sequence identity cutoffs are used. Here, we will use the more general term "taxa", to indicate the microbial equivalent of "species".
-
Chao1 29: Point estimator based on observed abundance with an added term that infers the number of taxa not observed -due to sequence technique constraints- but present in the sample according to a rate between the number of singletons and the number of doubletons.
-
ACE (Abundance-based Coverage Estimator): Point estimator that follows the same strategy of Chao1: a baseline given by the observed abundance and an addition aimed to infer the microorganisms not observed but present in the sample. The difference relies on the fact that ACE applies a more sophisticated statistical technique for the inference. This proposal includes a threshold to cluster “rare” taxa (not abundant) from “not rare” ones. In the empirical values evaluation of this work Chao1 and ACE have similar results.
-
Fisher 30 : A univariate parametric estimator (\(\alpha\) value) that assume that abundance values follow a logarithmic series model. It was widely used in entomological research. The parameter \(\alpha\) has no biological interpretation. This metric comes from other disciplines and therefore its parametric proposal does not fit very well to the nature of the samples.
-
Margalef 31 : This metric proposes to correct biases caused by variations in sampling from the total number of reads.
-
Menhinick 32 : This metric follows the same proposal of Margalef metric, but it differs only in the way to transform the total number of reads. Margalef reports a lower abundance over Menhinick in samples with a large number of microorganisms. An identified constraint of both metrics (Margalef and Menhinick) comes from the fact that they use the total number of reads to fit the reported bias. In this way they do not discriminate between the number of microorganisms (abundance) and how they are distributed (dominance).
-
Observed: The number of observed features present. Features may refer to Amplicon Sequence Variant (ASVs) or Operational Taxonomic Units (OTU) depending on the denoising tool.
-
Robbins: Is a proportion between the number of singletons and the total number of ASVs. In this way, we consider that this metric should not be considered as an alpha metric diversity but a likelihood of the existence of present and not observed taxa in the sample.
Dominance
-
Berger Parker 33 : This metric is a ratio between the abundance of the most abundant ASV and the number of ASVs in the sample. In this way, it informs the distance of a given ASV from the most abundant ASV. Compared with other dominance formula metrics, it is easy to calculate and has a clear biological interpretation.
-
Simpson 34 : Measures the probability that two microorganisms randomly selected from a sample belong to the same species. It is usually used in ecology research. In this way, 0 represents infinite diversity and 1, no diversity. That is, the bigger the Simpson metric, the lower the diversity. As this is unintuitive the value is usually subtracted from 1. Another option is to obtain the inverse value (1/Simpson). In both cases the resulting value does not match with a biological interpretation.
-
Dominance: As reported by the QIIME2 package and is often used in different studies. Its formula is 1-Simpson.
-
ENSPIE (Effective Number of Species, Probability of Interspecific Encounter): This metric is equivalent to the inverse of the Simpson metric.
-
Gini 35 : This metric measures the inequality among values of a frequency distribution. It is usually applied in economics and, from a conceptual point of view, is the area difference between the cumulative distribution of a perfect evenness richness and the real distribution curve of the sample, following the Lorenz model. It ranges between 0 and 1 where 0 corresponds to a maximum degree of homogeneity and 1 indicates the maximum degree of inequality. After a detailed study and a simulation (not shown in this work) it was found that assumptions and required parameter settings do not fit well to microbiome data. It is coherent with the empirical study results of this work where it was found that mean Gini value 0.99 and its standard deviation 0.0083.
-
Mcintosh 13 : This metric expresses the heterogeneity of a sample in geometric terms. Formula requires the use of a variable (U) that contains the distance of the sample from the origin in an S dimensional hypervolume. There is no easily understandable biological interpretation, related to this variable. It has been little used in the literature.
-
Strong: This metric has a similar Gini conceptual proposal and they share the same conclusions and remarks.
Phylogenetic
-
Faith 36 : Measures the amount of the phylogenetic tree covered by the community. Its calculation is performed by an algorithm on a tree, not from a mathematical formula. The value is the sum of the branch lengths of a phylogenetic tree connecting all species in the target assemblage. In this way, a higher value means more branches, means more richness. This metric does not consider species abundances and depends on the sequencing depth, denoising method, rarefaction depth and phylogenetic tree37,38.
Information
-
Shannon 39 : This metric measures uncertainty about the identity of species in the sample, and its units quantify information. It comes from physics where it quantifies the entropy of a system and it has been also a popular diversity index in the ecological literature. This metric depends on the sample size and intends to provide a weighted average, or a synthesis, between the abundance and the dominance of a sample. Other information metrics derive from this formula. The main hurdle could be found in the interpretation of the value since it is not possible to know if a given variance is due to abundance in the sample, or its dominance or both.
-
Brillouin 40 : This metric has a similar theoretical proposal and results than Shannon metric but it assumes that there is no uncertainty in the population from where sampling is applied. It has its origins in ecology and is useful when the entire population is known or when the randomness of the sample cannot be guaranteed. This metric is not applicable to sequencing data because the completeness of the census cannot be guaranteed. It is difficult to calculate, and its correct biological interpretation even more difficult.
-
Heip 41 : This proposal differs on the Shannon index since it does not depend on the sample size: it is a proportion of entropy.
-
Pielou 42 : This proposal is mainly similar to Heip but they differ in the normalization strategy. While Heip divide the entropy by the total number of taxa, Pielou divides by the log of the total number of taxa.
Selection of sequence data sets to test the metrics on
The 19 selected metrics were applied to the sequence data with public access to raw data of 16S ribosomal biomarker sequencing files obtained from a total of 4,596 stool samples described in 13 public available human microbiome studies, published from 2008 to date (see online Supplementary Information, Table S3). When selecting projects, a study was excluded if it presented samples without associated metadata, but not if there were missing samples reported in the metadata. Reference cohorts is a concept to describe usually called “healthy patients'' which we found as an imprecise term when comparing different cohorts due to the complex nature of microbial communities, and their interactions between them and with the host, so the concept of "healthy patient" is not accurate in the field of microbiota, which is why it is better mentioned, instead, as reference cohorts or reference control groups 43,44. Raw data (sequencing files) and metadata of the selected studies were downloaded from the repository of the National Center for Biotechnology Information45 (NCBI) or the European Molecular Biology Laboratory46 (EMBL). Based on the fact that alpha-diversity is a metric applied within each sample, conditions/groups are not relevant for this work.
Bioinformatic processing
Downloaded samples were processed with the QIIME2 tool suite, installed on a conda environment on a personal computer with Ubuntu 21.04. Samples identified as paired-end that only had a single read were removed. DEBLUR was used for noise removal, dereplication, chimera removal and ASV selection. All datasets were also processed with DADA2 to verify its impact on singleton values, all further tests were performed only on the DEBLUR results. Phylogenetic trees were created using the q2-phylogeny plugin. Selected alpha diversity metrics were calculated for each sample using q2-diversity (from qiime2 package), combined with the metadata and aggregated for all datasets. For systematic alpha metrics generation on all studies, Bash and Python scripts were created. For all further visualizations and comparisons, Jupyter Notebooks47 were designed in the Visual Studio Code48 environment. Pandas49 and Numpy50 were used for managing data frames. Numpy50, Scipy51 and R52 were used for statistical analysis and regression models. Plotly53 and Seaborn54 for visualizations.