Microbiome meets classical microbiology: quantifying sample CFU using 16S rRNA gene sequencing data

Background: Next-generation sequencing (NGS) has been extensively employed to perform microbiome characterization worldwide. As a culture-independent methodology, it has allowed high-level profiling of sample microbial composition. However, most studies are limited to sample information regarding relative bacterial abundances (sample proportions), ignoring scenarios in which sample microbe biomass can vary widely. Here, we develop an equivolumetric protocol for amplicon library preparation capable of generating NGS data responsive to input DNA, recovering proportionality between observed read counts and absolute bacterial abundances within each sample. Within a determined range, we show that the estimation of sample colony-forming units (CFU), the most common unit of bacterial abundance in classical microbiology, is challenged mostly by resolution and taxon-to-taxon variation. We propose Bayesian cumulative probability models to address such issues. Results: Observed read counts were consistently proportional to input DNA, total microbial load, and bacterium-specific sample abundances, although a saturation tendency was observed as abundances increased. Using Bayesian cumulative probability models, predictive errors in sample CFU estimation

varied constantly below one order of magnitude -as measured by the mean absolute log10-ratio (MALR).
For total microbial load, observed MALR was no greater than 0.2 during both cross-validation and validation on a test dataset. For observed bacteria, estimation of taxon-specific CFU showed MALR values of at most 0.5. We also performed leave-one-group-out cross-validation to assess predictive performance for previously unseen bacteria. While most bacteria showed MALR no greater than 1, such a threshold was exceeded only by Bacillus cereus.

Conclusions:
Being able to estimate sample CFU in a high-throughput fashion has a wide range of applications, from the study of built environments to public health surveillance. This study shows that equivolumetric protocols along with cumulative probability models allow sample CFU estimation from microbiome datasets. Further, our approach has the potential to generalize to previously unmodeled bacteria, an important feature in high-throughput settings. Lastly, it remains clear that NGS data are not inherently restricted to sample proportions only, and microbiome science can finally meet the working scales of classical microbiology.
Recent studies claim that the total number of reads in NGS-derived samples (library size) is an arbitrary sum, without biological relevance, yielding microbiome data as necessarily compositional in nature [8,10,11]. Nonetheless, a previous study has demonstrated that library sizes need not be arbitrary, potentially holding significant correlations with input bacterial cell counts [12].
The possibility of estimating absolute microbial abundance from NGS data has major impacts for research, government agencies, and industry, empowering researchers and policy makers to address microbiological issues in common scales, such as colony-forming units (CFU), without giving up the advantages of high-throughput technology. Further, relative information alone limits decision-making in scenarios in which sample microbe biomass is known to vary widely [13][14][15]. Bacterial percentages within a sample are hardly informative in terms of surface contamination levels or even risk of microbial environmental dispersion.
Notice that, here, we refer to "relative abundance" as taxa (potentially scaled) proportions or percentages within each sample, whereas the absolute counterpart refers to the corresponding total CFU -although that ultimately refers to bacterial concentration, i.e., absolute abundance per sample volume. For instance, a sample with 50% of S. aureus and 10 2 CFU is markedly different from another with the same percentage of S. aureus but 10 5 CFU. We consider that these two samples differ in absolute abundance of S. aureus. We will keep this interpretation of the term "absolute abundance" throughout this manuscript.
We further address potential terminology issues in the methods section A note on terminology.
The aim of this study is to develop a library preparation strategy and statistical analysis approach that, together, are capable of estimating sample CFU using NGS data onlyfor both total microbial load and taxon-specific sample abundances. The approach herein described was primarily designed for the analysis of samples from indoor environments with varying total biomass (throughout this manuscript we refer to "biomass" as the sample bacterial biomass). Our method enables the improvement of hospital microbiome surveillance as well as other similar sampling sites. Potentially, it can be adapted to broader applications such as clinical evaluations and food safety management.

Equivolumetric protocol for amplicon library preparation
First, we developed a customized laboratory protocol for amplicon library preparation to recover absolute microbial abundances in each sample after NGS sequencing ( Figure 1). Briefly, we adapt traditional methods to handle unnormalized inputs of DNA and amplicon. While equimolar protocols standardize samples and PCR to fixed concentrations [3,[16][17][18][19][20], we sample equal volumes of DNA or amplicon into each library preparation to keep major concentration differences intact. The PCR steps are also optimized for the same purpose, minimizing amplification cycles and stopping before most reactions plateau. Using fewer PCR steps, we also decrease error rates and chimera formation, as previously reported [18,21].
The agarose gel check after library amplifications can still be useful for samples with sufficient biomass, though it is often the case that low biomass samples show no visible bands, hampering any useful interpretation [12]. For this reason, we do not check for amplicons in agarose gels. In our protocol, PCR pooling is also performed in an equivolumetric fashion, and DNA sequencing follows as in traditional methods using Illumina platforms. The main justification for our proposal is the fact that samples from indoor environments vary widely in terms of total biomass, generally characterizing low biomass samples [12][13][14][15], and thus render relative information less useful.

Equivolumetric protocol, input DNA, and absolute bacterial abundances
To investigate whether our approach is capable of recovering absolute abundance information, we first assessed the relationship between NGS generated reads and corresponding input DNA. We used a synthetic DNA molecule with known concentrations (Figure 2A) and sequenced replicated serial dilutions. A polynomial fit demonstrates the sigmoid trend, which indicates NGS-based quantification in absolute terms may still be bounded above by methodological constraints under our protocol (e.g., amplification plateau for highly concentrated samples). We also estimated the corresponding copy numbers and observed similar behavior ( Figure 2B). Nonetheless, it is clear from this result that the total number of reads per sample (library size) increases with input DNA, which agrees with previous results that showed close relationship between NGS reads and total bacterial cells [12].
To further confirm the association between reads and sample bacterial load, the sum of all bacteria CFU within a sample (also referred to as total microbial load), we performed NGS sequencing in serially diluted samples of known bacterial concentrations (in terms of CFU), mimicking a surface sample collection from an indoor environment. Our results indicate that the equivolumetric protocol does recover proportionality between NGS data and microbial load in terms of both library sizes ( Figure 2C) and bacteria-specific counts ( Figure 2D). While PCR saturation and availability of reads during sequencing can still impose upper quantification limits, the proportionality was consistently reproduced with respect to known DNA concentrations, DNA copies, and CFU. Therefore, our results strongly suggest that compositional constraints motivated by supposedly arbitrary library sizes are not an inherent feature of microbiome datasets, which can indeed be generated to retain sample absolute abundance information, at least within the observed range of concentrations.

Using data from multiple sequencing runs
The replicates in Figures 2C-D come from four different sequencing runs, demonstrating the reproducibility of the method. As we keep total biomass differences, during data analysis we only correct for variations in the number of reads made available a priori for sequencing in each run (expected sample coverage, see Methods). Yet, such variations hardly impact observed values across orders of magnitude on the 10 scale. Supplementary Material 1 describes such a correction in detail as well as an application using real hospital microbiome samples replicated through 14 sequencing runs. We also demonstrate that this approach does not depend on DNA extraction method by testing four different extraction protocols with equivalent results (Supplementary Figure 1). A key assumption is that the expected sample coverages are not underestimated for the current quantification range so that the read counts are not censored for lack of availability of reads or library saturation.
While more sophisticated data transformations may be needed for other situations, when varying biomass by orders of magnitude such a step is largely simplified under our protocol. Even though we can still refer to the proposed procedure as normalization, it is markedly different from methods employed by software packages such as DESeq2 and EdgeR in the context of differential expression/abundance analysis [22,23]. While we do log-transform the NGS reads, the procedure indicated above merely scales up the read counts from pools of samples with lower expected sample coverages, thereby allowing the use of data from different sequencing runs to make predictions. Importantly, we can annulate the effect of the given procedure by using data from a single run or, similarly, by keeping the expected sample coverage constant across all sequencing runs. We do not assume, therefore, any dependency between such a procedure and the observed proportionality between reads and absolute abundances. On the other hand, the amplification step is a well-known methodological constraint in NGS [24].

Proportionality constraints and limits of quantification
Even though we do optimize this step for quantification, there is a point above which the amplification reaches a reaction plateau -Supplementary figure 2. This is the main reason behind the assumption that our protocol quantifies sample absolute abundances only within a limited range. By limiting the PCR cycles, we can detect major differences in input DNA during the exponential phase of the reactions, but not once the plateau is reached. Our PCR optimization, therefore, settles the upper bound of bacterial abundances that we are able to quantify. Above such point, all information generated is limited to the notion that the abundance of a given taxon (or the total microbial load) is greater than or equal to the upper limit of quantification. As we keep the ability to detect bacterial proportions -despite known biases [25]-, the situation of such taxa is similar to that under equimolar protocolsadded the information of particularly high sample abundance.
In practice, microbiome samples are often highly variable in terms of total microbial load. Faecal samples are generally characterized by high biomass, while hospital and indoor samples usually present low biomass [13,15,26]. Low biomass samples impose more challenges to their processing because of contamination and process inefficiencies [12]. In fact, in this study we removed E. coli sequences from our results since these were frequently detected in our negative controls. We were able to track the corresponding sequences to the DNA polymerase reagent. E. coli has been reported as a common molecular biology contaminant from recombinant enzymes such as polymerases [27]. We stress that low biomass samples should always be processed with special care, accompanied by negative controls to assess possible contaminations [28][29][30].

Modelling absolute abundance using NGS data
One way to recover absolute bacterial abundance for each sample is to associate relative information from high-throughput technology with absolute information from other methods. This has been previously done using qPCR or flow cytometry data as absolute abundance methodologies, and NGS as provider of bacterial proportions [7,9,31]. Here, however, total microbial load gains importance for surveillance of built environments as a measure of total contamination at sampling site, as opposed to merely a means to project proportions onto absolute terms. In the previous section, we presented data showing that NGS reads can respond monotonically to the increase of microbial load. We now turn the axes around to describe the present estimation task: given an observed library size (total sample reads), can total microbial load be predicted reliably? Figure 2E illustrates the problem.
Notice each value of microbial load varies only in orders of magnitude but corresponds to a relatively wide range of observed library sizes. Again, a polynomial trend is fitted, demonstrating the inadequacy of standard linear regression in this case. A similar behavior is observed if we analyze the read counts for each bacterium individually, despite significant taxon-to-taxon variation ( Figure 2F).
Such naive linear models ignore the monotonic, stepwise fashion in which total microbial load and absolute bacterial abundances vary conditionally on observed reads. Also, there are no prediction bounds: extrapolation towards higher library sizes yields continuously higher predicted values. This is likely unrealistic, given the plateaus observed in Figures 2A and 2B -and the inevitable PCR saturation as total sample biomass increases.
These characteristics led us to consider a cumulative probability model to robustly estimate total microbial load and absolute bacterial abundances using NGS data. In the next subsection, we briefly describe the fitted model for total microbial load -see Methods for formal model specification. We then extend it onto a hierarchical structure that allows variation across bacteria in order to predict taxonspecific absolute abundances in each sample. Supplementary Materials 2 and 3 describe both modelling strategies in further detail, including extensive prior and posterior predictive checks as well as assessment of modelling assumptions.

Cumulative probability model predicts total microbial load
Colony-forming units are continuous measures. However, in classical microbiology, the ability to quantify microbial abundance has limited resolution: observed values differ mainly in orders of magnitude, and it is difficult to state replicable differences within the same magnitude range. Nonetheless, decision-making based on such data relies mostly on logarithmic differences, and specific orders of magnitude are often the main interest -e.g. diagnosis of urinary tract infection [32], bacterial characterization from International Space Station surfaces [19], or hospital environments [33].
In this scenario, we propose a Bayesian cumulative probability model (CPM) with logit link, also We fit the model using the R package brms [37], and Figures 3A-3D show the results. Figure 3A shows the estimated class probabilities as a function of library size, and Figure 3B shows the implied expectations (black solid line, 95% credible intervals in blue) as well as the CHP (red solid line).
Predictive intervals for the CHP are also shown in light grey. Notice how the red line in Figure 3B follows closely the behavior of the class probabilities in Figure 3A. The predictions generated by the ordinal model are by construction monotonic. Also note that both expectation and CHP are bounded within the observed outcome space, overcoming extrapolation issues related to the previous naive model.
In Figure 3C, posterior-predictive check indicates the overall structure of the observed data is well captured by the posterior draws of the model. Figure 3D shows model-implied tail probabilities, herein defined as the (conditional) probability of observing at least abundance , i.e., ( ≥ | = ) rather than ( > | = ).
The ability of deriving lower/upper bounds with high probabilities is a major advantage of CPM. Under this model, even though distinguishing abundance values may be challenging for certain ranges of the predictor space, one can still rely on tail probabilities to guide decision-making. For instance, should CHP-and expectation-based predictions be shown limited in performance, or the predictive interval prohibitively wide, then one can still use the highest class such that the probability of having at least CFU is no less than a given probability threshold , i.e., for each observation find max 1 ≤ ≤ { : ( ≥ ) ≥ } for some large (e.g. 95%).
In the next subsection, we extend the previous model to handle taxonomic information in a hierarchical fashion so that one can make predictions of absolute abundance for each bacterium individually. We then validate both models using cross-validation and prediction on held-out samples (test sets) in the following section.

Hierarchical CPM predicts absolute bacterial abundances
In order to handle taxonomic information, we formulate a similar model which incorporates taxon-specific effects in a hierarchical structure. The major difference is that the linear predictor term is parametrized with population-level parameters and group-level counterparts (for both intercept and slope), allowing predictions of many bacteria with a single model and taking advantage of partial pooling [38].
While we use seemingly weakly-informative priors (see Methods for full model specification), their joint behavior favors outer classes to improve distinguishability when dealing with classes of overlapping average number of reads. This is illustrated with prior-predictive check and assessment of CPM assumptions in Supplemental Material 3, which also shows detailed workflow and visualizations for each observed bacterium. Our prior choice resulted from model comparison with approximate leave-oneout cross-validation [39,40]. Figure 3E shows the corresponding posterior predictive check, suggesting the data is well captured by the model-implied data generation process for all bacteria. B. cereus, S. aureus, and S. epidermidis may represent challenging cases, although this can be an artifact from estimating varying effects with only six bacteria (stronger priors led to worse fit during model comparison).

Model validation
We validated both models using 10-fold cross-validation (CV) and prediction on held-out test sets comprising of 10% of the total number of observations. We assess performance both as classification and regression tasks, using CHP-and expectation-based predictions. Figure 4A shows the 10-fold CV results for the total microbial load model. For visualization, we have split the assessed metrics into bounded between 0 and 1 and unbounded metrics. Bounded metrics based on CHP included the observed coverage of 95% predictive interval, Somers' Delta (measure of ordinal association), classification accuracy, and Spearman's rank correlation. The latter was also assessed for expectation-based predictions. In general, these metrics varied well above 0.9. Notably, the predictive intervals showed 100% coverage, which is likely overconfident. Nonetheless, most intervals spanned two abundance classes as in Figure 3B (see also Supplementary Materials 2 and 3), suggesting errors occur mainly within one order of magnitude from the true values. Ordinal association, as measured by Somers' Delta, was consistently greater than 0.95 for both CV and test set.
Unbounded metrics relied on modified versions of absolute errors, for both CHP-and expectation-based predictions. MALR and MAEr denote Mean Absolute Log-Ratio and Mean Absolute Error relative to true value, respectively, defined as follows.
MALR represents deviance in orders of magnitude, which varied during CV below 0.2 for both CHP and expectation, a result reproduced in the test set evaluation ( Figure 4B). Perhaps more intuitive, MAEr represents absolute errors as proportions of true values and tended to be smaller for CHP-based predictions compared to expectations. During 10-fold CV or test-set validation, we did not observe MAEr values greater than 0.7.
Although not as common as mean absolute errors or mean squared errors, the metrics herein assessed do not penalize estimation in varying orders of magnitude, offering advantages in interpretation.
A MALR value of 1 corresponds to a ratio between predicted and observed values of one order of magnitude in the 10 scale. A MAEr of 1 indicates prediction absolute error as large as the true value, which would still be largely insignificant given the logarithmic scale. Using both CHP-and expectationbased predictions, our results indicate that predictions for the total microbial load model were mostly kept within the observed orders of magnitude.

Predicting abundance of previously unseen bacteria
As the hierarchical CPM enables prediction of previously unseen bacteria, we also performed leave-one-group-out CV to assess how our model could generalize in high-throughput settings, in which one may have dozens of taxa of interest. For each bacterium, we hold out its corresponding data points and train a separate model with the remaining data. We then perform predictions for the held-out taxon, treating it as "previously unseen" -not used during model fitting. Thus, it is clear that generalization in high-throughput settings is challenged by specific bacteria, such as B. cereus, which deviate greatly from overall profiles. Yet, prediction errors for most bacteria were shown to vary below one order of magnitude, suggesting the approach is promising. Improvements can also be achieved, for instance, through the inclusion of more predictors such as 16S rRNA gene copy number, gram-like classifications, and even taxonomic information from higher ranks. The addition of more bacteria species to estimate varying effects can also be beneficial.

Discussion
Here we have shown that assessment of sample absolute bacterial abundance using NGS data becomes possible upon protocol alteration within specified conditions, and the remaining challenges lie within the realm of resolution and taxon-to-taxon variation. While library sizes do depend on sequencing effort -a partially arbitrary sequencing setup -, equivolumetric protocols assure the maintenance of major input DNA variations, at least for certain ranges of absolute abundance.
Quantification of absolute abundances using 16S rRNA gene amplicon sequence data has been object of study in the recent microbiome literature [7][8][9]. However, most of these works relies on second technologies, processing NGS data to yield sample proportions solely -and potentially missing the highthroughput capabilities inherent to NGS. By fixing volumes rather than concentrations during library preparation, our strategy allows the detection of major variations in input DNA within a specified quantification range. However, estimation of sample absolute abundances remains challenged by resolution and taxon-to-taxon variation: a smooth continuum of sample CFU values if hardly observable in practice, and the expected value of read counts given a value of CFU is notably variable across bacteria.
Bayesian cumulative probability models (CPM) address the above issues naturally by modelling the cumulative probability function of the sample CFU conditional on the observed read counts [34]. This strategy allows estimation of total microbial load as well as sample absolute abundances of observed bacteria. Further, a wealth of outputs is automatically available upon CPM fitting, e.g., conditional means, quantiles, and tail probabilities. These quantities can be readily explored to inform decision making with more than simply point estimates, e.g., estimate the probability that a certain critical sample in a hospital presents with more than a given threshold of CFU for a given taxonomy of interest.
By construction, the model can also generate continuous and (ordinal) categorical predictions [36]. Although the most likely outcome tends to agree with the estimated conditional mean, the former varies as a step function while the latter varies smoothly. The use of Bayesian framework also enables full probabilistic quantification of uncertainty, making interpretation of estimation intervals straightforward.
Lastly, by taking advantage of a hierarchical CPM, we show that the method potentially generalizes to previously unseen bacteria, i.e., predicts the sample CFU for taxa that were not primarily included in the modelling step. This is a crucial feature in high-throughput settings as potentially detected taxa are not known a priori. Also, modelling many taxa directly may be financially prohibitive.
Overall, our results indicate that the predictive errors for CFU do not exceed one order of magnitude (on the 10 scale) for observed bacteria. While total microbial load seems more reliably estimated, for both models the absolute errors tend to be no greater than two times the true values -in a reality of logarithmic differences. While one might doubt the importance of estimating 4 * 10 3 CFU compared to a true value of 2 * 10 3 CFU, we acknowledge our models can be improved by adding more data points, predictors, and different bacteria. A key modelling limitation is that, even though the hierarchical model has shown relative success at predicting previously unseen bacteria, it has also shown relatively poor performance at predicting sample CFU of Bacillus cereus when this taxon was held out during model fitting. This suggests that extrapolation in high-throughput settings may still require modelling more than simply a few taxonomies. Currently, the main laboratorial limitation is the upper limit of quantification possibly caused by the PCR amplification step. Our method carries potential to be adapted to address both modelling and laboratorial challenges, which if left for future work.

Conclusion
This study has presented an equivolumetric protocol for library preparation prior to 16S rRNA gene amplicon sequencing as well as a modelling strategy to predict sample CFU given NGS observed reads counts. Assuming the same protocol for a set of samples of similar nature, the proposed procedures were shown to recover the proportionality between library sizes and total microbial load -and, more generally, between NGS reads and absolute bacterial abundances within each sample. Still, further research is needed to understand whether such models can generalize to high-throughput settings, in which data from a small subset of taxa are used to make predictions on previously unseen bacteria. Future challenges also involve extending the range of bacterial abundances properly captured by both the protocol and the models.
It remains clear, though, that the claims that library size is always an arbitrary sum, often taken for granted by several previous works, and that NGS reads can carry only proportion-based information, are readily overcome by the methods herein proposed. Finally, our work paves the way to unleash the power of high-throughput sequencing towards the quantification of absolute bacterial abundances in terms of sample CFU, taking microbiome science one step closer to the working scales of classical microbiology.

Samples
A synthetic DNA fragment with a naturally non-occurring sequence was designed with the 16S rRNA V3/V4 primers sequences flanking their extremities. This fragment with 544 bp was synthesized as gBlocks® Gene Fragments from IDT (IA, USA). This DNA was eluted to a final concentration of 10 ng/L in TE buffer following the manufacturer instructions. Then it was serially diluted from 0.56 ng/L to 0.00000056 ng/L by a 10X factor dilution. This serial dilution is equivalent to a range of 954,000,000 to 954 copies of the synthetic DNA. Samples were processed in experimental triplicates.

Library preparation and sequencing
The 16S rRNA amplicon sequencing libraries were prepared using the V3/V4 primers (341F CCTACGGGRSGCAGCAG and 806R GGACTACHVGGGTWTCTAAT) [41,42] in a two-step PCR protocol. The first PCR was performed with V3/V4 universal primers containing a partial Illumina adaptor, based on TruSeq structure adapter (Illumina, USA) that allows a second PCR with the indexing sequences similar to procedures described previously [17]. Here, we add unique dual-indexes per sample in the second PCR. Two microliters of individual sample DNA were used as input in the first PCR reaction. The PCR reactions were carried out using Platinum Taq (Invitrogen, USA) with the conditions: 95°C for 5 min, 25 cycles of 95°C for 45s, 55°C for 30s and 72°C for 45s and a final extension of 72°C for 2 min for PCR 1. For PCR 2, two microliters of the first PCR were used and the amplification conditions were 95°C for 5 min, 10 cycles of 95°C for 45s, 66°C for 30s and 72°C for 45s with a final extension of 72°C for 2 min. All PCR reactions were performed in triplicates. The second PCR reactions were cleaned up using AMPureXP beads (Beckman Coulter, USA) and an equivalent volume of each sample was added in the sequencing library pool. At each batch of PCR, a negative reaction control was included (CNR). The final DNA concentration of the library pool was estimated with Quant-iT Picogreen dsDNA assays (Invitrogen, USA), and then diluted for accurate qPCR quantification using KAPA Library Quantification Kit for Illumina platforms (KAPA Biosystems, MA). The sequencing pool was adjusted to a final concentration of 11.5 pM (for V2 kits) or 18 pM (for V3 kits) and sequenced in a MiSeq system (Illumina, USA), using the standard Illumina primers provided by the manufacturer kit. Single-end 300 cycle runs were performed using V2x300, V2x300 Micro, V2x500 or V3x600 sequencing kits (Illumina, USA) with sample coverages specified in Supplementary table 1.

Bioinformatics analysis and taxonomic assignment
The sequenced reads obtained were processed using the bioinformatics pipeline described below (BiomeHub, SC, BR -hospital_miccrobiome_rrna16s:v0). First, Illumina reads have the amplicon forward primer checked, it should be present at the beginning of the read, and only one mismatch is allowed in the primer sequence. The whole read sequence is discarded if this criterion is not met. The primers are then trimmed, and the reads accumulated error evaluated. Read quality filter (E) is performed converting each nucleotide Q score in error probability (ei), that is summed and divided by read length (L). Taxonomy are assigned to each oligotype using a lowest common ancestor (LCA) algorithm. If more than one reference can be assigned to the same oligotype with equivalent similarity and coverage metrics (e.g. two distinct reference species mapped to oligotype "A" with 100% identity and 100% coverage), the taxonomic assignment algorithm leads the taxonomy to the lowest level of possible unambiguous resolution (genus, family, order, class, phylum or kingdom), according to similarity thresholds previously established [47].

Multiple Sequencing Correction (Normalization)
The correction procedure is fully described in Supplementary Material 1. Briefly, let , ∈ denote the normalized counts for the taxonomy sample ∈ , where is the set of samples from the ℎ sequencing run. Then the normalization is a simply rescaling of the raw counts.
, ∈ = , ∈ , ∈ The size factor is sequencing-specific and is calculated as follows: , ∈ = = * , max ′ =1,2,…, where * , ′ is the average number of reads per sample made available a priori in the sequencing pool of interest ⋆ within sequencing ′ (expected sample coverage). The lower the relative availability, the smaller the resulting factor and thus greater the normalized values relative to the raw counts. Once normalized, divergences across samples from different sequencing runs, but of similar bacterial abundances, are assumed to rise mostly from sequencing efficiency differences -yet of negligible order of magnitude.
This procedure is markedly different from procedures from software packages such as DESeq2 or transformations from compositional data analysis, mostly proposed within the context of differential abundance analysis [23,48]. Here, we only shift up read counts when the corresponding expected sample coverage is not equal to the maximum parameter within a given data set. While variation in relative availability does impact raw counts, it is likely innocuous with respect to the observed and herein addressed proportionality between read counts and sample absolute abundances.

Statistical Analysis
All statistical analyses were performed using R software environment version 3.6.2 [49]. We used the brms R package and Stan (v. 2.11.1 and v. 2.19.1, respectively) to perform all Bayesian analyses and the tidyverse package suite (v. 1.3.0) for data wrangling and visualization [37,50,51]. We also used the phyloseq R package (v. 1.30.0) to handle microbiome data [52]. Supplementary Table 2 lists all R packages used along with corresponding versions and references. The entire modelling strategy is further detailed in Supplementary Material 2 (total microbial load) and 3 (bacterial abundances). All models were fit within the Bayesian framework.

CPM for total microbial load estimation
We used a cumulative probability model (CPM) with a logit link, also known as Proportional Odds (PO) model, to predict total microbial load based on NGS reads. Let denote the total microbial load (in CFU scale) from the ℎ sample. Given our serially-diluted samples, we only observe = 5 abundance values such that takes values ∈ { 1 , 2 , … , } = {0.84 × 10 2 , 0.84 × 10 3 , … , 0.84 × 10 6 }. We then define the model: Each parameter is calculated as: Finally, we compute the cumulative probabilities using ordinal logistic regression: [ Pr( Y i ≤ c k ) ] = ik , for k = 1, 2, … , K − 1 where denotes the library size (total number of reads) for the observation . The linear predictor ψ has two unknown parameters, the intercepts and the slope β. We have placed weakly-informative priors on both, with no prior preference for any class : The intercepts are often called cutpoints as they represent the intersections between observable categories on the cumulative logit scale [54]. Notice we set the same prior for all − 1 cutpoints. The negative-valued slope parameter β seen in equation (12) arises naturally from the PO model derivation with latent continuous variable motivation. It also guarantees intuitive interpretations: positive values indicate a positive effect towards higher categories [55].
The ordinal model also allows going beyond conditional (cumulative) class probabilities to estimate conditional expectations, quantiles, and tail probabilities[36]. This is a major advantage of CPMs over other more commonly used methods such as linear and quantile regression 30 . We fitted the model using brms and Stan [37,50].

Hierarchical CPM for absolute bacterial abundances
We develop a cumulative logit random effects model to predict bacteria-specific abundances based on observed NGS reads, which is basically a multilevel version of the previous model (7) Except for the taxon subscript , the parameters are computed according to equations (8) through (10), and the ordinal regression becomes: where is the number of reads in observation from bacteria . Differently from the previous model, here we have only four classes ( = 4) and hence − 1 = 3 cutpoints. We allow both intercepts and slopes to vary across bacteria, such that: Notice the mean of this two-dimensional Gaussian distribution is the vector of population-level parameters (α ) . Thus, the variance-covariance matrix Σ governs how the group-level parameters vary around the population-level counterparts: We set the prior distributions for each unknown parameter: α ∼ (0,2.5) and β ∼ (0,2.5) σ α k ∼ ℰ (1) and σ β ∼ ℰ (1) The LKJ prior on the correlation matrix (which describes the correlation ρ between α and β) drives skepticism regarding extreme values near −1 and 1 [56]. Jointly, the behavior of the prior distributions slightly favors outer categories ( ∈ {1, }) in order to improve distinguishability for cases in which there were overlapping average number of reads. The prior choice was driven by model comparison using approximate leave-one-out cross-validation as well as prior and posterior predictive checks [39,40].

A note on terminology
Throughout this manuscript, we have termed as "absolute" the bacterial abundances in each sample as measured by their corresponding CFU. As previous works often refer to observed counts and observed proportions as proxies to relative bacterial abundances, we aim to avoid future confusion by clearly specifying the definitions herein employed.t Despite heterogeneity in the microbiome literature [8,23,57], we call relative abundances any measure that is proportional to bacterial percentages within a sample. Recent reports have argued that NGS datasets are limited to this type of information because library sizes are arbitrary [10,58] at least under traditional (equimolar) protocols. To explicitly diverge from this scenario, we refer to the total CFU of a given taxon within a given sample as the taxon's corresponding absolute abundance in that sample -a sample-wise notion of absolute abundance that has appeared previously in the literature [31].
However, it is important to note that the sample absolute abundances are still relative to the sampled units, representing a measurement of concentration (e.g. CFU/μL). In the case of a well-defined system (e.g. a cell suspension of defined volume), in principle only by knowing both the sample and the system sizes (e.g. sample and suspension volumes, respectively) that we could recover the absolute abundances in the system as a whole (e.g. total CFU of S. aureus in the entire cell suspension), a potentially pernicious extrapolation we did not attempt. This makes knowing the system absolute abundances largely prohibitive in practice as it is often not clear how to perform such size measurements in a high-throughput fashion.
As an alternative, compositional data analysis techniques have been applied to uncover characteristics of a community that are mathematically equivalent whether we consider the sample proportions or the system absolute abundances, showing promising results [8,48]. During sample collection, however, it is frequently challenging to even define what a system is. For instance, when sampling a patient bed within a hospital, such entire system is hardly definable. The whole bed? The specific parts touched by patients or staff? Anything but the mattress? Noteworthy, the sample absolute abundance still yields insights limited to the sampling sites and immediately surrounding environment, which is easily interpretable.
Although we acknowledge there is room for terminology improvement and standardization, here we merely aim to stress the difference between learning that a sample shows 50% of a given taxon (sample relative abundance) and learning that the same sample has 10 4 CFU of that same taxon (sample absolute abundance). While the chosen terminology attempts to state the difference between taxa percentages and taxa CFU within a sample, it is clear that the terms "relative" and "absolute", when it comes to taxa abundances in general -whether in the sample or in a system as a whole -, require careful interpretation.

Acknowledgements
We thank to BiomeHub for the financial support and exciting scientific environment. We also thank to the online data community which has been critical for the improvement of this manuscript since its first version as a preprint on BioRxiv.

Availability of data and materials
All sequence data are deposited in NCBI BioProject PRJNA603167. Data and code to reproduce the analysis are available as a compressed folder (Supplementary material 4).

Competing interests
All authors are currently full-time employees of BiomeHub (SC, Brazil), a research and consulting company specialized in microbiome technologies. BiomeHub funded the study design, analysis, and data submission for publication.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Funding
This study was entirely funded by BiomeHub (SC, Brazil).

Figure 2. Equivolumetric protocol recovers proportionality between input DNA and NGS reads.
Synthetic DNA fragment serially diluted from 0.56 ng/L to 0.00000056 ng/L (A) or from 954,000,000 to 954 DNA copies (B) and its total number of reads obtained by NGS sequencing using the equivolumetric protocol. Total sample reads (library size) from sequencing of serially-diluted samples of mock microbial community using the equivolumetric protocol demonstrates that the obtained read counts are proportional to total microbial load (C). Similar relationship is observed between taxon-specific counts and abundances (D). The estimation task of CFU based on NGS reads is illustrated for both total microbial load (E) and taxon-specific abundances (F). Total microbial load ranged from 0.84*10² to 0.84*10⁶ CFU, while taxon abundances ranged from 2*10² to 2*10⁵ CFU. A pseudocount of 1 was added to the read counts to avoid 10 (0). Hierarchical CPM accounts for differences across bacteria and takes advantage of partial pooling to estimate taxon-specific abundances (E). The resulting posterior predictive check indicates no major signs of misfit.