As the cost of sequencing continues to decrease, more studies will use shotgun metagenomes to estimate the functional content of fecal microbiomes and scientists will be faced with practical choices in the pre- and post-sequencing phases. In the current study, we found that read length, e-value threshold, choice of protein database, strategy for paired end reads, and sequencing depth all impacted the ability to detect sequences of known abundance in shotgun metagenomes of human fecal microbiomes. Furthermore, decision-making around these parameters is intertwined as they affect each other.
Previous studies have observed a relationship between read length and the accuracy of gene detection [11, 17]. Consistent with their observations, we found that overall accuracy increases with read length. What is more interesting is translating this knowledge to which of the commonly available sequencing formats are therefore appropriate. SR50 should clearly never be used; SR100 should only be used with a small customized database, provided that quality trimming does not extensively trim the reads. PE100 should only be used for the same purpose as SR100 or it can be used more broadly if the sequencing library is size selected to enable overlapping paired ends, which are then merged to create merged reads longer than 100bp. When processing longer read formats, reads <100bp should simply be discarded. The Human Microbiome Project’s WGS Read Processing protocol retaining all reads >60bp [18]; clearly, this threshold is too low for functional annotation of gene sequences. Given today’s read format choices, the ideal choice would be PE150 with size-selection to enable overlapping reads. This can generate merged reads with a median length near 250bp (D. Lemay, unpublished observations).
We also investigated what to do with legacy data in which most paired end reads are not mergable. For one of the largest sets of fecal metagenomes from healthy adults to date [15], we found that only one third of the reads overlapped and when they did, the mean read length was still rather short (just 112bp). In another dataset [16], merging reads was much more successful ((~88% overlap, mean 188bp, on PE150 data). Using these two datasets, we compared three read strategies: a “merged strategy” (using only merge-able reads), an “R1 strategy” (using only forward reads), and a “congruent strategy” (independentally mapping both reads and keeping only those hits in which both reads map to the same protein). As expected, all three strategies are highly correlated with “merged” and “congruent” having the highest correlation. Therefore, the “congruent” strategy is a reasonable alternative when reads are not mergeable.
For the design of future experiments, should the sequence library be size-selected to enable overlapping pairs of reads? If functional annotation is a primary focus, then merged reads are highly desirable. Fewer hits were obtained with the “congruent” strategy, suggesting that the sensitivity would be lower if paired end reads cannot be merged. Rare genes, such as antimicrobial resistance genes, would then be more difficult to quantitate.
Knowing that appropriate e-value thresholds are dependent on the search tool used as well as read length and database size [17], we investigated what thresholds would be appropriate for use with DIAMOND at different read lengths. We found that two different protein sequences had two different optimal thresholds for accurate detection. This supports the opinion of other researchers that it may not be possible to identify a single cut-off for all proteins [7, 17]. However, given that we identified and tested a “worse case scenario”, our data suggests that it is possible to identify reasonable ranges for mean read lengths. For example, using DIAMOND in “sensitive” mode, a threshold within the range of the default 1-e3 to 1-e10 would be appropriate for read lengths of 100-150bp when a small customized database is used. A threshold within the range of 1e-10 to 1e-25 would be appropriate for read lengths of 200-250bp. For a larger database, like the SEED, we observed higher false positives with increasing read length, suggesting that a threshold more strict than the default is needed for broad databases. Generally, if the research question in mind has low tolerance for false positives, thresholds towards the stricter end of the suggested ranges should be chosen.
As it is well-known that shotgun metagenomes will have reads of varying length, it may be useful to implement a binning strategy to apply e-value thresholds proportional to the read length. First, shorter reads (<100bp) should be removed. Then, if the read lengths vary dramatically within a sample, it may be useful to sort each sequence or merged pair of sequences into bins such that appropriate e-value thresholds can be applied to each bin.
Bengtsson-Palme has suggested that a database that is specialized for the research question should be used and, if it does not exist, should be constructed based on genes of verified function described in the literature [7]. Our data supports that opinion. With the goal of detecting beta-galactosidase, the custom database for this enzyme was far superior to the other databases. NCBI and SEED tended to under-estimate the number of beta-galactosidase enyzmes in the simulated metagenome, while the use of CAZy over-estimated them. The fact that beta-galactosidases exist in several distinct families in the CAZy database and that those families, in turn, include some sequences that are likely not beta-galactosidases further demonstrates the difficulty interpreting results for specific enzymes using the CAZy database. Criteria previously suggested for customization of databases include whether the sequences are experimentally verified, the quality of the data, and the functional and/or taxonomic coverage of the data [19]. If analyzing the content of a particular enzyme of interest, the database could be focused on experimentally verified sequences of that enzyme as we did for beta-galactosidase. By extension, if a particular pathway is the focus of the hypothesis, the protein sequences associated with pathway should be curated for the database. Hypothesis-specific databases would be faster, more accurate, and highly interpretable.
Investigating the effect of sequencing depth (e.g. library size) on the ability to quantitate specific genes or functional categories in human fecal metagenomes, we found that 5 million merged reads or 10 million unmerged reads would be sufficient to quantitate even rare genes. In prior work, Nayfach and Pollard surprisingly found that reducing the sequencing depth of human gut metagenomes by 95% introduced < 2.5% variation in gene copy number estimates [20]. However, they also suggest that the effects differ for common versus rare genes. What is rare? Considering that there are, on average, around 500,000 unique microbial genes per individual’s gut microbiome [21], many individual genes will be rare. We found that beta-galactosidase, the enyzme that digests lactose, has about 500 copies per 5 million merged reads in an adult metagenome and tens times as many in an infant metagenome. For beta-lactamase, an antimicrobial resistance gene, there were about around 100 copies per 5 million merged reads in a deeply sequenced infant metagenome. The concept of shallow sequencing—roughly 500,000 sequences per metagenome—has been put forward as an in-between alternative between 16S sequencing and current shotgun metagenomes [22]. With 500,000 unique microbial genes per individual’s gut microbiome [21], shallow sequencing would sample each gene only once, on average, which would not be sufficient to assign presence/absence to most genes with confidence, nevermind comparative quantitation among samples. Answering hypotheses about specific genes or functions more narrow than say, “carbohydrate metabolism”, will require a more deeply sequenced metagenome. We suggest a minimum of 5 million merged reads of at least 150bp in length.
Our recommendations are limited to shotgun sequencing of human fecal samples. These samples are high biomass with nearly all DNA from microbial sources and low risk of substantial contamination. Furthermore, host DNA is easly removed [23, 24]. Others have indicated that problem of higher variation due to contamination in low biomass samples [25] which suggests that recommendations based on human fecal samples may not extend to other samples. It is possible that recommendations may not even extend to stool samples of other animals. Zaheer et al reported that >50 million reads per sample would be needed to characterize the fecal resistome of cattle fecal samples [26]. Nearly all reads (>97%) were uncharacterizable in that study; the authors suggested that this may be due to the presence of feed-associated plants that were not in the database.
The experiments in the current study do not address how to use gene abundances to compare across metagenomes. Nayfach and Pollard review three methods: (1) gene relative abundance, which is relative amounts of genes found in a sample, (2) average genomic copy number, which is the expected number of copies of the gene per cell, and (3) gene absolute abundance, which cannot be estimated from sequence data alone [20]. The choice should probably be dependent upon the application. If the primary goal is to calculate the carbohydrate active enzyme profile of a fecal metagenome without regards to whether those enyzmes are derived from bacteria with a reference genome or even non-bacteria, such as archaea or yeasts, then the gene relative abundance may be expected to yield less bias than average genomic copy number. However, it is important that differential abundance testing is done using a count-adapted method, such as DESeq2 [27] and edgeR [28], which simultaneously account for differences in library sizes and the compositional nature of the data and have demonstrated applicability for metagenome data [29]. A comparison of 14 different methods demonstrated that these methods have the best performance to determine differnentially abundant genes in metagenomes [30].
Our study is intended to be agnostic to the pipeline used, except for the use of DIAMOND as a mapping utility. An argument against using point-and-click computational computational pipelines is the inability to assign appropriate cut-offs for genes of interest [7]. An advantage of creating one’s own pipeline or modifying one that is open source, such as SAMSA2 [31], is the ability for users to define e-value thresholds and use customized databases. SAMSA2, although built for metatranscriptomes, also works to assess gene abundances in metagenomes with the omission of the rRNA removal step, which is unnecessary for metagenomes. Therfore, a version of the SAMSA2 pipeline with that modification was used for the experiments in this paper. For the functional annotation of fecal metagenomes, it could be argued that a broad overview with mappings to the Clusters of Orthologous Genes (COG) database [32], KEGG Orthology [33], and/or SEED subystems [14] databases is a reasonable first step followed by tests for specific hypotheses with databases customized for those hypotheses.
An alternative to the DIAMOND/BLAST approach would be a mapping strategy which uses hidden Markov models (HMMs). A profile HMM encodes the statistical probability of each amino acid in the sequence of a protein family used to train the model. This can be useful to identify distant homologs that may not be represented in a database. The Pfam database is a large collection of protein families, with each family represented by a multiple sequence alignment and HMM [34]. Pfam also collects entries into clans that are collections of proteins related by profile HMM. Several tools, such as HMM-GRASPx [35] and MetaCLADE [36], identify protein domains in metagenome and/or metatranscriptome sequences. These methods are likely most useful to annotate sequences when proteins are not well-represented in reference databases, such as in environmental sequence data [37], because protein domain information is better than no information. However, proteins with very different functions can share the same protein domain and there can be multiple protein domains in the same protein. For human fecal metagenomes, it is now possible to map 70-90% of reads to genes [38] using the integrated gene catalog (IGC) of the human gut metagenome [39]. Given that the average percentage of prokyarotic genomes that comprise gene-coding regions is 87%, mapability of reads from human gut microbiomes has approached the maximum achievable [39]. Thus, it is both possible and preferable to identify the complete protein rather than a protein domain to enable interpretation of the results. In theory, a profile HMM can be built using complete sequences from different taxa for a particular protein; whether such a strategy has higher performance than the DIAMOND/BLAST approach is beyond the scope of the current analysis.
Assignment of sequence reads from fecal metagenomes to broad functional categories gives the impression of stability [40, 41], but this impression is likely false when investigating function on a finer scale of individual genes. Publicly available human fecal metagenomes have been reanalyzed for functional content using alternate methods [42] or for meta-analyses across cohorts [38]. Legacy data should also be re-analyzed for specific hypotheses. The results of the current study suggests some guidelines for using legacy data to address hypotheses of interest: create a custom database of the gene or genes of interest, merge paired ends to create longer reads, and adjust the e-value threshold for the median read length or bin reads by read length for different e-value thresholds.