General Usage
The EFILTER database contains 11633 species and 34 phyla (see Table B.1, Appendix B). 72% of species have habitat information specified, while 45% fall into at least one pre-defined environment (see Methods). 100% of phyla have habitat information specified, while 97.1% fall into at least one pre-defined environment. The pre-defined environment with the largest number of taxa is ‘soil’, which includes 17.5% of species. The next largest is ‘human’, which includes 8.4% of species. The pre-defined environments with the fewest number of taxa are ‘amphibian’, ‘reptile’ and ‘symbiont’, with 0.06%, 0.3% and 0.3% of species respectively (see Appendix B, Table B.2).
EFILTER inputs .txt files (including batch upload) in the form of lists of either Latin names or taxIDs. Inputs in the form of taxIDs are mapped onto Latin names in order to infer habitat associations. Only well defined taxa (i.e., taxa with entries in either Bergeys or NCBI/IJSEM, see Methods)areincluded in the database. EFILTER allows Latin names or taxIDs referencing any taxonomic rank from phylum to species, including lists with mixed ranks. Taxonomic rank is automatically determined based on name or taxID. Each taxon in an input file is queried against the EFILTER database using a defined environmental filter. Environmental filters can be constructed as user-defined keywords, user-defined phrases, or pre-defined environments (see Methods). In addition, users can construct any logical combination (‘AND’, ‘OR’, ‘NOT’, for example ‘eye AND human NOT other mammal’) of single filters. A video tutorial covering filter definition can be found at (https://www.youtube.com/watch?v=UU1rPTUZPTE).
Upon file upload, EFILTER outputs the number of taxa at each taxonomic rank as well as the number of taxa that were not found in the EFILTER database. Missing taxa include taxa that are not well-defined, taxa that were defined more recently than the most recent EFILTER update, or taxa that are not bacterial (for example fungal taxa from the output of shotgun sequencing). These taxa are ignored, meaning that there is no need to curate shotgun sequencing data to include strictly bacteria. EFILTER also automatically reports the pre-defined environmental classes for each unique taxon in an uploaded dataset, starting from the input taxonomic rankup to the taxonomic rank of family.
The primary goal of EFILTER is to identify organisms that are consistent/inconsistent with the sampled environment (i.e., the user selected filter). Thus, when a taxon list is queried against a chosen filter, EFILTER outputs the number of taxa with at least one habitat in the database that matches (passes) the filter.Taxa without at least one matching habitat are identified as failing the filter. In addition, for each input taxon, EFILTER reports whether the taxon passed or failed at taxonomic ranks higher than the inputted rank. Finally, EFILTER outputs the names of taxa that passed the filter at taxonomic ranks lower than the inputted rank. This latter feature can be useful, for example, to make an educated guess as to which species might be present given a list of genera from a 16S rRNA sequencing run.
Samples versus Controls
As a first demonstration of EFILTER performance, we used our own data, which consisted of 16S rRNA sequencing of 375skin microbiome samples from the foreheads of 50 individuals, along with 100paired controls (blank swabs collected before and/or after each person was sampled). Figure 1 shows the percentage of genera passing four different filters averaged over the samples and paired controls for each person. For the ‘skin’ filter (A), more organisms passed in the samples relative to the controls for the majority (48/50 or 96%) of people. By contrast, for the ‘built’ filter (C) more organisms passed in the controls relative to the samples for the majority (44/50 or 88%) of people. As expected, for logical OR combinations of filters, the percentages of passing taxa were greater overall (compare, for example Figures 1A and 1B or Figures 1C and 1D). However, once again, for the ‘human/mammal/body-site’ filter (C) more taxa passed in the samples relative to the controls for the majority (40/50 or 80%) of people, whereas the reverse was true for the ‘built/water’ filter (D) for the majority (30/50 or 60%) of people. In general, one would expect that a skin/human/mammal/body-site filter would perform better on human samples than on non-human (e.g., control) samples, which is what we see. Likewise, assuming that many contaminants come from room air or sequencing preparation steps, one would assume that a built/water filter would perform better on control samples than on human samples, which is again what we see.
Samples from Different Sources
As a second demonstration of EFILTER performance, we comparedvarious EFILTER pre-defined environmental filters across a variety of environmental samples. Datasets were downloaded as lists of genera directly from the MG-RAST website (https://www.mg-rast.org/). Specifically, we considered hot springs (mgp5265, mgp6907, mgp5356, mgp5355, mgp5270, mgp5269, mgp5268, mgp5266), tropical soil (mgp4362), a freshwater lake (mgp19525), oceans (mgp20413), cheese (mgp14606), human nares (mgp385), and horses guts (mgp7746). All studies involved shotgun sequencing. We used the same number samples from each environment (8,limited by the availability of hot springs samples; in all cases we selected the 8 samples with the lowest MG-RAST ID numbers), and restricted our analysis to the top 25 most abundant taxa in each sample. Figure 2 shows confusion matrices for samples versus environmental filters.Focusing on the averaged confusion matrix (right panel), we see strong EFILTER performance, with red/orange (high percentage of passing taxa) concentrated along the diagonal and blue/green (low percentage of passing taxa) off-diagonal. Overall, across the seven paired environments and samples, an average of 72% of taxa passed their expected filters. Passing rates were highest (88.5%) for the anterior nares, and lowest (62.5%) for the lake.
Filter fidelity to sample was quite high. For five out of seven samples (hot springs, soil, oceans, cheese and anterior nares), the expected filter (extreme, soil, salt water, dairy, human) performed best. Even when this was not the case, the expected filter still performed well. For the lake study, for instance, both the salt water and soil filters outperformed the fresh water filter (note that all three are in the‘environment’ class, see Methods), though only marginally (63.5% and 62.8% of taxa passing for the salt water and soil filters, relative to 62.5% for the fresh water filter). Likewise, for the horse gut study, the human filter outperformed the other mammal filter (note that both are in the ‘animal’ class), although once again, the other mammal filter still performed well (79.4% passing for the human filter relative to 75.9% for the other mammal filter). Sample fidelity to filter was not as good. For three out of seven filters (extreme, soil and human) the highest percentages of passing taxa were identified in matched samples (hot springs, soil, and anterior nares respectively). For the remaining four filters, however, the highest percentages of passing taxa were identifiedin non-matched samples (the fresh and salt water filters identified the highest percentage of passing taxa in the soil sample, while the dairy and other mammal filtersidentified the highest percentage of passing taxa in the anterior nares).
Samples from Different Body Sites
As a third demonstration of EFILTER performance, we considered a shot-gun sequencing dataset from the Human Microbiome Project (HMP), including 690 samples from 15 body-sites. Importantly, in this dataset all samples fall under a single filter class (‘body-site’, see Methods). As in the previous section, we only considered the top 25 most abundant taxa in each sample. Figure 3 shows confusion matrices for species- and genus-level data for each sample individually (A), and as sample averages over each body region (B; notice that, unlike Figure 2, there are different numbers of samples contributing to each body region).
EFILTER performance on the HMP dataset was even higher than on the environmental dataset, with an average of 40.9% of species and 76.0% of genera passing the appropriately matched filter. At the same time, however, filter fidelity to sample was lower (compare the right panel in Figure 3B to Figure 2). That is, at the rank of both genus and species, the expected filter (gut/digestive system, oral) performed best for only two (St, To+Kg+Bm+Sa+Sb+Sp+Td) out of five body regions. For the remaining three regions, the gut/digestive system and/or oral filters performed better. The gut/digestive system filter likely performed well because it has the largest number of taxa (see Appendix B). It is less clear why the oral filter out-performed the other filters, since it has fewer taxa than the ear/nose/throat filter at the ranks of both genus and species and has approximately the same number of taxa as the skin filter, atthe rank of genus.
HMP sample fidelity to filter, on the other hand, was strong, at least at the rank of species. In particular, four out of five filters (skin, oral, gut/digestive system, female reproductive system) identified the highest percentage of passing taxa in their matched samples (Rc, To+Kg+Bm+Sa+Sb+Sp+Td, St, Va). Sample fidelity to filter was slightly weaker at the rank of genus, where only two out of five filters (oral, gut/digestive system) identified the highest percentage of passing taxa in their matched samples. This fidelity, was similar to the fidelity observed in our environmental dataset (see Figure 2).One particularly striking feature in Figure 3 is the poor performance of all but the gut/digestive system filter on stool samples. Whereas >56% of genera from every other body region passed every body-site filter, only 17-35% of stool genera passednon-gut/digestive system filters. At the same time, however, 95% of stool genera passed the gut/digestive system filter. This suggests that stool taxa are unique to the gut environment, whereas taxa from other body sites exhibit broad body distributions.
Samples from Different Bioinformatics Pipelines
As a final demonstration of EFILTER performance, we compared EFILTER output for different bioinformatics pipelines. A common problem with microbiome studies is that different pipelines can give different taxon lists, including substitutions among distantly related organisms. EFILTER can be used to gauge performance of different bioinformatics methods and to identify potentially spurious taxa introduced through taxonomic assignment steps. To illustrate this, we used a shotgun sequencing dataset of the human skin microbiome [28], processed using both Kraken [29] and MetaPhlAn [30], and thresholded at various different read percentages. Results are shown in Figure 4, which also serves as an example of the EFILTER graphical output.
With larger thresholds, the pass:fail ratio increases for both bioinformatics pipelines. In other words, a larger fraction of the rare tail fails, regardless of the bioinformatics method used. This should not be surprising. Everything from contaminants and transient taxa to inherently rare/understudied organisms and bioinformatics errors are expected to contribute to the rare tail. In keeping with the consensus that Kraken has a tendency to over classify [10, 31], the pass:fail ratio for Kraken is significantly lower than it is for MetaPhlAn. This is particularly true when there is no threshold, with Kraken givingpass rates of 42% and35% for genera and species respectively relative to MetaPhlAn’s 67% and 61% at the same ranks. With a 1% threshold, both pipelines givesimilar results for species (72% passing), while Kraken actually shows a higher pass:fail ratio for genera (92% versus 88%). Surprisingly, though, despite Kraken’s lower pass:fail ratio for most scenarios, MetaPhlAn usuallyidentifies a greater absolute number of passing taxa. Thus, while it is likely that some of the failing taxa identified by Kraken (and MetaPhlAn) are truly present on skin,based on overall performance as judged by environmental consistency of assigned taxa, MetaPhlAn appears to be the better pipeline for this dataset.