Elter: A Tool for Identifying Unexpected and Erroneous Taxids in Sequencing Data

Background: Multiple types of error can enter metagenomics sample analysis, including contamination during sample collection, library preparation and sequencing, as well as incorrect taxonomic assignment by bioinformatics packages. Often, such errors either go unidentied, or else are removed, ad hoc, based on user knowledge of microbial ecology. However, becausedifferent researchers are more or less familiar with the ecologies of the various organisms in their systems, ltering is applied non-uniformly at best, with differences in the degree of ltering between studies and between taxonomic groups within a single study. Results: In this paper, we present EFILTER – a tool that capitalizes on decades of research in microbial ecology to identify suspicious or spurious taxa in metagenomics samples based on habitat information. Conclusions: EFILTER allows all microbiome researchers, regardless of background, to examine taxon lists for unusual entries, and to do so in a manner that is systematic and without bias.


Background
The ability to extract, clone and analyze DNA directly from environmental samples has revolutionized disciplines ranging from biomedical science to environmental microbiology [1][2][3]. Compared to historic, culture-based approaches,sequencing methods offer unprecedented advancements in terms of cost, effort and speed. They also improve detection of full community membership by exposing unculturable organisms [4]. However, sequencing studies are not without their own set of di culties.One of the greatest challenges lies in interpreting sequencingresults, which can be complicated by issues of contamination [5][6][7], sequencing error [8,9], and bioinformatics misidenti cation [10]. Although improved laboratory and bioinformatics techniques canhelp to alleviate these di culties, increased focus on detecting rare taxa, along with more speci c classi cation down to species and even strains, means that interpretation of sequencing data will remain a signi cant challenge for the foreseeable future.
Contamination is perhaps the largest and most di cult to address source of confusion in sequencing data. Whereasculture-based methods only detect organisms that can be reliably and repeatedly grown from an environmental sample, sequencing techniques identify even small pieces of non-viable DNA.
While this is one of the strengths of the sequencing approach, and the main reason that it can uncover unculturable organisms, thisis also why sequencing methods arefar more susceptible to issues with contamination [5][6][7]. Unfortunately, contamination can occur at any step along the processing pipeline, ranging from issues during sample collection and storage [11], though reagent contamination [12][13][14][15] to room source contamination andcontaminating bacteria from the mouth and skin of the researcher during sample preparation [16,17].While it is relatively straightforward to detect contaminants in samples that should contain a single organism, for example microbial DNA in the cow genome [5], or contamination of pure bacterial cultures with other bacterial taxa [18], identifying contamination in complex,mixed microbial communities presents a more di cult, and still open challenge. Page 3/18 Another major issue with sequencing data is imperfect taxonomic assignment. This challenge is also not easy to overcome, since there is an inherent trade-off between nding all taxa within a sample, and con dence/accuracy in taxon prediction [10]. Thus, bioinformatics methods that are highly sensitive are also more likely to report organisms not actually present. Meanwhile, more conservative classi cation methods can reduce the number of false positives, but do so at a cost of increasing false negatives. Misclassi cationcan be more or less problematic, depending on the microbial community involved, the type of sequencing method employed [19][20][21], the bioinformatics toolkit used [10,22,23], the reference databases available [24],and the quality (e.g., sequencing platform statistics, read depth, and read length) of the sequencing data [25][26][27].As with contamination, even in simple systems with appropriately chosen sequencing and bioinformatics methods, classi cation errors can have a signi cant impact on conclusions, for example by massively in ating diversity or the relative importance of rare organisms.
One method for identifying potentially problematic taxIDs in sequencing data is to rely on knowledge of microbial ecology. Finding a common plant pathogen in a human gut microbiome dataset, for example, might raise a red ag. Whether consciously or not, researchers use this approach to clean their data, for instance by deciding whether or not to trust all of the output taxa from a particular experiment [11]. Unfortunately, use of such information requires extensive knowledge of microbial behavior and characteristics -knowledge that is lacking among many, if not most researchers performing metagenomics analysis. Indeed, even for researchers with strong backgrounds in microbial ecology, the vast diversity of microbes from different environments means that it is impossible for any single researcher to be familiar with all common microbial habitats. Consequently, even with extensive knowledge of the ecologies of individual bacterial taxa, data veri cation based on microbial ecology is likely to be biased and ad hoc. Nevertheless, if such veri cation can be standardized and automated, it remains a promising method for addressing the challenges of spurious taxIDs in microbiome datasets.
In this paper, we present EFILTER (https://e lter.shinyapps.io/EFilter-app/)-a computational tool that performsEnvironmental FILTERing by leveraging known relationships between bacterial taxa and speci c habitats in order to identify taxIDs that are surprising or unlikely given the source of a metagenomics sample. EFILTER incorporates all habitat information available inBergey's Manual of Systematic Bacteriology and National Center for Biotechnology Information (NCBI) records from the International Journal of Systematic and Evolutionary Microbiology(IJSEM). As such, EFILTER allows users to assess the likelihood of taxIDs being in their datasets based on known microbial ecology. Further, it does so without requiring an extensive background in microbial ecology, and also without introducing bias that might otherwise arise based on user experience and knowledge of speci c taxonomic groups. To demonstrate the strength and performance of EFILTER, we test our tool on 4 datasets,which include both 16S rRNA and shotgun sequencing methods, analyzed with different bioinformatics pipelines and studied at different taxonomic ranks. Our analysis shows that EFILTER can reliably provide lists of unexpected/spurious organisms in metagenomics samples, offering a new,ecologically motivated tool for validation of microbiome datasets.

The EFilter Database
We constructed a database of microbe-environment associations using information from Bergey's Manual of Systematic Bacteriology and the National Center for Biotechnology Information (NCBI) Nucleotide database. For Bergey's, we converted pdf les of Volumes IIB, IIC, III, IV, and V as well as additional chapters on Aqui cae, Chlorobi, Chloro exi, Chrysiogenetes, Crenarchaeota, Cyanobacteria, Deferribacteres, Deinococcus-Thermus, Euryarchaeota, Nitrospirae, Thermodesulfobacteria, Thermomicrobia, and Thermotogae to .txt les using the pdfminer functions in Python. The text was then separated into segments for each taxonomic group. Text describing genera and species was mined for environmental information. We did not use text describing higher taxonomic groups, since this information tends to be too broad to be useful. To identify sentences containing environmental information within genus and species descriptions, we used the keywords in Table A.1 (see Appendix A). For the NCBI Nucleotide database, we only used entries with IJSEM[Journal] in the journal eld. Our assumption was that all of these entries represent newly de ned taxa. Although taxa can be de ned in other journals, the International Journal of Systematic and Evolutionary Microbiology (IJSEM) represents one of the largest publishers of new taxon descriptions. Most other journals do not specialize in taxonomy, and thus may contain other types of studies where taxonomic speci cation is not precise/accurate. Our goal was to avoid including habitat information from inaccurate sources, since this is the very problem that EFILTER seeks to correct. Habitat information was taken from the 'isolation source/' eld of the NCBI entries. NCBI/IJSEM entries were only used for taxa that did not exist in Bergey's (i.e., taxa that were de ned since publication of the most recent Bergey's volume for that taxonomic group).
Habitat descriptions from Bergey's Manual of Systematic Bacteriology and NCBI were stored in tabdelimited .txt les (see Appendix B), along with relevant taxonomic information, forming the basis of our environmental database. For species, all searches are performed on species-level descriptions harvested from either Bergey's or NCBI/IJSEM. For genera, searches are performed on genus-level descriptions from Bergey's, as well as all species-level descriptions for species within the genus. For higher taxonomic ranks, searches are performed on all genus-level and species-level descriptions within the particular taxonomic group.

User-De ned Search Terms
The entire environmental database can be queried for keywords. In order to minimize search time, EFILTER contains a look-up table linking each unique keyword from the database (see above) to the species/genera in whose descriptions that keyword appears. That is, we built a table of keywords wherein each keyword (column) maps to many rows, with each row being a speci c species/genus in the database. All keywords in our look-up table are full, stand-alone words. However, in many cases, users may be interested in related words (e.g., 'dog' and 'dogs') as well. For this reason, whenever a keyword is typed into EFILTER, EFILTER automatically generates a list of all larger, stand-alone words ('dogs, dogbite') containing the inputted keyword. This enables users to select the precise keyword combinations that they want to include. Thus, for the 'dog' example, the user may want to include 'dogs', 'dog-bite', and 'dogbite' , but not 'peptidoglycan', 'endogenous', etc. Finally, the environmental database can be queried for phrases (e.g., 'dogs and cats'). In this case, however, there is no look-up table, and the phrase is searched for directly through the database. Consequently, searches that involve querying for phrases are somewhat slower.

Pre-de ned Environments
Although possible, it may be hard and time-consuming for individual users to think of all potential keywords associated with a particular microbial environment. For this reason, we pre-de ned keyword combinations for 46 common habitats falling within six broad classes (animal, body sites, food, plant, environment and specialized). To generate keyword combinations, we manually curated all keywords appearing in ³5 species/genus descriptions and then went through these keywords individually to decide whether they or any of their word derivatives fell into our common environmental categories. We believe that our pre-de ned environment feature will be the most useful for the majority of users. Lists of keyword combinations used for each pre-de ned environment (ranging from 6 words for 'sand' to 199 words + derivatives for 'other mammals') can be found in Appendix C.

General Usage
The EFILTER database contains 11633 species and 34 phyla (see Table B.1, Appendix B). 72% of species have habitat information speci ed, while 45% fall into at least one pre-de ned environment (see Methods). 100% of phyla have habitat information speci ed, while 97.1% fall into at least one pre-de ned environment. The pre-de ned environment with the largest number of taxa is 'soil', which includes 17.5% of species. The next largest is 'human', which includes 8.4% of species. The pre-de ned environments with the fewest number of taxa are 'amphibian', 'reptile' and 'symbiont', with 0.06%, 0.3% and 0.3% of species respectively (see Appendix B, Table B.2). EFILTER inputs .txt les (including batch upload) in the form of lists of either Latin names or taxIDs.
Inputs in the form of taxIDs are mapped onto Latin names in order to infer habitat associations. Only well de ned taxa (i.e., taxa with entries in either Bergeys or NCBI/IJSEM, see Methods)areincluded in the database. EFILTER allows Latin names or taxIDs referencing any taxonomic rank from phylum to species, including lists with mixed ranks. Taxonomic rank is automatically determined based on name or taxID. Each taxon in an input le is queried against the EFILTER database using a de ned environmental lter. Environmental lters can be constructed as user-de ned keywords, user-de ned phrases, or pre-de ned environments (see Methods). In addition, users can construct any logical combination ('AND', 'OR', 'NOT', for example 'eye AND human NOT other mammal') of single lters. A video tutorial covering lter de nition can be found at (https://www.youtube.com/watch?v=UU1rPTUZPTE).
Upon le upload, EFILTER outputs the number of taxa at each taxonomic rank as well as the number of taxa that were not found in the EFILTER database. Missing taxa include taxa that are not wellde ned, taxa that were de ned more recently than the most recent EFILTER update, or taxa that are not bacterial (for example fungal taxa from the output of shotgun sequencing). These taxa are ignored, meaning that there is no need to curate shotgun sequencing data to include strictly bacteria. EFILTER also automatically reports the pre-de ned environmental classes for each unique taxon in an uploaded dataset, starting from the input taxonomic rankup to the taxonomic rank of family.
The primary goal of EFILTER is to identify organisms that are consistent/inconsistent with the sampled environment (i.e., the user selected lter). Thus, when a taxon list is queried against a chosen lter, EFILTER outputs the number of taxa with at least one habitat in the database that matches (passes) the lter.Taxa without at least one matching habitat are identi ed as failing the lter. In addition, for each input taxon, EFILTER reports whether the taxon passed or failed at taxonomic ranks higher than the inputted rank. Finally, EFILTER outputs the names of taxa that passed the lter at taxonomic ranks lower than the inputted rank. This latter feature can be useful, for example, to make an educated guess as to which species might be present given a list of genera from a 16S rRNA sequencing run.

Samples versus Controls
As a rst demonstration of EFILTER performance, we used our own data, which consisted of 16S rRNA sequencing of 375skin microbiome samples from the foreheads of 50 individuals, along with 100paired controls (blank swabs collected before and/or after each person was sampled). Figure 1 shows the percentage of genera passing four different lters averaged over the samples and paired controls for each person. For the 'skin' lter (A), more organisms passed in the samples relative to the controls for the majority (48/50 or 96%) of people. By contrast, for the 'built' lter (C) more organisms passed in the controls relative to the samples for the majority (44/50 or 88%) of people. As expected, for logical OR combinations of lters, the percentages of passing taxa were greater overall (compare, for example Figures 1A and 1B or Figures 1C and 1D). However, once again, for the 'human/mammal/bodysite' lter (C) more taxa passed in the samples relative to the controls for the majority (40/50 or 80%) of people, whereas the reverse was true for the 'built/water' lter (D) for the majority (30/50 or 60%) of people. In general, one would expect that a skin/human/mammal/body-site lter would perform better on human samples than on non-human (e.g., control) samples, which is what we see. Likewise, assuming that many contaminants come from room air or sequencing preparation steps, one would assume that a built/water lter would perform better on control samples than on human samples, which is again what we see.

Samples from Different Sources
As a second demonstration of EFILTER performance, we comparedvarious EFILTER pre-de ned environmental lters across a variety of environmental samples. Datasets were downloaded as lists of genera directly from the MG-RAST website (https://www.mg-rast.org/). Speci cally, we considered hot springs (mgp5265, mgp6907, mgp5356, mgp5355, mgp5270, mgp5269, mgp5268, mgp5266), tropical soil (mgp4362), a freshwater lake (mgp19525), oceans (mgp20413), cheese (mgp14606), human nares (mgp385), and horses guts (mgp7746). All studies involved shotgun sequencing. We used the same number samples from each environment (8,limited by the availability of hot springs samples; in all cases we selected the 8 samples with the lowest MG-RAST ID numbers), and restricted our analysis to the top 25 most abundant taxa in each sample. Figure 2 shows confusion matrices for samples versus environmental lters.Focusing on the averaged confusion matrix (right panel), we see strong EFILTER performance, with red/orange (high percentage of passing taxa) concentrated along the diagonal and blue/green (low percentage of passing taxa) off-diagonal. Overall, across the seven paired environments and samples, an average of 72% of taxa passed their expected lters. Passing rates were highest (88.5%) for the anterior nares, and lowest (62.5%) for the lake.
Filter delity to sample was quite high. For ve out of seven samples (hot springs, soil, oceans, cheese and anterior nares), the expected lter (extreme, soil, salt water, dairy, human) performed best. Even when this was not the case, the expected lter still performed well. For the lake study, for instance, both the salt water and soil lters outperformed the fresh water lter (note that all three are in the'environment' class, see Methods), though only marginally (63.5% and 62.8% of taxa passing for the salt water and soil lters, relative to 62.5% for the fresh water lter). Likewise, for the horse gut study, the human lter outperformed the other mammal lter (note that both are in the 'animal' class), although once again, the other mammal lter still performed well (79.4% passing for the human lter relative to 75.9% for the other mammal lter). Sample delity to lter was not as good. For three out of seven lters (extreme, soil and human) the highest percentages of passing taxa were identi ed in matched samples (hot springs, soil, and anterior nares respectively). For the remaining four lters, however, the highest percentages of passing taxa were identi edin non-matched samples (the fresh and salt water lters identi ed the highest percentage of passing taxa in the soil sample, while the dairy and other mammal ltersidenti ed the highest percentage of passing taxa in the anterior nares).

Samples from Different Body Sites
As a third demonstration of EFILTER performance, we considered a shot-gun sequencing dataset from the Human Microbiome Project (HMP), including 690 samples from 15 body-sites. Importantly, in this dataset all samples fall under a single lter class ('body-site', see Methods). As in the previous section, we only considered the top 25 most abundant taxa in each sample. Figure 3 shows confusion matrices for species-and genus-level data for each sample individually (A), and as sample averages over each body region (B; notice that, unlike Figure 2, there are different numbers of samples contributing to each body region).
EFILTER performance on the HMP dataset was even higher than on the environmental dataset, with an average of 40.9% of species and 76.0% of genera passing the appropriately matched lter. At the same time, however, lter delity to sample was lower (compare the right panel in Figure 3B to Figure 2). That is, at the rank of both genus and species, the expected lter (gut/digestive system, oral) performed best for only two (St, To+Kg+Bm+Sa+Sb+Sp+Td) out of ve body regions. For the remaining three regions, the gut/digestive system and/or oral lters performed better. The gut/digestive system lter likely performed well because it has the largest number of taxa (see Appendix B). It is less clear why the oral lter out-performed the other lters, since it has fewer taxa than the ear/nose/throat lter at the ranks of both genus and species and has approximately the same number of taxa as the skin lter, atthe rank of genus.
HMP sample delity to lter, on the other hand, was strong, at least at the rank of species. In particular, four out of ve lters (skin, oral, gut/digestive system, female reproductive system) identi ed the highest percentage of passing taxa in their matched samples (Rc, To+Kg+Bm+Sa+Sb+Sp+Td, St, Va). Sample delity to lter was slightly weaker at the rank of genus, where only two out of ve lters (oral, gut/digestive system) identi ed the highest percentage of passing taxa in their matched samples. This delity, was similar to the delity observed in our environmental dataset (see Figure 2).One particularly striking feature in Figure 3 is the poor performance of all but the gut/digestive system lter on stool samples. Whereas >56% of genera from every other body region passed every body-site lter, only 17-35% of stool genera passednon-gut/digestive system lters. At the same time, however, 95% of stool genera passed the gut/digestive system lter. This suggests that stool taxa are unique to the gut environment, whereas taxa from other body sites exhibit broad body distributions.

Samples from Different Bioinformatics Pipelines
As a nal demonstration of EFILTER performance, we compared EFILTER output for different bioinformatics pipelines. A common problem with microbiome studies is that different pipelines can give different taxon lists, including substitutions among distantly related organisms. EFILTER can be used to gauge performance of different bioinformatics methods and to identify potentially spurious taxa introduced through taxonomic assignment steps. To illustrate this, we used a shotgun sequencing dataset of the human skin microbiome [28], processed using both Kraken [29] and MetaPhlAn [30], and thresholded at various different read percentages. Results are shown in Figure 4, which also serves as an example of the EFILTER graphical output.
With larger thresholds, the pass:fail ratio increases for both bioinformatics pipelines. In other words, a larger fraction of the rare tail fails, regardless of the bioinformatics method used. This should not be surprising. Everything from contaminants and transient taxa to inherently rare/understudied organisms and bioinformatics errors are expected to contribute to the rare tail. In keeping with the consensus that Kraken has a tendency to over classify [10,31], the pass:fail ratio for Kraken is signi cantly lower than it is for MetaPhlAn. This is particularly true when there is no threshold, with Kraken givingpass rates of 42% and35% for genera and species respectively relative to MetaPhlAn's 67% and 61% at the same ranks. With a 1% threshold, both pipelines givesimilar results for species (72% passing), while Kraken actually shows a higher pass:fail ratio for genera (92% versus 88%). Surprisingly, though, despite Kraken's lower pass:fail ratio for most scenarios, MetaPhlAn usuallyidenti es a greater absolute number of passing taxa. Thus, while it is likely that some of the failing taxa identi ed by Kraken (and MetaPhlAn) are truly present on skin,based on overall performance as judged by environmental consistency of assigned taxa, MetaPhlAn appears to be the better pipeline for this dataset.

Discussion
A wealth of microbiome research has emerged over the past decade. Most of it, however, has been plagued by contamination, sequencing errors and taxonomic misidenti cation [6][7][8][9][10]. This has led to a lack of reproducibility across pipelines and amongst labs. Recently, there has been a call for improved standards and validation of microbiome datasets, including efforts like the MicroBiome Quality Control (MBQC) project [15] and the Critical Assessment of Metagenome Interpretation (CAMI) initiative (http://microbiome-cosi.org/cami), as well as standards development by the National Institute of Standards and Technology (NIST) [32]. In this paper, we introduce EFILTER as a new tool for helping to assess the quality of microbiome datasets and for identifying suspicious taxIDs within them. EFILTER is unique amongst microbiome informatics approaches in that it leverages ecological information to target organisms that are unexpected based on sample source.
Ultimately, EFILTER provides lists of suspicious taxIDs. The question, of course, is what to do with this information. In general, we advocate against indiscriminant dismissal of taxa that fail a chosen lter. Rather, two approaches are possible. First, depending on the analysis being performed, analysis can be run with and without discarding the problematic taxa in order to determine whether any conclusions change and, if they do, which and how many. In Figure 5, for example, we show how conclusions about relative sample diversity differ depending on whether analysis includes all taxa or only those taxa passing the salt water OR fresh water OR general water lter for the tara ocean dataset from MG-RAST (mgp20413). Notably, 93.6% of diversity orderings remain unchanged when failing taxa are ignored, suggesting that suspicious taxa are not overly problematic for the conclusions of thisparticular analysis with this particular dataset.
A second approach for using EFILTER data is to carefully examine taxa that fail and to make decisions about whether to include them based on additional knowledge or further experimentation. Depending on the system, the list of failing taxa can still be large. Thus, we recommend using broad lters and focusing on particularly egregious failures. That is, either failures that involve very abundant taxa, or else failures that extend to higher taxonomic ranks. We illustrate such an approach in Table C.1 of Appendix C for our own skin microbiome dataset. Notably, although we cannot be certain about the source of any taxon failures, we can make educated guesses about a sizeable fraction. Some taxa, for example Geodermatophilusand Methylobacterium, are known contaminants of blanks from other sequencing studies, suggesting a contamination origin. Others, for example Rhodothermus and Rheinheimera are present in samples and controls taken at the same time or are ubiquitous in controls, again suggesting a contamination origin. Our lab has cultured certain suspicious taxa, for example Enhydrobacter, directly from skin, suggesting an unreported habitat for this organism. Finally, some taxa, includingModestobacter and Hymenobacter, are found in a range of samples and in no controls, hinting that they may be real, as yet undiscovered taxa from human skin. Supporting this claim, both Modestobacter and Hymenobacter have been found in other human microbiome datasets. Modestobacter, in particular, would be interesting to explore further. Although it is currently known from extreme (desert), rock/stone and soil environments it showed up in both 16S amplicon sequencing from our skin samples and in shotgun sequencing from a separate skin study performed in a different lab using an entirely different sequencing and bioinformatics pipeline [28]. This suggests that the source could be an as yet unknown species of Modestobacter that resides on human skin.
Although the goal of EFILTER is to identify suspicious taxa in metagenomics samples, it is worth pointing out that EFILTER also works reasonably well for determining sample source origin, at least when the possible sources are quite distinct. In Figure 2, for instance, it is possible to discriminate soil/water sources, animal sources and extreme sources based on the percentage of abundant (top 25) genera passing the associated lters. Such discrimination is more di cult for closely related sources, for example for sources from different mammals, sources from different types of water, or sources from different human body regions (see Figure 3). A future goal would be to extend EFILTER such that source discrimination is improved, possibly by leveraging additional habitat information from sequencing studies.
On a similar note, one of the downsides of the existing EFILTER database is that it only uses habitat information from validly published taxa descriptions. Although this ensures that contaminated or otherwise compromised metagenomics samples do not in uence the EFILTER database, it means that EFILTER does not capitalize on the widely available, though imperfect, sequencing data currently in the public domain. A future direction would be to extend EFILTER to include these data, but to do so in a manner that incorporates con dence in the data source. This could result in higher percentages of taxa passing the appropriate lters, and also improved source discrimination.

Conclusions
Contamination, sequencing error and bioinformatics misidenti cations will continue to plague microbiome research for the foreseeable future. Historically, many questionable results from sequencing studies have been addressed ad hoc by researchers who realize that particular organisms should not be present in particular samples. With EFILTER, we extend this capability to anyone involved in microbiome research, and do so in a way that allows rigorous, systematic, and repeatabledataset cleaning and validation based on ecologically relevant habitat information. Figure 1 Average percentage of taxa passing the pre-de ned (A) skin, (B) skin OR human OR other mammal OR eye OR oral OR ear/nose/throat OR female reproductive systems OR male reproductive systems OR general reproduction OR lymph circulatory system OR nervous system OR bone/muscle OR liver/urinary tract/pancreas OR gut/digestive system (C) built and (D) built OR fresh water OR salt water OR general water lters for each of our 50 sets of samples (x-axis) and paired controls (y-axis). Samples consisted of 5-11 forensic swabs rubbed against the skin at 2 cm intervals across the entire width of each person's forehead. Paired controls consisted of forensic swabs exposed to room air before and after sampling from each person. Samples were analyzed using 16S rRNA sequencing of the V3-V4 region followed by QIIME based on 97% sequence similarity. Lists of taxa were generated by including any genus that was present at >0.1% of reads in any sample/control. A greater fraction of sample (control) taxa pass the lter for any point that lies below (above) the dashed grey line in each panel.

Figure 2
Confusion matrices showing the percentage of taxa passing each pre-de ned environmental lter (x-axis) against each sample (y-axis) considering samples individually (left panel), or averaged over studies/sources (right panel). Red indicates that the majority of taxa passed the lter; blue indicates that the majority of taxa did not pass the lter. Filters are ordered such that those coming from the same broad class (e.g., environment, animal) are next to each other. Strong EFILTER performance is indicated by warm colors along the diagonal and cool colors in off-diagonal regions.  Pairwise combinations of samples from the tara oceans dataset, where black indicates that the sample with the highest diversity depends on whether taxa failing the salt water/fresh water/general water lter are included, and white indicates that it does not. Data were downloaded as lists of genera directly from the MG-RAST website.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.