TaxaTarget: Fast, Sensitive, and Precise Classication of Microeukaryotes in Metagenomic Data

Microbial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classication, the use of a eukaryotic marker gene database can improve the computational eciency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classication rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classication process.

training, we learned the discriminatory power and classi cation thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classi cation errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.

Conclusion
Data-driven methods for learning classi cation thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classi cation. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Background
Eukaryotes, bacteria, and archaea comprise the domains of life on Earth (according to our current understanding). Well known eukaryotes, like plants and animals, belong to two small branches on a much larger phylogenetic tree of mostly uncharacterized microscopic eukaryotes, i.e., microeukaryotes.
Some microeukaryotes are well known because they bene t humans. For example, Saccharomyces cerevisiae (baker's yeast) and Aspergillus sojae are involved in the creation of fermented foods, drinks, and spices such as beer, bread, miso, and soy sauce. Others, like Plasmodium falciparum, Trypanosoma brucei, and Cyclospora cayetanensis are infamous for causing human diseases like malaria, sleeping sickness and food poisoning outbreaks, respectively. However, studies estimate that less than 5% of the predicted millions of species of microeukaryotes are currently characterized [1][2][3]. Furthermore, the eukaryotic tree of life is still expanding as whole new supergroups are being discovered [4].
Microeukaryotes do not exist in isolation, rather they are ecologically integral members of diverse microbial communities (alongside bacteria, archaea, viruses) i.e., microbiomes, found nearly everywhere on Earth. For example, phytoplankton, a mixture of photosynthetic microorganisms oating in oceans and freshwater bodies, are responsible for nearly half of all carbon xation and primary production on Earth-70% is performed by microeukaryotic groups i.e., Diatoms, Coccolithophores, and Chlorophytes [5]. In the human gut microbiome, some species of Blastocystis and Entamoeba, once strictly considered parasites, might promote healthy and diverse assemblages of bacterial species. Further, recent studies have shown that many parasitic microeukaryotes might be commensals, with no effect on health, or even mutualists [6][7][8][9][10][11], depending upon the microbial background. There is even evidence that the absence of some parasitic microeukaryotes might be a contributing factor to the development of autoimmune diseases [9,10].
Metagenomics-the sequencing of DNA extracted directly from microbial communities-provides a means for studying the microbial diversity, functional potential, and ecology of microbiomes.
Traditionally, metagenomic analyses involve aligning the short DNA sequences output by sequencing (i.e., reads) to a reference database of sequences (e.g., genomes or genes) with known taxonomy and/or functions, and transferring the annotations if sequence similarity is high. Although leading to an explosion of knowledge about microbial communities [12], most metagenomic studies have focused on bacteria, with relatively few studies (although a growing number) focusing on microeukaryotic groups such as fungi and protists [9,[13][14][15].
The metagenomic study of microeukaryotes has lagged for a variety of reasons. First, there is a general sparsity of sequenced genomes and transcriptomes, with more eukaryotic species represented in protein databases, e.g., ~40,000 species in the UniRef100 database, than genome databases, e.g., ~4,000 species in the National Center for Biotechnology information (NCBI) Assembly database. Combined, this is still orders of magnitude fewer species than the millions estimated to exist [1,2]. Sparse reference sequences can negatively impact the performance of tools that use sequence similarity for classi cation.
For example, metagenomic reads from a novel species might not align to any sequences in the reference database of a tool and thus go undetected.
Microeukaryotes often occur at low abundance in microbial communities with genomes that are orders of magnitude larger than bacteria [16,17]. Ideally, all the genomes in a metagenome could be directly reconstructed (i.e., assembled) from the reads. However, the algorithmic challenges of assembly, the limitations of current sequencing technologies, and the general sparsity of eukaryotic reads per metagenome, often results in few or no overlapping reads to construct longer eukaryotic sequences (i.e., contigs). Microeukaryotic contigs, longer than 3,000 bp, can sometimes be assembled, given su cient depth of coverage [18], and classi ed as eukaryotic based upon k-mer composition with high accuracy [19]. However, current bioinformatic approaches for characterizing microeukaryotes tend to rely upon metagenomic reads for analysis.
For large eukaryotic genomes (sometimes billions of base pairs), mapping metagenomic reads directly can be computationally expensive in terms of runtime and memory. Additionally, the assemblies of eukaryotic genomes are frequently contaminated with bacteria, potentially leading to classi cation errors [20][21][22]. A more robust and scalable solution is the use of marker sequence databases such as the clade-speci c genes used in MetaPhlan3 [23], universally present sequences such as the cytochrome oxidase (COI) gene or ribosomal subunits (as used by Metaxa2 [24]), or single copy orthologs (as used by EukDetect [22,25]). The latter have shown high speci city for eukaryotes with a strong phylogenetic signal [18,25]. A drawback for using a database of marker sequences is that the combined length of the marker sequences is usually a small fraction of any eukaryotic genome, potentially resulting in low sensitivity when employed in metagenomic data.
A general challenge for tools using sequence similarity for taxonomic classi cation is that sequences, and regions of sequences, can have variable levels of discriminatory power because they evolve at different rates [26]. For example, regions of the 18S ribosomal subunit (rRNA), a eukaryotic marker sequence used by Metaxa2 for taxonomic classi cation, are known to lack discriminatory power at the species, genus, and sometimes even the family or broader levels for some eukaryotic groups [3]. One consequence is that a metagenomic read can align equally well to a sequence in multiple species, providing insu cient evidence for classi cation at the species level. EukDetect addressed this problem for eukaryotic marker genes sharing conserved regions with bacteria (a common source of classi cation errors) by removing marker genes from its database if bacterial reads mapped, or if a Hidden Markov Model (HMM) used for eukaryotic gene annotation predicted genes in bacterial genomes [22] (note, this strategy was not extended to other groups like archaea and viruses). This reduced the number of marker gene families in the EukDetect database from 255 to 214. A drawback of discarding entire eukaryotic marker gene families, due to conserved regions in bacteria, is that all the non-conserved regions, which are potentially informative for eukaryotic taxonomic classi cation, will also get discarded.
It is common for taxonomic classi cation approaches to implement universal alignment score cutoffs as a simple method for classifying metagenomic reads mapped to database sequences. However, this also ignores that sequences can evolve at different rates and can have variable discriminatory power. EukDetect implements such an approach to classify species, as does Metaxa2 which assigns universal cutoffs for the classi cation of each taxonomic rank from Kingdom to species [24]. CCMetagen also assigns universal cutoffs (from phylum to species), but it implements a slightly more sophisticated approach, i.e., using classi cation thresholds learned from the clustering of fungal internal transcribed spacer (ITS) and ribosomal subunit marker sequences to assess the taxonomic relationship between metagenomic reads and sequences in the NCBI Nucleotide database [27]. Although the thresholds are learned for speci c markers with a data-driven approach, they end up being applied to all sequences in its database which might evolve at different rates.
As an alternative, supervised learning approaches have been developed for dynamically determining thresholds using marker genes [28] and for the discovery of protein families [29] in metagenomic data; however, these approaches were developed for bacteria using bacterial marker genes. Supervised learning methods use labeled training data to learn functions for classifying new examples. For example, MetaPhyler [28] predicts the taxonomic relatedness (from phylum to species) of metagenomic reads aligned to marker genes. It learns the cutoffs for classi cation thresholds by training on the scores of reads, from the genomes of known taxa, aligned to the marker genes in its database. This approach automatically assesses if sequences contain enough information for classi cation at different taxonomic levels. However, a limitation of MetaPhyler is that it builds classi ers for whole genes, treating all regions of a gene as having equal classi cation power. ROCker [29] introduced a method for assessing if metagenomic reads were sequenced from speci c gene families using sliding window, region-speci c classi ers. Although a functional pro ling tool, we thought the approach could be successfully adapted for taxonomic classi cation.
Here, we present a metagenomic taxonomic pro ling tool, taxaTarget, speci cally designed for microeukaryotes. We show that taxaTarget is computationally e cient, with higher sensitivity, and often higher speci city than similar state-of-the-art tools using marker sequence databases (i.e., Metaxa2 and EukDetect, MetaPhlan3 was not included because it only had 122 eukaryotic species as of release mpa_v30)-even for species not represented in the database. Using a similar database of single copy orthologs as EukDetect, we show how a supervised learning approach automates the assessment of which genes and regions are most informative for classi cation, reducing the need for manual curation, and retaining more sequences in the database. Further, the supervised learning approach of taxaTarget replaces universal classi cation thresholds (i.e., universally applied to all sequences in a database) with classi ers tuned by region in a data-driven manner. One challenge for implementing a supervised learning approach was the scarcity of training data for microeukaryotes which initially led to many classi cation errors; however, we were able to implement several novel solutions which effectively handled errors caused by missing data.

Work ow and implementation
The work ow of taxaTarget is shown in Figure 1. It is written in Python3 (Python ≥ 3.6) and executed via command line. It utilizes a custom database of eukaryotic marker genes, the trained classi ers, and the NCBI taxonomy. It accepts as input raw metagenomic sequencing reads in fastq format. The reads are rst mapped to the database with Kaiju (v1.8.0) [30] to quickly identify reads that might be sequenced from eukaryotic marker genes. The binned reads are then more sensitively aligned with Diamond (v2.0.11) [31] to the marker genes. The Diamond output is then provided as input to taxon-speci c classi ers trained on a curated database of eukaryotic sequences. Finally, the results are aggregated and the taxonomic pro le of the metagenomic sample is generated.
The taxonomic classi cation results are output in four tab-delimited les: 1) classi ed_reads.txt provides the initial classi cation results for individual reads that get classi ed as eukaryotic-this allows users to explore the raw results before ltering; 2) marker_gene_read_counts_per_taxa.txt is a matrix of read counts per marker gene for all eukaryotic lineages detected in the sample; 3) nal_read_classi cations.txt provides the nal classi cation results for all reads included in the taxonomic pro le; 4) Taxonomic_report.txt provides the nal aggregate taxonomic pro le (i.e. read counts per taxonomic lineage and the number of marker genes with mapped reads) of the sample provided.
Building the marker gene database The database of taxaTarget was built from protein sequences extracted from the UniRef100 database of UniProt (release-2021_02) [32] to maximize the breadth of species detectable. The UniRef100 database was used, instead of UniProt, because it removes identical sequences and assigns the lowest (most recent) common ancestor to the representative sequences in the database. The ~2.0x10 8 protein sequences in the UniRef100 database were downloaded on April 15, 2021. Approximately 6.4x10 7 eukaryotic proteins were extracted and assigned a taxonomic lineage using the taxonomic identi ers provided by UniRef100 and the NCBI taxonomy. Eukaryotic proteins were excluded if the taxonomic lineage included the keywords "unclassi ed eukaryotes", "environmental samples'', or if the base taxonomic label was at a broader taxonomic rank than species. Eukaryotic marker genes were identi ed and extracted if they aligned with Diamond to the 255 eukaryotic Benchmarking Universal Single-Copy Orthologs (BUSCO) marker genes (version 5) [25] with an alignment matching or exceeding the BUSCO cutoffs for sequence length and bit score.

Training the taxonomic classi ers
The classi ers of taxaTarget are a set of region-speci c functions that use thresholds for family, genus, and species, to classify metagenomic reads. The thresholds for species, genus, and family (S (M, W) , G (M, W) , and F (M, W) , respectively) are learned for each non-overlapping 20 amino acid window, W, of each marker gene, M, in the taxaTarget database via training. As an example, the notation for the species threshold of the 140 to 160 amino acid window of marker gene UniRef100_C0H4X5 (its UniRef100 identi er) would be S (UniRef100_C0H4X5,140) . The thresholds learned during training for each window, provide the basis for the taxaTarget classi ers.
To provide an overview, the training process began by decomposing the UniRef100 database into a set of reads. Each read was aligned to the taxaTarget database, yielding one or more alignment scores depending on how many marker genes it aligned to. Per alignment, each read was assigned a label, either "species", "genus", "family", or "negative", depending on its taxonomic relationship to the marker gene. A read that aligned to a marker gene from a broader rank than family was considered a negative. The set of labelled reads that mapped within the window of a marker gene were used to determine the thresholds for that window.
The alignment score used for training and classi cation was the mean bit score-the bit score of a read alignment divided by the alignment length. The mean bit score was used because there is an approximately linear relationship between the alignment bit score and alignment length [28], i.e., reads of different lengths, sequenced from the same gene, will align back to the marker gene with different bit scores, but similar mean bit scores. Therefore, training on the mean bit score allows for standardized classi cation of reads of different lengths.
In detail, training began by aligning all UniRef100 reads (reads are all 70 amino acid long k-mers, with step size 10, extracted from UniRef100), to the marker genes using Diamond BLASTP. The procedure for 1) If a read alignment to UniRef100_C0H4X5 started or ended within the 140 to 160 amino acid window (extending towards the end of the gene) with a minimum bit score of 60 and alignment length of at least 30 amino acids, a) The mean bit score of the read alignment was extracted b) A taxonomic label ("species", "genus", "family", or "negative") was assigned depending on the taxonomic relationship of the read to the marker gene (Plasmodium falciparum for UniRef100_C0H4X5) 2) S (UniRef100_C0H4X5, 140) , G (UniRef100_C0H4X5, 140) , F (UniRef100_C0H4X5, 140) were set to the minimum mean bit score of the reads mapped and corresponding to that taxonomic rank 3) Thresholds were adjusted or removed based on the following conditions a) A threshold was removed if there was no training data corresponding to its taxonomic rank b) If the mean bit score of a broader taxonomic rank was higher than the threshold of a more speci c taxonomic rank, that threshold was either adjusted to that mean bit score or removed if the mean bit score was higher than the threshold of the next most speci c rank c) The window was only kept in taxaTarget for classi cation if, i) S (UniRef100_C0H4X5, 140) had a value (in some cases reads did not map to a window) ii) S (UniRef100_C0H4X5, 140) > {G (UniRef100_C0H4X5, 140) , F (UniRef100_C0H4X5, 140) , max(negatives)} 4) The process was repeated for every window of every marker gene in the database.
At the end of the training process, the values for S (M, W) , G (M, W) , and F (M, W) were de ned for each window, for each marker gene, subject to the ltering parameters described above. This provided the basis for the classi ers used by taxaTarget for classi cation.

Taxonomic classi cation of metagenomic reads
Classifying individual reads Taxonomic classi cation with taxaTarget begins by aligning metagenomic reads to the taxaTarget database with Diamond BLASTX, yielding alignment scores. The highest mean bit score alignment(s) of individual reads are used for classi cation. If a read aligns with 100% identity to multiple marker genes from different species, the read is classi ed at the lowest common ancestor of the respective hits. Otherwise, individual reads are classi ed as follows: 1) The alignment of a read to a marker gene must be at least 30 amino acids long (gaps are counted) 2) A read alignment must either start or end in the window of a classi er and extend toward the end of the marker gene. If a read alignment overlaps multiple windows, the window used for classi cation is window = oor( min(start, end) / 20 ). If a classi er does not exist for that window, the read is not classi ed.
3) The classi er for a window consists of a function that uses the learned classi cation thresholds for species, genus, and family, S (M, W) , G (M, W) , and F (M, W) , respectively, where M is the speci c marker gene the read aligned to and W is the window.
4) The mean bit score of the read alignment is provided as input to the classi er 5) Classi cation a. If the mean bit score ≥ S (M, W) , classify the read at the species level b. If the mean bit score ≥ G (M, W) , classify the read at the genus level c. If the mean bit score ≥ F (M, W) , classify the read at the family level d. Else, the read is not classi ed Generating the taxonomic pro le of the metagenomic sample Once the individual reads have been taxonomically classi ed, taxaTarget uses several functions to analyze the pattern of mapped reads for each taxon and to lter out false positives from the taxonomic pro le. The sources of false positives were identi ed through experiments or reported in the literature previously [22]. Each parameter has a default setting that can also be set by the user.
1. Thresholds determined from training data that lacked examples from broader taxonomic ranks or negatives are frequently overly permissive leading to false positive classi cations. To address this factor, taxaTarget relies on "padding," a user tunable parameter, to adjust the thresholds higher (requiring a higher mean bit score in order to classify a sequence). The default padding added (0.5 by default) was determined via parameter sweep analyses detailed later in the Methods. The following padding strategy is used: a. By default, T padding = T + 0.5(I -T), where T is the threshold (options are S (M, W) , G (M, W) , and F (M, W) ) before padding, I is the mean bit score of the identity alignment for the window, and T padding is the threshold value after padding is added b. Padding is added using the following series of rules. Once a rule is met, the subsequent rules are not applied. During training, if no reads aligned to a marker gene, i. At a rank broader than family, then padding is added to S (M, W) , G (M, W) , and F (M, W) ii. At a rank broader than genus, then padding is added to S (M, W) and G (M, W) iii. At a rank broader than species, then padding is added to S (M, W) 2. As reported in [22] and supported by our observations, false positives frequently entail just one or two reads mapping to a single marker gene of an organism. To avoid such errors, taxaTarget requires that at least 3 reads be classi ed per taxonomic unit, and that reads map to at least 3 marker genes within a taxonomic unit for it to be reported in the output.
3. If sequencing samples the genomes in a metagenome randomly, then the reads should cover a genome, approximate, uniformly. False positives are often characterized by many reads mapping to a small subset of the marker genes from a particular taxon, perhaps due to a conserved domain. taxaTarget relies on outlier detection to lter out marker genes that have more mapped reads than expected for each taxon. The cutoff for the maximum expected number of mapped reads for a marker gene is approximated as follows: a. Let X = {x 1 , x 2 , ..., x N }, where X is a vector of counts and x i is a single draw, sampled with replacement, from the set of marker genes, {m 1 , m 2 , ..., m K }, where N is the total number of reads mapped to the marker genes, and K is the total number of marker genes for a taxon. C = mode(X) × 2.
b. Any marker gene with more mapped reads than the mean of C after 100 experiments is excluded for the taxon being considered. 4. A source of error, also identi ed in [22], is that more species of a genus get detected than are present in the sample. Sometimes, this is due to the species lacking a complete set of marker genes in the taxaTarget database. To address such situations, taxaTarget examines the pattern of read mappings to the marker genes from the equivalent species using an extension of the strategy reported in [22]. First, a primary species is de ned as the species with the most mapped reads that passes all previous lters. The primary species is assumed present, and the other species are compared to it to determine if they are present as well or false positives. For demonstration purposes, let us consider two scenarios (that taxaTarget handles) where two species, A and B, within the same genus have mapped reads, where A is the primary species.
a. If the reads of species A and B map to non-overlapping sets of marker genes, the read counts are summed and the taxonomic pro le reports both species as a single result indicating there is not enough information to determine whether either or both are present.
b. If the reads of species A and B map to partially overlapping sets of marker genes, species B is considered a false positive if the ratio of reads mapped (normalized by the number of marker genes) to shared genes versus not-shared genes is less than 1, with a tolerance of 10%.

Simulated data sets
Four simulated datasets were created using ART [33] to generate single-end, Illumina HiSeq, 150 bp reads at 1X depth of coverage of the genomes described in Table 1. The four datasets were named nonmicroeukaryote, microeukaryote_in_DB, microeukaryote_not_in_DB, and sensitivity_analysis.
For the non-microeukaryote dataset, the GenBank assembly report (available at https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) was parsed with a custom Python script to download one assembly per genus for each of the following phylogroups: metazoa (animals), embryophyta (plants), archaea, bacteria, and viruses. Per genus, the assemblies were rst prioritized by assembly level, i.e., complete genome, chromosome, scaffold, contig.
Then an assembly was randomly selected from the highest assembly level available for that genus. For analysis, simulated reads were combined into metagenomes by phylogroup and randomly down sampled to approximately 10 million reads. All reads simulated from the archaea and virus assemblies were used as they were already fewer than 10 million.
The assemblies for the microeukaryote_in_DB dataset were downloaded from the NCBI assembly database after searching for "eukaryota" and ltering for "Latest RefSeq", "Protists", "Fungi", "Complete genome", "Chromosome", "Representative." Here, we only used high quality assemblies to reduce the likelihood of contaminating sequences in the assemblies which could affect the classi cation results. The assemblies were further ltered to only include species that had sequences in the databases of each tool tested (taxaTarget, EukDetect and Metaxa2). The read sets were analyzed individually as well as a metagenome mixture where all the reads had been combined into a single fastq le.
The assemblies for the microeukaryote_not_in_DB were downloaded using the GenBank assembly report and a custom Python script that found species without sequences in any of the databases used by Metaxa2, EukDetect, and taxaTarget, but that did have sequences from the same genus in these databases. Each read set was analyzed individually.
The assemblies of two protist (Phytophthora sojae and Plasmodium falciparum) and one fungal (Aspergillus niger) species were downloaded from GenBank (GCA_000149755.2, GCA_000002765.2, GCA_000002855.2, respectively) to create the sensitivity_analysis dataset. Each species was represented in the database of each tool tested. Six read sets were created from each assembly, composed of 1, 10, 100, 1x10 3 ,1x10 4 , and 1x10 5 reads using a custom Python script to randomly select from the ART simulated reads. Each read set was analyzed individually.
Parameter sweep analysis for adjusting classi cation thresholds As mentioned earlier, thresholds learned from training data that lacked examples from broader taxonomic ranks or negatives frequently allowed false positive classi cations. To address this, taxaTarget relies on "padding" these thresholds, a user tunable parameter, to adjust the thresholds higher. The default amount of padding to add to thresholds was assessed via a parameter sweep analysis. The values tested for padding included (0.0, 0.1, 0.2, …, 1.0). The experiment involved three simulated datasets which were useful for assessing different performance metrics of taxaTarget.
First, the non-microeukaryote dataset was used to assess the false positive rate of taxaTarget-both with and without applying outlier detection (see above). The false positive rate was calculated as the proportion of reads from non-protist/non-fungi phylogroups (i.e., animals, archaea, bacteria, plants and viruses) that were misclassi ed as a protist or fungus.
Second, the metagenome of the microeukaryote_in_DB dataset was used to assess the relationship between the precision and the sensitivity of taxaTarget for species with sequences in the database. The precision was calculated using the number of true positive and false positive reads classi ed at the species level. The sensitivity of taxaTarget was calculated using the number of true positive and false negative species detected in the metagenome.
Third, the microeukaryote_not_in_DB dataset was used to assess the relationship between the precision and sensitivity of taxaTarget for species with no sequences in the database, but that have relatives from the same genus within the database. The precision and sensitivity were calculated as before except at the genus level.
Limit of detection analysis using simulated data The sensitivity_analysis dataset (Table 1) was created to test the limit of detection of the tools for two protist (Plasmodium falciparum and Phytophthora sojae) and one fungal (Aspergillus niger) species across a range of depth of coverage of the genomes. These datasets were meant to mimic metagenomic samples where protist and fungal species occur with varying levels of sparse genomic coverage. The 3 species have genomes ranging in size from 2.3x10 7 bp to 8.3x10 7 bp. Every tool tested had reference sequences from these species in their database. Each read set was classi ed separately with the tools and the number of reads correctly classi ed at the genus level were recorded.
Testing the false positive rate of the tools for nonprotist/non-fungi species The non-microeukaryote dataset (Table 1) was used to test the false positive rate of the tools. The false positive rate was calculated as the proportion of reads from non-protist/non-fungi phylogroups (i.e., animals, archaea, bacteria, plants and viruses) that were misclassi ed as a protist or fungi. The mean false positive rate for all phylogroups was normalized by 10 million to derive the mean number of false positive reads per 10 million reads classi ed.
Measuring the precision and sensitivity of the tools for protist and fungal species The microeukaryote_in_DB dataset (Table 1) was used to measure the precision of the tools when classifying protist and fungal species with reference sequences in the database. The tools were rst applied to each read set individually to provide a baseline for the precision of the tools and then to the combined read set to assess the precision of the taxonomic pro le when multiple species were present (sometimes multiple species from the same genus). The precision was calculated separately at the taxonomic levels of family, genus and species. The sensitivity of the tools was de ned at the family, genus and species levels as the proportion of the total number species in the dataset where at least one read was classi ed correctly at the respective taxonomic levels.
The microeukaryote_not_in_DB dataset (Table 1) was used to measure the precision of the tools when classifying protist and fungal species with no reference sequences (although there were reference sequences from other species within the corresponding genera) for the species in the databases used by the tools. The precision and sensitivity of the tools were measured as mentioned in the previous paragraph except only at the taxonomic levels of family and genus as the respective species are not represented in the reference collection.

Analysis of real metagenomic data
The accuracy of the tools for detecting known protist species in a microbiome was quanti ed using 57 Cameroonian shotgun metagenomic gut samples [34]. Polymerase chain reaction (PCR) had previously been performed on the samples to detect Blastocystis and microscopy to detect Entamoeba. Additionally, the previous study had identi ed several target protist species by mapping the metagenomic reads to a database of their genomes or 18S rRNA sequences [34]. We summarized and compared the taxonomic pro les of each tool with the ndings of the original study. For Blastocystis, the PCR results were used as a "gold standard" for measuring the accuracy of the tools for detecting this parasite.

Database creation
The reference database of taxaTarget is composed of orthologs of the 255 BUSCO eukaryotic marker genes. These marker genes tend to have a strong eukaryotic evolutionary signal and are less likely to generate false positives when used for taxonomic classi cation [18,25]. Of these, 877,724 were identi ed as homologs of the 255 eukaryotic BUSCO marker genes. These sequences originate from 1,319 families, 2,564 genera and 4,232 species. As seen in Figure 2, many species had few marker genes (e.g., 1,152 species only had one marker gene identi ed); however, 2,619 species had over 50 marker genes and the median number of marker genes per species was 152.

Statistics about the trained classi ers
Classi ers were trained for all non-overlapping 20 amino acid regions (22.7x10 7 potential regions) per protein in the database. About of the regions (7.4x10 6 ) were excluded after training because the highest mean bit score alignments for family, genus, and species did not follow the pattern of species > genus > family (Figure 3). For the remaining ~ of regions, training data was often sparse i.e., 22.7%, 64.2%, and 68.1% of the regions lacked negative-labeled, family-labeled, or genus-labeled data, respectively. The number of region-speci c classi ers varied by orders of magnitude between marker genes, from 2.8x10 3 to 1.5x10 6 with a median of 4.3x10 4 . There was also great variability per species, from 1 (e.g., Phytophthora andina) to 1.6x10 5 (e.g., Aureobasidium pullulans) with a median of 2.1x10 3 region-speci c classi ers per species. Furthermore, there were cases where the mean bit score showed a high variability across windows of a marker gene within a single rank; the maximum mean bit score range observed was 1.94 across the windows of the Brix domain-containing protein of Hortaea werneckii (A0A3M7GIU8).
Parameter sweep analysis: Padding thresholds trained with sparse data The rst parameter sweep analysis used the non-microeukaryote metagenome to assess how the padding of thresholds, with and without outlier detection, affected the number of non-microeukaryote reads falsely classi ed as microeukaryotic (protist or fungal) species ( Figure 4A). Here, outlier detection refers to whether taxaTarget is set to remove marker genes with more reads mapped than expected (see Methods).
As seen in Figure 4A, irrespective of padding, taxaTarget always performed better when using outlier detection, especially at low threshold padding, i.e., 0.2 or less. Without outlier detection, the number of false positive reads per 10 million reads classi ed decreased from 179 to 7 as the threshold padding increased from 0 to 1. With outlier detection, the same value over the same range decreased from 26 to 4.
From 0.1 to 0.2 padding there was an unexpected increase in the number of false positive reads classi ed as embryophyta. This was caused by several species where the pattern of mapped reads to marker genes was not uniformly distributed at 0.1 padding-resulting in them being ltered out of the taxonomic pro le by taxaTarget. However, at 0.2 padding the number of reads mapped to marker genes was greatly reduced and approximated uniform coverage-resulting in them being wrongly included in the taxonomic pro le. The phylogroups that contributed to the most false positives, irrespective of padding level, were the bacteria and embryophyta. For bacteria, misclassi ed reads were most frequently classi ed as fungi, chlorophyta, and choano agellata. For example, at 0.5 padding there were 15 misclassi ed reads, of which 7, 5, and 4 were assigned to fungi, chlorophyta, and euglenozoa, respectively. For embryophyta, misclassi ed reads were most frequently classi ed as chlorophyta and charophyceae-green algae. For example, at 0.5 padding there were 23 misclassi ed reads, 20 were assigned to chlorophyta (the other 3 were assigned to stramenopiles). The viruses, archaea, and metazoa did not produce false positives, only did so at a low rate (irrespective of threshold padding), or only at a padding of 0.1 or lower, respectively.
The second analysis used the microeukaryote_in_DB metagenome to assess the effect of padding thresholds on the precision and sensitivity of taxaTarget at the species level for species with sequences in the reference database ( Figure 4B). The sensitivity was 1.0 regardless of padding, i.e., all species were detected at the species level in the metagenome. With respect to precision, as padding increased from 0 to 1 there was a slight increase from 0.951 to 0.981, respectively.
The third analysis used the microeukaryote_not_in_DB dataset to assess the effect of padding thresholds on the precision and sensitivity of taxaTarget for species without sequences in the database ( Figure 4C).
The precision was assessed at the genus level. Similarly, the sensitivity was calculated as the proportion of species detected at the genus level. Overall, the sensitivity was 0.952 from 0 to 0.8 padding and 0.94 thereafter. The precision continuously increased from 0.814 to 0.916 from a padding of 0 to 1, respectively.
Together, the parameter sweep analyses led us to choose a threshold padding of 0.5 as the default setting for taxaTarget, but users can adjust this parameter.

Limit of detection analysis
The limit of detection for each tool was tested by classifying read sets simulated from genomes, where the number of reads varied by 6 orders of magnitude (1, 10, 100, 1x10 3 , 1x10 4 , 1x10 5 ) ( Figure 5). The datasets were simulated from the genomes of two protist species (Plasmodium falciparum and Phytophthora sojae) and one fungal species (Aspergillus niger). The genomes ranged in size from 23, 34, and 83 Mbp for Plasmodium falciparum, Aspergillus niger and Phytophthora sojae, respectively. Each species had reference sequences in the database of each tool. The sensitivity of a tool was measured as the number of reads correctly classi ed at the genus level.
Amongst the tools, taxaTarget showed the greatest sensitivity, only requiring 100 reads to detect Aspergillus and 1,000 reads to detect Phytophthora and Plasmodium. EukDetect required 10,000 reads to detect Aspergillus and Phytophthora and 100,000 reads to detect Plasmodium. Metaxa2 showed the least sensitivity, unable to detect Phytophthora given a million reads, detecting Aspergillus given 100,000 reads, and 10,000 reads to detect Plasmodium.
False positive rate when classifying simulated metagenomes composed of non-fungi/non-protist phylogroups Microbiomes can be composed of hundreds or thousands of microbial species-usually only a small proportion are microeukaryotes (mostly protists and fungi) [17,35]. The non-microeukaryote dataset was used to quantify the false positive rate of the tools when classifying non-protist/non-fungal phylogroups.
As seen in Table 2 Classi cation performance for known fungal and protist species Here we assess the precision of the tools when applied to data derived from genomes found in the databases of all tools.
The precision of the tools was tested under two conditions. First, classifying read sets simulated from individual genomes (Table 3). Second, classifying the metagenome of the combined read sets ( Table 4).
The rst condition helped establish a baseline of precision for each tool when only a single species was present. The second condition tested the precision of the taxonomic pro le of the tools when there were many species in a sample, sometimes multiple from the same genus (24 of the 54 species came from a genus with multiple species).
When classifying read sets simulated from individual genomes, all tools showed high precision (> 95%) at genus and species levels, but only taxaTarget had a sensitivity of 100%, i.e., detected all species (Table  3). In contrast, EukDetect had a sensitivity of 90.7%, misclassifying 5 species as other species. Metaxa2 had the lowest sensitivity, only detecting 31.5% of the species and classifying 160 reads in total at the species level. In terms of precision, taxaTarget and Metaxa2 were the best performing at the species level, 97.3% and 97.4%, respectively, with EukDetect slightly lower at 95.3%.
When classifying the simulated reads combined into a single metagenome, all tools still showed high precision (≥ 95%). Metaxa2 and taxaTarget still had the highest precision at the species level, 97.0% and 97.3%, respectively, and taxaTarget still had a sensitivity of 100% (Table 4). In contrast, EukDetect showed a substantial loss in sensitivity (only detecting one species per genus when multiple species were present), detecting 68.5% of the species, in contrast to 90% when data from each species is provided separately. Further, Metaxa2 detected the same number of species as previously but with even fewer classi ed reads, 130 in total.

Classi cation performance for unknown fungal and protist species
We sought to test whether the tools could detect species that were not represented in their databases, but that had reference sequences from the same genus. We tested the tools using the microeukaryote_not_in_DB dataset consisting of genomic read sets simulated at 1X coverage for 85 protist/fungal species that had no reference sequences in the databases of any tool, but that had sequences from species related at the genus level (Table 5).
In terms of sensitivity, taxaTarget detected the most species in the sample at the family and genus levels, 97.6% and 95.2%, respectively. Meanwhile, EukDetect detected about 50% of the species at the family and genus levels, and Metaxa2 detected 28% or fewer. In terms of precision, all tools detected the species at the family level with ≥90%. However, EukDetect was the most precise in detecting the species at family and genus levels with 99.9% and 90.4% precision, respectively. taxaTarget showed slightly lower precision, detecting species at the family and genus levels with 94.5% and 89.1% precision, respectively. In contrast, Metaxa2 had a precision of only 41.6% for detecting species at the genus level.

Detection of target species in the Cameroonian gut metagenomes
The accuracy of the tools for detecting known protist species in a microbiome was quanti ed using 57 Cameroonian shotgun metagenomic gut samples (Table 6). PCR had previously been performed on the samples to detect Blastocystis and microscopy to detect Entamoeba. Additionally, several target protist species had been detected by mapping the metagenomic reads to a database of their genomes or to their 18S rRNA [34].
For Blastocystis, the PCR results were used as the "gold standard" for the analysis. The accuracy of the tools was assessed at the genus level because the databases of the tools did not contain all the Blastocystis species detected by PCR. Metaxa2 had the highest agreement with the PCR results, 96.5%, with taxaTarget and EukDetect showing slightly less agreement with the PCR results, 89.5% and 82.5%, respectively. All the tools detected Blastocystis in one sample that was PCR negative. For the 5 PCR positive samples where taxaTarget did not detect Blastocystis, three samples only had one or two classi ed reads-taxa with fewer than 3 classi ed reads are removed from the taxonomic pro le by taxaTarget. The other two samples had three and seven reads, respectively, but they mapped to fewer than three marker genes, the minimum number required by taxaTarget to be included in the taxonomic pro le.
Previously, microscopy of the Cameroon samples had been found less sensitive for Entamoeba than mapping the metagenomic reads to reference genomes-19 and 36 samples were positive using microscopy and read mapping, respectively [34]. For our analysis, Metaxa2 and taxaTarget were the most sensitive tools, detecting Entamoeba in 21 and 13 samples, respectively. All Entamoeba positive samples had also been found positive for Entamoeba by the previous study. However, only 67% (Metaxa2) and 69% (taxaTarget) of the Entamoeba positive samples agreed with the microscopy results. In contrast, EukDetect did not detect Entamoeba in any samples.
The previous analysis had also detected Endolimax, Enteromonas, and Giardia [34]. None of the tools detected the genus Endolimax and Metaxa2 and EukDetect did not detect Enteromonas or Giardia. In contrast, taxaTarget detected either Giardia or Hexamitidae (the family of Giardia) in the two samples where it was previously identi ed. Additionally, despite not having any Enteromonas sequences in the database, taxaTarget detected Hexamitidae (the family of Enteromonas) in 3 of 4 samples where Enteromonas had been identi ed previously.
Comparison of the taxonomic pro les for the 57 Cameroon gut samples The total diversity of microeukaryotes was compared for the three tools ( Figure 6). EukDetect, taxaTarget, and Metaxa2 detected 12, 20, and 22 microeukaryote genera, respectively. Amongst these genera, only Blastocystis was detected by all-was also the most frequently identi ed. Amongst the tools, taxaTarget had the greatest overlap with the other tools, sharing 9 genera with EukDetect and 3 with Metaxa2. The three additional genera that EukDetect identi ed, but not taxaTarget, were either not in the database of taxaTarget (Kodamaea), were not detected and would not have been reported in the results for having too few marker genes (i.e., 3) in the database (Volvariella), or were detected at too low of abundance and were ltered out of the taxonomic pro le (Bactrocera). Similarly, of the 18 genera uniquely detected by Metaxa2, ten were not in the database of taxaTarget, two had three or fewer marker genes in the database, and four genera either had fewer than three mapped reads per sample or the reads mapped to a single marker gene. Further, although Aplanochytrium was not detected at the genus level, its family was detected. Only the genus Amphora showed no signs of detection at the family or genus levels despite having su cient marker genes in the taxaTarget database to be reported in the taxonomic report.
Consistent with other studies of the human gut eukaryome [35], no genus was found in all samples and most genera were rare. For perspective, there were only three and ve genera identi ed in ve or more samples by EukDetect and Metaxa2, respectively, and ve for taxaTarget. For these genera, taxaTarget contained all detected by EukDetect and all but one by Metaxa2, Korotnevella, which is not in the taxaTarget database.
Of the 20 genera detected by taxaTarget in the 57 Cameroonian gut samples, 16 had been observed in the human gut microbiome before (Table 8). Of the remaining four, Chaetoceros, Perkinsus, and Wallemia potentially could have been transiently introduced to the gut microbiome if a person consumed foods that carried them. Amongst the genera identi ed by taxaTarget were parasites with clinical relevance to humans such as Trichinella, Schistosoma, Strongyloides, Digenea, Necator, Trypanosoma, and Giardia.

Computational resource usage of the tools
The runtime and maximum random access memory (RAM) usage (Figure 7) of the tools were recorded for the non-microeukaryote and microeukaryote_in_DB simulated metagenomic datasets as well as one of the Cameroonian gut metagenomes-a total of ~91 million reads. All tools were run using 12 CPUs on a 256 gigabytes RAM node (Intel® Xeon® central processing unit, E5-2650, version 3, at 2.3 gigahertz). In terms of computational e ciency, taxaTarget and EukDetect had the shortest total runtime, 102 minutes for both, and the smallest maximum memory requirement (2 gigabytes and 1.7 gigabytes, respectively). In contrast, although Metaxa2 did not have a substantially longer total runtime, 130 minutes, its memory usage ranged widely from a minimum of 46 megabytes to a maximum of 12 gigabytes depending on the dataset being analyzed.

Discussion
Metagenomics is a powerful tool for exploring the complex microbial communities of bacteria, archaea, viruses, and eukaryotes that compose microbiomes. However, despite the many tools available, we are still not able to comprehensively assess the taxonomic composition of metagenomes. Here we report an important step forward for the identi cation of microeukaryotic species in metagenomic data. taxaTarget is a fast and computationally e cient supervised learning method that learned region-speci c classi cation thresholds from the UniRef100 database. Our results show this approach can match and often outperform other state-of-the-art, marker-gene-based tools in terms of precision and sensitivity.
These gains applied for species with sequences in the database and for species which only have close relatives in the database-extending the predictive power of the scarce sequence data available for microeukaryotes. Although the use of a protein database for taxaTarget could be expected to have high sensitivity for divergent sequences, due to the redundancy of the genetic code, it was not expected to be more precise than tools using nucleotide databases (and thus might be even more precise with a nucleotide database).
Perhaps the greatest challenge for developing taxaTarget was the scarcity of reference sequence data versus their estimated species diversity [36]. It is imperative that more genomes/transcriptomes/proteomes are sequenced for eukaryotic species to increase the breadth of detectable taxa in metagenomes. As seen in the analysis of the Cameroonian gut metagenomes, although taxaTarget identi ed more microeukaryotes than the previous study of the dataset, including clinically relevant parasites, the biased representation of species in the databases of the tools acutely affected the results. EukDetect and taxaTarget-which use a similar marker gene database-detected an almost non-overlapping set of genera compared to Metaxa2, which uses a custom database of small and large ribosomal subunits as well as COI genes. For reference, the small ribosomal subunit 18S rRNA database in Metaxa2 (which includes SILVA and PR2) currently has nearly an order of magnitude more species represented than in protein or genome databases. As a side, we did not include the 18S rRNA sequences in taxaTarget because the additional species would only have had one marker sequence, which is insu cient for handling classi cation errors with our current approach.
Scarce data is also problematic for supervised learning because the accuracy of classi cation is often dependent upon the number and diversity of labelled training data. For taxaTarget, the classi cation thresholds of many regions were learned using ve or fewer data points, preventing the use of more sophisticated machine learning approaches. Some regions lacked some or all the thresholds because there was no training data. Together, this prevented us from using neighboring windows to impute missing thresholds. However, as more data becomes available, future work should test imputation methods (a common solution for handling missing data in supervised learning [37]) as they might increase sensitivity/precision by including more regions for classi cation.
The main solution, in taxaTarget, for handling classi cation errors due to missing data was to pad thresholds that lacked training data at broader taxonomic ranks, an approach that was highly effective for reducing the number of classi cation errors. However, even at high threshold padding, false positives from non-microeukaryotes persisted, highlighting the need for more comprehensive training data. One persistent source of classi cation errors were pileups of false positive reads from non-microeukaryotes in speci c regions of speci c marker genes. Provided su cient depth of coverage, these events might be simply ltered using sequencing coverage statistics [38]; however, it was common to observe 100 or fewer mapped reads per microeukaryotic taxa in the fecal metagenomes we analyzed, which is far less than 1X depth of coverage. The simple and effective solution developed for taxaTarget used a variant of the "urn problem" to detect and remove outlier marker genes. The marker genes of a taxa were treated like marbles with different colors in an urn and the number of reads mapped to the marker genes were treated as draws, with replacement, from the urn. Averaged over 100 simulations, the count of the marker gene drawn most often provided a useful baseline for the highest expected variant.
Another source of error, related to missing data, (observed here and previously [22]) was caused by closely related species with partially overlapping gene sets in the database. For example, if the marker genes of species A are a subset of those found in species B, both from the same genus, there is the chance that some metagenomic reads sequenced from A will get classi ed as B. EukDetect partially addressed this problem by ltering out species where more reads mapped to unshared versus shared marker genes with another species in the same genus. taxaTarget implemented a modi ed version of this solution and found it effective, although some errors persisted. EukDetect applied an additional lter (taxaTarget does not use), to further address this source of error, which removes species if the global percent identity of aligned reads across genes shared with the primary species is less than that for the primary species. It was acknowledged in [22] that this might result in only detecting the primary species of a genus, but this bias was observed for EukDetect even when using a simulated metagenome where each species had 1X depth of coverage. Future work should explore if statistical methods for assessing ambiguity in taxonomic classi cation, like ATLAS [39], can be adapted for the classi cation of microeukaryotes in metagenomic data (ATLAS in its current form was not usable because it depends on a database with good species representation of the metagenome being analyzed).
A major bene t of using a supervised learning approach was that training provided granular information about the discriminatory power of each region of the database sequences (something Metaxa2 and EukDetect do not provide), which sometimes varied widely for adjacent regions (this information is provided for download with taxaTarget and might be useful for the study of sequence evolution). Where EukDetect removed 41 of the 255 eukaryotic BUSCO marker gene families from its database for regions conserved in bacteria, taxaTarget was able to retain all 255, by only excluding the regions that lacked discriminatory power.
Previously, the most expensive step for creating a tool like taxaTarget was the training process. To be able to classify metagenomic reads of variable length, MetaPhyler and ROCker were trained with a minimum of three read sets of different length [28,29]. However, we found that the longest read set was su cient for training because the mean bit scores of the alignments were similar to those of the shorter read lengths. This resulted in a threefold reduction in training time. A potential future research area is exploring the replacement of the read mapping step of training with directly using the pairwise alignments of the sequences in the database.
Future work should test the methods of taxaTarget when extended to the marker genes of other phylogroups such as bacteria, archaea, and viruses. Furthermore, the methods should be tested when extended to entire sequence databases-which might improve sensitivity and speci city. Ideally, given su cient data, training could identify which regions of a genome or proteome are useful for taxonomic classi cation. However, the use of sequences other than marker genes might not be feasible without the development of additional algorithms for handling missing data and the resulting classi cation errors. Lastly, despite highlighting the performance gains of data-driven methods, taxaTarget still implemented multiple user-de ned thresholds, which also might be optimized or replaced with data-driven methods.

Conclusions
Our results show that the taxonomic classi cation of microeukaryotes in metagenomic data can be improved by utilizing data-driven methods that learn classi cation thresholds from the structure of an input database. These gains applied for species with sequences in the database and for species which only have close relatives in the database. The scarcity of available reference sequences still presents many challenges for the metagenomic taxonomic classi cation of microeukaryotes. More genomes/transcriptomes/proteomes need to be sequenced and methods for handling missing data and the resulting classi cation errors need to be further explored. The implementation of these methods will help facilitate a more taxonomically comprehensive analysis of metagenomic data and expand our knowledge about roles and diversity of eukaryotes in microbial communities.

Data availability
The source code and a tutorial for installing and running taxaTarget can be found on GitHub at https://github.com/SethCommichaux/taxaTarget. The taxaTarget database can be downloaded from https://obj.umiacs.umd.edu/taxatarget/archive.tar.gz.
All data used for analysis in this study are publicly available. The Cameroonian gut metagenomes can be found under NCBI BioProject PRJEB27005. The NCBI Assembly accessions for the non-microeukaryote dataset, microeukaryote_in_DB, and microeukaryote_not_in_DB datasets can be found in the supplementary materials. Tables   Table 1 The composition of the simulated metagenomic datasets   Table 3 Precision of tools for individually classifying simulated genomic (1X depth of coverage) data from 54 protist/fungal species with reference sequences in the database of each tool.    Table 6 Detection of Blastocystis in 57 Cameroonian gut samples. PCR(+) indicates that PCR detected Blastocystis in the sample. Tool(+) indicates the tool detected Blastocystis in the metagenomic sample.  Table 7 The 20 genera detected by taxaTarget in the 57 Cameroonian gut microbiome samples Aspergillus (4%) yes diverse genus of mold consisting of several hundred species [43] Digenea (4%) yes parasitic atworm ( uke) [49] Schistosoma (4%) yes parasitic atworm ( uke) [49] Giardia (2%) yes protist parasite [49] Wallemia (2%) no fungi commonly found in food; can cause skin infections; might transiently occur in human gut if colonized food is consumed [50] Perkinsus (2%) no molluscan parasite; might transiently occur in human gut if colonized molluscan is consumed [51] Fusarium (2%) yes mycotoxin-producing fungi commonly [43] found on food Necator (2%) yes parasitic nematode [49] Trepomonas (2%) no free-living, environmental protist whose ancestor might have been parasitic [52] Brettanomyces (2%) yes yeast commonly found on fruit [45] Trichinella (2%) yes parasitic nematode that can affect microbial composition of Work ow diagram for taxaTarget: A) Metagenomic reads are input to taxaTarget in fastq format. The reads are rst mapped to the marker genes with Kaiju-a fast, k-mer approach. The subset of reads that mapped to the marker genes with Kaiju are then more sensitively aligned to the marker genes with Diamond. The mean bit score and start position of the Diamond read alignments are used for classi cation. B) An example marker gene that a read mapped to. Most windows (every non-overlapping 20 amino acid region) have classi cation thresholds (Species, Genus, and Family) learned from training, but some regions lacked training data and thus lack thresholds. C) Classi cation of a read for a window where threshold padding is applied. The start position of the read alignment is used to determine the window used for classi cation. During training, no negative labelled reads aligned to the window at a rank broader than genus and so the thresholds for Genus and Species get padded (i.e., adjusted upwards). Here, the read gets classi ed as the genus of the marker gene because the mean bit score of its alignment is above Genus with padding.   Results of parameter sweep analysis for taxaTarget. A) The non-microeukaryote metagenome was used to assess how the padding of thresholds affected the number of non-microeukaryote reads falsely classi ed as microeukaryotic (protist or fungal) species. Outlier detection refers to whether taxaTarget is set to remove marker genes with more reads mapped than expected-if sequencing randomly samples the genomes in a metagenome and that read coverage should approximate a uniform distribution. B) The microeukaryote_in_DB metagenome was used to assess the effect of threshold padding on the precision and the number of detected species (n = 54) at the species level. C) The microeukaryote_not_in_DB dataset was used to assess the effect of threshold padding on the precision and the number of detected species (n = 83) at the genus level.

Figure 5
The sensitivity of each tool was tested with 3 simulated genomic datasets varying by 6 orders of magnitude (1, 10, 100, 1x10 3 , 1x10 4 , 1x10 5 ) of reads. The datasets were simulated from the genomes of 2 protist species (Plasmodium falciparum and Phytophthora sojae) and 1 fungal species (Aspergillus niger). The genome size of each species is in parentheses besides the species name.

Figure 7
Runtimes and maximum RAM usage for each tool when run on several metagenomic datasets.