Genomic data
For each of four major taxonomic divisions (bacteria, fungi, animals, and plants), we downloaded 20 genomes from GenBank (32). Within each of these divisions, we included genomes from a total of 22 classes, 46 orders, and 58 families (Figure 1; details in Supp. Tab. 1). To select these genomes we first constructed a filtered list of genomes that were represented at the chromosome level or greater in GenBank. Within divisions we then selected genomes randomly from this list (without replacement, such that no species was represented more than once).
Read simulation
We simulated Nanopore reads using NanoSim 2.0.0 (33) with the default error parameters for E. coli R9 1D data. This method uses a mixture model to produce simulated reads with indel and error rates similar to real datasets. The error model is applied equally to all parts of a read, and the read lengths are drawn from a distribution approximating real data. To create simulated read data of specific lengths, we truncated the simulated reads after the relevant number of basepairs using a custom bash script (i.e. to simulate 100bp Nanopore reads, we truncated all reads in a simulated dataset to 100bp (see example command below).
simulator.py linear -r Reference -c ecoli -n 1000 -o Output --min_len Length --max_len 8000
cat Output.fasta | awk -v 'RS=>' 'NR>1{print ">" $1; printf("%.Lengths\n", $2)}' > Output_trimmed.fasta
We simulated PacBio reads using SimLORD. This method assumes that the number of passes along a read is chi-squared distributed and dependent on read length, as would occur in a standard SMRT run. Thus, shorter reads have more passes, with very short reads (e.g. < 1000bp in length) having well over 10 passes on average; reads 2000 basepairs in length having a median of over four passes; and reads 3000 basepairs in length having a median of more than three passes. Thus, short reads are almost exclusively CCS (“HiFi”) reads. For both Nanopore and PacBio, we simulated reads for lengths varying from 100 bp to 4,000 bp at 100 bp intervals, simulating 1,000 reads per interval for all taxa (a total of 40,000 reads for each taxon per read simulation program, and 3.2 million reads for all taxa and read lengths per read simulation program). (See example command below)
simlord --read-reference Reference -n 1000 -f1 Length Output
We simulated Illumina data using dwgsim 0.1.12 (34) with the following options:
dwgsim -e 0.0001 -E 0.0001 -N 2000 -1 100 -2 100 -r 0.0001 -R 0.01 -y 0.000 -c 0
This implements errors to mirror those in Illumina data, with constant error rates of 1e-4 and no indels (which are extremely rare in Illumina data). We generated 1,000 reads for each genome, at three read lengths: 100 bp, 150 bp, and 300 bp (a total of 240,000 reads across all taxa and lengths), and used only single end reads for all analyses.
Sequence Classification
We used BLAST 2.7.1 (35) and Kraken2 (19) for sequence classification. We created a local custom database consisting of the NCBI nt database (downloaded on Feb 8 2019) and the genomes of the 80 taxa that we used to test classification success. We used the default alignment parameters for BLAST (word_size = 11, match/mismatches scores =2,-3, gap costs – existence = 5, extension =2, filter = low complexity regions), except for implementing a maximum e-value of 0.1. We used the match with the highest bit score for all downstream analyses. For Kraken2 analyses we used the default parameters (in which the k-mer length is 35 bp and default minimiser length is 31 bp). For Kraken2 we used the taxon assigned by the lowest common ancestor (LCA) algorithm employed in Kraken2 for downstream analyses.
Accuracy metrics
To assess the effects of read length on classification accuracy we focus our analysis only on how often a read is assigned to the correct taxon. For our simulated reads there are three possible outcomes when querying a database (Table 1).
We expect that taxa that are well represented in the database, and which have few closely related taxa, will have high rates of true matches. Taxa with many close relatives in the database will have many false matches. Taxa that are poorly represented in the database will have high rates of failed queries. Both of these latter results are in a class usually referred to as false negatives: we falsely infer taxon A is absent. However, they largely arise from different mechanisms. Importantly, as genomic databases become more complete, we expect the fraction of failed queries will decrease. At the same time we expect that the fraction of false matches may increase, as more and more closely related taxa become present in the database. The exact nature of this tradeoff is not well explored. Novel statistical approaches, such as Bayesian re-estimation of species frequencies, may mitigate the problem (21); however, improved methods are required to address this problem (36).
There are other aspects of classification success that we do not focus on here. The first of these is the notion of a true negative: a sequence that is known to not arise from any taxa, should not return a match to any taxa. This is not a biologically realistic situation (all sequences arise from a taxon), although this aspect is useful when trying to assess the performance of different classifiers (37) and presenting the full truth table. The second aspect we do not consider here are false positives: if a read query matches taxon A, but does not arise from taxon A. We would thus falsely interpret taxon A as being present in a community. This metric is intrinsic to the composition of the community rather than just each taxon and the database. For example, if taxon A dominates the community, then it cannot have a high fraction of false positives relative to true positives simply because the vast majority of read queries from the community will be from taxon A and thus true positives. Conversely if taxon B is extremely rare, there will be a large number of false positives relative to true positives, as very few read queries will be from taxon B, resulting in a very small fraction of true positives.
Thus, we use a simplified set of metrics (see Table 1) that are not intrinsically related to community composition: true matches, false matches, and failed queries. We used our simulated genomic sequence reads from 80 taxa to quantify these three outcomes at both the genus and family level. To assign genus and family from species, we used the NCBI taxonomy database (38) (which is used by BLAST as the default taxon classifier).
We calculate two ratios from the three metrics in Table 1. The first is the fraction of true positives classified correctly (i.e. recall):
Recall = Mtrue/(Mtrue + Mfalse + Mfail)
The second is the ratio of true matches to false matches. This simply excludes failed queries from the equation. We term this second metric classification success.
Classification Success = Mtrue/(Mtrue + Mfalse)
The critical difference between these metrics is that taxa which are poorly represented in the database may nevertheless have high rates of classification success, although recall will necessarily be low. However, as the fraction of failed queries approaches zero (which we expect as genomic databases grow), these two metrics become equivalent.