Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads
Background: The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities.
Results: Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities.
Conclusions: This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
This is a list of supplementary files associated with this preprint. Click to download.
Posted 13 May, 2020
On 11 Jun, 2019
On 24 Apr, 2020
On 20 Apr, 2020
Received 20 Apr, 2020
Invitations sent on 15 Apr, 2020
On 05 Apr, 2020
On 04 Apr, 2020
On 04 Apr, 2020
On 17 Mar, 2020
Received 10 Feb, 2020
On 04 Feb, 2020
On 09 Dec, 2019
Invitations sent on 09 Dec, 2019
On 08 Dec, 2019
On 08 Dec, 2019
On 30 Oct, 2019
Received 06 Oct, 2019
Received 06 Oct, 2019
On 24 Sep, 2019
On 24 Sep, 2019
Invitations sent on 13 Jul, 2019
On 25 Jun, 2019
On 11 Jun, 2019
On 11 Jun, 2019
On 05 Jun, 2019
Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads
Posted 13 May, 2020
On 11 Jun, 2019
On 24 Apr, 2020
On 20 Apr, 2020
Received 20 Apr, 2020
Invitations sent on 15 Apr, 2020
On 05 Apr, 2020
On 04 Apr, 2020
On 04 Apr, 2020
On 17 Mar, 2020
Received 10 Feb, 2020
On 04 Feb, 2020
On 09 Dec, 2019
Invitations sent on 09 Dec, 2019
On 08 Dec, 2019
On 08 Dec, 2019
On 30 Oct, 2019
Received 06 Oct, 2019
Received 06 Oct, 2019
On 24 Sep, 2019
On 24 Sep, 2019
Invitations sent on 13 Jul, 2019
On 25 Jun, 2019
On 11 Jun, 2019
On 11 Jun, 2019
On 05 Jun, 2019
Background: The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities.
Results: Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities.
Conclusions: This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5