Background: Most current metagenomic classifiers and profilers employ short reads to classify, bin and profile microbial genomes that are present in metagenomic samples. Many of these methods adopt techniques that aim to identify unique genomic regions of genomes so as to differentiate them. Because of this, short-read lengths might be suboptimal. Longer read lengths might improve the performance of classification and profiling. However, longer reads produced by current technology tend to have a higher rate of sequencing errors, compared to short reads. It is not clear if the trade-off between longer length versus higher sequencing errors will increase or decrease classification and profiling performance.
Results: We compared performance of popular metagenomic classifiers on short reads and longer reads, which are assembled from the same short reads. When using a number of popular assemblers to assemble long reads from the short reads, we discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Specifically, across most classifiers, we observed a significant increase in precision, while recall remained the same, resulting in higher overall classification performance. On real metagenomic data, we observed a similar trend that classifiers made fewer predictions. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall with longer reads.
Conclusions: This finding has two main implications. First, it suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall as shorter reads. Second, this finding suggests that it might be a good idea to consider utilizing long-read technologies in species classification for metagenomic applications. Current long-read technologies tend to have higher sequencing errors and are more expensive compared to short-read technologies. The trade-offs between the pros and cons should be investigated.