We compared the performance of several taxonomic classifiers: Kaiju, Kraken2, MetaPhlAn 3, MetaPhlAn 4, and a custom version of Kraken2 using a database derived from GTDB-TK genomes on our in-silico samples. The analysis aimed to determine the bacterial, archaeal, and fungal composition of the in-silico shotgun metagenomic samples.
The DNA-to-Marker technique, applied by MetaPhlAn3 measures taxon genome counts relative to the total detected genomes. In contrast, Kraken2 operates on a DNA-to-DNA basis, and Kaiju utilizes a DNA-to-Protein approach. Both aim to estimate the proportion of assigned sequences, thus emphasizing sequence abundance over taxonomic abundance. Each classifier, while covering bacteria and archaea universally, relies on databases with unique taxonomic inclusions. Kaiju's “nr-euk” database integrates viruses, microbial eukaryotes, and eukaryota; MetaPhlAn3 includes Eukaryota; and Kraken2 “plus-pf'' adds viruses, plasmid, Univec_core (for vector contamination screening), and protozoa. Our custom.Kraken database further incorporates phyla Basidiomycota, Ascomycota, and Mucoromycota to reflect our in-silico community and to partially capture the fungal diversity in soil. For a concise overview of each classifier's characteristics and performance metrics, see Table 1. In this section, we delve into the implications of relative abundance threshold, classifier selection, and contig assembly in taxonomic classification.
Table 1
Comparison of Taxonomic classifiers: performance metrics, database composition and classification accuracy
Classifier | MetaPhlAn | Kraken and Bracken | Kaiju |
Version | MetaPhlAn3 | MetaPhlAn4 | Kraken2/2.1.1 and Bracken/2.2 | Kaiju/1.7.4 |
Database | CHOCOPhlAn 201901 | CHOCOPhlAnSGB 202103 | plus-pf | custom database | nr-euk |
Organisms included in the database | Bacteria, Archaea, Eukaryota | Bacteria, Archaea | Bacteria, Archaea, Eukaryota, plasmid, human, Univec_core, Protozoa | Bacteria, Archaea, Eukaryota | Bacteria, Archaea, Eukaryota, Virus, Microbial Eukaryotes |
Database size | 2.4 GB | 23 GB | 61 GB | 1.2 TB | 144 GB |
Processing time (h:m:s) | 3:02:38 | 4:24:14 | 0:43:29 | 2:20:08 | 11:23:26 |
Species F1 score (optimal threshold) | 0.26 ± 0.02 | 0.41 ± 0.02 | 0.74 ± 0.01 | 0.68 ± 0.0 | 0.48 ± 0.01 |
Species F1 score (No threshold) | 0.26 ± 0.02 | 0.42 ± 0.02 | 0.5 ± 0.01 | 0.63 ± 0.01 | 0.11 ± 0.01 |
Unclassified | 94.5% | 90.6% | 79.3% | 0.46% | 37.43% |
Classified | 5.5% | 9.4% | 20.7% | 99.54% | 62.57% |
Influence of relative abundance threshold
Initially, we evaluated whether the introduction of a minimum relative abundance threshold impacted the overall performance of each classifier at species and genus level classification, using classic model performance metrics such as considering the F1 score, sensitivity, and precision.
In assessing classifier performance, Kraken2 and MetaPhlAn's F1 scores remained stable at a low relative abundance threshold for both species and genus levels (Figure. 3). Most classifiers demonstrated optimal performance at a relative abundance threshold of 0.001% with custom.Kraken registering an F1 score of 0.68 +/- 0 and MetaPhlAn3 at the lower end with 0.26 +/- 0.02. Kraken2, however, reached its highest performance at 0.005% with an F1 score of 0.74 +/- 0 .01. Beyond these optimal points, F1 scores began to diminish. Overall this indicates that an optimal relative abundance threshold varies by classifier and database combination. This is important as certain classifiers and databases excel at identifying species that are present in low quantities, which can be crucial in soil metagenomics, where low-abundance organisms can play fundamental roles in the dynamics of microbial communities and ecosystems.
To investigate the factors affecting F1 scores, we examined precision and sensitivity by species. Among classifiers, custom.Kraken, Kraken2, and Kaiju often provided a high number of false positives, resulting in reduced precision. However, their robust sensitivity, characterized by minimal false negatives (see Fig. 3.(c) and (d), and Supplementary Fig. 1(a) and (b)), compensates for this shortfall in precision. On the other hand, MetaPhlAn 3 and 4 performed with acceptable precision but poor sensitivity, placing them at a relative disadvantage when compared to the other classifiers (Fig. 3).
We identified significant differences in the performance metrics F1 score across different classifiers and different thresholds (p-value < 0.001). These differences became particularly noticeable when we grouped the thresholds and compared the performance of classifiers against each other. Dunn's post-hoc test indicated that MetaPhlAn's performance was significantly different to other classifiers. A higher z-value means that the two groups are further apart in terms of standard deviation. In our results, Kraken2 vs. MetaPhlAn3 had a z-value of 9.05 (p-value < 0.001), custom.Kraken vs. MetaPhlAn3 with a z-value of 7.63 (p-value < 0.001), Kraken2 vs. MetaPhlAn4 with a z-value of 5.6 (p-value < 0.001), and custom.Kraken vs. MetaPhlAn4 z = 4.2 (p-value < 0.001). Comparing MetaPhlAn3 and MetaPhlAn4, a difference with z = 3.4 (p-value < 0.01) was observed. In contrast, no significant differences were noted between Kaiju and both MetaPhlAn3 and 4. These findings highlight MetaPhlAn's markedly different performance from the other classifiers evaluated. (Table 1).
Regarding relative abundance thresholds, a post hoc Dunn test - applied across all classifiers and samples to isolate the impact of threshold values - indicated that the values at extreme thresholds tested, notably 0, 0.05%, 0.1%, and 0.5%, exhibited significant variations in F1 score when compared to other threshold settings (Supplementary Table 2). Balancing precision and sensitivity necessitates the selection of optimal classifiers and thresholds for soil shotgun metagenomic data. Our analysis identifies the optimal relative abundance threshold of 0.001% for custom.Kraken, Kaiju, and MetaPhlAn, and 0.005% for Kraken2 at the species level (Fig. 3).
Role of taxonomic classifiers in estimating species-level relative abundance
By using the optimal relative abundance threshold, we examined the community structure and tested whether species' relative abundances were deviating from expected values. To do this, we employed Multidimensional Scaling (MDS) to identify which classifiers could accurately discern the in-silico community structure and report relative abundance values that align with expected values. Applied to the compositional data, MDS confirmed marked variations between different classifiers (PERMANOVA: p-value = 0.001, R2 = 0.79) (Fig. 4.(a).). The classifier accounted for approximately 80% of the observed variance in microbial community composition.
In assessing classifier accuracy at the species level using the Euclidean distance between observed and expected outputs, custom.Kraken consistently demonstrated superior performance in comparison to other classifiers. The Kruskal-Wallis tests, conducted to compare distance values among different classifiers across the runs, consistently pointed to significant differences. Specifically, for each run, the effect size (H-value) was large, ranging from approximately 0.902 to 0.950. The p-values for all runs varied between 0.05 to 0.001, emphasizing the robustness of these results (Supplementary Table 3a). A post-hoc Dunn test with Bonferroni adjustment demonstrated that custom.Kraken differed significantly from Kaiju, MetaPhlAn4, and MetaPhlAn3 across all 10 runs (p-value range < 0.001–0.01], H-value range: [3.92 -8 .6]; Supplementary Table 3b). In these runs, Kaiju also significantly varied from MetaPhlAn3 (p-value range < 0.001, H-value range: [3.97–4.47]) and exhibited differences from MetaPhlAn4 in one run (p-value: < 0.01, H-value [2.93]). Kraken2's distinctiveness from both MetaPhlAn3 and MetaPhlAn4 was evident throughout the ten runs. Figure 4.(b) provides a visual representation of these average distances for each classifier in the form of a box plot.
Supplementary Table 3a. Kruskal-Wallis Analysis Examining the Influence of Tools and Thresholds on Euclidean distance per sample. Data was grouped by 'Run' and 'Classifier'. Each row indicates differences in Euclidean distance for specific groupings. The H-values, derived from the Kruskal-Wallis test statistic, consistently indicate significant differences between groups. Effect sizes are denoted by the eta2[H] values, which fall within an approximate range of 0.902 to 0.950. All associated p-values are less than 0.0001.
Run | .y. | n | effsize | method...5 | magnitude | statistic | df | p-value | method...10 |
1 | Distance | 100 | 0.95019781 | eta2[H] | large | 94.2687921 | 4 | < 0.0001 | Kruskal-Wallis |
10 | Distance | 100 | 0.92816258 | eta2[H] | large | 92.1754455 | 4 | < 0.0001 | Kruskal-Wallis |
2 | Distance | 100 | 0.92945701 | eta2[H] | large | 92.2984158 | 4 | < 0.0001 | Kruskal-Wallis |
3 | Distance | 100 | 0.92952579 | eta2[H] | large | 92.3049505 | 4 | < 0.0001 | Kruskal-Wallis |
4 | Distance | 100 | 0.90159125 | eta2[H] | large | 89.6511683 | 4 | < 0.0001 | Kruskal-Wallis |
5 | Distance | 100 | 0.92614403 | eta2[H] | large | 91.9836832 | 4 | < 0.0001 | Kruskal-Wallis |
6 | Distance | 100 | 0.92939698 | eta2[H] | large | 92.2927129 | 4 | < 0.0001 | Kruskal-Wallis |
7 | Distance | 100 | 0.93110787 | eta2[H] | large | 92.4552475 | 4 | < 0.0001 | Kruskal-Wallis |
8 | Distance | 100 | 0.93064763 | eta2[H] | large | 92.4115248 | 4 | < 0.0001 | Kruskal-Wallis |
9 | Distance | 100 | 0.92005836 | eta2[H] | large | 91.4055446 | 4 | < 0.0001 | Kruskal-Wallis |
Understanding the unique methodologies employed by each classifier is crucial. While MetaPhlAn3 utilizes a DNA-to-Marker technique, emphasizing taxon genome counts, Kraken2 and Kaiju underscore sequence abundance over taxonomic distinction via their DNA-to-DNA and DNA-to-Protein strategies, respectively. Notably, when no cut-off was applied, MetaPhlAn3's results substantially deviated from expected values. Nevertheless, the Euclidean distances for MetaPhlAn3 remained consistent, regardless of the relative abundance threshold of 0.001. Such uniformity of MetaPhlAn suggests the relative abundance threshold limited influence on the taxonomic abundance precision, which is supported by comparable F1 scores at both levels (Table 1)
Comparative Efficiency of Classifiers in metagenomic short-read utilization
In our analysis comparing the efficiency of various taxonomic classifiers in read classification, we found that on average, across ten separate runs, the classification rate was 99.54% with custom.Kraken, 63% with Kaiju, 21% with Kraken2 and 5–10% with MetaPhlAn (Table 1) in diverse simulated sequencing libraries. This underscores the superior efficiency of custom.Kraken in maximizing sequencing output value. While one might assume a high percentage of unclassified reads indicates a vast array of unidentified species, it's essential to recognize that this can often be a consequence of the selected classifier and database. Therefore, these figures might not truly reflect the sample's diversity, but rather the limitations or specificity of the analytical tools used. Although custom.Kraken provides the closest values (very low Euclidean distance) to the expected relative abundance values at the species level, it is not statistically significant from Kraken2. Despite their similar performance in identifying relative abundance values, custom.Kraken vastly outperforms Kraken2 by classifying 99% of reads compared to Kraken2's 21%. This underscores a distinct discrepancy in their ability to analyse sequencing data, clearly attributed to the underlying databases each employs, given the constant classifier and algorithm.
Assembled vs. Unassembled Reads: Evaluating the Optimal Approach
Contigs, with their longer sequences and assembly from multiple reads, promise enhanced specificity in taxonomic classification. Their historical relevance and analytical efficiency further positioned them as potential candidates for taxonomy assignments. To analyse this we selected five random runs from the sequenced data to assemble with Spades, followed by classification using the three top-performing classifiers; this facilitated a detailed comparison between the default unfiltered taxonomic profiles, those adjusted to their optimal relative abundance thresholds, and the taxonomic profile of the assembled contigs. Using the Kruskal-Wallis test, we compared the F1 scores of taxonomic classifications on assembled contigs versus trimmed reads (at optimal relative abundance threshold for species-level classification). For versions of the Kraken2 classifier implementation, the observed F1 score was lower for assembled contigs than trimmed reads with and without a relative abundance threshold filter. Notwithstanding this, the relative abundance threshold in conjunction with Kaiju favoured contigs. From the perspective of classifier performance, we observed significant variations in F1 scores between the two read types: Kaiju (p-value < 0.01), Kraken2 (p-value < 0.01), and custom.Kraken (p-value < 0.01). However, upon applying optimal relative abundance threshold filters, trimmed reads consistently outperformed contigs across all classifiers in terms of F1 scores and precision. Notably, sensitivity declined at these optimal relative abundance thresholds, suggesting an increased likelihood of false negatives (Supplementary Fig. 2). In summary, the use of assembly-based methods to assess taxonomic diversity resulted in a reduction in classification accuracy.
Evaluating taxonomic misclassifications across different classifiers
While our efforts have been directed towards ascertaining the superiority of specific classifiers under given conditions, it is equally crucial to investigate what remains undiscovered or erroneously identified in the course of our investigations.
The chi-square statistical analysis substantiated a significant divergence in the family-level taxonomic classifications generated by the four evaluated classifiers, reflecting notably distinct patterns in the identification of true positives, true negatives, false positives, and false negatives across the families picked by each classifier (χ² statistic = 135543 − 4395.4, p-value < 0.001) [detailed in Supplementary Table 4 and illustrated in Supplementary Fig. 3]. The divergence suggests that the observed taxonomies at the family level are profoundly influenced by the choice of classifier, rather than being an accurate representation of the actual sample profile. This finding emphasizes the need for researchers to treat the interpretation of taxonomic data with caution, as the choice of classifier can have a significant impact on the biological interpretations and conclusions that are drawn.
Supplementary Table 4. Chi-squared Analysis of Classification Outcomes by Different Tools at Family Level, Showing All Outcomes (FN, FP, TN, TP) with Significant P-values (< 0.001).
Classification Outcome | Chi-Square Statistic | P-value |
False Positive | 135543 | < 0.001 |
False Negative | 21751 | < 0.001 |
True Positive | 13672 | < 0.001 |
True Negative | 4395.4 | < 0.001 |
To further examine the distribution of false positives (FP’s) and false negatives (FN’s) at higher taxonomic categories like family level, we employed Venn diagrams and heatmaps. The Venn diagrams depict an interesting inverse relationship between FPs and FNs. Notably, while custom.Kraken, Kraken2, and Kaiju are major contributors to FPs, their influence on FNs is considerably smaller, often overlapping with MetaPhlAn4. Conversely, MetaPhlAn demonstrates fewer unique FPs but a larger number of unique FNs (Fig. 5.(a). and 4.(b).).
Heatmaps emphasize the divergence amongst classifiers in terms of false positive detection. Kaiju adeptly identified Fungi without registering any FNs, yet provided a significant number of FPs specifically for fungal families. Conversely, Kraken2 exhibited the opposite trend with fewer fungal FPs but a notable number of FNs. An increase in FPs can be observed when the underlying database has sequences outside the mock community's makeup, leading to misclassifications in kingdoms like Viridiplantae, Chromista, Metazoa, and Viruses. Kaiju's misclassifications and Kraken2's consistent errors in the Chromista and Hominidae families further illustrate this. To improve accuracy in sequence data analysis, it might be beneficial to use databases mainly consisting of sequences from the targeted microbes, specific to the research environment (for instance, excluding terrestrial mammal sequences in marine microbiome studies). This approach could enhance classification precision, although it may narrow the scope of the investigation. custom.Kraken, however, strikes a balance with the fewest FPs but misses certain families like Bdellovibrionaceae and Rhodobacteraceae (See Supplementary Fig. 2 and Fig. 3).
In terms of true positives, MetaPhlAn4 predominantly detected common family hits, while Kraken2, custom.Kraken, and Kaiju had additional shared detections. The varied counts among classifiers, with custom.Kraken and Kaiju spotting unique families, highlight the differential efficiency of these tools in true positive identifications (supplementary Fig. 1.(d).).
Refining Biological Insights with Optimized Taxonomic Classification
In our analysis of the NovaSeq soil dataset, we utilized the custom.Kraken classifier, informed by our earlier in-silico evaluations. The real metagenome dataset used for this study was obtained from a previous study by Mantri, S.S., et al. 2021[1] encompassing seven soil metagenomes from samples collected from soil horizons of multiple forest sites in Germany. Our comparative analysis with the original study, which employed maxikraken, revealed a notable difference in both the identified phyla and the classification of reads. Our analysis resulted in an average of 58% classified reads, a marked improvement from the 47.04% reported in the original study. This variation in read classification percentages underscores the inherent challenge of microbial identification in complex soil matrices. Nevertheless, both methods identified prominent soil phyla such as Planctomycetota, Actinomycetota, Chloroflexota, Pseudomonadota, and others.
Two deviations were prominent. Cambisol_B had a higher number of reads assigned to Chloroflexota presence with custom.Kraken compared to maxikraken results. Despite similar levels of Actinomycetota and Pseudomonadota, we detected increased proportions of Acidobacteriota and Ascomycota across all samples. Our reanalysis, utilizing the custom.Kraken also enabled the identification of several phyla – Gemmatimonadetes, Candidatus Calescamantes and Nitrospirota – which notably emerged among the top phyla in our study, but were not identified as prominent phyla in the original study (Fig. 6).
Figure 6. Comparative microbial composition across three sampling sites and soil horizons at the phylum Level. (a) Bar plot representing the taxonomic profile of the shotgun dataset using custom.Kraken. (b) Bar plot illustrating the taxonomic profile of the shotgun dataset as derived from the original study. Both plots showcase the top 14 phyla with distinct colours, while the remaining phyla are collectively grouped as “Remainder”. Consistent colour schemes are used across both plots for the same phyla. The x-axis delineates the three sampling sites (Podsol, Cambisol, and Stagnosol) and is further segmented into the three soil horizons (O, A, and B) for each site.
The beta diversity of microbial communities across seven soil samples reaffirmed our in-silico observations. On examining species-level distinctions, the NMDS plot revealed that the Kaiju and custom.Kraken tools tended to cluster together, potentially due to the shared utilization of Metagenome Assembled Genomes (MAGs) in their respective databases, as depicted in (Supplementary Fig. 7.(a)). Additionally, PERMANOVA (adonis2) analysis indicated that both the sample site (R2 = 12.3%, p < 0.01) and the taxonomic tool used (R2 = 85.09%, p < 0.001) influenced diversity at the species level. Additional PERMANOVA (adonis2) analysis was conducted to determine whether this was also observed at the genus level. Consistent with the species-level findings, the results affirmed the significant impact of both the sample site and the taxonomic tool on genus-level diversity, explaining 18.46% (p = 0.001) and 77.5% (p = 0.01) of the variance, respectively, as illustrated in (Supplementary Fig. 7. (b)). Clearly, classifiers have a stronger influence on microbial profiles than the inherent soil composition, once again emphasizing the importance of benchmarking.