Spearman correlation is not a reliable measurement for the prediction accuracy of gene contents
We compared the predictions of PICRUSt, PICRUSt2 and Tax4Fun to the results of shotgun metagenome sequencing on publicly available datasets for which both metagenome and 16S rRNA sequences were available (Table S1). As we would expect from previous literature [4], gene content estimations from these tools were robustly correlated with gene contents from metagenome sequencing with Spearman correlations in the range of 0.53 to 0.87 (Fig. 1). For example, in one soil sample (Fig. 1B), there is a clear correlation between the relative abundance of each gene from PICRUSt and the relative abundance from metagenome sequencing (Spearman’s rho = 0.85). However, if we independently permute each gene’s abundances across samples (Fig. 1A) and then compare the gene composition from metagenome sequencing to PICRUSt predictions of this sample, the correlation that was observed is not substantially impacted (Spearman’s rho = 0.84) (Fig. 1C).
The likely explanation for this observation is that across environments, there is less variation between metagenome functional profiles of samples than their taxonomic profiles (Fig. 2), an observation that has been previously made for human samples in the Human Microbiome Project [19]. In the datasets examined, the relative abundance of genes from prediction tool estimates were highly correlated with that from metagenome sequencing, with correlation coefficients always higher than 0.5, and this was true for both permuted and unpermuted samples for PICRUSt (Fig. 1D), PICRUSt2 (Fig. 1E) and Tax4Fun (Fig. 1F). The correlations were often only marginally higher on the unpermuted data than those permuted, with perhaps the gorilla dataset as an exception (Fig. 1D, E and F). However, even in the gorilla samples, the largest difference between Spearman coefficients for permuted and unpermuted data was only 0.12. For the 2 soil datasets, the Spearman coefficients for the unpermuted data were not significantly different from those for the permuted ones with all three prediction tools (Fig. 1D, E and F).
Inference from metagenome prediction tools showed higher consistency with metagenome sequencing in human samples than non-human samples
As an alternative evaluation to Spearman’s correlation of gene composition, we examined how the inference of predicted gene compositions compared to that of shotgun metagenome sequencing in each of our datasets. For this purpose, we formed a null hypothesis for each gene in each dataset that there is no difference in the mean of that gene’s distribution of relative abundance between the two groups in the dataset. For example, for each of the 5,574 genes detected by both PICRUSt and metagenome sequencing in the Human_KW dataset, we used a Wilcoxon test to generate P-values for the difference in gene composition between rural and urban samples. Across all the genes, there was a reasonable correlation (rho = 0.46) of P-values from Wilcoxon tests run on real metagenome sequencing data and those predicted with PICRUSt data. Unlike our results for Spearman correlation of gene composition, this inference correlation is sensitive to data permutation, as when we repeated this procedure on permuted data (Fig. 3A), the correlation between P-values generated from metagenome sequencing and those from prediction tools approached zero (Fig. 4). We calculated the inference correlation coefficients for the estimates from PICRUSt, PICRUSt2 and Tax4Fun on all 7 datasets. We saw a similarly robust correlation for the other human dataset (Human_TY) evaluating a null hypothesis comparing US and non-US samples. However, when we extended this analysis to non-human datasets (using the null hypotheses for each study listed in Table S1), the inference produced by metagenome prediction tools showed a much lower similarity to inference produced by metagenome sequencing (Fig. 3C).
To determine whether sample sizes contributed to the differences in performance across datasets, we randomly sub-sampled each larger dataset (without replacement) to 10 samples (5 per group) and re-calculated the comparison of P-values between metagenome prediction tools and metagenome sequencing (Fig. S1). Even at a smaller size, data from the human studies showed greater concordance than those from other environments. We conclude that the difference in sample sizes between datasets does not explain the variability of metagenome prediction tools’ accuracy between different samples types in our study. Likewise, the effect sizes of the associations with metadata, measured as R2 in a PERMANOVA test, were not substantially higher in human samples (Table S1). It therefore also seems unlikely that effect size alone can explain the better concordance we observed between inference results from metagenome prediction tools and metagenome sequencing for human samples.
We further investigated the consistency of metagenome prediction tools and metagenome sequencing by examining how many genes were missed or incorrectly detected by metagenome prediction tools. For some datasets, such as the Human_KW dataset, metagenome prediction tools failed to predict many genes that were detected by metagenome sequencing (Table S2). For other datasets, such as the soil datasets, many genes predicted were not detected in metagenome sequencing and there were also many genes seen in metagenome sequencing but not in metagenome prediction tools (Table S2). For the chicken dataset with an average metagenome sequencing depth of 31 million reads/sample and the gorilla dataset of 27 million reads/sample, 39.5% and 36.9% of predicted genes could not be detected by metagenome sequencing. In addition, the metagenome sequencing of the Human_KW dataset with an average sequencing depth of 10 million reads/sample detected 13,880 genes and metagenome prediction tools missed 59.1% of them.
Metagenome prediction tools performs differently for different functional categories
We next investigated the discrepancy between metagenome prediction tools and metagenome sequencing for inference in different functional categories (Fig. 5, Fig. S2-4). When comparing the inference from metagenome prediction tools to inference from metagenome sequencing, some functional categories performed better than others in the human gut samples, including those related with genetic information processing such as Replication and repair, Translation, Folding, sorting and degradation, and metabolism related functions including Glycan biosynthesis and metabolism, Nucleotide metabolism and Amino acid metabolism. Some functional categories performed less well, including Biosynthesis of other secondary metabolites, Xenobiotics biodegradation and metabolism, and functions related with Environmental information processing and Signaling and cellular processes, such as Signal transduction, Membrane transport and cell growth and death. For the genes only detected by one method, most of the genes missed by metagenome prediction tools belong to Signal transduction, Signaling molecules and interaction, and functions related with Genetic information processing, while metabolism-related functions were more likely to be predicted (Fig. S3 and Table S3c). Among the genes predicted by metagenome prediction tools but not detected by metagenome sequencing, most of them belong to Signaling molecules and interaction, Metabolism of terpenoids and polyketides and Xenobiotics biodegradation and metabolism (Fig. S4 and Table S3d).