Genome-resolved metagenomics enables draft bacterial genomes to be reconstructed from DNA sequence data derived from complex microbial mixtures 1. The metagenome-assembled genomes (MAG) derived from such a process can be annotated to predict their functional toolbox upon which microbiome-level functional analyses can be conducted 2,3. One of the main issues of this methodology is that MAGs usually have different levels of genome completeness, i.e., the entirety of a microbe’s DNA is not always captured in the reconstructed genome 4. Genome completeness is primarily estimated through the presence of single-copy core genes (SCCGs), which are expected to be found in most bacteria 5. It is common to use MAGs with completeness values as low as 70% for the functional analyses of microbial communities 6. However, if a genome is estimated to be 70% complete, it is probable that many of the functions encoded in the actual genome will not be captured in the MAG, and thus the functional capacity of the genome will be underestimated 3,7. Not accounting for the level of completeness of MAGs could therefore lead researchers to incorrect interpretations of results, such as the artifactual deficit of functions being misinterpreted as real biological signal.
A major challenge of metagenomic research is correcting or accounting for biases in statistical analyses and modelling. However, we currently ignore how the loss of functional capacities is correlated with genome completeness, and whether these relationships are constant or variable across microbial phylogeny and metabolic domains. To address these issues, we investigated the relationship between estimated genome completeness and metabolic function fullness (defined as the proportion of biochemical reactions enabled by the genes present in a genome to accomplish a metabolic function) using 11,842 genomes from diverse origins publicly accessible at the GTDB database 8 (Supplementary Table S1). Genome completeness was estimated using CheckM 5, while functional fullness of KEGG modules was estimated using DRAM 2,9. To ensure robust statistical modelling based on unbiased data, only MAGs belonging to the four most diverse bacterial phyla were considered; namely, Actinobacteriota, Bacteroidota, Firmicutes and Proteobacteria (around 3,000 genomes each). The representation of genome completeness was evenly distributed across 70-100%, each window of 1% containing ca. 100 genomes from each phylum (Fig. 1a), and only genomes with contamination/redundancy values under 10% were considered (Fig. 1b). We also filtered the KEGG modules used for the modelling, by only considering the functions represented in at least 5% of the genomes (i.e., minimum representation of 592 data points).
We employed generalised linear models with binomial distribution to understand the association of genome completeness and function fullness in a filtered set of 195 KEGG metabolic modules across 11,842 genomes (Fig. 1c; details in the Online Methods). The models estimated a positive relationship between genome completeness and metabolic function fullness for 94% of the studied modules, spanning all functional domains and levels of complexity (i.e., number of enzymatic steps). Overall, the increase of completeness from 70% to 100% was associated with a 15±10% (mean±sd) increase in module fullness. This relationship remained constant across the completeness gradient, with a slight tendency for the slope of the relationship to increase with completeness (Fig. 2A). This indicates that, while increasing the threshold to exclude MAGs with low completeness from functional analyses minimises the issue, the problem persists even when only considering ‘high quality’ (>90%) MAGs. We also found evidence for significant differences between the fullness-completeness relationship across bacterial phyla. Considering all KEGG modules analysed, Proteobacteria showed the overall strongest fullness-completeness relationship followed by Firmicutes, Actinobacteriota and Bacteroidota (Fig. 2B).
Similarly to taxonomic differences, the fullness-completeness relationship did not change evenly across metabolic domains. The fullness of the modules belonging to the ‘nucleotide metabolism’ and ‘biosynthesis of other secondary metabolites’ domains were the most affected by completeness, while ‘energy metabolism’ showed the weakest fullness-completeness association (Fig. 2C). In addition, the complexity of the modules was negatively associated with the fullness-completeness relationship (Fig. 2D). This suggests that the fullness of the modules with the fewest steps are the ones that are more severely affected by genome incompleteness.
Our results highlight the need to consider genome completeness when comparing the functional capacities between microbial genomes or metagenomes. We argue that completeness biases should be accounted for in functional diversity analyses, analogously to how DNA sequencing depth biases are considered in diversity modelling approaches 10. At the very least, our observations urge scientists to revisit whether the functional differences observed across contrasting treatments or environments are driven by differences in genome completeness. Ideally, our results should serve as a baseline to account for genome completeness in statistical modelling, and when enough information is available, to reconstruct missing functions in MAGs through functional imputation. Only through the correction and mitigation of the functional biases introduced by uneven genome completeness will researchers be able to robustly characterise, model, and assess the functional capabilities of microbial communities.