The aim of the present study was to provide a first screening of UMGS and isolates genomes for a minimal subset of genetic functions (CMF) necessary – but not sufficient - to sustain bacterial life. In order to generate the CMF, two publicly available minimal genomes were downloaded from NCBI website: JCVI-syn 3.0 genome generated by Hutchison et al. (11) and C. Eth-2.0 genome generated by Venetz et al. (12). The two genomes were annotated and only the genes assigned with certainty (not being prefixed by putative or hypothetical) and present in both genomes were retained and used as a reference set for the CMF. The CMF covered the JCVI-syn 3.0 and C. Eth-2.0 genomes at 91% and 84%, respectively. The CMF mostly includes genes involved in genetic information processing and cytosolic metabolism (Additional file 1). In particular, of the 190 genes included in CMF (Additional file 2), 143 were assigned by KEGG orthology to the genetic information processing pathways, with the functions involved in translation highly represented, including 115 genes among which 44 encode for ribosomal subunits, 20 for aminoacids-tRNA ligases, and 24 for tRNA. Replication and repair are other groups of functions highly represented in the CMF list, including 16 genes encoding for DNA polymerases, gyrases, and topoisomerases among others. Conversely, 35 out of 190 genes are devoted to metabolic functions, including especially carbohydrate metabolic pathways (e.g. glycolysis and gluconeogenesis, galactose metabolism, starch and sucrose metabolism, etc), energy metabolism (including all subunits of ATP synthase), and metabolism of nucleotides. Only two genes included in CMF are exclusively devoted to environmental information processes, and other two to cellular processes. However, 7 out of 190 genes showed multiple functionalities according to their KEGG orthology; for instance, Enolase is involved in metabolism, genetic information processes and environmental information processes, Phosphoglycerate kinase is involved in both metabolism and environmental information process, and two Protein translocase subunits (SecA and SecY) are involved in genetic information processes, environmental information processes and cellular processes.
Next, we scanned both UMGS and isolated NCBI genomes for the presence of genes included in CMF. To this aim, 10,000 human gut metagenome assembled UMGS were randomly downloaded from the UMGS database generated by Almeida et al. (8), including the 1,175 high quality and 893 medium quality UMGS. Conversely, the 400 NCBI genomes were carefully selected to include a panel of isolates from the human gut which approximate the overall phylogenetic diversity of the ecosystem. Phylogenetic information about the genomes included in this study, and the species included in the selected genomes, are reported in the Additional file 3. Each genome set was then annotated, and for both the NCBI and UMGS genomes, the presence or the absence of each gene included in CMF was verified, generating a binary matrix of CMF presence/absence profiles. For each tested genome, the percentages of adherence to the CMF and the absolute amounts of missing entries were also computed. Our analysis revealed that the NCBI and the UMGS genomes are characterized by a significant different presence of the CMF (P < 0.001, Kruskall-Wallis test), with the NCBI genomes showing a higher average representativeness value and a lower standard deviation when compared to UMGS (93.2 % ± 2.9 and 67.9 % ± 9.5 Standard Deviation for NCBI and UMGS genomes, respectively) (Figure 1A). Particularly, when comparing the percentage of adherence to CMF between UMGS with a CheckM score equal or greater than 90% (high quality), UMGS with a CheckM score below 90% and NCBI genomes, the latter resulted significantly higher in CMF representation (P < 0.001, Kruskall-Wallis test) and, as expected, high quality UMGS showed an higher percentage of adherence to CMF with respect to low quality UMGS (P < 0.001, Kruskall-Wallis test) (Additional file 4). In Figure 1B the overall profile of the missing CMF in NCBI and UMGS genomes is reported. No differences in CMF hits in the UMGS genomes assigned at the different phylogenetic levels were obtained (data not shown). The CMF were found generally less represented in UMGS, with a total of 45 genes lacking in more than 50% analyzed genomes, with respect to the NCBI isolates. Among the missing genes, the 16S rRNA, fundamental for bacterial life, has not been retrieved in 8,034 out of 10,000 UMGS. However, this is probably inherent to the structure of the 16S rRNA gene, consisting in conserved and variable region and an overall similarity that can be up to 97% between two different bacterial species, making the assembly very hard in metagenomic data from microbial community (13). Focusing on the specific functionalities, it has been possible to highlight a systematic absence of genes encoding for ATP-dependent metalloprotease FtsH, DNA topoisomerase subunit 4, Dihydrolipolysine-residue acetyltransferase and Histidine biosynthesis protein HisB in the genomes of the analyzed UMAGs – being missed in more than 85% of the UMGS genomes. This group of 4 genes represent a set of biologically diverse functions: ATP-dependent metalloprotease FtsH gene encodes for a metalloprotease that plays a crucial role in the control of membrane protein integrity and regulates LPS biosynthesis (14) and DNA topoisomerase subunit 4 gene encodes for a protein crucial in the chromosome segregation process, decatenating newly replicated chromosomes (15). Particularly, the lack of this latter gene can lead to the impossibility for a bacterium to resolve DNA supercoilings, compromising the viability of the organism (16). On the other hand, Dihydrolipolysine-residue acetyltransferase is a component of the pyruvate dehydrogenase complex and its absence in the UMGS genomes can suggest their strict anaerobic propensity. Finally, Histidine biosynthesis protein HisB is a protein involved in step 6 and 8 of the sub-pathway that synthesizes L-histidine from 5-phospho-alpha-D-ribose 1-diphosphate. The function of this protein is crucial for bacterial life, since histidine is required for multiple biological processes (17). However, the lack of genes involved in histidine metabolism may indicate species variants auxotrophic for Histidine, possibly occurring in a highly syntrophic ecosystem as the human gut (17, 18).
Clustering analysis and PCA of the presence/absence profiles of CMF genes in NCBI and UMGS genomes showed a segregation between the two groups of genomes (PCA based on Euclidean distances and Logistic PCA are provided in Figure 2A-B and Additional File 5, respectively). Clustering analysis clearly separates the genome batch in two parts, demarking a difference between the two types of genomes (P < 0.001, Fisher's exact test), with UMGS grouped on the left and right sides of the heatmap, flanking the NCBI genomes (Figure 2A). In the same graphics, it is possible to notice how UMGS genomes systematically lack more genes when compared to NCBI genomes (32% ± 34.5 and 11.5% ± 24.2 genomes missing for a single CMF gene in UMGS and NCBI set, respectively). The PCA analysis carried out using the binary Euclidean metric showed a separation of the genomes in the two-dimensional plan (P < 0.001, permutation test with pseudo-F ratio), with NCBI genomes less disperse if compared to UMGS, indicating a more homogeneous representation of the CMF genes inside the NCBI group (Figure 2B).
The genome quality of the 10.000 UMGS we obtained by retrieving the respective CheckM score. Result were superimposed on the same PCA (Figure 2C), were UMGS are color-coded according to the corresponding CheckM value. Interestingly, our analysis highlighted a gradient of CheckM variation across the two-dimensional space, indicating a gradient of decreasing genomes completeness along the PC2 component. Confirming this observation, CheckM score is also correlated significantly (P <0.0001) and negatively with PC2 component, showing how the completeness of the UMGS has gradually decreased as the axis index increases. Finally, a positive correlation between the CMF hits and the CheckM score in the 10.000 UMGS was observed (Kendall’s correlation tau = 0.6, P < 0.001) (Figure 2D).