The aim of the present study was to provide a first screening of UMGS and isolates genomes for a minimal subset of genetic functions (CMF) necessary – but not sufficient - to sustain bacterial life. In order to generate the CMF, two publicly available minimal genomes were downloaded from NCBI website: JCVI-syn 3.0 genome generated by Hutchison et al. (11) and C. Eth-2.0 genome generated by Venetz et al. (12). The two genomes were annotated and only the genes assigned with certainty (not being prefixed by putative or hypothetical) and present in both genomes were retained and used as a reference set for the CMF. The CMF covered the JCVI-syn 3.0 and C. Eth-2.0 genomes at 91% and 84%, respectively. The CMF mostly includes genes involved in genetic information processing and cytosolic metabolism (Additional file 1). In particular, of the 183 genes included in CMF (Additional file 2), 143 were assigned by KEGG orthology to the genetic information processing pathways, with the functions involved in translation highly represented, including 115 genes among which 44 encode for ribosomal subunits, 20 for aminoacids-tRNA ligases, and 24 for tRNA. Replication and repair are other groups of functions highly represented in the CMF list, including 16 genes encoding for DNA polymerases, gyrases, and topoisomerases among others. Conversely, 35 out of 183 genes are devoted to metabolic functions, including especially carbohydrate metabolic pathways (e.g. glycolysis and gluconeogenesis, galactose metabolism, starch and sucrose metabolism, etc), energy metabolism (including all subunits of ATP synthase), and metabolism of nucleotides. Only two genes included in CMF are exclusively devoted to environmental information processes, and other two to cellular processes. However, 7 out of 183 genes showed multiple functionalities according to their KEGG orthology; for instance, Enolase is involved in metabolism, genetic information processes and environmental information processes, Phosphoglycerate kinase is involved in both metabolism and environmental information process, and two Protein translocase subunits (SecA and SecY) are involved in genetic information processes, environmental information processes and cellular processes.
We scanned both UMGS and isolated NCBI genomes for the presence of genes included in CMF. To this aim, 10,000 human gut metagenome-assembled UMGS were randomly downloaded from the UMGS database generated by Almeida et al. (8), including the 1,175 high quality and 893 mid quality UMGS. On the other hand, the 400 NCBI genomes were carefully selected to include a panel of isolates from the human gut which approximate the overall phylogenetic diversity of the ecosystem. Phylogenetic information about the genomes included in this study, and the species included in the selected genomes, are reported in the Additional file 3. Each genome set was then annotated, and for both the NCBI and UMGS genomes, the presence or the absence of each gene included in CMF was verified, generating a binary matrix of CMF presence/absence profiles. For each tested genome, the percentages of adherence to the CMF and the absolute amounts of missing entries were also computed. Our analysis revealed that the NCBI and the UMGS genomes are characterized by a significant different presence of the CMF (P < 0.001, Kruskall-Wallis test), with the NCBI genomes showing a higher average representativeness value and a lower standard deviation when compared to UMGS (93.2 % ± 2.9 and 67.9 % ± 9.5 Standard Deviation for NCBI and UMGS genomes, respectively) (Figure 1A). Particularly, when comparing the percentage of adherence to CMF between UMGS with a CheckM score equal or greater than 90% (high quality), UMGS with a CheckM score below 90% and NCBI genomes, the latter resulted significantly higher in CMF representation (P < 0.001, Kruskall-Wallis test) and, as expected, high quality UMGS showed an higher percentage of adherence to CMF with respect to low quality UMGS (P < 0.001, Kruskall-Wallis test) (Additional file 4). In Figure 1B the overall profile of the missing CMF in NCBI and UMGS genomes is reported. No differences in CMF hits in the UMGS genomes assigned at the different phylogenetic levels were obtained (data not shown). The CMF were found generally less represented in UMGS, with a total of 45 genes lacking in more than 50% analyzed genomes, with respect to the NCBI isolates.
Clustering analysis and PCA of the presence/absence profiles of CMF genes in NCBI and UMGS genomes showed a segregation between the two groups of genomes (PCA based on Euclidean distances and Logistic PCA are provided in Figure 2A-2B and Additional File 5, respectively). Clustering analysis clearly separates the genome batch in two parts, demarking a difference between the two types of genomes (P < 0.001, Fisher's exact test), with UMGS grouped on the left and right sides of the heatmap, flanking the NCBI genomes (Figure 2A). In the same graphics, it is possible to notice how UMGS genomes systematically lack more genes when compared to NCBI genomes (32% ± 34.5 and 11.5% ± 24.2 genomes missing for a single CMF gene in UMGS and NCBI set, respectively). The PCA analysis carried out using the binary Euclidean metric showed a separation of the genomes in the two-dimensional plan (P < 0.001, permutation test with pseudo-F ratio), with NCBI genomes less disperse if compared to UMGS, indicating a more homogeneous representation of the CMF genes inside the NCBI group (Figure 2B).
The genome quality of the 10.000 UMGS we obtained by retrieving the respective CheckM score. Result were superimposed on the same PCA (Figure 2C), were UMGS are color-coded according to the corresponding CheckM value. Interestingly, our analysis highlighted a gradient of CheckM variation across the two-dimensional space, indicating a gradient of decreasing genomes completeness along the PC2 component. Confirming this observation, CheckM score is also correlated significantly (P <0.0001) and negatively with PC2 component, showing how the completeness of the UMGS has gradually decreased as the axis index increases. Finally, a positive correlation between the CMF hits and the CheckM score in the 10.000 UMGS was observed (Kendall’s correlation tau = 0.6, P < 0.001) (Figure 2D).