The aim of the present study was to provide a first screening of UMGS and isolates genomes for a minimal subset of genetic functions (CMF) necessary – but not sufficient - to sustain bacterial life. In order to generate the CMF, two publicly available minimal genomes were downloaded from NCBI website: JCVI-syn 3.0 genome generated by Hutchison et al. (11) and C. Eth-2.0 genome generated by Venetz et al. (12). The two genomes were annotated and only the genes assigned with certainty (not being prefixed by putative or hypothetical) and present in both genomes were retained and used as a reference set for the CMF. The CMF mostly includes genes involved in genetic information processing and cytosolic metabolism (Additional file 1). In particular, of the 190 genes included in CMF (Additional file 2), 143 were assigned by KEGG orthology to the genetic information processing pathways, with the functions involved in translation highly represented, including 115 genes among which 44 encode for ribosomal subunits, 20 for aminoacids-tRNA ligases, and 24 for tRNA. Replication and repair are other groups of functions highly represented in the CMF list, including 16 genes encoding for DNA polymerases, gyrases, and topoisomerases among others. Conversely, 35 out of 190 genes are devoted to metabolic functions, including especially carbohydrate metabolic pathways (e.g. glycolysis and gluconeogenesis, galactose metabolism, starch and sucrose metabolism, etc), energy metabolism (including all subunits of ATP synthase), and metabolism of nucleotides. Only two genes included in CMF are exclusively devoted to environmental information processes, and other two to cellular processes. However, 7 out of 190 genes showed multiple functionalities according to their KEGG orthology; for instance, Enolase is involved in metabolism, genetic information processes and environmental information processes, Phosphoglycerate kinase is involved in both metabolism and environmental information process, and two Protein translocase subunits (SecA and SecY) are involved in genetic information processes, environmental information processes and cellular processes.
Next, we scanned both UMGS and isolated NCBI genomes for the presence of genes included in CMF. To this aim, a total of 10,400 genomes were selected and downloaded, covering all the most represented phyla in the gastrointestinal tract both for UMGS and NCBI genomes (see Additional file 3 for the distributions of the genomes at phylum level). In particular, 400 genomes of isolated species were obtained from NCBI, covering a wide array of bacterial species, and 10,000 genomes were selected from the UMGS in the database generated by Almeida et al. (8). For more information about the genomes included in this study, and the species included in the selected genomes, see Additional file 4. Each genome set was then annotated, and for both the NCBI and UMGS genomes, the presence or the absence of each gene included in CMF was verified, generating a binary matrix of CMF presence/absence profiles. For each tested genome, the percentages of adherence to the CMF and the absolute amounts of missing entries were also computed. Our analysis revealed that the NCBI and the UMGS genomes are characterized by a significant different presence of the CMF (P < 0.001, Kruskall-Wallis test), with the NCBI genomes showing a higher average representativeness value and a lower standard deviation when compared to UMGS (93.2 % ± 2.9 and 67.9 % ± 9.5 Standard Deviation for NCBI and UMGS genomes, respectively) (Figure 1A). In Figure 1B the overall profile of the missing CMF in NCBI and UMGS genomes is reported. The CMF were found generally less represented in UMGS, with a total of 45 genes lacking in more than 50% analyzed genomes, with respect to the NCBI isolates. Among the missing genes, the 16S rRNA, fundamental for bacterial life, has not been retrieved in 8,034 out of 10,000 UMGS. This was due by the nature of the 16S rRNA gene, consisting in conserved and variable region and an overall similarity that can be up to 97% between two different bacterial species, making the assembly very hard in metagenomic data from microbial community. By analyzing the data at a higher level of detail, it has been possible to see how the genes encoding for ATP-dependent metalloprotease FtsH, DNA topoisomerase subunit 4, Dihydrolipolysine-residue acetyltransferase and Histidine biosynthesis protein HisB were the most absent within the UMAGs genomes analyzed – i.e. systematically missing in more than 85% of the UMGS genomes. This group of 4 genes represent a set of various and biologically diverse functions: ATP-dependent metalloprotease FtsH gene encodes for a metalloprotease that plays a crucial role in the control of membrane protein integrity and regulates LPS biosynthesis (13)"; DNA topoisomerase subunit 4 gene encodes for a protein crucial in the chromosome segregation process, decatenating newly replicated chromosomes (14). A lack of this gene can lead to the impossibility for a bacterium to resolve DNA supercoilings, determining the lack of viability of the organism (15). On the other hand, the lack of Dihydrolipolysine-residue acetyltransferase, a component of the pyruvate dehydrogenase complex, can confirm the strict anaerobic propensity of the novel-identified bacteria species. Finally, Histidine biosynthesis protein HisB is a protein involved in step 6 and 8 of the sub-pathway that synthesizes L-histidine from 5-phospho-alpha-D-ribose 1-diphosphate. The function of this protein is crucial for bacterial life, since histidine is required for multiple biological processes (16).
Clustering analysis and PCA of the presence/absence profiles of CMF genes in NCBI and UMGS genomes showed a segregation between the two groups of genomes (Figure 2). Clustering analysis clearly separates the genome batch in two parts, demarking a difference between the two types of genomes (P < 0.001, Fisher's exact test), with UMGS grouped on the left and right sides of the heatmap, flanking the NCBI genomes (Figure 2A). In the same graphics, it is possible to notice how UMGS genomes systematically lack in more genes when compared to NCBI genomes (32% ± 34.5 and 11.5% ± 24.2 genomes missing for a single CMF gene in UMGS and NCBI set, respectively). Finally, the PCA analysis carried out using the binary Euclidean metric showed a separation of the genomes in the two-dimensional plan (P < 0.001, permutation test with pseudo-F ratio), with NCBI genomes less disperse if compared to UMGS, indicating a greater homogeneity in the representativeness of the CMF genes inside the NCBI group (Figure 2A).