The aim of the present study was to provide a first screening of UMGS and isolates genomes for a minimal subset of genetic functions (CMF) necessary – but not sufficient - to sustain bacterial life. In order to generate the CMF, two publicly available minimal genomes were downloaded from NCBI website: JCVI-syn 3.0 genome generated by Hutchison et al. (11) and C. Eth-2.0 genome generated by Venetz et al. (12). The two genomes were annotated and only the genes assigned with certainty and present in both genomes were retained and used as a reference set for the CMF. The CMF mostly includes genes involved in genetic information processing and cytosolic metabolism (Additional file 1). In particular, of the 190 genes included in CMF (Additional file 2), 143 were assigned by KEGG orthology to the genetic information processing pathways, with the functions involved in translation highly represented, including 115 genes among which 44 encode for ribosomal subunits, 20 for aminoacids-tRNA ligases, and 24 for tRNA. Replication and repair is another group of functions highly represented in the CMF list, including 16 genes encoding for DNA polymerases, gyrases, and topoisomerases among others. Conversely, 35 out of 190 genes are devoted to metabolic functions, including especially carbohydrate metabolic pathways (e.g. glycolysis and gluconeogenesis, galactose metabolism, starch and sucrose metabolism, etc), energy metabolism (including all subunits of ATP synthase), and metabolism of nucleotides. Only two genes included in CMF are exclusively devoted to environmental information processes, and other two to cellular processes. However, 7 of the 190 genes showed multiple functionalities according to their KEGG orthology; for instance, Enolase is involved in metabolism, genetic information processes and environmental information processes, Phosphoglycerate kinase is involved in both metabolism and environmental information process, and two Protein translocase subunits (SecA and SecY) are involved in genetic information processes, environmental information processes and cellular processes.
Next, we scanned both UMGS and isolated NCBI genomes for the presence of genes included in CMF. To this aim,a total of 800 genomes were randomly selected and downloaded. In particular, 400 genomes of isolated species were randomly obtained from NCBI, covering a wide array of bacterial species, and 400 were selected from the UMGS in the database of Almeida et al. (8). For more information about the genomes included in this study, and the species included in the 400 randomly selected NCBI genomes, see Additional file 3. Each genome set was annotated, and for both the NCBI and UMGS genomes, the presence or the absence of each gene included in CMF was computed, generating a binary matrix of CMF presence/absence profiles. For each tested genome, the percentages of adherence to the CMF and the absolute amounts of missing entries were also computed. Our analysis revealed that the NCBI and the UMGS genomes are characterized by a different percentage of representativeness of the CMF (P < 0.001, Kruskall-Wallis test), with the NCBI genomes showing a higher average representativeness value and a lower standard deviation when compared to UMGS (93.2 % ± 2.9 and 78.2 % ± 11.5 SD for NCBI and UMGS genomes, respectively) (Figure 1A). In Figure 1B the overall profile of the missing CMF in NCBI and UMGS genomes is reported. The CMF was found generally less represented in UMGS, with a total of 45 genes lacking in more than 200 analyzed genomes, with respect to the NCBI isolates. Interestingly, among the missing genes, the 16S rRNA gene, fundamental for bacterial life, has not been retrieved in 237 out of 400 UMGS. Clustering analysis and PCA of the presence/absence profiles of CMF genes in NCBI and UMGS genomes showed a segregation between the two groups of genomes (Figure 2). Clustering analysis cuts the genomes batch in two parts, neatly separating the two types of genomes (P < 0.001, Fisher's exact test), with UMGS grouped on the left side of the heatmap (Figure 1A). In the same graphics, it is possible to notice how UMGS genomes systematically lack in more genes when compared to NCBI genomes (87.1 ± 89.0 and 27.1 ± 54.2 genomes missing for a single CMF gene in UMGS and NCBI set, respectively). Finally, the PCA analysis carried out using the Euclidean metric showed a separation of the genomes in the two-dimensional plan (P < 0.001, permutation test with pseudo-F ratio), with NCBI genomes less disperse if compared to UMGS, indicating a greater homogeneity in the representativeness of the CMF genes inside the NCBI group (Figure 1A).