Map of diazotrophs with a complete Nif core
We obtained 118 references of organisms with the preliminary systematic review from the academic literature, of which 81 and 37 represented bacterial and archaeal genomes, respectively. RAFTS³G was used to identify 22 classes of Nif proteins from these 118 genomes: NifH, NifD, NifK, NifE, NifN, NifB, NifX, NifU, NifS, NifW, NifT, NifZ, NifM, NifJ, NifF, NifP, NifI1, NifI2, NifA, NifL, NifO, and NifX-associated. As a result, we generated 336 groups (a total of 2,156 Nif sequences) that constituted the Nif protein database. Data mining using the SWeeP method and Nif clusters obtained with RAFTS³G was developed to predict N2-fixing species from bacterial ‘proteomes’ (the ‘proteome’ as defined in [29]) from the National Center for Biotechnology Information (NCBI) GenBank database. As a result, we enriched our potential diazotrophic list with 711 prokaryotic proteomes. Among them, 544 were identified a complete Nif core (defined as the minimum set composed of NifH, NifD, NifK, NifE, NifN and NifB) and 167 had partial or no similarity with some Nif proteins from the Nif core set (Additional File 2 - Table S1). Among the proteomes without the Nif core set, 87 (approximately 52%), such as Asaia bogorensis NBRC 16594, Rhizobium sp Y9, Synechococcus sp PCC 6312, and Agrobacterium tumefaciens, were unable to identify any Nif sequence similarity. In proteomes with some Nif core sets, 12 proteomes (approximately 7%) were identified as Nif proteins with significant similarities with NifD (e.g. Moorella thermoacetica) and with NifN (e.g. Aminobacter sp. MSH1) and 33 proteomes with NifH (approximately 20%), such as Clostridium botulinum CDC1436, Fibrobacter succinogenes subsp. succinogenes S85, Fusobacterium varium, Desulfitobacterium dehalogenans ATCC 51507, Megasphaera elsdenii 1414 among others. In proteomes with partial Nif core sets assembled, we highlight Thermacetogenium phaeum DSM 12270 (containing NifH, NifD, NifK, NifE, and NifB) and Clostridium drakei, Roseiflexus castenholzii DSM 13941, Roseiflexus sp RS1, Moorella thermoacetica, Caldicellulosiruptor kronotskyensis 2002, Candidatus Desulforudis audaxviator MP104C and Syntrophothermus lipocalidus DSM 12680 (containing NifH, NifD, NifK, and NifB) as described in Table 1.
Finally, merging the 544 species with a complete Nif core set and 118 identified from the literature, we obtained 662 archaeal and bacterial proteomes containing the Nif core set. Of these, we identified 52 species without literature linking the species (or genome reference) to BFN (Table 2). Additional File 2 - Chart S1 contains the list of diazotrophic organisms identified in this study, the genome accession number from GenBank and the bibliographic references for each organism identified as diazotrophic.
RAFTS³G was utilized in 662 proteomes (a total of 3,143,063 protein sequences) and we generated 1,021,121 clusters: 327,558 sequences clustered with at least two sequences and 693,563 of the protein sequences were single. With the grouped sequences, we identified the major clusters containing Nif proteins (Additional File 2, Table S2), totaling a new database with 10,076 Nif sequences. The total number of Nif core sequences exceeds the number of proteomes analyzed: NifH (1,001), NifD (680), NifK (681), NifE (686), NifN (667), and NifB (669) (Additional File 2, Table S3). Additionally, we identified Skermanella pratensis and Skermanella sp. TT6 as the complete Nif core set in addition to the Nif protein accessories: NifA, NifW, NifP, NifS, NifU, NifX, NifXa, NifT and NifJ (Additional File 2, Chart S2) and for Asaia bogorensis NBRC 16594 [MTCC 4041] had no sequences of Nif core sequences observed in silico.
The k indicated to Nif core suggests that 10 diazotrophic groups
We determined the best number of clusters (k) for the Nif core proteome dataset using various statistical methods. From the 662 proteomes with a complete Nif core, we constructed SWeeP vectors for each species (Additional File 1, Figure S3). The k values were identified by comparing Hierarchical Clustering on Principal Components (HCPC) results with the Gap statistic, Elbow method, and Silhouette coefficient results. The Elbow method suggested optimal k values of 5, 10, 13, and 16; the Silhouette coefficients indicated 10, 13, and 18; and the Gap statistics suggested k = 8 (Additional File 1, Figure S4). In HCPC, the best k values were greater than 8 clusters (Additional File 1, Figure S5). We examined the k values 8, 10, and 13 and compared each cluster's consistency with the Nif core ordered dissimilarity heatmap (Figure 1 and Additional File 1, Figure S4). For k = 13, cyanobacteria species were subdivided into two groups (groups 6 and 7) and in heatmap were unique (Figure 1 A, in blue). For k = 8, all Paenibacillus species (Figure 1A, in orange) were grouped with Azotobacter spp. (group 3), but were proximity with Geobacter group in Heatmap. At k = 10, all cyanobacteria species were grouped into group 6, and Paenibacillus spp. were grouped in group 4 with Geobacter spp. (Figure 1, in red). Thus, k=10 was adopted for a more cohesive grouping. Group-phyla correlations are presented in Table 3.
The PCA-HCPC yields ten groups and four with great variance
PCA results of the Nif core proteins mapping the 10 groups were visualized in 2D and 3D (Additional File 1, Figure S7 A and B). PC1 and PC2 in PCA 2D explained 7.43% and 6.01% of the variance, respectively, indicating insufficient explained variance (PC < 10%) [30]. Therefore, a 3D visualization map was used (Figure 2). Groups with the highest auto values were identified in G1, G3, G6, and G10 (Figure 2 groups delimited in colorscolor).
The groups presented distinct phylogenetic and functional characteristics with significant synteny patterns
The 10 groups were arranged from the innermost to the outermost layer, consistent with the phylogenetic order (Figure 3), as follows: G3, G4, G6, G5, G2, G1, G7, G8, G9, and G10. These groups presented distinct phylogenetic and functional characteristics, with significant synteny patterns observed in several groups (Table 4 and Additional File 1, Figures S1-S16).
Group 3 (G3) constituted the majority of obligate anaerobic species of Archaea (Methanosarcina spp. division), Firmicutes (Clostridium spp. division), Fusobacteria, Spirochaetes, Verricumicrobia (Coraliomargarita akajimensis), Chloroflexi, Chlorobi, Bacteroidetes and Deltaproteobacteria (Desulfovibrio spp. division). We determined that this species had the most compact pattern of the general constitution of the nif genes (nifHI1I2DKENB) (Additional File 1 - Figure S8). Accessory structures such as NifX and NifV occur near the nif core (e.g., Ilyobacter polytropus DSM2926). C. akajimensis, the unique aerobic microorganism of G3, contained the positive regulator nifA, constituting the nifHDKENBVA cluster. The nifA sequences were observed simultaneously with the regulators anaerobic nifI1I2 in some species of Bacteroidetes, Spirochaetes, Verrucomicrobia and δ-proteobacteria.
Group 4 (G4) was composed of obligate anaerobes of Euryarcheota (Methanobacterium spp. division), Firmicutes (bacilares), Nitrospirae, and some facultative anaerobes or microaerobic organisms, such as Deferribacteres, Aquificales, Actinobacteria, Chrysogenetes, Epsilon and Deltaproteobacteria (Geobacter division). We identified an increase in nif genes constitution compared to the G3 (Additional File 1 - Figure S9). Notably in Methanotermococcus sp. IH1 and Heliobacterium modesticaldum Ice1 (nifI1I2HDKENXB), Kyrpidia spormannii EA1 (nifBVSHDKENXXaHesA), Paenibacillus spp. (nifBHDKENXXahesAV), Dissulfurispira thermophila T55J and Thermodesulfovibrio yellowstonii DSM 11347 (nifHDKENXI1I2B), Hydrogenobacter thermophilus TK6 and Thermocrinis albus DSM 14484 (nifI1I2TBWZVSAHDKXaENX), Denitrovibrio acetiphilus DSM 12809 (nifAZHDKTENXXaVB), Calditerrivibrio nitroreducens DSM 19672 (nifVBAZENXXaHDKT), Desulfurispirillum indicum S5 (nifVBASUPHDKTZENX), Frankia alni EA1 (nifVHDKENXXaWZBhesAhesB), Desulfuromonas soudanensis WTL (nifHDKENXI1I2B), Anaeromyxobacter sp. K (nifBHDKENXA), Geobacter pickeringii G13 (nifVHDKENXB), Sulfurospirillum barnesii SES3 and Arcobacter nitrofigilis DSM 7299 (nifVBHDKTAENXXaFWZQ). As presented, we identified variable-sized Nif cores in this group including different loci patterns to nifI1I2 in obligately anaerobic/facultative and nifA in facultative/aerobic organisms. Additionally, we identified the protein associated with NifX (we called the gene nifXa) and hesA and hesB genes in facultative/aerobic species.
Group 6 (G6) was composed of all the Cyanobacteria species. We identified the composition of the nif core with a relative pattern in cyanobacteria: nifPVZTBSUHDKENXXaW (Additional File 1, Figure S10). Proteins, such as fdxB, hesA, hesB, and Mo-nitrogenase C-terminal (we called the gene monC) have been reported to play significant roles in BNF presented in these cores.
Group 5 (G5) was composed of Alpha (Magnetococcus marinus MC-1 and Martelella endophytica YC6887), Beta (Azoarcus spp. division) and Gamma (Azotobacter spp. division) proteobacteria (Additional File 1 - Figure S11). In M. marinus, the pattern is the most compact compared with the other organisms in the group (nifABQHDKTYENXXaVZ), with the integration of the isc cluster. We identified the NifL protein in only one Alphaproteobacteria - M. endophytica (nifUSVWZMBHDKTYENXXaQLA). The Betaproteobacteria in this group were unique in that they contained the nifL. Among Halorhodospira spp., the only nitrogen-fixing species of Gammaproteobacteria is notable for not containing it (e.g., H. halophila SL1, nifENXXaABQVWZMHDKY). Clusters of genes adjacent to the nif elements that integrate other systems, such as iscAUS and rnfABCDEH, related to Fe-S incorporation and electron transfer respectively, which may also play a role in BNF, have been reported.
Group 2 (G2) was represented by gamma proteobacteria (Dickeya division). In G2, the genomic synteny of the species Tolumonas auensis, Brenneria goodwinii, Pectobacterium atrosepticum JG10-08, Dickeya fangzhongdai ND14b, and Kosakonia oryzae Ola 51 is represented in Additional File 1 Figure S12. We identified a cluster similar to Klebsiella variicola strain 342 [31] (nifJHDKTYENXUSVWZMFLABQ). In addition, we also identified genes related to the BNF process, namely fdxN, fdxB, and iscA (iscN).
Group 1 (G1) was a mixture of Gammaproteobacteria (Klebsiella division). The genomic synteny of Klebsiella quasivariicola KPN1705 and Raoultella ornithinolytica S12 is represented in Additional File 1, Figure S13. Among these three species and all the other species cataloged in G1 (Additional File 2, Table S5), we identified the synteny pattern nifJHDKTYENXUSVWZMFLABQ.
Group 7 (G7) was composed of alpha (Bradyrhizobium division), beta (Burkholderia division) and Gamma (Halothiobacillus division) proteobacteria, Acidithiobacilia, Nitrospirae (Leptospirillum spp.) and Verrucomicrobia (Methylacidiphilum spp.) (Additional File 1 - Figure S14). The composition of the nif genes revealed a pattern of nifABZ(Z)THDKENXXaQUSVPW. In most cases, there was a duplication of the nifZ gene (e.g., Azoarcus sp. KH32, nifABZ1Z2STHDKENXXaQW). Gene clusters such as isc, composed of hscA, iscA, iscN, iscU, iscS; rnf, composed of rnfABCDEH; and fix, composed of fixABCX, have been reported.
Group 8 (G8) was composed of Alphaproteobacteria (Rhodobacter spp. division) (Additional File 1 - Figure S15). Generally, the nif element pattern is presented as nifABZTHDKENXXaQUSVW. isc (iscANUS) and fix (fixABCX) clusters have also been reported near the nif elements.
Group 9 (G9) included the Alphaproteobacteria Gluconacetobacter diazotrophicus PA1 5, Hartmannibacter diazotrophicus, Methylobacterium nodulans ORS 2060, Mesorhizobium erdmanii NZP2014, Rhizobium leguminosarum GLR17, and Sinorhizobium meliloti. The nif genes were similar in constitution, represented by nifABZTHDKENXXaQUSV(P)W. Only the nif genes in Neorhizobium galegae, Sinorhizobium fredii, Ensifer sojae, and other species of Rhizobium spp. were impracticable because these organisms have BNF genes in plasmid organelles. A cluster of fix and nod genes was found adjacent to the nif cluster. Several hypothetical sequences identified by RAFTS³G showed homology with IS group insertases (e.g., iss3).
G10 constituted the largest number of Rhizobium species and the Sinorhizobium americanum (CCGM7/CFNEI 73) and Sinorhizobium fredii NXT3 (Additional File 1 - Figure S16). The nif sequences of this group were identified exclusively in plasmids and the nif genes are constituted by nifABZTUSWHDKENXXa. For Rhizobium species (i.e. Rhizobium sp. TAL182) there were three structural components of the NifH forming the nifHDK (listed as NifH2, NifD2 and NifK2) and another copy of nifH (NifH3) generally with nifQ borderer, constituting the nifQH3H2D2K2ABZTUSWHDKENXXa cluster. Additional proteins include Nod proteins, Fix proteins, insertase-related proteins (e.g. ISS) and cytochrome P (CYP450).
Duplication of the Nif core in some Cyanobacteria species and Azoarcus sp KH32C
The tree obtained from nitrogen-fixing cyanobacteria was divided into non-heterocyst-forming (NFH) and heterocyst-forming (HF) clades (Figure 4). In NFH, there are diazotrophs of the orders Oscillatoriales, Chroococcidiopsidales, Pleurocapsales and Chroocococcales. The HF cluster was composed of Nostocales, Stigonematales and Synechococcales.
We report the duplication of Nif core proteins in Aulosira laxa NIES-50 , Azoarcus sp KH32C, Calothrix brevissima NIES-22, Nostoc carneum NIES -2107, Tolypothrix tenuis PCC 7101 and Trichormus variabilis ATCC 29413 suggesting the existence of different loci with two distinct nif clusters related with the Nif core or “Nif core2” - Nif core2 was adopted to call the second Nif core set identified (Additional File 2, Chart S3). A comparison of the Nif core and Nif core2 distributions in the phylogenetic tree revealed different topologies of Aulosira laxa, Calothrix brevissima, Nostoc carneum, Tolypothrix tenuis and Trichormus variabilis. The tree topology for these species, with the Nif core2, suggests the BNF related to HF characteristics.
The divergence observed in the Nif core tree compared to the proteome tree highlights that species like Azoarcus olearius, Azoarcus communis, and Azoarcus sp. BH72 were distinct from Azoarcus sp. KH32C and Azoarcus sp. CIB (Figure 5). Similarly, Azospira oryzae PS and Azospira sp. I09 formed distinct groups from Azospira restricta DSM 18626. In conforming, the HCPC-PCA suggests that the species of Azoarcus and Azospira should be divided into distinct groups—specifically, Azoarcus sp. KH32C and Azoarcus sp. CIB were grouped into G8—Azoarcus sp. KH32C (with Nif core2) was grouped into G7 (Figure 4). A. olearius (DQS4 and BH72 strains) and A. comunis (TSNA42 and TSPY31 strains) were grouped into G5. In parallel, the Azospira restricta DSM 18626 was grouped in G5, while A. oryzae OS and Azospira sp. I09 were grouped in G8. Based on the nif gene cluster annotation, Azoarcus spp. and Azospira spp. grouped in G5, G7, and G8 exhibited similar intragroup nif gene compositions (Additional File 2, Chart S4).