Cataloguing microbiome genes from the topsoil of China’s eco-regions
Topsoil samples were collected from n = 41 locations, covering six distinctive geographical regions (Fig. 1a, Table S1). Altogether, 579.3 gigabases (Gb) of paired-end sequence data were generated, averaging 93.6 million paired reads per sample. De novo assembly of that sequence data yielded a non-redundant gene catalogue for the 41 locations. The total length of this non-redundant assembly was 24.6 Gb (mean contig N50 length of 543 bp), from which 54.7 million partial genes > 100 bp were predicted. After removing redundancy by clustering the genes with a shared identity > 90% and a coverage > 90% of the shorter gene, a total of 25.4 million non-redundant genes were deposited in our metagenomic libraries. Bacterial genes were clearly dominant, comprising 97.69% of all sequences, followed by 2.22% for archaea, 0.07% for eukaryota, and only 0.02% for viruses (Fig. 1b, Table S2).
The Chao1 index values for taxonomic (species) and functional (KOs, KEGG Orthology) traits for all microbial groups in China’s six eco-regions are shown in Fig. 1c and Fig. 1d, respectively. For taxonomic traits, the Chao1 richness of the entire soil microbiome ranged from 12 383.95 (at AS group) to 13 569.53 (at HC group), with an overall mean (± SD) and median of 12 961.03 ± 551.81 and 12 938.09, respectively (Table S3, S5). For functional traits, whole microbiome’s Chao1 richness based on the KOs ranged from 6540.04 (at LP group) to 7505.37 (at HC group), having an overall mean of 6875.65 ± 476.89 and a median of 6771.20 across the six regions (Table S4, S6). In addition, the ACE and Simpson indices for α-diversity based on species and KOs are summarized in Tables S3 and S4, respectively. Significant differences arose between regions for their soil microbiome’s Chao1 index when based on species or KOs (Table S5, S6).
Principal coordinates analysis (PCoA) revealed significant differences in the β-diversity of taxonomic traits (species) (PERMANOVA, R2 = 0.415, p = 0.001; Fig. 1e) as well as functional traits (KOs) (PERMANOVA, R2 = 0.432, p = 0.001; Fig. 1f) between different regions. Evidently, whether taxonomic or functional, the microbiome traits of five regions (KS, AS, QT, LP, and FG) were geographically separated along the first principal coordinate from those of HC.
Taxonomic analysis identified a total of 105 phyla of soil microorganisms: 82 for bacteria, 12 for archaea, 10 for eukaryota, plus 1 virus phylum (Fig. 2a, Table S7). Most of the metagenomes were dominated by Actinobacteria (18.01–62.09%) and Proteobacteria (14.66–50.31%), followed by Acidobacteria (0.61–13.10%). In addition, Chloroflexi (1.37–8.01%), Firmicutes (1.19–3.78%), Bacteria_unclassified (0.96–5.42%), Gemmatimonadetes (0.65–5.49%), Cyanobacteria (0.89–21.54%), Planctomycetes (0.72–2.54%), Euryarchaeota (0.19–27.94%), Verrucomicrobia (0.23–3.85%), Bacteroidetes (0.47–6.38%), Thaumarchaeota (0.05–3.90%), Candidatus_Rokubacteria (0.05–3.45%), Candidatus_Tectomicrobia (0.14–2.36%), Nitrospirae (0.15–1.70%), and Deinococcus-Thermus (0.22–1.51%), together accounted for ca. 98% of all metagenomic sequences derived from the topsoil samples (Table S7).
The RPKM abundance data were screened for KEGG pathways significantly enriched in the six eco-regions, using the linear discriminant analysis (LDA) effect size (LEfSe) method (LDA > 2.5, p < 0.05). Collectively, 76 KEGG pathways (18.5% of total) were differentially abundant among the six regions, with 4, 27, 34, 4, 7, and 0 pathways enriched in the LP, KS, HC, FG, AS, and QT, regions respectively (Fig. 2b, Table S8). Further analysis revealed that the 76 KEGG pathways could be classified into 23 classes that mainly dealt with seven biological processes (Fig. 2c, Table S8). (1)Carbohydrate metabolism: e.g., pentose phosphate pathway, amino sugar and nucleotide sugar metabolism, fructose and mannose metabolism, galactose metabolism, starch and sucrose metabolism, and glycolysis/gluconeogenesis. (2)Amino acid metabolism: e.g., valine, leucine and isoleucine biosynthesis; glycine, serine and threonine metabolism; arginine biosynthesis; arginine and proline metabolism; phenylalanine, tyrosine and tryptophan biosynthesis; and lysine biosynthesis. (3)Xenobiotics biodegradation and metabolism: e.g., aminobenzoate degradation, benzoate degradation, caprolactam degradation, chloroalkane and chloroalkene degradation, metabolism of xenobiotics by cytochrome P450, nitrotoluene degradation, polycyclic aromatic hydrocarbon degradation, and styrene degradation. (4)Energy metabolism: e.g., nitrogen metabolism, sulfur metabolism, methane metabolism, carbon fixation pathways in prokaryotes, and oxidative phosphorylation. (5)Metabolism of cofactors and vitamins: e.g., folate biosynthesis, one carbon pool by folate, porphyrin and chlorophyll metabolism, thiamine metabolism, as well as ubiquinone and other terpenoid-quinones’ biosynthesis. (6)Replication and repair: e.g., base excision repair, DNA replication, homologous recombination, mismatch repair, and nucleotide excision repair. (7)Lipid metabolism: e.g., steroid hormone biosynthesis, glycerophospholipid metabolism, sphingolipid metabolism, and fatty acid degradation (Fig. 2c, Table S8). Further, the RPKM-normalized read counts differed markedly (more than five-fold) across six regions, being 7.29×106, 6.27×106, 1.22×107, 8.63×106, 6.44×106, and 6.37×106 in LP, KS, HC, FG, AS, and QT, respectively (Table S8).
Profiling of functional genes involved in metabolism essential for the cycling of C, N, and S
To detect the functional genes conferring traits involved in C, N, and S cycling, the obtained soil metagenomic reads were annotated using the databases of CAZy [23] and KEGG [24]
For carbohydrate metabolism, the 41 soil samples contained six critical CAZyme-encoding genes involved in SOC decomposition and biosynthesis (Fig. 3, Table S9). The genes were distributed unequally among glycoside hydrolases (GHs), glycosyl transferases (GTs), polysaccharide lyases (PLs), carbohydrate esterases (CEs), auxiliary activities (AAs), and carbohydrate-binding modules (CBMs), these generally considered to reflect the microbial potential of decomposition and biosynthesis of SOC [25]. Evidently, the genes encoding the GHs (organic carbon decomposition) and GTs (organic carbon biosynthesis) enzymes were the most abundant in all samples across the six regions (Fig. 3), followed by those coding for CEs and AAs, and least for CBMs and PLs. In addition, there is considerable variation in the total abundances of genes participating in SOC decomposition and biosynthesis across all 41 samples from the different regions, ranging from 61 267.39 to 77 190.28 (Fig. 3, Table S9).
For nitrogen metabolism, biogeochemical N cycling between inventories is often attributed to the six classical N-transformation processes: nitrogen fixation, nitrification, denitrification, nitrate reduction, nitrogen transport/nitrate assimilation, and organic nitrogen metabolism [3, 26] (see the N-cycling model in Fig. 4). Here, 41 KOs encoding key enzymes responsible for N cycling were detected from all samples across the six regions; these genes could be categorized into seven individual gene clusters (Fig. 4, Table S10). (1)Nitrogen fixation: The conversion of N2 to biologically available ammonia (NH4+) is carried out by the nitrogenase complex [26]. Three key genes (nifD, nifH, nifK) encoding the nitrogenase complex were identified in four regional groups (KS, AS, QT, FG samples), but the nifH and nifK were not identified in either the LP or HC group. The highest and lowest abundance of genes involved in nitrogen fixation were found in the QT group (7.43) and HC group (0.135), respectively, the former nearly 55 times greater than the latter. Overall, the abundance of these three genes was measurable yet consistently low across all soil samples from the 41 sites.
(2) Nitrification
Microbial enzymes (ammonia monooxygenase) catalyze the process whereby ammonia (NH3) is oxidized to nitrite (NO2−) and subsequently to nitrate (NO3−) [27]. Four key genes (pmoA-amoA, pmoB-amoB, pmoC-amoC, hao) encoding ammonia monooxygenase were identified in all six regions. The genes involved in nitrification were most abundant in the FG group (43.50) and least so in the HC group (15.57), the former 2.79 times higher than the latter. Of them, pmoB-amoB (80.28) and pmoC-amoC (58.67) were two most abundant genes participating in the nitrification process.
(3) Denitrification
The conversion of NO3− to N2 proceeds via four intermediate steps (NO3−→NO2−→NO→N2O→N2), producing several nitrogenous compounds with notable roles as air polluting gases (N2O and NO) [28]. Eleven marker genes (nirS, nirK, norB, norC, nosZ, narG-nxrA, narH-nxrB, narI, napA, napB, napC) encoding key enzymes for denitrification were identified in all six regions. These genes attained their highest abundance in the QT group (754.21), being lowest in the HC group (296.37), with the former 2.54 times the latter. The three most abundant genes involved in the denitrification process were narG-nxrA (747.29), nirK (649.48), and norB (631.83).
(4) Nitrate reduction: The reduction of NO3− to NH4+ ultimately leads to the incorporation of N into microbial biomass [26]. Of these, DNRA (dissimilatory nitrate reduction to ammonia) is an anaerobic process in which NO3− serves as an electron acceptor to oxidize and release energy from organic carbon. It is mediated by nitrate reductases that form NO2− and nitrite reductases that convert NO2− to NH4+ [29]. DNRA is a novel a biological pathway of N-cycling, and the shortest, in terrestrial ecosystems where NO3− is reduced to NH4+ in soils [30]. Four key genes (nrfA, nrfH, nirB, nirD) encoding key enzymes for DNRA were identified in all six regions. The highest abundance of these genes involved in the DNRA pathway was found in the LP group (656.28) and the lowest in the FG group (329.02), the former being about double the latter. Of these, nirB (1988.84) was the gene present in greatest abundance. Compared with the DNRA pathway, ANRA (assimilatory nitrate reduction) pathway is an energetically costly process that depends on different families of nitrate reductases and nitrite reductases. A total of six marker genes (nasA, nasB, narB, nirA, NR, NIT-6) encoding key enzymes for ANRA pathway were detected across all six regions. The first four of those genes were present in all six regions, with nasA (1607.44) and nirA (1252.56) being the most abundant, where NR and NIT-6 had negligible abundances in three regional groups (LP, FG, and HC) and the FG group, respectively. Further, the highest abundance of genes participating in the ANRA pathway occurred in the HC group (667.09) and the lowest in the FG group (406.41), with the former 1.64 times higher than the latter.
(5) Nitrogen transport/nitrate assimilation: Five marker genes (NRT, nrtA, nrtC, nrtB, nrtD) encoding key enzymes for nitrogen transport/nitrate assimilation were identified in all six regions. These genes were found most and least abundant in the KS (702.81) and HC (358.00) groups, respectively, the former being 1.96 times higher than the latter.
(6) Organic nitrogen metabolism
Eight marker genes (gltB, gltD, gdhA, gdhB, glnA, ureA, ureB, ureC) encoding key enzymes for organic nitrogen metabolism were identified in all six regions. The highest abundance of these genes was found in the LP group (7014.40) and the lowest in the HC group (4937.93); the former was 1.42 times the latter. Of these, glnA (16 187.50) and gltB (10 596.90) were two most abundant genes involved in the organic nitrogen metabolism process.
Overall, then, we uncovered significant differences in the total gene abundances of N metabolism among the different eco-regions, these indicating the soil microbiome of LP harbors the greatest gene abundance for N metabolism (9203.69), followed by that of QT (8976.28), KS (8471.59), and AS (8363.10), with FG (7113.39) and HC (6673.01) having the lowest abundance (Fig. 4, Table S10).
For sulfur metabolism, biogeochemical cycling of S between inventories is often attributed to the three distinct sulfate-transforming processes (Fig. 5): sulfur assimilation, anaerobic sulfate respiration, and sulfide oxidation [31] (see the S-cycling model in Fig. 5). In this study, a total of 21 KOs encoding key enzymes responsible for S cycling were detected from all soil samples across the six regions. These genes could be categorized into three individual gene clusters (Fig. 5, Table S11). (1)Sulfate reduction: The redox reaction in which the divalent anion sulfate (SO42−) is reduced to sulfide (S2−) occurs after the addition of eight electrons. This reaction can be coupled to the oxidation of other chemical species, including a variety of organic compounds (e.g., acetate, pyruvate, methane) or inorganic compounds such as hydrogen (H2) [32]. Sulfate reduction can happen via assimilatory sulfate reduction (ASR) and dissimilatory sulfate reduction (DSR). Almost all microbes use the ASR pathway to reduce SO42− or sulfite (SO32−) to build biomolecules. Here, 10 marker genes (PAPSS, sat, cysNC, cysN, cysD, cysC, cysH, cysJ, cysI, sir) encoding key enzymes participating in the ASR pathway were identified in all six eco-regions. The highest abundance of these genes was found in the LP group (2845.11) and the lowest in the HC group (2017.40). Of these, cysNC (2530.80), cysD (2425.40), cysI (2168.48), and cysH (2041.78) were four most abundant genes involved in the ASR pathway, followed by sat (1718.80), sir (1429.45), cysC (860.12), cysN (770.90), cysJ (659.60), and PAPSS (9.71). In contrast to the ASR pathway, many microbial phyla in anoxic habitats rely on sulfur (either as [thio-]sulfate, sulfite, or elemental S0) as a terminal electron acceptor, hence the so-called “DSR” [33, 34]. Like the ASR pathway, hydrogen sulfide (H2S) is also produced in the DSR pathway, but in much larger quantities and as a ‘waste product’ rather than a precursor [35]. Here, five marker genes (sat, aprA, aprB, dsrA, dsrB) encoding key enzymes crucial to the DSR pathway were identified across six regions. Of these, sat (1718.80), aprA (91.21), and aprB (53.69) were three most abundant genes present and evenly distributed among the six regions. In addition, a very low abundance of dsrA (0.34) and dsrB (1.03) was respectively found in two (AS and FG) and three (KS, AS, and QT) regional groups.
(2) Thiosulfate oxidation by the SOX complex
In chemotrophic and phototrophic sulfur oxidizers that do not form sulfur deposits, a periplasmic thiosulfate-oxidizing multienzyme complex (SOX complex) has been described as responsible for sulfate’s formation from thiosulfate [36]. Here, six marker genes (soxA, soxB, soxC, soxX, soxY, soxZ) encoding key enzymes for thiosulfate oxidation were identified in all six regions. The highest and lowest of abundance of these was in the KS (808.33) and HC group (74.73) group, respectively, the former being 10.8 times higher than the latter. Furthermore, soxC (1106.70) was the most abundant gene involved in the thiosulfate oxidation process, followed by soxY (472.73), soxA (379.14), soxB (331.35), soxZ (262.72), and finally soxX (215.41).
Altogether, we detected significant differences in the total abundances of genes related to S metabolism among different regions. Specifically, the KS (3552.47) and LP (3545.35) regional groups have the highest gene abundance for N metabolism, followed by QT (3396.34) and AS (3394.62), with FG (2955.29) and HC (2404.06) having the lowest abundance (Fig. 5, Table S11).
Effects of environmental factors to microbial genes involved in the cycling of C, N, and S
To understand the relationship between the respect abundances of microbial functional genes involved in C, N, and S cycling vis-à-vis environmental factors across the six eco-regions in China, a Mantel analysis using Bray–Curtis’s dissimilarity measures (geographic distance) was firstly conducted. For carbohydrate functional traits, our results showed that gene abundances for GHs, GTs, PLs, CEs, AAs, and CBMs had positive relationships with multiple environmental factors (Fig. 6a, Table S12). Evidently, across all six regions, NDVI was the most critical environmental factor that individually explained the gene abundance involved in carbon metabolism, except with respect to PLs (soil C:N ratio). Soil pH and MAP were also contributing environmental factors driving the geographic distribution of the genes’ abundance. For N-cycling functional traits, the abundances of genes for nitrogen fixation, nitrification, denitrification, DNRA, ANRA, nitrogen transport, and organic N metabolism were correlated with either a single factor or a suite of environmental factors (Fig. 6b, Table S13). Evidently, MAT and MAP were the most critical environmental factors that individually explained the gene abundance of nitrogen fixation (r = 0.253, p = 0.017) as well as nitrification (r = 0.310, p = 0.004) and DNRA (r = 0.264, p = 0.004), respectively. Meanwhile, the gene abundance of denitrification was more positively correlated with MAT (r = 0.328, p = 0.002) than either SOC (r = 0.177, p = 0.007) or MAP (r = 0.160, p = 0.026), while gene abundance of ANRA was modestly positively correlated with the NDVI (r = 0.300, p = 0.001), soil pH (r = 0.236, p = 0.002), and soil EC (r = 0.236, p = 0.055). For nitrogen transport, its gene abundance was strongly positively correlated with MAP (r = 0.551, p = 0.001) and NDVI (r = 0.470, p = 0.001), as what that of organic N metabolism with NDVI (r = 0.480, p = 0.001) and soil C:N ratio (r = 0.480, p = 0.001). For S-cycling functional traits, the abundances of genes for ASR, DSR, and thiosulfate oxidation (by SOX complex) had positive relationships with multiple environmental factors (Fig. 6c, Table S14). Evidently, ASR’s gene abundance was moderately positively correlated with MAP (r = 0.354, P = 0.001), NDVI (r = 0.333, p = 0.001), and soil C:N ratio (r = 0.306, p = 0.001); the DSR’s gene abundance was moderately positively correlated with MAP (r = 0.413, p = 0.001), NDVI (r = 0.385, p = 0.001), and SOC (r = 0.300, p = 0.001); finally, the gene abundance of thiosulfate oxidation was strongly positively correlated with MAP (r = 0.649, p = 0.001), NDVI (r = 0.585, p = 0.001), and to a lesser extent, soil pH (r = 0.400, p = 0.001). Overall, these Mantel tests revealed that the functional gene abundance of microbes involved in C, N, and S cycling across all six eco-regions were each associated with the NDVI, soil properties, and climatic variables.
Furthermore, random forest model was used to estimate the importance of environmental factors and presented their importance value on the abundances of microbial functional genes involved in C, N, and S cycling (Fig. 7). Our random forest model indicates that the individual functional traits within carbohydrate metabolism were regulated by different predictors. Among them, SOC and Latitude (p < 0.05) were the two most important predictors in regulating entire carbohydrate functional traits (Fig. 7a); MAP (p < 0.01) was the most important predictor in regulating N-cycling functional traits, followed by C: N ratio, SOC, and Latitude (Fig. 7b); SOC, MAP, and Latitude (p < 0.01) were the three most important predictors in regulating S-cycling functional traits, followed by TN, Longitude, and C:N ratio (p < 0.05) (Fig. 7c).