General features of the metagenome
The metagenome sequencing experiment of five yak fecal samples produced approximately 312 million paired reads and 92 Giga base pairs (Gbps) in total (Additional file 1). After de novo assembly using pooled sequence data from all samples, the resulting metagenome was composed of 1,676,522 contigs, with the average GC% content of 44.3% and the N50 value of 2,153 bp. Among these contigs, the longest one was 377,952 bp. About 68% of the high-quality reads can be recruited back to the assembled contigs greater than 1,000 bp, and the mean sequencing depth of these contigs was 26-fold, giving adequate coverage for the assembly of metagenomic reads. Gene calling based on the contig assemblies predicted 4,570,557 coding sequences (CDSs) with an average length of 698 bp. In this catalog of microbial genes, 44% (2,013,063 genes) possessed complete open reading frames with a mean length of 737 bp. The protein sequence similarity analysis showed that 70.9% (3,241,667) of all the CDSs were annotated by the entries in the NCBI NR database, 51.7% (2,363,314) annotated by the COG database, 46.8% (2,136,681) annotated by the KEGG database and 61.6% (2,815,543) annotated by the Pfam database. Besides, classification of all CDSs based on the COG functional categories indicated that 11.5% were associated with information storage and processing, 10.7% with cellular processes and signaling, 17.7% with the metabolism of various biopolymers (e.g. carbohydrates, amino acids, nucleotides, coenzymes, lipids, and inorganic ions), and 0.7% with the mobile genetic materials like transposons and prophages (Additional file 2).
Taxonomic composition of the yak gut microbiota
To understand the community structure of the yak fecal microbiome, taxonomic distribution based on the pooled reads from all samples was analyzed using protein-level sequence classification. The taxonomic profile of the microbial community consisted of twenty phyla and 120 genera (≥ 0.1% abundance) (Additional file 3). Firmicutes and Bacteroidetes were the most predominant bacteria, accounting for over three quarters (75.7%) of the whole microbial community (Fig. 1A). Both phyla are also the predominant bacterial populations in the fecal microbiota of cattle [12, 15]. The other bacterial phyla with moderate abundance were Proteobacteria (7.3%), Actinobacteria (4.0%), and Spirochaetes (1.6%). For the archaeal domain, Euryarchaeota (3.0%) was the major phylum dominated in the yak fecal microbiome. At the family level, 103 families were detected and the highly abundant taxa with more than 1% abundance are displayed in Fig. 1B. It was noted that eight Firmicutes families were highly abundant, including Lachnospiraceae (11.7%), Ruminococcaceae (6.9%), Clostridiaceae (5.6%), Hungateiclostridiaceae (2.6%), Oscillospiraceae (2.5%), Bacillaceae (2.0%), Paenibacillaceae (1.8%), and Peptococcaceae (1.1%). A substantial diversity of the Bacteroidetes organisms was also found, which was well represented by five abundant families Bacteroidaceae (6.5%), Prevotellaceae (2.8%), Rikenellaceae (2.8%), Flavobacteriaceae (2.2%) and Tannerellaceae (1.0%). Additionally, the taxonomic profiles of individual fecal samples were also summarized in Additional file 3. As shown in Additional file 4, it seemed that the community structures of different samples were similar to each other. Based on the ANOSIM test, there was no significant difference for the microbial communities between the two study sites (R = 0.75, P = 0.10).
Novel CAZymes in the yak gut microbiome
To explore the enzyme repertoire for the breakdown of complex polysaccharides, the genes encoding carbohydrate-active enzymes (CAZymes) present in the yak fecal microbiome was further detected using dbCAN2 [16]. It resulted in 119,926 putative CAZyme sequences assigned to 268 enzyme families, accounting for ~2.6% of the total genes in the catalog. To estimate the novelty of the annotated CAZymes, the protein sequences were searched against the NCBI NR database and the results were summarized in Additional file 5. A small fraction (16.2%) of all the predicted CAZymes were relatively conserved proteins that shared more than 70% identity with the best-hitting homologs. It suggested that 100,543 of the predicted carbohydrate-metabolizing enzymes may be novel, especially for 16,546 proteins that had less than 40% identity with the known proteins in the NR database.
All the detected genes coding for CAZymes were further assigned into six functional classes: 71,908 glycoside hydrolases (GHs), 27,163 glycosyltransferases (GTs), 2,367 polysaccharide lyases (PLs), 14,932 carbohydrate esterases (CEs), 5,389 carbohydrate-binding modules (CBMs), and 204 auxiliary activity enzymes (AAs), respectively. The sequence conservation of these CAZymes was also evaluated through binning their identity percentages with the best matches in the NCBI NR database and the overall identity distribution is displayed in Fig. 2. It was apparent that the GHs were the most abundant, representing the majority (60.0%) of all the CAZyme genes. On the contrary, the AAs (0.2%) were very scanty in the community, and they were relatively conserved compared to the publicly available sequences, with a mean identity of 76%. Notably, the low abundant PLs (2.0%) exhibited the highest genetic divergence with a mean identity of 44%. Besides, the identity percentages for the other four classes were 58% (GTs), 56% (GHs), 56% (CEs), and 51% (CBMs), respectively.
Diversity of carbohydrate-degrading enzymes in the microbiome
GHs (EC 3.2.1.-) are prominent enzymes for hydrolyzing the glycosidic bonds of carbohydrate substrates such as plant cell walls, starch particles, and mucin [4, 10]. Currently, sequence similarity-based family classification of CAZymes has produced 167 GH families (http://www.cazy.org/), many of which group together enzymes with different substrate activities [5]. In the yak fecal microbiome, a total of 71,908 GHs were allocated to 120 CAZy families (Additional file 6). The top 11 abundant families (i.e. GH13, GH2, GH3, GH78, GH43, GH20, GH109, GH29, GH25, GH77, and GH36) possessed 36,283 genes, accounting for about half of the total number of the GH-related sequences. GH13, which is a main α-amylase family that hydrolyzes the internal α-1, 4-glucosidic linkages of starch-related carbohydrates [17], was the largest family with a relative abundance of 8.8%. In addition, the sets of genes encoding four categories of lignocellulolytic enzymes (i.e. cellulases, endo-hemicellulases, debranching enzymes, and oligosaccharide degrading enzymes) from 26 GH families were identified in the fecal microbiome of yak (Table 1). The oligosaccharide degrading enzymes (27.0%) were the most dominating, followed by debranching enzymes (6.1%), endo-hemicellulases (4.1%), and cellulases (2.6%). The cellulases responsible for hydrolyzing β-1,4 linkages in cellulose chains were mainly represented by the genes belonging to the GH5 (cellulases) and GH9 (endoglucanase). The genes coding for endo-hemicellulases were distributed in six families GH8, GH10, GH11, GH26, GH28, and GH53. Of these, GH28 (polygalacturonase), GH10 (endo-1,4-β-xylanase), GH53 (endo-β-1,4-galactanase), and GH26 (xyloglucanase) were more abundant, accounting for nearly 97% of total endo-hemicellulases. Besides, the genes encoding debranching enzymes were mostly assigned to the families GH78 (α-L-rhamnosidase) and GH51 (α-L-arabinofuranosidase), with 3,585 and 726 genes, respectively. High numbers of genes encoding different oligosaccharide degrading enzymes, e.g. β-galactosidase, β-glucosidase, β-xylosidase, α-L-fucosidase, and α-Mannosidase, were found in the families GH1, GH2, GH3, GH29, GH35, GH38, GH39, GH42, GH43, and GH94. Of these, GH2, GH3, and GH43 were the predominant enzyme families, with a relative abundance of 7.5%, 5.9%, and 4.9%, respectively.
Furthermore, the density of the GH genes in the yak fecal microbiome was 20.5 GHs per million base pairs of the assembled contigs. The comparison of GH frequencies with those present in the other herbivore microbiomes implicated that the density of GHs in yak gut was comparable to that (20.4 GHs/Mbp) of termite gut but relatively higher than that in the elephant gut (18.1), cow gut (17.6) and rumen (12.5) (Table 1). The highest density of GHs was found in the camel rumen (24.2). Meanwhile, the number of different GH families predicted in the above herbivore metagenomes was 118 in elephant gut, 112 in camel rumen, 111 in cow rumen, 97 in cow gut, and 57 in termite gut, respectively. Besides, the analysis also found that the GH genes were significantly overrepresented in 18 families (i.e. GH1, GH4, GH20, GH24, GH29, GH33, GH37, GH38, GH39, GH78, GH79, GH84, GH85, GH109, GH110, GH123, GH141, and GH163) in the fecal microbiome of yak comparing to the rumen microbiome of cow and camel (p-value < 0.01; Additional file 6). The evidence for high-density GHs and diversified enzyme families present in the fecal microbiome of yak revealed that its intestinal microbiota likely had strong potential to breakdown various plant-derived polysaccharides in vivo.
PLs (EC 4.2.2.-) are the enzymes that cleave uronic acid-containing polysaccharides using an β-elimination mechanism [18]. These enzymes can target PCW polysaccharides (e.g. pectin and pectate) and/or animal glycans (e.g. chondroitin, heparin, and hyaluronan) [4]. Here we identified 2,367 genes encoding PLs fell into 25 families. Among these PLs, the common enzymatic activities related to degradation of animal glycan were hyaluronate lyase, gellan lyase, chondroitin lyase, and heparin lyase [19], which were represented by the prominent families PL35, PL33, PL12, PL8 and PL21 in the yak fecal microbiome. PL35 (447 genes) and PL33 (401 genes) were the most abundant families, both of which were significantly overrepresented in the fecal microbiota of yak comparing to the rumen microbiota of cow and camel (Additional file 6). The lower frequencies of the PL genes encoding pectin lyase, pectate lyase and rhamnogalacturonan endolyase were found in the families PL1, PL11, and PL9, which have been reported to be involved in the breakdown of pectin and pectate that are common ingredients of PCW polysaccharides [7].
CEs are a class of esterases that catalyze the O-de- or N-deacylation of substituted saccharides and cooperate with GHs to break down PCW polysaccharides [8]. According to the CAZy database, CEs have been segregated into 17 CAZy families. The esterases in the families CE1-7 and CE16 have been supposed to deacetylating enzymes for the breakdown of acetylated plant hemicellulose [20]. In the present study, the set of the predicted CEs belonged to 15 families. Among these, CE1 (3,399 genes) and CE4 (3,048 genes) were the most abundant families, both representing the enzymic activity to degrade acetyl xylan. In addition, moderate abundances were also observed in the families CE2, CE3, CE6, CE7, and CE12 associated with degradation of acetylated plant hemicellulose (Additional file 6).
Major carbohydrate-degrading genes originated from Firmicutes and Bacteroidetes
To find out the major microbial populations contributing to the digestion of complex carbohydrates, taxonomic profiles of the genes encoding carbohydrate-degrading enzymes represented by GHs, CEs, and PLs, respectively, were determined by the LCA algorithm using MEGAN [21]. As shown in Fig. 3A, the majority (> 90%) of all carbohydrate-degrading enzymes were mainly derived from the microbes affiliated to Firmicutes and Bacteroidetes. Moreover, the largest cohort of microbes contributing to the gene repertoire of GHs (56.7%) and CEs (62.7%) was the Firmicutes bacteria. By contrast, Bacteroidetes was the most dominant among the putative microbial producers for PLs (56.7%). A further view at the lower taxonomic level revealed that the microbial species belonging to the families Bacteroidaceae, Ruminococcaceae, Rikenellaceae, Clostridiaceae, and Prevotellaceae were frequently observed in all three classes of CAZymes (Fig. 3B). The proportions of PLs originated from Rikenellaceae (21.1%) and Paenibacillaceae (19.5%) were much higher than those of GHs and CEs. It was apparent that the members of Catabacteriaceae carried the genes encoding CEs (3.3%) alone. Additionally, the carbohydrate-degrading genes excluding the PLs were detected in the families Akkermansiaceae, Erysipelotrichaecea, Spirochaetaceae, and Acetobaceraceae. Notably, a substantial number of CEs (27.1%) were found in the cohort of unclassified bacterial species from the taxonomic clades Firmicutes, Clostridiales, Lentisphaerae, and Bacteroidales in which less abundant PLs and none of GHs were observed. The high proportion of the unclassified taxa indicated that many microbes with special metabolic potential were undiscovered in the gut community of yak.
Recovery of the uncultured genomes and their lignocellulolytic potential
In the present study, 68 metagenome-assembled genomes (MAGs) with completeness ≥ 80% and contamination ≤ 10% were recovered to further explore genome biology of individual lignocellulolytic species in the yak digestive tract. The characteristics of the genome assemblies and the predicted taxonomy were summarized in Additional file 6. The sizes of the MAGs were ranged from ~0.85 to ~3.53 Mb with an average N50 of 26,918 bp. Additionally, these MAGs harbored a varied GC% content between 22.9% to 64.1%, representing a broad range of diverse microbes (Additional file 7).
The analysis of taxonomic inference for MAGs indicated that all the putative genomes were assigned to six bacterial phyla and a single archaeal phylum (Fig. 4). Among the MAG-representing microbial populations, the most frequently observed taxa were the species affiliated to Firmicutes (30 MAGs), followed by Bacteroidetes (24 MAGs). The other MAGs were taxonomically assigned to the phyla Verrucomicrobia (7 MAGs), Proteobacteria (4 MAGs), Fibrobacteres (1 MAG), Spirochaetes (1 MAG), Euryarchaeota (1 MAG). All MAGs of Bacteroidetes belonged to the order Bacteroidales, and the majority (24 out of 30) of Firmicutes MAGs were affiliated to the class Clostridia. As shown in Fig. 4A, the number of genomes that could be taxonomically classified was decreased sharply at the genus level. Approximately a quarter (14 out of 68) of the MAGs were classified at the genus level, and none of the genomes were assigned with the taxonomic identifiers at the species level (Additional file 7). It suggested that most of the uncultured genomes were novel species firstly discovered in the yak gut microbial community.
On the other hand, based on phylogenetic reconstruction with 400 highly conserved prokaryotic proteins, the whole-genome phylogeny for the uncultured genomes together with some public reference genomes from the closely related species is illustrated in Fig. 4B. The genetic relationships of the MAGs were consistent with their taxonomic classifications at the phylum level. For instance, a single Euryarchaeota MAG (#32) clustered with Methanobrevibacter smithii was assigned to the genus Methanobrevibacter, which has been identified as the dominant methanogen in the large intestine of finishing pigs [22].
In terms of the genes coding for GHs in the recovered genomes, the potential of lignocellulosic degradation was evaluated. The amount and gene density of all GH-encoding genes present in the genomes belonging to different phyla is shown in Table 2. A lot of GHs were observed in the genomes of Bacteroidetes and Firmicutes, which possessed 897 and 857 genes, respectively. The highest gene density, 27.0 GHs/Mbp, was observed in the Fibrobacteres MAG (#42) that was assigned to the family Fibrobacteraceae. The relatively high density of GHs was found in the bacterial genomes from the phyla Bacteroidetes (19.5), Firmicutes (17.6), and Verrucomicrobia (17.8), respectively (Table 2). Conversely, the GHs are scarce in the single Euryarchaeota MAG, which encodes two GH genes only. Besides, the distribution of the GHs involved in the degradation of PCW polysaccharides across the MAGs is displayed in Fig. 5. The genes encoding lignocellulolytic enzymes were frequently distributed in the following families: cellulases (GH5), endo-hemicellulases (GH10 and GH28), debranching enzymes (GH51 and GH78), and oligosaccharide degrading enzymes (GH2, GH3, GH29, GH38, GH39, and GH43). It was worth noting that 15 MAGs were derived from novel bacteria species with carbohydrate-digestive capacity, each with more than 20 genes encoding lignocellulolytic enzymes. They were seven Clostridia MAGs (#15, #25, #33, #48, #52, #55, and #68) belonging to the Firmicutes, five Bacteroidia MAGs (#04, #07, #29, #44, and #59) belonging to the Bacteroidetes, and two Kiritimatiellae MAGs (#41 and #46) belonging to the Verrucomicrobiota, and the remaining MAG (#42) belonging to the Fibrobacteres. Two Bacteroidetes MAGs (#29 and #07) harbored the most abundant (hemi)cellulose-degrading genes, with 54 and 49 genes, respectively. MAG07 possessed abundant genes encoding cellulases (18 GH5, 4 GH9, and 1 GH44) and endo-hemicellulases (5 GH11, 3 GH8, 3 GH10, and 1 GH53). By contrast, more genes encoding debranching (5 GH51, 1 GH67, and 1 GH78) and oligosaccharide degrading enzymes (20 GH43, 8 GH2, 5 GH3, 3 GH29, and 1 GH35) were detected in MAG29.