Identification of semi-tryptic peptides in metaproteomes
We analyzed two large-scale published datasets including 447 fecal (PXD008675, [7]) and 176 MLI metaproteomes (PXD007819, [15]). Both datasets were generated using high-resolution MS/MS, which allowed searching a large sequence space with a low FDR for semi-tryptic peptide identification. To increase microbial taxonomic coverage, we assembled sequences from a variety of culture-dependent sources such as UniProtKB, NCBI, CGR [20] and culture-independent sources such as IGC [18]. A total number of 12,828,005, 3,133,023, 2,948,562, and 2,757,990 MS/MS spectra were searched for the feces, ascending colon (AsC), descending colon (DeC), and terminal ileum (TI) MLI metaproteomes respectively, from which 3,804,903 (29.66%), 2,035,847 (64.98%), 1,917,761 (65.04%), and 1,808,732 (65.58%) peptide-spectrum matches (PSMs) were identified using PEAKS (1% FDR) in the first step large database search. To increase the sensitivity and confidence of large sequence space based metaproteomic analysis, we employed the two-step database searching strategy [14] and multiple search engines. Proteins identified with at least one peptide in the first step were reserved for the second round search using MaxQuant, PEAKS, and pFind. Only peptides identified by all three softwere were kept for further analysis, resulting in 125,494, 103,170, 106,243, and 92,784 peptides identified in the fecal, AsC, DeC, and TI metaproteomes respectively (Additional file 1: Tables S3-S18), among which 108,784 (86.68%), 76,325 (73.97%), 77,341 (72.79 %), and 65,002 (70.06%) peptides were assigned as microbial unique peptides (not shared by human or food sequences). Using UniPept, a total of 85,126 (78.25%), 67,288 (88.16%), 70492 (91.14%), and 57955 (89.16%) microbial peptides could get taxonomic and/or functional annotations in the fecal, AsC, DeC, and TI metaproteomes respectively. Despite of the comprehensiveness of IGC, which were frequently employed in previous metaproteomics studies, 11,540 (10.61%), 9,025 (11.82%), 9,129 (11.80%), and 7308 (11.24% TI) microbial peptides were only captured by UniProt/NCBI/CGR in the fecal, AsC, DeC, and TI metaproteomes respectively. This was probably because the UniProt/NCBI/CGR database are largely based on the translation of a completely sequenced single microorganism genome, the depth and assembly quality of which are significantly increased compared with that of gut microbial metagenomes. Among all identified microbial peptides, 28,525 (26.22%), 10,650 (13.95%), 9,357 (12.10%), and 10,614 (16.33%) peptides were semi-tryptic in the fecal, AsC, DeC, and TI metaproteomes respectively (Additional file 1: Tables S3 and S5-S7).
We identified 7,969 (6.35%), 14,869 (19.48%), 15,128 (14.24%), and 15,360 (16.55%) human specific peptides in the fecal, AsC, DeC, and TI metaproteomes respectively, among which, 5,104 (64.05%), 5,724 (38.50%), 5,254 (34.73%), and 6,825 (44.43%) peptides were semi-tryptic (Additional file 1: Tables S4 and S8-S10). Gene ontology (GO) analysis reveals that 84.13%, 79.97%, 81.74%, and 80.18 % of human semi-tryptic peptides were derived from potential extracellular proteins while only 1.16%, 0.80%, 0.73%, and 0.76% of microbial semi-tryptic peptides were assigned to potential extracellular proteins in the fecal, AsC, DeC, and TI metaproteomes respectively. The higher percentage of extracellular proteins, which are more susceptible to the gut luminal and mucosal proteases, contributed to the higher proportion of semi-tryptic peptides for human proteins.
Because many food resource such as pig, bovine, and other mammals share a large number of sequences with humans, the number of food unique semi-tryptic peptides were negligible, generally below 50 per sample (thus was not considered for further analysis), after excluding peptides shared by human. In addition, food semi-tryptic peptides represent complex proteolytic events across the entire digestive system because food proteins could be extensively proteolyzed by gastric pepsin, pancreatic proteases, and small intestinal exopeptidases before they reaches the large intestine (colon), where most gut microorganisms live.
Fecal metaproteomics reveals altered microbial proteolysis at different levels
We comprehensively investigated gut microbial proteolysis in terms of the taxonomic and functional distributions as well as the cleavage motif (Fig. 1 and 2). We first calculated the normalized relative abundance of semi-tryptic peptides (NRASP) to determine the relative degree of proteolysis of different microbial and functional groups. Fig. 1 illustrated the NRASP of 20 major taxonomic sub-groups, 35 major biological processes, and 32 enzyme sub-classes (Additional file 1: Tables S19-S21) identified in at least 75% of the 447 fecal metaproteomes from CD (n=204), UC (n=123), and control (n=120) groups.
The median values of NRASP of two dominant phyla Firmicutes and Bacteroidetes, two dominant classes Bacteroidia and Clostridia, two major orders Bacteroidales and Clostridiales, family Bacteroidaceae, as well as genus Bacteroides were around 1 with a very low individual variation, suggesting the relative abundance of the corresponding semi-tryptic peptides were comparable to that of fully tryptic peptides (Fig. 1a). However, the median of NRASP increased to approximately 1.25 for families Lachnospiraceae and Ruminococcaceae, and 1.5 for genera Roseburia and Prevotella as well as two abundant species Faecalibacterium prausnitzii and Prevotella copri, respectively, but reduced to approximately 0.5 for phylum Actinobacteria and order Bifidobacteriales (including its single family member, Bifidobacteriaceae). Four out of 20 major taxonomic sub-groups exhibited significant inter-group difference (Dunn-Bonferroni post hoc, P < 0.05) in their NRASP (Additional file 1: Tables S19): family Ruminococcaceae and species Prevotella copri in CD have increased NRASP compared with the control group; genus Faecalibacterium and its sole known species Faecalibacterium prausnitzii in UC have reduced NRASP compared with the CD group (Fig. 2a).
The median of NRASP of most biological processes also fluctuated around 1 (Fig. 1b). However, these values increased to 1.75-2 for isoleucine biosynthetic process, valine biosynthetic processes, bacterial-type flagellum-dependent cell motility, protein transport, carboxylic acid metabolic process, fucose metabolic process, and glucose metabolic process, and further increased to 2.5 for fatty acid metabolic process and L-threonine catabolic process to glycine, but reduced to approximately 0.75 for polysaccharide catabolic process, carbohydrate transport and transmembrane transport, and further reduced to 0.3 for metabolic process, respectively. Six out of 35 major biological processes exhibited significant inter-group difference (Dunn-Bonferroni post hoc, P < 0.05) in their NRASP (Additional file 1: Tables S20): bacterial-type flagellum-dependent cell motility, polysaccharide catabolic process, anaerobic cobalamin biosynthetic process, and fructose 1,6-bisphosphate metabolic process in CD have increased NRASP compared with the control group, and translation and glycolytic process in UC have reduced NRASP compared with the control and UC group, respectively (Fig. 2b).
At enzyme level, 3-hydroxybutyryl-CoA dehydrogenase which is involved in butyrate metabolism showed the highest NRASP (median value > 3), followed by 3-hydroxyacyl-CoA dehydrogenase involved in fatty acid beta-oxidation, glycine C-acetyltransferase involved in L-threonine degradation, phosphoenolpyruvate carboxykinase (ATP) involved in gluconeogenesis, ketol-acid reductoisomerase (NADP(+)) involved in the biosynthesis of branched-chain amino acids (BCAA), and superoxide dismutase involved in tolerance to oxidant stress (median NRASP of 2-3, Fig 1c). Four out of 32 major enzyme sub-classes exhibited inter-group difference (Dunn-Bonferroni post hoc, P < 0.05) in their NRASP (Additional file 1: Table S21): superoxide dismutase in UC, protein-synthesizing GTPase in CD, and short-chain acyl-CoA dehydrogenase in both CD and UC have increased NRASP compared with the control group (Fig 2c).
We further investigated the microbial proteolytic cleavage site motif by calculating the amino acid frequencies at P6 - P6′ position (Fig. 2d). Generally, alanine and valine were the most abundant amino acids in P1 position in different samples, while amino acids in other positions exhibited higher individual variation. The fecal metaproteome revealed a global alteration of microbial proteolytic motif in IBD (Fig. 2e and Additional file 1: Table S22). In the unsupervised hierarchical clustering, CD and UC clustered together (separated from control), sharing many alterations such as increased frequencies of leucine (P5 and P5′) and tryptophan (P3 and P4′) as well as decreased frequencies of alanine (P1′), methionine (P1′ and P2′), aspartic acid (P3), and serine (P4). Cleavage motif of CD and UC also showed their unique alteration patterns in many cases. For instance, increased frequencies of leucine (P3, P2′, and P4′) and phenylalanine (P6, P3′, and P5′) were only observed in CD, while increased frequencies of histidine (P1 and P5′), serine (P1′), and arginine (P4) as well as decreased frequencies of valine (P2) and cysteine (P3′ and P6′) were only observed in UC. However, the large inter-individual variation did not allow for clear separation between groups in the principal component analysis (PCA, Fig. 2f) or partial least squares discriminant analysis (PLS-DA, Fig. 2g) of cleavage motif.
We also performed principal coordinates analysis (PCoA) of semi-tryptic peptide LFQ abundance using Bray-Curtis distance or Jaccard-based dissimilarity but did not result in clear group separation either (Additional file 1: Fig. S1). Although we used the ‘‘match between runs’’ of option MaxQaunt to increase transferred identification between separate LC-MS runs, the percentage of semi-tryptic peptides identified in more than 75% of samples were less than 0.05%.
Altered cleavage motif of human proteins were also observed (Additional file 1: Fig. S2 and Table S23). Similar to microbial cleavage motif, CD and UC clustered together and separated from control for human protein cleavage motif. Microbial proteins and human proteins can exhibit similar or reversed alteration trends in certain motif positions around the cleavage site in IBD (Additional file 1: Fig. S3). For instance, frequencies of glutamic acid, histidine, and three aromatic amino acids (phenylalanine, tryptophan and tyrosine) were increased in P5′ in IBD for both microbial and human proteins.
MLI metaproteomics reveals location-specific alterations of microbial proteolysis
The fecal microbiome serve as a proxy for the gut luminal microbiota but is not fully representative of the mucosa-associated microbiota at the site of disease. Important complementary knowledge was acquired by systematically characterizing the MLI metaproteomes from different sites, which revealed remarkable location-specific alterations of microbial proteolysis signatures. A total of 57, 58, and 51 biological processes were identified in at least 75% of samples in AsC, DeC, and TI MLI metaproteome, respectively, among which 7 (12.28%), 10 (17.86%), and 10 (19.61%) biological processes exhibited significant inter-group difference in their NRASP (Additional file 1: Tables S24-S26). In the AsC metaproteomes, NRASP of regulation of translation, carbohydrate transport, DNA repair, protein secretion, generation of precursor metabolites and energy, and carboxylic acid metabolic process significantly increased in CD and/or UC (Fig. 3a). In the DeC metaproteomes, most alterations occurred in UC, including increased NRASP of ribosome biogenesis, terpenoid biosynthetic process, 'de novo' UMP biosynthetic process, cell division, and translational termination, as well as reduced NRASP of gluconeogenesis (Fig. 3b). In contrast, in the TI metaproteomes, most alterations occurred in CD, including increased carbohydrate transport, L-fucose catabolic process, translational termination, and ATP synthesis coupled proton transport as well as reduced NRASP of transcription, translation, protein folding, and carbohydrate metabolic process (Fig. 3c).
The microbial cleavage motif also revealed remarkable location-specific alterations in MLI metaproteomes (Fig. 4a-c). Similar to NRASP, microbial cleavage motif differed more in DeC and TI than AsC, where 41, 57 and 32 amino acid frequencies at a specific site exhibited significant inter-group differences, respectively (P < 0.05, Kruskal-Wallis and Dunn-Bonferroni test, Additional file 1: Tables S27-S29). In the unsupervised hierarchical clustering, CD and UC clustered together, separated from control in the ascending colon (Fig. 4a); whereas UC and control clustered together, and separated from CD in the terminal ileum (Fig. 4c). The supervised partial least squares discriminant analysis (PLS-DA) also revealed that, on PC1 axis, UC partially separated from CD and control in the descending colon and CD clearly separated from the other two groups in the terminal ileum (Fig. 4d-f). Altered cleavage motif of human proteins were also observed in the MLI metaproteomes (Additional file 1: Fig. S4 and Tables S30-32). Similar to cleavage motif of microbial proteins, cleavage motif of human proteins in CD separated from the other two groups in the terminal ileum. Microbial proteins and human proteins can exhibit similar or reversed alteration trends in certain motif positions in IBD (Additional file 1: Fig. S5). For instance, increased frequencies of valine and leucine (P1) and reduced frequencies of two aromatic amino acids phenylalanine and tyrosine (P1) in IBD were observed in both microbial and human motif in MLI metaproteomes from different locations.
Association with microbial composition and proteases
To explore the potential relevance of gut microbial community structure (Additional file 1: Table S33) to proteolysis pattern, we associated the top three principal coordinates (PCo1-PCo3) of microbial semi-tryptic peptide LFQ intensity and the top three microbiome β-diversity principal coordinates (PCo1-PCo3, Fig. 5a) computed using Bray-Curtis distance. We found low correlations between semi-tryptic peptide LFQ intensity (PCo1) and β-diversity (PCo2 and PCo3) at different taxonomic levels (-0.40 < Spearman's rank correlation coefficient (R) < 0.42, P < 5.6e-9 for all pairwise associations, n = 272, Fig. 5c).
To investigate the association with microbial proteases, we resorted to the transcriptional abundance of microbial protease/peptidase (Fig. 5b, Additional file 1: Table S34) because protein-level abundance was inaccessible due to the limited sensitivity of current metaproteomics methodology. The microbial protease/peptidase transcriptome at feature (PCo3), species (PCo1), genus (PCo2), and family (PCo2) levels showed moderate correlations with semi-tryptic peptide LFQ intensity (PCo1), stronger than those of β-diversity (Fig. 5d, -0.55 < R < 0.54, P < 2.6e-11, n = 184). However, there was only low correlations at higher taxonomic levels (order, class, and phylum).
Association with host protease inhibitors andimmunoglobulins
In addition to microbial variables, we also investigated the involvement of host factors. Human endogenous protease inhibitors are particularly present in the intestinal tract. Four human protease inhibitors (serpin A1, A3, B1, and B6) were identified in fecal metaproteomes. Serpin A1, A3, and B6 exhibited negative correlations (-0.41 < R < -0.25, P < 7.9e-8, n = 447) with the semi-tryptic peptide LFQ intensity (PCo1 and PCo2). To further investigate the effect of host factors on gut microbial proteolysis, we also analyzed the correlations between gut microbial proteolysis pattern and host immunoglobulins. IgG1, IgG4, and IgM were negatively correlated with PCo1 (-0.44 < R < -0.25, P < 8.4e-8, n = 447), and IgA was positively correlated with PCo2 (R = 0.33, P < 8.4e-13, n = 447) of semi tryptic peptide LFQ intensity, respectively.