Patients were enrolled on an IRB approved (MDACC 2014 − 0543) study for longitudinal sampling of the gut microbiome (Table S1). For all 41 patients enrolled on study, median age was 49 (range 29 to 72 years), and median BMI was 28.6 (range 17.5 to 46.7). Most patients were Non-Hispanic White (44%) or Hispanic (44%) and had squamous carcinoma (74%) and stage II disease (54%). Median tumor size was 5.4cm (range 1.8cm to 11.5cm). More information on clinicopathological characteristics of the patients is provided in Table S2.
Functional richness of MA metagenome associates with larger tumor size and with advanced CC stage
We first explored associations of total number of predicted genes and functions with patient and tumor characteristics (Table S3, Fig. 1a). A higher numbers of unique functions was positively correlated with increased tumor size (R = 0.41, P = 0.008) (Fig. 1b). Importantly, a large number of unique KEGG orthologous groups (KOs) was significantly associated not only with large tumor size (P = 0.02), but also with advanced stages (III/IV) of cervical cancer (P = 0.005; Fig. 1a).
Unsupervised hierarchical clustering reveals microbial communities with 2 distinct profiles of molecular functions
Next, we explored whether specific genes or functions were associated with patient and tumor characteristics using unsupervised hierarchical clustering of abundances. We selected 2,396 KOs that were found in greater than 30 samples. These KOs clustered into 2 large clusters and 4 sub-clusters (Fig. 1c). In general, Cluster 2 was comprised of smaller tumors (“ST cluster”), when compared with Cluster 1 (“LT cluster”; t-test P = 0.06). Within the LT cluster, Sub-cluster 1.1 was dominated by younger patients (t-test P = 0.05) with large tumors (t-test P = 0.03).
The KEGG pathway analysis of the ST cluster and the LT cluster using KEGG mapper  also identified a set of pathways enriched in each of the clusters (Fig. 1c). The ST Cluster was enriched with pathways associated with rapid microbial cell proliferation, including ribosome biogenesis (Additional File 1: Figure S3a), DNA repair (Additional File 1: Figure S3b), oxidative phosphorylation (Additional File 1: Figure S3c), and synthesis of peptidoglycan and lipopolysaccharides (Fig. 1d). Almost all enzymes involved in synthesis of biotin (vitamin B7) were also found in the ST cluster (Fig. 1d).
The LT cluster (Cluster 1 in Fig. 1c) was enriched with pathways associated with bacterial stress response, as indicated by activation of quorum sensing (Additional File 1: Figure S4a), sporulation, and degradation of glycan. The glycan degradation pathways included high activity of the KEGG pathway, “Other glycan degradation” (Fig. 1e). In addition, pathways involved in utilization of glycan degradation byproducts, such as fructose and mannose (Additional File 1: Figure S5a), and especially galactose (Additional File 1: Figure S5b) were enriched. Galactose is the major constituent (~ 85%) of glycans in normal human gastric mucus  and was also enriched in the LT cluster. Both branches of the pentose phosphate pathway, the oxidative branch maintaining redox balance in stress conditions and non-oxidative branch that supply glycolysis with intermediates derived from pentoses , were also more active in the LT cluster (Additional File 1: Figure S6). Most enzymes involved in utilization of ethanolamine (Fig. 1e), a breakdown product of human and bacterial cell membrane , and enzymes involved in production of ornithine (Fig. 1e), a precursor for synthesis of polyamines, such as putrescine [47, 48], were found enriched in the LT cluster. The biological processes of flagellar assembly and chemotaxis (Additional File 1: Figure S7) were also up-regulated in the LT cluster, revealing the importance of mobility for bacteria residing in the metagenomes .
Supervised comparison of metagenomes associated with largest and smallest tumors
To confirm the association of large tumor size with glycan degrading microbial communities, we divided patient samples into “Large Tumor” (LT-group) and “Small Tumor” (ST-group) groups. Two groups of metagenomes were selected to quantify abundances of the communities. The LT-group included 14 metagenomes of patients with the largest tumors, and the ST-group had 14 metagenomes of patients with the smallest tumors (Fig. 2a). The LT-group were also significantly more enriched in patients with stage III/IV of CC then ST-group (chi-squared contingency table tests p-values 0.02), since tumor size is roughly correlated with stage in the 2009 FIGO staging system used in the study. The diversity and richness of molecular functions was significantly higher in LT versus ST-group (Fig. 2b and Fig. 2c respectively). There were 1,496 differentially abundant KOs between the groups with most of the KOs (75%) were more abundant in the LT-group. The density plots of the differentially abundant KOs were also different between the groups. The KO abundances were significantly more scattered around the mean (standard deviation is 2.6) in LT-group when compared with ST-group (standard deviation is 1.3). The KO abundances mean value was almost twice as low in LT-group than in ST-group (Fig. 2d and Fig. 2e respectively). The observed density plots indicated that 2 distinct populations (sets) of molecular functions, referred to as LT-abundant and ST-abundant, might coincide in each metagenome but be most representative in either LT- or ST-group of samples. This conclusion was confirmed by density plots of all KOs found in LT-group (Fig. 2f) and in ST-group (Fig. 2g). By fitting parameters of the normal distributions from Fig. 2d and Fig. 2e to density plots in Fig. 2f and 2g using multiple linear regression, we found that although both, LT- and ST- abundant functions were present in LT- and ST-group of metagenomes, the fitting parameter of LT-abundant population was increased in LT-group by 1.2 and the fitting parameter of ST-abundant population was decreased by 1.5 in the group.
CAZyme Enzyme families associated with small tumors
To associate the LT-abundant KO population with the glycan degradation we compared abundances of carbohydrate-active enzymes (CAZyme) families in this population and in the ST- abundant KO population using CAZy database . The database classifies CAZymes according to their functions, such as synthesis of complex carbohydrates or their hydrolysis.
Glycoside hydrolases GH, which are enzymes involved in hydrolysis of glycosidic bonds between carbohydrates or between a carbohydrate and a non-carbohydrate moiety, were highly enriched in LT abundant KO population (Fig. 2h). There were 16 GH among KOs in the population, and none in the ST abundant population, which included only 3 Glycosyl Transferases (GT); the enzymes that are mainly involved in biosynthesis of disaccharides, oligosaccharides, and polysaccharides. We further compared ST- and LT-abundant KO populations in terms of abundances of glycosidases, enzymes annotated by EC 3.2.1- involved in hydrolyzes of O- and S-glycosyl compounds. There were 32 glycosidases in 28 samples, and many of them were significantly more abundant in metagenomes of patients with large tumors (Fig 2i). Only one enzyme, lysozyme, was found to be significantly more abundant in the small tumors group of patients. Lysozyme is a known component of two-component cell lysis cassette in bacteriophages .
Biological processes associated with small and large tumors
The KEGG pathway analysis of differentially abundant KOs was further used to identify specific biological processes represented by LT-abundant and ST-abundant KOs (Fig. 2j). The representative pathways revealed by the analysis were consistent with results of the supervised analysis of 41 samples. Very active proliferation in ST-group of metagenomes were indicated by a high pathway enrichment score (ST- versus LT-abundant population) for DNA replication, ribosome, t-RNA biosynthesis, homologous recombination and mismatch repair. Stress response and degradation of the mucus layer in LT-group were indicated by enrichment of biofilm formation pathway, galactose metabolism (Fig. 2j, Additional File 1: Figure S8), and by a significant increase in the abundance of trehalose-specific PTS components (Fig. 2k) among LT- versus ST-abundant KOs. Ornithine biosynthesis (Additional File 1: Figure S9), putrescine and spermidine metabolisms (Additional File 1: Figure S8; Fig. 2k) were also enriched in LT- abundant KOs, while enzymes involved in vitamin B12 production were enriched among ST-abundant KOs (Fig2 k).
Association of KEGG pathway with clinicopathological characteristics and recurrence free survival (RFS)
None of the studied clinical characteristics showed significant (p<0.05) association with RFS. This analysis was limited by the low number of events in the study cohort. Among KEGG pathways enriched in either LT- or ST-group of MA metagenomes, only a trend of negative association with RFS was found for high activity of glycan degradation pathway and of Phosphotransferase system (Fig. 3e and Fig. 3f). Both pathways were enriched in LT-group of MA metagenomes. Vice versa, significant positive association with RFS was found for high activity of Ribosome biogenesis and DNA repair pathways enriched in ST- group of MA metagenomes (Fig. 3g and 3h). Associations with RFS were also observed for stage and positive nodal status (Additional File 1: Figure S10A and S10B). These positive associations are well documented in the literature as very significant in studies of large number of CC patients (>10,000) . In this study, patients with positive nodes also had significantly increased activity of glycan degradation pathway (p=0.005) and ornithine biosynthesis (p=0.05) in MA metagenomes (Fig.3a and Fig. 3b). Activity of the pathways was also significantly higher in patients with stage III/IV tumors (Fig. 3c and Fig. 3d), while activity of ribosome biogenesis was higher in metagenomes of stage I/II patients (Additional File 1: Figure S11).
Taxonomic associations with tumor size
Difference in taxonomic structure of MA metagenomes in LT-group versus ST-group can be seen at different taxonomic levels (Fig. 4a, Additional File 1: Figure S12-S14) with dominance of Phylum Firmicutes (Class Clostridia and Order Clostridiales) and Phylum Proteobacteria in LT- group and Phylum Bacteroidetes (Class Bacteroidia, Families Prevotellaceae and Ruminococcaceae in ST-group. Like the differentially abundant molecular functions, the density plots of differentially abundant species were also different between the groups (Fig. 4b) with a greater median abundance of species in ST-group. No difference, however, was seen between LT- and ST-groups of metagenomes in terms of species richness, diversity, and evenness (Fig. 4c).
Taxonomic annotation of most abundant contigs (OTUs) in each group and comparison of their abundances between groups revealed that many of the contigs belonged to Order Bacteroidales, and they were significantly more abundant in ST-group. There were 5 putative species of Genus Porphyromonas (P. endodontalis, P. sp COT_239OH1446, P. bennonis, P. sp HMSC065F10 and P. uenonis) and 2 species of Genus Prevotella (P. Timonensis and P. Buccalis) among Bacteroidales (Fig. 4d, Additional File 2: Table S3). None of the genera was significantly abundant in LT-group. Contigs annotated by Class Clostiridia and other levels of taxonomy in the Class were abundant in L-group including 12 putative species of Family Ruminococcaceae. Only 1 species of Ruminococcaceae (AF41_9) was found more abundant in ST group. The observations were consistent with results obtained by comparison of LT- and ST-groups using LEfSe (Fig. 4e and Additional File 1: Figure S15). The latter analysis has also identified less abundant taxa enriched in one of the groups, such as Class Tissierellia, Genera Ezakiella, Murdochiella and Hungatella, in ST-group, as well as Families Ruminococcaceae, Lachnospieraceae, Enterobacteriaceae, and Order Enterobacterales in LT group (Additional File 1: Figure S16).
Taxonomic associations with pathways
To identify taxa utilizing the glycan degradation pathway in each metagenome we selected contigs that encode enzymes of the pathway and quantified abundances of taxa representing these contigs. Overlay of the enzyme abundances across metagenomes with the relative abundances of taxa involved in glycan degradation (Fig. 4f) showed that major taxa degrading glycan are different between MA metagenomes of patients with small and large tumors. Clostridiales is more likely to encode enzymes of the glycan degradation in the LT-group, while Bacteroidales is the major taxon that encode the enzymes in ST-group. The result is consistent with the difference in taxonomic structure of MA metagenomes described in the previous paragraph. Further correlation analysis revealed a significant positive association of the glycan degradation pathway score with Bacteroidales, Clostridiales, and with non-classified taxa. A significant negative correlation was found with Actinomycetales, Chlamydiales, and Tissierellales (Additional File 1: Figure S17). In general, there was significant variation across taxa involved in glycan degradation. The variation was especially dramatic in MA metagenomes of patients with large tumors.