Increased human DNA contents (HDCs) in CRC patients
We first focused on CRC. As expected, we found that HDCs were significantly higher in feces of CRC patients than that of the healthy controls in all seven CRC datasets (Fig. 1a, Tables S1 and Tables S2). We then identified in total 26 species that were significantly correlated with HDCs in at least two datasets (Spearman Rank Correlation, p-value < 0.05, Fig. 1b; see Methods and Table S3). Among which, thirteen overlapped with the CRC microbial signatures with markedly differential abundances identified in at least two datasets (adjusted p-value < 0.05, see Methods), including twelve CRC-enriched species (Fig. 1b), such as Fusobacterium nucleatum, Bacteroides fragilis and Peptostreptococcus stomatis, which were found in two recent meta-analyses of CRC [15, 16]. Microbial colonization varies along the colon, partly because of thickness of mucous layer. Previous studies showed the B.fragilis with the capability of glycoproteins degradation and toxin production could penetrate the protective mucous layer, suggesting the bacteria accelerate the injury of gut barrier, trigger inflammation and induce tumorigenesis [28–30]. We also identified forty metabolic pathways that were significantly correlated with HDCs in at least two datasets (Table S4); among which, sixteen were previously identified as metabolic-pathway-biomarkers for CRC (see Methods). Most of the HDC-related functions decreased in at least three datasets were related to carbohydrate degradation for production of energy and short-chain fatty acids, such as D-galactose degradation and sucrose degradation (Fig. 1c) . In addition, HDC negatively correlated with the degradation pathways of several monosaccharides and monosaccharide derivatives, including fucose, mannose, galactose and UDP-N-acetyl-D-glucosamine (Table S4), which are known building blocks of gut mucus glycans; these results indicated decreased concentrations of the monosaccharides and derivatives, further confirming that the intestinal barrier is compromised . Together, our results suggested that CIB, as indicated by HDCs that can be directly quantified from gut metagenomics data, maintained a relationship with gut microbiota dysbiosis both in taxonomic and functional levels.
Combination of HDC and microbiome contributed significantly to patient stratification
We next tested if HDC and correlated species and pathways (referred as to HDC-species and HDC-pathways respectively) could contribute to patient stratification in CRC. Similar to Wirbel et al  and Thomas et al , we performed a leave-one-dataset-out (LODO) analysis  in which Random forest classifiers were trained on the combined datasets of all but one, and tested on the one that was left-out; we did this for each dataset in turn. As shown in Fig. 2a and 2c, for models trained using species and pathways abundances, including HDCs could improve prediction performance. More importantly, HDC was ranked as a top important feature, i.e. the 4th and 1st in the taxonomic (Fig. 2b) and functional (Fig. 2d) models, respectively. Interestingly, both HDC-related models performed better than models based on altered markers, even though overlap existed in the taxonomic and functional features (Fig. 2a, 2b). These results indicated the HDC-correlated features could contribute substantially to patient stratification and disease diagnosis (Fig. 2).
Similar results were found in CD
We then checked if similar results could be found in CD. A previous study reported elevated fecal HDCs in pediatric CD as compared with healthy controls ; the authors used quantitative polymerase chain reaction (QPCR) method to quantify HDCs by targeting human beta-tubulin coding-sequences. The authors also calculated HDCs from the metagenomics data and reported that the QPCR results were positively correlated with metagenomics-data-derived HDC values (r = 0.81 Pearson’s correlation, p = 9.3 × 10− 11; see ref ). We re-calculated the HDCs using our methods and found they were highly correlated with theirs (r = 0.978 Pearson’s correlation, p < 2.2e-16; Table S5). These results further validated the reliability and accuracy of metagenomics-derived HDCs.
We identified 46 HDC-correlated species (Control + Baseline group, Spearman correlation, P-value < 0.001), most of which were also differential-species (31 of 47 CD-signature species) that showed significant abundance changes between healthy controls and untreated patients (Control + Baseline group, Wilcoxon rank sum test, adjusted p-value < 0.05, Fig. 3a, Table S6 and Table S7). Akkermansia muciniphila and Bacteroides caccae as mucus-degrading commensal species, were expectedly reduced with increasing HDCs, because impaired gut was insufficient to secrete mucus . Another control-enriched bacterial marker, Eubacterium ventriosum, was previously identified to be negatively associated with fundamental components of eukaryotic cell membranes . Similarly, differential pathways partly overlapped with HDC related pathways, including those involved in carbohydrate, protein and glycogen metabolism, the decreased abundances of which were known to associated with nutrient deficiency and dysfunction of intestine (Table S8 and Table S9) [31, 35, 36].
We also built random forest classifiers using species and pathways abundances for CD and did 10 times repeated 10-fold cross-validation. Similar to CRC, we found that adding HDC to the input data could improve prediction performance (AUC increased from 0.94 to 0.95 based on species profile; increased from 0.90 to 0.92 based on pathways profile; Figure S1); similar to CRC, we found that HDC was ranked as a top important feature (1st in this case), and majority of top ten features were HDC-correlated (Fig. 3b). Interestingly, although overlapped significantly, these species are quite different from those in CRC (Table S10) in terms of their changes and importance in patient stratification (Fig. 3b), likely due to differences of disease localizations and microenvironments: CD commonly occurred in the terminal part of ileum and present an inflammatory habitat for microbes, while CRC appearing as tumor microenvironment occurred in the colorectum [37, 38]. Nonetheless, it appears that elevated HDC is a common feature of intestinal diseases, while different diseases can be distinguished by their different gut dysbiosis profiles.
HDC and related dysbiosis signified clinical treatment outcomes
The CD patients we analyzed were treated with diet intervention or anti-TNF antibodies; the outcomes were evaluated with fecal metagenomics sequencing at week 1, 4 and 8 after the interventions . We found that the HDCs were significantly decreased over time (Fig. 4a). As expected, HDC correlates significantly with fecal calprotectin (FCP; Pearson’s correlation = 0.498, p < 2.2e-16, Figure S2), a clinical indicator of intestinal inflammation released by neutrophils. However, concentrations of fecal calprotectin were only associated with three altered species in CD, indicating that HDC is a better biomarker related with dysbiosis than fecal calprotectin. Strikingly, we found 23 of the HDC-correlated species showed coordinated changes with HDC, i.e. species that were positively (negatively) correlated with HDC in the Control + Baseline group decreased (increased) with the decreasing HDCs (Kruskal-Wallis rank sum test, adjusted p-value < 0.05, Figure S3), suggesting that the intervention that reduced fecal HDCs could globally reverse the gut dysbiosis in a species-specific manner. Such a conclusion was further supported by the observation that the correlations between HDC and some of the species were consistent in the Control + Baseline, Week1, Week4 and Week8 groups (Fig. 3a).
We then investigated the effects of classifiers based on HDC and gut microbiome in predicting response to CD therapy (see Methods). As we expected, including HDC to the models could improve performances (Fig. 4b, Figure S4); again, we found that models based on HDC related species performed better than models based on altered species. These results suggested we need reform the previous thinking that considers only changed species as biomarkers of patients, because there were some species whose alterations didn’t reach the significance threshold (e.g. fdr < 0.05) but had a tendency. Besides, according to accuracies of classifiers built on pathways, we hypothesized that the microbial functional network didn’t change a lot during treatment, even if the patients had a turn for the better (Figure S4). To confirm our hypothesis above, we collected another metagenomics dataset of CD patients for external validation. Interestingly, when constructed on HDC and species, HDC related classifier had highest value of AUC (AUC = 0.71, Fig. 4c), while the models built on pathway profiles appeared the trend but none of AUCs are over 0.65, probably due to the incompleteness of restoration in microbial metabolism (Table S11). Most of the key features of HDC related classifier are consistent with foregoing results that several HDC related species tended to recover when patients were under treatment (Figure S5). The excellent performance of the classifiers confirmed our inference that HDC related features had the potential to be signatures in classifying therapeutic response.