A Computational Framework to Quantify Host-Microbiome 1 Interactions in Clostridioides difficile Infection

35 Background: Clostridioides difficile infection (CDI) is the most common cause of healthcare– 36 associated infection and an important cause of morbidity and mortality among hospitalized 37 patients. A comprehensive understanding of C. difficile infection (CDI) pathogenesis is crucial 38 for disease diagnosis, treatment and prevention. To achieve that, a quantitative study of host- 39 microbiome interactions in CDI is a prerequisite. Yet, an effective computational framework to 40 quantify host-microbiome interactions in CDI was lacking. 41 Methods: Here, we characterized gut microbial compositions and abroad panel of 42 immunological markers in a comprehensive clinical cohort of 243 well-characterized human 43 subjects with four different C. difficile infection/colonization statuses (CDI, Asymptomatic 44 Carriage, Non-CDI Diarrhea, and Control). Based on microbial and immunological features, we 45 developed a computational framework to detect CDI status using random forest and symbolic 46 classification models. 47 Results: First, by calculating the correlations between microbial compositions and the 48 circulating levels of host immune markers for each of the four phenotype groups, we found that 49 the interactions between gut microbiota and host immune markers are very sensitive to the status 50 of C. difficile colonization and infection. Second, we demonstrated that incorporating both gut 51 microbiome and host immune marker data into random forest classifiers can better distinguish 52 CDI from other groups than can either type of data alone. Finally, we performed symbolic 53 classification using selected features from random forest classifiers to derive simple mathematic 54 formulas that explicitly model the interactions between gut microbiome and host immune 55 markers. 56 Conclusions: Overall, this study provides an effective computational framework to quantify 57 the role of the intricate interactions between gut microbiota and host immune markers in CDI 58 pathogenesis. This framework may inform the design of future diagnostic and therapeutic 59 strategies. 60 61


Introduction 74
Clostridioides difficile infection (CDI) is the most common cause of healthcare-associated 75 infection and an important cause of morbidity and mortality among hospitalized patients 1-3 . 76 Exposure to toxinogenic C. difficile can lead to a range of clinical outcomes ranging from 77 asymptomatic colonization to mild diarrhea and more severe disease syndromes such as 78 pseudomembranous colitis, toxic megacolon, bowel perforation, sepsis, and death 4,5 . 79 Asymptomatic C. difficile carriage is characterized by C. difficile colonization in the absence of 80 symptoms of infection. The diagnosis of CDI is based on clinical signs and symptoms in 81 combination with laboratory testing, including enzyme immunoassays (EIA) for TcdA and 82 TcdB, nucleic acid amplification tests (NAAT), selective toxinogenic culture, cell cytotoxicity 83 neutralization assay, and glutamate dehydrogenase EIA 6-8 . However, currently available 84 approaches do not accurately differentiate CDI from diarrhea with another cause in a patient 85 colonized with toxinogenic C. difficile. 86 Current treatment strategies for CDI, including vancomycin, metronidazole and 87 fidaxomicin, have inconsistent cure rates and treatment failure or CDI recurrence may occur in 88 approximately one third of cases 9,10 . Antibiotic exposure is considered the most important factor 89 predisposing patients to CDI 11,12 . In fact, treatments with antibiotics have a tremendous impact 90 on the composition and functionality of the gut microbiota, and accordingly are associated with 91 reduced colonization resistance against pathogens such as C. difficile [13][14][15] . It has been reported 92 that several gut commensal bacteria may contribute to the prevention of C. difficile colonization 93 and infection 16,17 . Once colonized, C. difficile can produce toxins that mediate a robust 94 inflammatory response. Toxin A (TcdA) and toxin B (TcdB) are the primary virulence factors of 95 C. difficile 18 and act on intestinal epithelial cells first, inducing pro-inflammatory cytokines, loss 96 of tight junctions, cell detachment and an impaired mucosal barrier 19-21 leading to further 97 exposure of immune cells to toxins. The innate and adaptive immune responses to CDI play 98 crucial roles in disease onset, expression, severity, progression, and overall prognosis 22,23 . The 99 innate immune defense mechanisms against C. difficile and its toxins include the commensal 100 intestinal flora, mucosal barrier, intestinal epithelial cells, and mucosal immune system 24,25 . 101 TcdA and TcdB have multiple effects on the innate immune system, including inducing 102 expression of numerous pro-inflammatory mediators (e.g., cytokines, chemokines and 103 neuroimmune peptides) and the recruitment and activation of a variety of innate immune 104 cells 26,27 . Adaptive immunity is also sufficient to provide some protection from CDI, likely via 105 antibody-mediated neutralization of TcdA and TcdB 28-31 . 106 The role of the immune response combined with the knowledge that a balanced 107 microbiota can prevent colonization and infection demonstrates the importance of combining 108 both gut microbiota and host immune markers in understanding the pathogenesis of CDI. Yet, an 109 effective computational framework to quantify the role of the intricate interactions between gut 110 microbiota and host immune markers in human CDI pathogenesis has been lacking. Here we 111 leverage tools from machine learning to develop such a framework. Machine learning has a great 112 impact in many areas of medical research, as it offers a principled approach for developing 113 sophisticated, automatic, and objective algorithms for analysis of complex data. Indeed,previous 114 studies indicate that supervised learning can be successfully employed for clinical disease 115 assessment for diverse disorders 32-35 . In our previous work, we found that specific immune 116 markers, particularly G-CSF, can be used to distinguish adults with CDI from other groups 117 including asymptomatic carriers and NAAT-negative patients with and without diarrhea 36 . Here, 118 we leverage machine learning tools to integrate the host immune marker data and newly obtained 119 gut microbiome data from subjects of the same cohort to identify collections of bacteria and 120 immune markers that can be associated with CDI. Our aim is to quantify the role of intricate 121 interactions between gut microbiota and immune response in CDI pathogenesis, which can 122 inform the design of future diagnostic and therapeutic strategies. 123 124

134
Microbial community structure 135 To compare the overall microbial community structure of the four groups, we first calculated the 136 alpha diversity (i.e., the within-sample taxonomic diversity) of each sample at the genus level 137 using four different measures: taxa richness (the observed number of different taxa present in the 138 sample), Chao1 (abundance-based estimator of taxa richness), Evenness (the uniformity of the 139 population size of each taxa present in the sample), and Shannon diversity index (estimator of 140 taxa richness and evenness: more weight on richness). As shown in Fig. 1:a-d, we found that taxa 141 richness and Chao1 did not differ significantly among these groups. The gut microbiota of Non-142 CDI Diarrhea subjects showed lower evenness than that of the Control group. Shannon diversity 143 was significantly lower in the Non-CDI Diarrhea and CDI groups than in the Control group. 144 To determine whether the gut microbial compositions of participants are affected by C. 145 difficile infection/colonization status, we performed Principal Coordinates Analysis (PCoA) at 146 the genus level using Bray-Curtis dissimilarity (which is a beta diversity measure to quantify the 147 between-sample compositional dissimilarity). We found no distinct clusters corresponding to the 148 four different phenotype groups, implying that the gut microbial compositions of participants 149 from the four groups are not significantly different (Fig. 1e). Interestingly, by directly comparing 150 the beta diversity of each group, we did find that the CDI group displays higher beta diversity 151 than other groups (Fig. 1f), indicating that the microbial compositions of participants within the 152 CDI group vary more prominently than other groups. Permutational multivariate analysis of 153 variance (PERMANOVA) showed that the overall bacterial composition differed significantly 154 among different groups based on the CDI status (P < 0.001; Table S2), whereas other host  155  factors such as age, sex, race and ethnicity had no significant effect on the microbiome  156 composition.

157
To identify microbiome markers (i.e., certain taxa with very high discriminatory ability) 158 to differentiate those different phenotype groups, we performed differential abundance analysis. 159 In particular, we used ANCOM 37 (analysis of composition of microbiomes) with a Benjamini-160 Hochberg correction, and adjusted for age and sex. We found that the abundances of 15 genera 161 were significantly different between CDI and Asymptomatic Carriage groups ( Fig. 2a and Table  162 S3). Among the 15 genera, 4 of them (Veillonella, Enterobacter, Granulicatella and Dialister) of 163 these genera were enriched in the CDI group, while the other 11 genera (including Lactococcus, 164 Dorea, Moryella, Stenotrophomonas and Agathobacter) were enriched in the Asymptomatic 165 Carriage group. We also found 16 differentially abundant genera between the Non-CDI Diarrhea 166 group and the CDI group ( Fig. 2b and Table S4). Of these, 10 genera (including Clostridioides, 167 Enterobacter, Dialister, and Veillonella) were enriched in the CDI group, and the other 6 genera 168 ([Eubacterium]_hallii_group, Collinsella, Agathobacter, Dorea, Stenotrophomonas and 169 Streptococcus) were enriched in the Non-CDI Diarrhea group. ANCOM analysis also enabled us 170 to identify 40 genera (including Clostridioides and Veillonella) that have significant differential 171 abundances between the CDI group and the whole Non-CDI group ( Fig. 2c and Table S5). Note 172 that a total of 6 differentially abundant genera were identified from all the three comparisons. 173 Among them, Veillonella, Enterobacter and Dialister were enriched in the CDI group, while 174 Dorea, Stenotrophomonas and Agathobacter were depleted in the CDI group.

176
Microbial correlation networks 177 To compare the microbial communities of the four groups at the network-level, we constructed 178 the genus-level microbial correlation network for each group using SparCC 38 . We found that the 179 microbial correlation network of the CDI group has quite different structure compared to other 180 groups (Fig. 3). More precisely, it has fewer nodes and edges, lower average degree, but higher 181 modularity (Table S6). These indicate that the overall microbial correlations in the CDI group 182 are much weaker than those in other groups.

183
To analyze these patterns in more detail, we used NetShift 39 to identify potentially 184 important "driver" taxa responsible for the change of microbial correlations. This analysis 185 revealed 24 potential driver taxa linked with the change of microbial correlations between CDI 186 and Asymptomatic Carriage groups (Fig. S1). The top driver taxa were Alistipes, Clostridioides, 187 Desulfovibrio, Eggerthella, Erysipelatoclostridium, Klebsiella, Odoribacter Proteus, 188 [Ruminococcus]_torques_group, Streptococcus, Vagococcus and Veillonella. We then identified 189 24 genera as potential driver taxa underlying the change of microbial correlations between CDI 190 and Non-CDI Diarrhea groups (Fig. S2) Host immune markers and CDI 201 To determine the systemic levels of proinflammatory cytokines in CDI, we measured the and toxin B as previously reported 36 . We previously demonstrated specific markers of these 207 innate and adaptive immunity that can distinguish CDI from each of the other three groups 36 . In 208 the current study, we are particularly interested in comparing the CDI group and the combined 209 Non-CDI group. Based on the Mann-Whitney U test, we identified in total 11 immune markers 210 that displayed significantly different concentrations in these two groups, including G-CSF, IL-4, 211 IL-6, IL-8, IL-10, IL-15, TNF-α, MCP1, IgA anti-toxin A and B, and IgG anti-toxin A in blood 212 (Table S7). All of these immune markers had higher concentrations in the CDI group than in the 213 Non-CDI group. Host immune marker variations between samples were evaluated using the 214 Principal Component Analysis (PCA) (Fig. 1g). PCA plot showed no clear clustering of those 215 subjects based on immune marker concentrations. However, boxplot of Euclidean distance of 216 immune marker profiles from CDI patients showed higher within-group variation than that in all 217 the other three groups (Fig. 1h). PERMANOVA analysis indicated that the immune homeostasis 218 was significantly different among different groups based on the CDI status (P = 0.016; Table  219 S2). 220 221 Interactions between gut microbiome and host immune markers 222 To reveal the interactions between the gut microbiome and the host immune system, we 223 calculated the correlations between microbial compositions and the circulating levels of host 224 immune markers for each of the four groups. The results are shown in Fig. 4 and Fig. S4. For the 225 Control group, the most significant correlations were identified as Chiristensenellaceae R-7 226 group negatively correlated with TNF , Bifidobacterium positively correlated with VEGFA and 227 IL-13, Rothia positively correlated with IL-15, and Veillonella positively correlated with IL-4 228 ( Fig. 4a and Fig. S4). For the Non-CDI Diarrhea group, Ruminococcaceae UCG-011 was 229 negatively correlated with IL-8 and IL-6, Defluviitaleaceae UCG-011 was positively correlated 230 with IL-1b, and Blautia was negatively correlated with MCP1 levels (Fig. 4b). For the 231 Asymptomatic Carriage group, we found that Lactobacillus was negatively correlated with 232 VEGFA, Akkermansia was positively correlated with IL-6, and Enterococcus was positively 233 correlated with TNF (Fig. 4c). For the CDI group, negative correlations involved Akkermansia 234 and IL-10, Lactococcus and G-CSF, while positive correlations involved Lactobacillus and IgG 235 and IgA anti-toxin B (Fig. 4d). Interestingly, none of these most significant correlations was 236 universally present across different groups. This indicated that the interactions between gut 237 microbiota and host immunological markers can be very sensitive to the status of C. difficile 238 colonization and infection. Although the rudimentary correlation analysis cannot reveal any 239 nonlinear interactions between gut microbiota and host immune markers, the result implies that 240 the integration of gut microbiota and host immune markers might be quite useful for highly 241 accurate classification of CDI.

243
Classification of CDI using host immune markers and gut microbiota 244 To determine whether host immune markers or gut microbiota could serve as biomarkers to 245 classify subjects into different groups, we constructed a multi-class classifier based on random 246 forests (RF). One of the most popular performance metrics of a classifier is the Area Under the 247 receiver operating characteristic Curve (AUC). The performance of a multi-class classifier is 248 measured by both micro-average and macro-average AUCs. We considered three different 249 feature types: (1) host immune maker concentrations alone; (2) gut microbial compositions 250 alone; and (3) the integration of (1) and (2) in our classification analysis. To eliminate 251 confounding effects, we excluded the genus Clostridioides from our classification analysis. The 252 immune marker-based classifier achieved macro-average AUC ~ 0.827 and micro-average AUC 253 ~ 0.828 (Fig. S5a), which are quite comparable to the performance of microbiota-based classifier 254 (Fig. S5b). Interestingly, integrating immune marker with gut microbiota showed much better 255 classification performance (macro-average AUC ~ 0.926 and micro-average AUC ~ 0.869) (Fig.  256 S5c).

257
We further performed binary classifications to distinguish CDI subjects from 258 Asymptomatic Carriage, Non-CDI Diarrhea, and Non-CDI subjects, using different feature types 259 (Fig. 5). The goal of this analysis was to assess whether any single taxon or immune marker 260 could reliably differentiate CDI status. In the classification of CDI vs. Asymptomatic Carriage, 261 we found that G-CSF and Moryella were the most important immune and microbial features, 262 respectively ( Fig. S6:a-b). But the classification based on G-CSF (or Moryella) alone did not 263 yield very high performance: mean AUC ~ 0.817 (or 0.701), respectively ( Fig. 5:a1-a2). When 264 we used all the immune markers (or all the genera) as features, we achieved mean AUC ~ 0.867 265 (or 0.805), respectively ( Fig. 5:a3-a4). Interestingly, when we integrated all the host immune 266 markers and gut microbial composition data together, we achieved a much higher performance 267 with mean AUC ~ 0.900 (Fig. 5:a5). In order to select a subset of features that is as 268 discriminatory as the whole set of features, we followed the "1-SE" rule (i.e., one chooses the 269 model with fewest features such that its classification performance is less than one standard error 270 away from that of the model with all the features), and selected the following 4 features: 2 271 bacterial genera (Moryella and Veillonella) and 2 immune markers (G-CSF and IL-6) in 272 classifying CDI and Asymptomatic Carriage groups ( Fig. S6:g-j). The RF classifier with those 273 selected features displayed an outstanding classification performance, with mean AUC ~ 0.916 274 (Fig. 5:a6). Note that a significant negative correlation between Moryella and G-CSF was found 275 in the Asymptomatic Carriage group (Fig. 4c), which might contribute to the outstanding 276 performance of the RF classifier with Moryella and G-CSF as selected features. 277 In the classification of CDI vs. Non-CDI Diarrhea groups, we found that G-CSF and 278 [Eubacterium]_hallii_group are the top immune and microbial features, respectively ( features. Note that Enterococcus was found to be significantly associated with G-CSF in the 289 Non-CDI Diarrhea group (Fig. 4b). This might partially explain the outstanding performance of 290 the RF classifier with Enterococcus and G-CSF as selected features. 291 In the classification of CDI vs. Non-CDI groups, we found that G-CSF and Derive nonlinear interactions between gut microbiota and host immune markers using 304 symbolic classification 305 As mentioned earlier, traditional correlation analysis cannot reveal any nonlinear interactions 306 between gut microbiota and host immune markers. This fact and the outstanding classification 307 results based on well-selected features prompt us to derive simple mathematical models to 308 quantify the intricate interactions between gut microbiota and host immune markers. To achieve 309 that, we leveraged symbolic classification (SC) 40,41 , a genetic programming technique that 310 automatically searches the space of mathematical expressions to find the model that best fits a 311 given dataset. The fitness function in SC is a maximization function, and the number of 312 generations is chosen based on the saturation of the fitness score (Fig. S7). Using the same set of 313 selected features and trained with the entire dataset, the SC model outperformed logistic 314 regression (LR) in differentiating CDI (see Table 2). 315 Indeed, as shown in Table 2, we derived a simple SC model with selected features,  316 reaching a very high accuracy (0.896) in distinguishing CDI subjects from Asymptomatic 317 Carriage. Basically, for each subject , we calculate the diagnostic score ( ) that will be used 318 for CDI diagnosis: the class of subject is CDI if ( ) > 0; Asymptomatic Carriage, if ( ) ≤ 0. 319 Similarly, we derived a SC model with accuracy of 0.900 (or 0.882) in distinguishing CDI from 320 Non-CDI Diarrhea (or Non-CDI) with the corresponding diagnostic score shown in Table 2. To 321 ensure the SC models learned from the entire dataset are not overfitting, we performed cross-322 validation. With different training sets, SC will derive different mathematical formulas (i.e., 323 diagnostic scores). However, those SC models learned from different training datasets 324 demonstrated quite robust performance in terms of Accuracy, Precision, Recall and F1-score (see 325  Table S8). More importantly, even trained with less data, the SC models still outperformed LR 326 models learned from the entire dataset. 327 As shown in Table 2, in the formulas of the diagnostic score, we colored the gut 328 microbiota (or host immune marker) features in red (or blue), respectively. It is clearly seen that 329 any potential interactions between gut microbiota and host immune makers are completely 330 ignored in the formulas derived from LR. But for the formulas derived from SC, nonlinear 331 interactions between gut microbiota and host immune makers can be clearly seen. We emphasize 332 that those nonlinear interactions are not always pairwise. Those explicit interaction terms could 333 inform further mechanistic studies to further reveal the role of intricate interactions between gut 334 microbiota and host immune markers in CDI pathogenesis. 335 336

337
Consistent with previous studies 42-45 , we found that the gut microbiota of CDI patients was 338 characterized by lower Shannon diversity than that of the Control group. Interestingly, we 339 observed an increased variation of both immune markers and gut microbial compositions in the 340 CDI group with respective to other studied groups. This suggests that CDI is characterized by a 341 significantly less stable microbiome and immune homeostasis. Our findings are in line with the 342 Anna Karenina principle, which suggests that CDI linked changes in the microbiome and 343 immune homeostasis are likely stochastic, leading to community instability 46-48 . 344 We were able to identify several candidate driver taxa (e.g., Desulfovibrio, Klebsiella, 345 Streptococcus and Veillonella) that played a key role in driving the changes of microbial 346 correlation networks between CDI and Asymptomatic Carriage (or Non-CDI Diarrhea, Non-347 CDI) groups. Among those driver taxa, Streptococcus has previously been shown to produce 348 lactate thus impacting C. difficile TcdA expression to alleviate CDI 49 . Previous study indicated 349 that Desulfovibrio has a pathogenic role in ulcerative colitis due to its ability to generate 350 sulfides 50 . Klebsiella bacteria have been increasingly shown to develop antimicrobial resistance, 351 most recently to the class of antibiotics known as carbapenems 51,52 . It is thus possible that the 352 CDI pathogenesis is further enforced by the enrichment of antagonistic bacteria present in the gut 353 microbiome of CDI subjects.

354
We developed classification models aimed at differentiating CDI status based on host 355 immune markers and gut microbiome data. We were able to identify specific immune and 356 microbial features that could accurately distinguish CDI subjects. In addition, most of the 357 selected features identified by feature selection were also differentially abundant genera and 358 differentially expressed immune markers. From the classification of CDI and Asymptomatic 359 Carriage, we were able to select a few features with outstanding discriminability, including 360 Veillonella and Moryella. Interestingly, a positive relationship between Veillonella and CDI has 361 been identified in recent studies 53-56 . An important role for Veillonella in CDI is supported by the 362 fact that Veillonella species were associated with low coprostanol levels that correlated strongly 363 with CDI 53 . A similar negative relationship between Moryella species and CDI has previously 364 been observed 57 . Enterococcus, a feature selected from the classification of CDI vs. Non-CDI 365 Diarrhea, has been reported to be associated with CDI due to vancomycin resistance 58 . 366 Consistent with the findings from previous reports 59,60 , Epulopiscium was significantly enriched 367 in the CDI group and played an important role in differentiating this comparison. Among those 368 features selected from the classification of CDI and Non-CDI groups, Enterobacter and 369 Fusobacterium have been considered as opportunistic pathogens involved in multiple 370 diseases 61,62 . 371 Machine learning method has the potential to identify biomarkers and aid in the diagnosis 372 of many diseases. However, the learnt relationships between predictors and outcome are 373 typically non-transparent, especially non-linear methods (i.e., decision tree learning) 63  validate the mechanism underlying the observed associations between these biomarkers and CDI. 394 Finally, further external validation of the classification models and derived formulas need to be 395 performed on an additional cohort with same inclusion criteria as the current cohort. 396 397

398
Utilizing this well-characterized cohort and leveraging machine learning tools, we proposed an 399 effective computational framework to quantify the role of the intricate interactions between gut 400 microbiota and host immune markers in CDI pathogenesis. We believe this framework can 401 inform the design of future diagnostics of CDI, as well as therapeutic strategies for its prevention 402 and treatment. 403 404 405

406
Study cohort 407 The background and design of this cohort has been described in detail previously 67  and consistency of stools, consultation with treating clinicians, and detailed chart review 417 (requiring mention of "diarrhea", "loose stools", and/or increased frequency, in notes written by 418 multiple providers). Patients for whom there was any doubt about the presence of diarrhea, or 419 who had chronic diarrhea, were excluded.
(2) Asymptomatic Carriage (n=40): Eligible patients 420 were inpatients ≥18 years old, admitted for at least 72 hours, who had received at least one dose 421 of an antibiotics within the past 7 days, and did not have diarrhea in the 48 hours prior to stool 422 specimen submission. Patients with 2 or more loose stools within 24 hours were excluded; 423 patients who had 1 loose stool were included only if they had recently received a laxative. 424 Patients were excluded if they had a colostomy; received oral or intravenous metronidazole, oral 425 vancomycin, oral rifaximin, and/or oral fidaxomicin for >24 hours within the prior 7 days; had 426 been diagnosed with CDI in the past 6 months; or had tested negative for C. difficle within the 427 past 7 days. Stool specimen were collected prospectively under verbal informed consent. A 428 discarded serum sample from within 24 hours of the stool specimen were also captured. NAAT Microbial diversity and differential abundance analysis 466 The diversity measures and permutational multivariate analysis of variance (PERMANOVA) 467 were calculated using the vegan package in R (see Supplementary Methods for details). For 468 differential abundance analysis, we used ANCOM 37 (analysis of composition of microbiomes), 469 with a Benjamini-Hochberg correction at 5% level of significance, and adjusted for age and sex. 470 The Mann-Whitney U test was used to compare the difference of immune marker levels between 471 different groups. 472

473
Microbial correlation network and microbiome-immune association analysis 474 The microbial correlation networks were constructed using SparCC 38 (sparse correlations for 475 compositional data, https://github.com/luispedro/sparcc) (see Supplementary Methods for 476 details). We also used the NetShift 39 (https://web.rniapps.net/netshift) to identify potential 477 "driver" taxa underlying the differences of microbial correlation networks associated with CDI 478 and Asymptomatic Carriage (or Non-CDI Diarrhea, and Non-CDI). The key driver taxa were 479 identified based on the neighbor shift (NESH) score, Jaccard Index and delta betweenness ( B). 480 Associations between the gut microbiota and host immune markers were quantified by Spearman 481 correlation coefficients in combination with Benjamini-Hochberg FDR correction to account for 482 multiple hypothesis testing (significance threshold 0.05). All included genera were required 483 to be detected in 50% of all samples in each group.

485
Classification with Random Forests model 486 To build a classification model capable of testing the overall contribution of immunological or 487 microbial data in distinguishing the CDI status, we developed a multi-class random forests (RF) 488 classifier. The data is split into a training set and a test set, with 70% of the data forming the 489 training data and the remaining 30% forming the test set. The performance of the multi-class 490 model was measured by micro-average and macro-average AUC. To determine whether more 491 specific host immune markers or gut microbial taxa could differentiate CDI subjects from 492 Asymptomatic Carriage, Non-CDI Diarrhea and Non-CDI groups, we constructed the binary 493 classifiers based on RF models with integrated immune markers and microbiome data (see 494 Supplementary Methods for details).

496
Symbolic classification with genetic programming 497 We employed Karoo GP 69 , a genetic programming application suite written in Python that 498 support both symbolic regression (SR) and symbolic classification (SC) analysis, to derive 499 simple formulas for CDI diagnosis. Due to the different training sets, SC will derive different 500 formulas, but their classification performances are quite comparable (Table S8). The formulas 501 shown in Table 2 were derived based on the whole dataset (for details see supplementary 502 methods). To demonstrate the advantage of SC, for each classification task (i.e., CDI vs. 503 Asymptomatic Carriage, CDI vs. Non-CDI Diarrhea, and CDI vs. Non-CDI), we also performed 504 logistic regression (LR) using the same set of selected features as used in SC (      Veillonella. In particular, for each classification task (regardless of using SC or LR), the 860 following selected features were: (1) CDI vs. Asymptomatic Carriage: 1 , 4 , 13 and 15 ; (2) 861 CDI vs. Non-CDI Diarrhea: 1 , 2 , 9 , 10 , and 11 ; (3) CDI vs. Non-CDI: 1 , 3 , 4 , 5 , 6 , 7 , 862 8 , 12 , 14 and 15 . Note that in the calculation of precision, recall and F1-score, we can treat 863 either CDI (or Asymptomatic Carriage, Non-CDI Diarrhea, Non-CDI) as the true positive. 864 Results shown in the parenthesis represent the latter case.