Accurate Five-category Classication for Colorectal Cancer Using Gut Microbiome 16S rRNA Sequencing

The association between the and 23 remains unclear. We performed 16S rRNA sequencing of the V3-V4 amplicon from 999 24 samples from subjects at various stages of CRC development and constructed an accurate 25 predictive random forest model for CRC development. In the testing set, our five- 26 category CRC prediction classifier had accuracies of 0.84 and 0.74 using the relative 27 operational taxonomic unit (OTU) abundances and relative genus abundances, 28 respectively. Specifically, the OTU-based classifier had a sensitivity of 0.97 and 29 specificity of 0.97 for CRC samples, and the genus-based classifier had a sensitivity of 30 0.97 and specificity of 0.95 for CRC samples. Meanwhile, the gut microbiota was found 31 to differ at all stages of CRC development. The differential abundances of closely related 32 bacteria were used to accurately classify the five stages of CRC development. Additionally, both and OTUs played important roles in classifier modelling. Our work not only provides valuable 16S rRNA sequencing data from 35 patients and healthy individuals on a large scale but also identifies reproducible gut microbiome biomarkers for CRC staging and highlights their potential applications as noninvasive microbiome biomarkers for diagnosis and as predictive CRC screening tests.

Colorectal cancer (CRC) is the 3rd most common cancer in the world and the 2nd most 41 common cancer in China, and it accounts for more than 0.57 million deaths worldwide 1 . 42 Most sporadic CRCs develop through polypoid adenomas to an intramucosal carcinoma 43 (high-grade dysplastic adenoma), which then progresses to malignant forms over 5-10 44 years. This is known as the adenoma-carcinoma process, which is induced by a multistep 45 mechanism that involves intestinal flora, metabolism, inflammation, gene mutations, 46 etc. 2-7 . Because the cancerization process can last for decades, the detection of 47 precancerous-stage adenoma or early-stage CRC could bring the 5-year relative survival 48 rate to up to 90%, alleviate the mortality rate, and reduce the economic burden. 49 Previous investigations have shown that the intestinal flora is different between healthy 50 individuals and patients with diseases, including CRC [8][9][10] . Recent studies have shown that 51 from nodular polyps to early cancer to metastatic disease, the intestinal flora changes at 52 all stages of CRC development 11,12 . With in-depth research advances, increasing evidence 53 shows that the intestinal flora and its related metabolites are closely related to the 54 occurrence of malignant tumours through the development of chronic inflammation and 55 immune surveillance disorders 13 . Changes in the type and number of intestinal flora 56 constituents significantly affect the occurrence and development of CRC 10,14-16 . For 57 example, certain strains of Escherichia coli and Bacteroides fragilis have a significant 58 promotion effect on CRC 17-21 . Nevertheless, the relative abundance of any one kind of 59 microorganism may not be able to distinguish between healthy individuals and CRC 60 patients 22 . Alternatively, metagenomic profiling of faecal microbiomes was used to 61 discover microbial biomarkers between CRC patients and healthy individuals 23 . The 62 7/31 second principal components (Fig. 1b). The CRC samples were significantly lower than 114 in other groups of samples in the first principal component (upper boxplots). According 115 to the second principal component test (left boxplots), the most significant difference was 116 found between the NAA and AA groups and between the NAA and HI samples (P < 117 0.001). The CRC and AA, NAA, and HI samples were significantly different, with a P 118 value < 0.05. Meanwhile, significant differences between the NAA and PL samples and 119 between the HI and PL samples were also found. These results indicated that the gut 120 microbiotas of each stage of CRC progression share similar bacterial components but 121 could be significantly separated from those of other stages. 122 To identify stage-specific gut microbial signatures with CRC progression, the sequences 123 of 561 OTUs were annotated and analysed at the phylum (Extended Data Fig. 1b), 124 family, and genus levels. The OTUs were divided into 17 families, and only 16 genera 125 were found to have significant differences among serial CRC stages using a two-sided 126 Mann-Whitney U test (Fig. 2) gradient-boosted trees), and Random Forest Classifier). As a result, the accuracy of 152 classification in both the training set and testing set was apparently affected by using 153 different algorithms, and the ensemble classifiers generally outperformed the weak 154 classifiers in accuracy using either relative OTU or genus abundances ( Table 1). 155 For weak classifiers using the relative OTU abundances, the classifier using the decision 156 tree algorithm had the highest accuracy (0.616) in the testing set, followed by using naïve 157 Bayes algorithms (0.512). Using the relative genus abundances, the accuracy rank was 158 consistent with the ranking by the relative OTU abundance modelling but with a smaller 159 9/31 accuracy value. Compared with that of the weak classifiers, the classification accuracy of 160 the ensemble classifiers was considerably improved. The ensemble classifier constructed 161 using the relative OTU abundances and relative genus abundances with the highest 162 classification accuracy was the random forest classifier, with testing accuracies of 0.845 163 and 0.743, respectively. Altogether, the relative abundances using OTUs have a better 164 classification performance than that of the genera, and the random forest model has the 165 best classification effect. These findings highlighted that the relative taxonomic 166 abundances from 16S rRNA gene sequencing data can be accurately associated with 167 different stages of CRC using the random forest classifier. 168 Assessment of CRC staging classification using the microbiome random forest 169 model 170 In addition to the benchmarking of classification accuracy using various classifiers, we 171 assessed the prediction ability of our random forest classifier for the 297 samples in the 172 testing set. The sensitivity (true-positive rate) and specificity (1− false-positive rate) of 173 the classifier were evaluated using the relative OTU abundances and the relative genus 174 abundances ( Table 2). Using the relative OTU abundances, the sensitivity and specificity 175 of the classification model for CRC were both 0.97. As demonstrated in the matrix of 176 confusion (Fig. 3a), the OTU-dependent random forest classifier divides the 59 samples 177 from CRC patients into 57 CRCs, 1 AA, and 1 NAA. Meanwhile, the 59 healthy control 178 samples were divided into 55 HIs, 2 PLs, 1 NAA and 1 AA. Together, our trained 179 random forest classifier performed better in the classification of CRC and healthy stages 180 with a higher sensitivity and specificity than other stages of CRC development. For 181 classification using the relative genus abundances (Fig. 3b), the performance of the 10/31 classifier was generally worse than that of using the relative OTU abundances ( Table 2). 183 The genus-dependent classifier had a sensitivity value of 0.97 with a specificity of 0.95 184 for CRC samples, but the sensitivity for heathy samples was only 0.71. These results 185 suggested that the trained random forest classifier using the relative genera abundances 186 will have a high false positive rate among healthy samples and thus is impractical for 187 clinical application, especially in CRC early detection. Therefore, these results suggest 188 that a faecal microbiome predictive test using 16S rRNA gene sequencing with relative 189 OTU abundances with our trained random forest classifier could serve as a promising 190 clinical tool for large-scale CRC screening. 191

Correlations among critical taxonomic biomarkers 192
Based on the outstanding ability of our random forest classifier to discriminate each stage 193 of CRC progression from the training set to the testing test, we tried to explain the 194 correlation among biomarkers in the OTU-and genus-based classification model. The 195 relative abundances of the top 5 important OTUs and OTU-annotated genera were 196 extracted, and calculations were performed (Fig. 3c, see Extended Data Table 1 for  197 OTUs not in the database). As a result, the top five OTUs were all highly correlated with 198 absolute values greater than 0.2 (Fig. 4f). For instance, OTU511 (Dialister) and OTU325 199 (Bacteroides) have a positive correlation with a correlation coefficient of 0.41, and 200 OTU153 (Bacteroides) and OTU202 (Prevotella) have a negative correlation with a 201 correlation coefficient of -0.37. These results highlighted the importance of each OTU 202 even from the same genus. However, the top five genera generated from the relative OTU 203 abundances were found to be less correlated (Fig. 4l). For example, Clostridium sensu 204 stricto was found to have a negative correlation between Bacteroides with a correlation 205 11/31 coefficient of -0.23 and a positive correlation between Prevotella (0.21). Using the 206 relative OTU abundances, the correlations among Clostridium sensu stricto, Diaslister,207 and Fusobacterium were all less than 0.1, which demonstrates the weak correlation of 208 genera annotated from OTU data in the random forest classifier. Meanwhile, the top 5 209 important features from the model using relative genus abundances were also highly 210 correlated (Fig. 3d, Extended Data Fig. 2). Considering the false positive rate of the 211 genera model, the importance of each genus was perturbed by the neighbouring groups 212 because of the inappropriate taxonomic level. 213 In addition to the correlation analysis among features in the classification models, the 214 distribution of critical OTUs and genera showed that the relative abundances of important 215 OTUs and genera had diverse distribution patterns between stages of CRC development 216 ( Fig. 4a-e, g-k). For example, any one of the 5 OTUs had a significant difference 217 between CRC and other stages (P < 0.0001, Wilcoxon test), and both OTU325 (Fig. 4a) 218 and OTU153 (Fig. 4c) could effectively distinguish AA from other stages (P < 0.001, 219 Wilcoxon test). In the distribution of genus abundances, the abundances of Bacteroides 220 ( Fig. 4k), Prevotella (Fig. 4h), Clostridium sensu stricto (Fig. 4i), and Dialister (Fig. 4g) 221 in the CRC samples were significantly different from those in the other samples, while 222 the abundances of Fusobacterium were significantly different only between CRC and 223 NAA or PL (Fig. 4j). According to the distribution divergence of OTUs and genera, these 224 results supported that the random forest classifier using the significantly different 225 distributed relative OTU abundances could have a better classification accuracy. 226

Microbiome functional changes during CRC progression 227
To evaluate the correlation of signalling pathways between normal samples and abnormal 228 samples, the functional potentials of the microbiome were inferred using PICRUSt2 32 . A 229 total of 29 signalling pathways were found to be highly correlated with different statuses 230 of colorectal diseases (Extended Data Fig. 3)   In previous studies, researchers mainly constructed a two-class model that distinguishes 282 CRC/adenoma and healthy controls 35-37 . This study included all the stages of CRC 283 development ranging from HI, PI, NAA, AA to CRC and constructed a five-category 284 classification model using 11 algorithms. The classifiers using ensemble algorithms are 285 superior to the weak classifiers. The relative OTU abundances achieved the best 286 classification performance on the random forest model with accuracies above 0.8. 287 Furthermore, using OTUs as a feature gave better results than using genera as a feature 288 for classification. This probably resulted from the interference of the addition of OTUs 289 without classification value into relative genus abundances (Extended Data Table 3). 290 Together, it is recommended to use the relative OTU abundances as features when using 291 15/31 16S rRNA sequencing data to construct a classification model. Since more valuable 292 classification factors among categories can be retained using the relative OTU 293 abundances, the classification model can be more accurate than using a genera model. 294 Overall, the testing results of the 297 samples exhibit the robustness of our trained five-295 category random forest classifier. 296 In addition to evaluating the performance of our OTU-based classification model, we 297 tried to explain the correlation of critical features that contribute to the model. Using

Sample grouping 324
Samples were divided into five groups, HI, PL, NAA, AA, and CRC, according to the 325 following histopathological criteria: HI was defined as normal without hyperplasia or 326 neoplastic findings under colonoscopy; PL was defined as non-adenomatous hyperplasia; 327 NAA was defined as having adenoma with a diameter <10 mm; AA was defined as 328 adenoma with high-grade dysplasia, adenoma with a diameter ≥10 mm, villous 329 differentiation in ≥25% of the visible area under a microscope, or a serrated lesion with a 330 diameter ≥10 mm; and CRC was defined as all stages of CRC (I-IV). 331

17/31
Library preparation and sequencing 332 Total genomic DNA from faecal samples was extracted and purified using nucleic acid 333 extraction and purification kits (New Horizon Health Technology Co., Ltd., China). The 334 DNA concentration was measured using a NanoDrop2000 (Thermo), and the quality was 335 assessed on a 1% agarose gel (1%, w/v). Each DNA sample was diluted to 1 ng/μL with 336 sterile water before PCR amplification. The V3 and V4 hypervariable regions of the 16S 337 rRNA gene were amplified using the primer pair 341F (5'-CCTAYGGGRBGCASCAG- cut-offs, respectively. Then, the reads were input into the UPARSE program 46 for 355 chimaeric read filtering and cluster (OTU) generation with 99% similarity. Next, the 356 abundances of the OTUs in each sample were stored in a matrix, with each row denoting 357 an OTU. For each sample, the total abundances of the OTUs were normalized to 50,000. 358 Moreover, the OTUs were classified and annotated by SINTAX 47 using the taxonomic 359 information in the RDP 48 and SILVA 49 databases. Finally, the relative OTU abundances 360 were compared among groups using linear discriminant analysis effect size (LEfSe) 50 . 361

Data preprocessing 362
Data preprocessing was performed on OTU features and samples. In feature screening, 363 OTUs with normalized abundances less than 10 in more than 90% of the samples were 364 excluded from further analysis. Then, the sum of the remaining OTU abundances in each 365 sample was normalized to 100 for standardization. As a result, the OTU abundances of 366 each sample were standardized, and the sequencing data were converted into 367 characteristic ratio data. Samples with a standard deviation of the relative OTU 368 abundances greater than 3 were treated as abnormal samples and excluded from the 369 following analysis. To solve the problem of interbatch differences between samples in the 370 sequencing process, we normalized the samples. Specifically, under each OTU sequence, 371 the relative abundance ratio of each sample corresponding to the OTU was calculated and 372 used as the characteristic expression value of the model. That is, in the OTU relative 373 abundance matrix, the samples are normalized to eliminate batch-to-batch problems 374 during the sequencing process. was set using a step size of 0.1 when the weight was updated in each iteration. The total 392 number of iterations was 10, the depth of the tree was 5, and the minimum loss function 393 reduction value required for node splitting was set to 0. When training each tree, the ratio 394 of the used data to the total training set was 0.8, and the ratio of the used features to the 395 total features was 0.8. The objective function was set to logistic, and the weight of the 396   The phylogenetic tree of 561 OTUs was divided into 17 families. The genus nodes of 586 different gates are marked with different colors, and the outmost purple squares indicate 587 nodes has a relative abundance greater than 0.005. A two-sided Mann-Whitney U test 588 revealed 16 genera with a significance level of less than 0.05 in serial CRC stages, and 589 the box plot shows the distribution of these genera. The y axis for each box plot 590 represents the relative abundance. Each plot shows boxes for the five groups (HI, PL, 591 NAA, AA and CRC) in order from left to right. 592