Sample and dataset collections
We downloaded the single-cell RNA-seq gene expression data from GEO (Gene Expression Omnibus, https://www.ncbi.nlm.nih.gov/geo/) with accession number GSE125527, GSE150115, GSE12527, GSE150516, GSE162335, GSE132465, GSE144735, GSE188711, GSE178318, GSE164522, GSE146771, GSE178341, GSE200997, Single Cell Portal(https://singlecell.broadinstitute.org/single_cell) with accession number SCP259, and Gut Cell Survey(https://www.gutcellatlas.org/). Colorectal cancer bulk transcriptome data and CMS typing information are from CRCSC(Colorectal Cancer Subtyping Consortium, https://www.synapse.org/#!Synapse:syn2634742). Bulk RNA-seq data of UC were downloaded from the GEO with accession number GSE109142, GSE128682, and GSE150961.
Bioinformatics analysis
Gene set variation analysis(GSVA). Firstly, we carried out quality control on the single cell data of 148 cases of colorectal cancer, eliminated 8 samples of poor quality, and converted the remaining samples into pseudo-bulk data. Then we performed “gsva” to derive the enrichment score of UC-CRC signature in pseudo-bulk data using a R package called GSVA(v.1.38.2).
Finite mixture models(FMM). We assume that some of the colorectal cancer samples are CAC and the other is sCRC, then the UC-CRC score calculated by GSVA will have different distributions. In order to get the best cut off value, we used finite mixture model(FMM) by using R package Flexmix (v.2.3–18)[73]. FMM is a statistical model that assumes the presence of unobserved groups, called latent classes, within an overall population. Each latent class can be fit with its own regression model, which may have a linear or generalized linear response function. We choose the best model according to Bayesian information criterion(BIC).
BIC = kln(n)-2ln(L)
Where k is the number of model parameters, n is the number of samples, and L is the likelihood function. Model evaluation criteria: the lower the BIC, the better. According to the optimal number of classifications recommended by the FMM, we classify the 14 samples with the highest GSVA scores into one category
K-means clustering algorithm. We use k-means clustering algorithm(KCA) to verify the cut off obtained by FMM. KCA is an iterative clustering analysis algorithm. Its step is to divide the data into K groups and randomly select K objects as the initial clustering centers, then calculate the distance between each object and each seed clustering center and assign each object to the nearest clustering center. Cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster will be recalculated according to the existing objects in the cluster. This process will be repeated until a termination condition is met. The termination condition can be that no (or minimum number of) objects are reassigned to different clusters, no (or minimum number of) clustering centers change again, and the sum of square errors is locally minimum. Under the recommended optimal number of classifications, we also classify the 14 samples with the highest GSVA scores into one category
ScRNA-seq: Quality control and doublet removal. As for quality control of scRNA-seq data, we set the cut off min.cells as 10 in the CreateSeuratObject function in Seurat (v.4.1.0)[74]. The data were filtered to remove cells with fewer than 200 unique genes per cell or greater than 2500 genes per cell, and remove cells with mitochondrial genes greater than 5%. To remove the potential doublets, we perform doublet remove by using DoubletFinder (v.2.0.3)[75].
The data were then processed with Seurat’s standard pipeline. First, NormalizeData was run using the method LogNormalize and scale.factor of 10,000. We scaled data with top 2000 most variable genes by using FindVariableFeatures function in R package Seurat (v.4.1.0). We used variable genes for principal component analysis (PCA), used FindNeighbors in Seurat to get nearest neighbors for graph clustering based on PCs, and used FindCluster in Seurat to obtain cell subtypes, and visualized cells with the uniform manifold approximation and projection (UMAP) algorithm. To eliminate the batch effect, we performed harmony algorithm in Harmony (v.0.1.0)[76] to remove batch correction before clustering analysis, and applied FindNeighbors and FindCluster in Seurat to obtain cell subtypes. The resolution was selected variably according to the number of cells. The major cell types were annotated by comparing the canonical marker genes and the differentially expressed genes (DEGs) for each cluster. We defined 33 immune cell types in CRC based on the gene signatures of each cell type and known lineage markers.
Differential-expression analysis. We used the “FindAllMarkers” function in Seurat to identify genes that are differentially expressed between clusters with the following parameters: min.pct = 0.25, logfc.threshold = 0.25. The non-parametric Wilcoxon rank-sum test was used to obtain p-values for comparisons, and the adjusted p-values, based on Bonferroni correction, for all genes in the dataset. We used heatmap to visualize DEGs based on gene expression after the log-transformed and scaling.
Cell pseudotime trajectory analysis. Monocle3(v.1.0.0)[77] was used to construct the pseudotime trajectory. In detail, the “cell_data_set” was built from the Seurat object of all the tumor cells using the data slot of the integrated assay. Dimension reduction was done by UMAP, “learn_graph” and “order_cells” functions were used to establish the trajectory. Each single cell was projected on the tree and formed the trajectory by DDRThree. The “root_state” of this trajectory tree was manually appointed in cells from healthy samples.
Using "graph_test" function to find differential genes of pseudotime trajectories(we set 1e-10 as q_value cut-off value) and calculate the co-expression gene module of the differential genes(Extended Data Figs. 4a).
Pathway enrichment analysis and Gene set variation analysis(GSVA). We select the co-expression gene modules that gradually increase with the state of the disease. We next performed the pathway enrichment analysis of those differential expressed genes by using cluterProfiler (v.3.18.1) (enricher function, KEGG gene sets or cancer hallmark genesets from msigdb). For GSVA analysis, e performed “gsva” to derive the enrichment score of these differentially expressed genes in the UC bulk transcriptome data using a R package GSVA(v.1.38.2).
Interaction map in each patient sample using CellPhoneDB. To analyze receptor-ligand pairing in healthy, UC, CAC-like samples, we used CellphoneDB(v.3.0.2)[78], a cell-cell interaction network tool for scRNA-seq datasets. The interactions between distinct cell subtypes via putative ligand-receptor pairs were visualized by a heatmap.
Simplified the UC-CRC signature VOSLassobag. To streamline the UC-CRC signature, we used the VOSLassobag to shrink genes, and further selecting the intersection of genes with a frequency greater than 1 and the UC-CRC signature.
Statistical Analysis.
Statistical analysis was performed in R v.4.0.3. Comparisons of numerical variables according to disease state were carried out using the Wilcox test and the ANOVA test. Survival analysis was performed by R package. Survival (V3.3-1) and survminer (V0.4.9) were adopted to generate the Kaplan-Meier plot and the P-value was calculated through log-rank test. Hazard ratio (HR) was calculated by Cox proportional hazards model and 95% CI was reported.