Data acquisition
The scRNA-seq data and bulk RNA-seq data of human CRC samples were included in this study. The scRNA-seq profiles included 10,398 cells from 10 human CRC samples (accession number GSE146771), which were obtained from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database. This database contains 5169 cells from tumor cores, 2400 cells from paracancerous tissues, and 2829 cells from peripheral blood, performed using the SMART-seq2 platform. Normalized matrix files for the dataset were downloaded. The bulk RNA-seq data of CRC samples, including 398 tumor samples and 39 normal samples, were obtained from the TCGA database (https://portal.gdc.cancer.gov/).
We excluded samples with an overall survival (OS) time < 7 days or insufficient clinical information regarding age, gender, or TNM stage.
Processing of the CRC scRNA-seq data and bulk RNA-seq data
A total of 10 samples and 10,398 cells were included in this study. The Seurat package in R 4.0.3 was used for quality control (QC). The quality standards were as follows: 1) genes detected in < 3 cells were excluded; 2) cells with < 50 total detected genes were excluded, and; 3) cells with ≥ 5% of mitochondria-expressed genes were excluded. For the remaining cells, cell-cycle scores were calculated using Seurat’s CellCycleScoring function since the cell cycle phase effect was observed. Batch effects among the patients had already been eliminated by the data donator. The gene expression matrices were further normalized to RNA counts, mitochondrial percentages, and cell cycle scores using the top 3000 variable genes. PCA was used to calculate the significantly available principal components (PCs). We then applied the t-distributed stochastic neighbor-embedding (tSNE) algorithm for dimensionality reduction with 20 initial PCs to perform cluster classification analyses across all cells.
Cell type recognition
We performed differential expression analysis among all genes within cell clusters using Seurat’s FindAllMarkers function to identify the marker genes in each cluster. An adjusted P-value < 0.05, expression percentage > 0.25, and | log2 [fold change (FC)] | > 0.25 were considered cutoff criteria for identifying marker genes. Subsequently, different cell clusters were determined and annotated by the singleR package according to the composition patterns of the marker genes and were then manually verified and corrected with the CellMarker database. The malignant cells were annotated by correlation with the data donator’s cell annotation of malignant cells. The corresponding genes of cell surface markers for the annotation of cell clusters are listed in Supplementary Table 1.
Pseudotime trajectory analysis
Single-cell pseudotime trajectories were constructed using the Monocle 2 algorithm, an R package designed for single-cell trajectories by Qiu et al. to reveal the immune cell differentiation process. This algorithm applies a machine learning technique to reduce the given high-dimensional expression profiles to a low-dimensional space, visualized as a tSNE plot. Single cells were projected onto this space and ordered into a trajectory with branch points. The dynamic expression heatmaps were constructed using the plot pseudotime heatmap function. In addition, differential expression analysis between branches was performed using the plot_genes_branched_heatmap function.
Functional enrichment analysis
DEGs analysis was performed using Seurat’s FindMarkers function. The following cutoff threshold values were used: adjusted P-value < 0.05 and |log2 [FC]| >1. The DEGs were loaded into Metascape (http://metascape.org), a tool for gene annotation and gene list enrichment analysis.
GSVA was performed to explore correlating pathways of different cell clusters. We downloaded “c2.all.v7.4.symbols.gmt” and “c5.all.v7.4.symbols.gmt” from the Gene Set Enrichment Analysis website (http://www.gsea-msigdb.org), which were used for further enrichment analyses. The GSVA analysis was performed in R 4.0.3 to calculate the enrichment score of the pathways in each cell.
Cell-cell communication analysis
CellChat is a versatile toolkit used to infer and analyze intercellular communication networks from scRNA-seq data quantitatively. Based on the ligand-receptor interactions database for humans and mice and using network analysis and pattern recognition approaches, CellChat can predict major signaling inputs and outputs for cells and establish how those cells and signals coordinate their functions. As described above, 10,398 single cells were clustered into 28 cell types. CellChat was then applied to analyze the molecular interaction among the 28 cell types. Ligand-receptor pairs with a P-value < 0.05 were filtered to evaluate the relationship between different cell types.
Gene regulatory network analysis
We used SCENIC (Aibar et al., 2017), an algorithm package that can reconstruct transcriptional states and regulatory networks from scRNA-seq data, to access the gene regulatory networks relating to TFs and regulons in individual cells of the CRC samples. The gene expression matrix of each cell type was input into SCENIC, and a co-expression network was constructed using GENIE3. Direct binding by DNA-motif analysis was identified based on a motif dataset (hg19-500bp-upstream-7species.mc9nr.feather, hg19-tss-centered-10kb-7species.mc9nr.feather) to construct regulons for each transcription factor. Finally, regulon activity was analyzed using AUCell (Area under the Curve), where a default threshold was applied to binarize the specific regulons (“0” = “off” for TFs, and “1” = “on”). Regulon modules were identified based on the Connection Specificity Index (CSI) to identify specific associating partners. Hierarchical clustering with Euclidean distance was performed based on the CSI matrix to identify different regulon modules. We then used 0.65 as a cutoff to construct the regulon association network, to investigate the relationship between different regulons.
Correlation with bulk RNA-seq data
CIBERSORTx is a new machine learning method developed from CIBERSORT for estimating the abundance of cell clusters in bulk RNA-seq data. This tool was used to digitally purify the transcriptome of individual cell clusters from the bulk data without isolating single cells. Based on the scRNA-seq data identified previously, we applied CIBERSORTx to create a signature matrix. Next, we used CIBERSORTx to estimate the fraction of each cell cluster in TCGA CRC expression matrix, which was first normalized to transcripts per million (TPM) values. Next, multivariate Cox regression and stepwise regression analyses were applied to select the optimal coefficient for each cell cluster to construct the risk model. The risk scores were then divided into “high risktype” and “low risktype” according to the median risk score. The formula for the model is as follows:
Then we incorporated the risktype, TNM stage, gender, and age to construct a new clinical risk model using multivariate Cox regression. The clinical risk scores were divided into “high clinical risktype” and “low clinical risktype” according to the median clinical risk score. The formula for the clinical model is as follows:
The associations of risktype and clinical risktype with OS were analyzed using KM curves with ROC curve analysis to verify the sensitivity and specificity of the model for the training cohort.
Statistical analyses
Statistical analyses were conducted using R software (version 4.0.3; R Foundation for Statistical Computing, Vienna, Austria). All statistical tests were two-sided, with P-values < 0.05 considered statistically significant.