Development and Validation of a Unique Gignature of Cancer Stemness-related Genes for Predicting the Overall survival of Colorectal Cancer Patients

Background: Cancer stem cells (CSCs), which are characterized by self-renewal and plasticity, are highly correlated with tumor metastasis and drug resistance. To fully understand the role of CSCs in colorectal cancer (CRC), we evaluated the stemness traits and prognostic value of stemness-related genes in CRC. Methods: In this study, the data from 616 CRC patients from The Cancer Genome Atlas (TCGA) were assessed and subtyped based on the mRNA expression-based stemness index (mRNAsi). The correlations of cancer stemness with the immune microenvironment, tumor mutational burden (TMB) and N6-methyladenosine (m6A) RNA methylation regulators were analyzed. Weighted gene co-expression network analysis (WGCNA) was performed to identify the crucial stemness-related genes and modules. Furthermore, a prognostic expression signature was constructed using Lasso-penalized Cox regression analysis. The signature was validated via multiplex immunouorescence staining of tissue samples in an independent cohort of 48 CRC patients. Results: This study suggests that high mRNAsi scores are associated with poor overall survival in stage (cid:0) CRC patients. Moreover, the levels of TMB and m6A RNA methylation regulators were positively correlated with mRNAsi scores, and low mRNAsi scores were characterized by increased immune activity in CRC. The analysis identied 2 key modules and 34 key genes as prognosis-related candidate biomarkers. Finally, a 3-gene prognostic signature (PARPBP, KNSTRN and KIF2C) was explored together with specic clinical features to construct a nomogram, which was successfully validated in an external cohort. Conclusions: There is a unique correlation between CSCs and the prognosis of CRC patients, and the novel biomarkers related to cell stemness could accurately predict the clinical outcomes of these patients.


Background
Colorectal cancer is among the most common and lethal cancers of the digestive system [1]. Although neoadjuvant chemo-radiotherapy and immunotherapy offer good prospects for operable colorectal cancer cases, the 5-year survival rates remain low in cases of advanced disease [2]. With advances in individualized tumor treatment, the remarkable tumor heterogeneity was shown to be closely associated with chemoresistance, radiosensitivity and patient survival [3]. Cancer stem cell (CSC) traits, a crucial part of cancer heterogeneity, are considered to be crucial drivers of the prognosis and response to therapy [3,4].
Mounting evidence suggests the existence of CSC in colorectal cancer, with studies revealing their roles in metastasis, drug resistance, and continually helping the cancer adapt to the changing microenvironment [5,6]. It has been demonstrated that the accumulated epigenetic and genetic variability allows CSC to evolve and thereby continue tumor growth and maintenance, which is closely associated with the alteration of the tumor microenvironment (TME) [7,8]. Recent studies demonstrated that colon cancer cells with stem-cell-like properties can promote drug resistance, migration and invasion, which is regulated by the TME and not a fully autonomous behavior of individual cells [9,10]. Cancer stemness encompasses both the stemness phenotype with bona de CSCs and the intrinsic potential for differentiation into colon cancer cells, which is de ned as a fundamental underlying property of malignancy [11,12]. Studies investigated the levels of cancer cell stemness markers CD133 and CD44 d in cell culture models in vitro revealed that a large number of cytokines, cellular and molecular mechanisms play signi cant roles in mediating cancer cell stemness and CSC properties. These include cancer-associated broblasts, IL13, exosomes and c-Jun signaling in colorectal cancer [13,14]. However, most of theories haven't been con rmed in vivo or translated into clinical research, because of the integrated and complex cancer ecosystem.
The interaction between the immune system and CSCs is still controversial, as the increased tumorigenicity of CSCs reveals that they can promote oncogenic immunomodulation in colon cancer [15,16]. Moreover, cancer cells and CSCs with high stemness can decrease the expression of major histocompatibility complex (MHC) molecules, thereby promoting immune evasion and decreasing the activity of anti-tumorigenic immune cells [17]. However, an integrated understanding of colorectal cancer stemness, including its interactions with the tumor immune microenvironment, still requires further research. To evaluate the role of stemness in tumor pathogenesis and the vital factors leading to dedifferentiation and acquisition of stem-cell-like properties in colorectal cancer, arti cial intelligence and bioinformatic methods could be employed to further understand cancer stemness [18,19]. One-Class Logistic Regression (OCLR) can be sued to extract epigenetic and transcriptomic features from normal stem cells and their differentiated progeny, including induced pluripotent stem cells and embryonic stem cells with different level of stemness. Notably, this approach could also be used to identify the stem cell signatures and quantify cancer stemness via a multi-omics analysis.
In this study, we hypothesized that cancer stemness may confer immunosuppressive properties on tumors and mediate the prognosis. To verify it, we developed the immune score construct and identi ed the proportions of immune cells using vector regression. We then assessed of correlations the stemness indices with molecular features, and identi ed a stemness-related gene signature that might be helpful in evaluating the prognosis of colon and rectal cancer patients.

Study design
To explore the mRNAsi in colon and rectal cancer, we obtained the transcriptome pro ling for the gene expression and clinical information of colon and rectal cases from the TCGA database, and the analysis work ow was shown ( Figure 1). We employed the "Cell type Identi cation by Estimating Relative Subsets of RNA Transcripts (CIBERSORT)" algorithm and "Estimation of Stromal and Immune cells in Malignant Tumors using Expression data (ESTIMATE)" to develop the immune score model construction and identify the proportions of immune cells with support vector regression. And we have analyzed the alteration spectrum reversible N6-methyladenosine (m6A) RNA methylation regulators in different level of mRNAsi colon and rectal cancer group. Besides we assessed stemness-related genes sets with WGCNA.
We identi ed signatures from a key gene module, and their prognostic role in colon and rectal cancer. Our studies applied some multi-omics approaches to identify the features of cancer stemness and de ne the roles of stemness-related genes in colorectal cancer.

Data sources
Within the The Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/tcga/), we identi ed the transcriptome pro ling by RNA-sequencing (RNA-seq), single nucleotide variants (SNV) and the corresponding prognostic and clinicopathological information of the colon and rectal cancer set. And the Ensembl IDs were converted to gene symbols based on the Ensemble database (http://asia.ensembl.org/ index.html). The RNA-seq results of 433 tissues and 408 colon cancer samples were obtained from COAD, and 221 tissues and 208 rectal cancer samples were obtained from READ with the FPKM method [20]. The SNV results were obtained from COAD and READ with the VarScan2 Variant Aggregation and Masking.

Colorectal cancer patients
The colorectal cancer specimens who underwent the surgical resection from January 2006 through December 2012 were approved by the Pathology Department of Cancer Institute and Hospital, Chinese Academy of Medical Sciences. All of the patients have gotten informed consent. Patients who met the following criteria were included: (A) no getting neoadjuvant radiotherapy or chemotherapy before the surgery; (B) with complete clinicopathological information, including the sex, age, tumor location, TNM classi cation, follow-up time and so on. And the specimens in this study were followed up every three months until 1 January 2018. Forty-eight tumor tissues and adjacent normal tissues were collected from colorectal cancer patients with untreated stage to stage IV. mRNAsi in subtypes Unsupervised learning, as a machine learning technique, is employed in the eld of medical data mining, and Wiznerowicz et al. [19] draw two stemness indexes based on re ected epigenetic regulation features and transcriptome, respectively, named mRNAsi and mDNAsi with OCLR. The mRNAsi is used to identify cancer stemness features and assess the degree associated with oncogenic dedifferentiation, which could be a quantitative form of cancer CSCs. Higher mRNAsi scores are related with cancer biological processes in CSCs and more tumor dedifferentiation based on the histopathological grades. Besides these studies further evaluated annotated cancer stemness in nearly 12,000 samples, covering 33 tumor types from TCGA. And the stemness feature was performed employing colon and rectal cancer specimens from TCGA-COAD and -READ to get the comprehensive molecular subtyping of each specimen [19]. And we divided colon and rectal cancer patients into high-mRNAsi group and low-mRNAsi group based on the cut-off value determined by the median mRNAsi index.

Differentially expressed genes (DEGs)
The "limma" packages in R language was employed to identi ed differentially expressed genes in the expression data from the TCGA, and Wilcox Test was used in the analysis processing [21]. The |log2-fold change|>1 and false discovery rate (FDR)<0.05 were considered to be the cut-offs criteria for DEGs selection between colon and rectal cancer and normal sets. The volcano plot and heat-map were drawn with the "pheatmap" package in R.

Estimation of immune microenvironment and in ltrating immune cells
We employed the ESTIMATE to evaluate the immune score, immune cell in ltration level, for each colon and rectal cancer samples from TCGA [22]. And then, we evaluated the enrichment levels of the 29 immune signatures in each colon and rectal cancer specimen by the single-sample gene-set enrichment analysis (ssGSEA) score [23][24][25]. The colon and rectal cases could be classi ed into three subgroups based on the ssGSEA score and hierarchical clustering. Besides, we evaluated the 22 human immune cell subsets of every colon and rectal cancer samples with CIBERSORT web portal (http://cibersort. stanford.edu/) and 1000 permutations [26]. The CIBERSORT deconvolution algorithm output had a p-value<0.05 was considered accurate and successful deconvolution. And the output estimates would be normalized for each sample to add up to one, enabling their direct interpretation as cell fractions for comparison across different groups. The package "Gene lter" R was employed to identify each sample.
Weighted Gene Co-expression Network Analysis (WGCNA) We used the "WGCNA" R package to establish the co-expression network of differentially expressed genes [29]. We employed Pearson's correlation matrices, co-expression similarity matrix and average linkage method to evaluate the correlations among the included genes. We used the function Amn=|Cmn|β (Cmn=Pearson's correlation between gene-m and gene-n; Amn=adjacency between gene m and gene n; β representing soft thresholding parameter) to distinguish the strength of correlations and build a weighted adjacency matrix with a scale-free co-expression network. We used a topological overlap matrix (TOM) to evaluate the connectivity and dissimilarity of the co-expression network established with an appropriate β value. Based on the TOM dissimilarity measurements, the average hierarchical linkage clustering could be established, and we set the minimum genome to 30 to build module dendrograms.
In order to con rm the key modules and genes, we set the module membership (MM) and gene signi cance (GS) to be the measure used to identify the correlation between genes and mRNAsi and epigenetically regulated mRNAsi (EREG-mRNAsi). The module eigengenes (MEs) were de ned as the signi cant components of the principal component analysis (PCA) for each gene module, where the expression level of every gene could be grouped with a distinct feature. We used a log10 transformation of the p-value (GS=lgp) for the linear regression of correlations between clinical phenotypes and gene expression. We used module signi cance (MS) to represent the correlation between clinical traits and gene expression calculated using the average GS in the module. We used a cut-off of < 0.25 to merge highly similar modules, which could help cluster the key genes. The thresholds for the screening of key module genes were set as cor. gene GS > 0.5 and cor. gene MM > 0.6. Gene set, ontology and pathway enrichment analysis The package "clusterPro ler" in R was employed to evaluate the enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) and gene ontology (GO) to reveal the key biological functions of the module genes [30]. We set p-value < 0.05 and an FDR < 0.05 as the statistically criteria to output. The whole transcriptome of all tumor samples was employed for GSEA, and only gene sets with NOM p < 0.05 and FDR q < 0.05 were set as statistically criteria.

Protein-Protein Interaction (PPI) Network
We used Search Tool for the Retrieval of Interacting Genes (STRING) Version 11.0 (https://string-db.org/) to establish the PPI network, in order to determine the co-expression relationship between key module genes [31].
Oncomine and Cbioportal database Oncomine (http://www.oncomine.org/) was employed to identify differences in transcriptomic RNAsequencing data of module genes between tumors and normal tissues in colorectal cancer. The threshold limits were as follows: p-value, 0.05; fold change, all; gene level, 10%; data type, mRNA. We compared the transcriptome level of key module genes in clinical cancer specimens were determined by Student's t test. And Cbioportal (http://www.cbioportal.org/) was used to identi ed the somatic mutations of module genes in colorectal cancer.

Multiplex immuno uorescence staining and image analysis
To evaluate the expression and distribution of three key genes related to tumor stemness in colorectal cancer and normal tissues, we performed multiplex immuno uorescence staining using the PANO 7-plex IHC kit (cat. No. 0004100100; Panovue, Beijing, China) and Tyramide Signal Ampli cation Fluorescence Kit (Panovue, Beijing, China) [32]. We established the colorectal cancer tissue microarrays (TMAs) consisting of primary tumor, metastatic tumor and matched normal tissue from cancer patients who had been con rmed by pathological examination with hematoxylin and eosin (H&E) staining [6]. Each of these tissues was cut into pieces of 1.0 mm and attached to the slides (5 mm thick) from the TMAs. The TMAs were incubated with anti-PARPBP (ab211634; 1:100; Abcam; Cambridge, UK), anti-KNSTRN (ab122769; 1:100; Abcam; Cambridge, UK) and anti-KIF2C (12139-1-AP; 1:200; Proteintech, Rosemont, IL) antibodies at 4°C overnight, and then with horseradish peroxidase-conjugated secondary antibody and tyramide. A microwave was uesd to heat-regenerate the TMAs after each Tyramide Signal Ampli cation step. We used 4′,6-diamidino-2-phenylindole (DAPI) to counter-stain the cell nuclei.
The Mantra System (PerkinElmer, Waltham, Massachusetts, US) was used to capture the multispectral immuno uorescence images with the uorescence spectra at 20-nm wavelength intervals from 420 to 720 nm with the same exposure time, which were then composited to establish a single stack image. To capture images of sections without auto uorescence, we extracted the spectrum of auto uorescence of TMAs and each uorescein from the images of unstained and single-stained sections, which were used to establish the spectral library for multispectral unmixing using inForm image analysis software (PerkinElmer, Waltham, Massachusetts, US). Two independent pathologists analyzed and counted singlepositive cells and the expression of the three genes in each tissue of TMAs at 200× magni cation in a blinded manner. The nucleated stained cells were quanti ed and expressed as the number of cells in TMAs. The positive rate of single index and single index intensity score were employed to evaluate the expression and distribution of the identi ed key genes in colorectal cancer, which were calculated by multiplication of the multiplex immuno uorescence staining intensity (percentage of single index %= Number of positive cells/ Total number of cells; The 25% staining was taken as the threshold of the strength score; 25-49%, strength I; 50-74%, strength II; 75-100%, strength III; single index strength score= [(strength I*positive rate of single index) *1+(strength II*positive rate of single index) *2+(strength III*positive rate of single index) *3] *100) [33].

Statistical analysis
In this study, the colon and rectal cancer specimens were allocated into training and testing group randomly through the 2:1 ratio with the package "caret", in order to promote the generalization ability of model. And we established the LASSO-penalized Cox regression model to identify the most signi cantly survival-related gene signatures with the patients' overall survival (OS). We set 10-fold cross validation as the criteria to prevent over tting with the penalty parameter lambda.1se [34]. Then the time dependent receiver operating characteristic (ROC) curve and the area under the curve (AUC) were employed to identify the prognostic accuracy of the three gene signature model in training and testing group with the package "survival ROC" [35]. The median of risk scores was set as the cut-off value to the separate patients into high risk and low risk score group. We employed the Kaplan-Meier survival analysis and the log-rank test to evaluate the difference in OS between different groups. The nal (forward and backward elimination methods) Multivariate Cox regression analysis was employed to evaluate the independence of predictors and established three signatures and nomogram. The we validate the performance of the Cox model internally and externally with bootstrap method. Bootstrap-corrected OS rates were calculated by averaging the Kaplan-Meier estimates based on 2000 bootstrap samples.

Stemness properties of colon and rectal cancer cells
Colon and rectal cancer patients could be subtyped using the mRNAsi The mRNAsi of colon and rectal cancer tissues was signi cantly higher than that of normal tissues (Wilcoxon rank sum test, p<0.05) (Figures 2A and H), which suggested that the level of stemness is associated with tumor development. The mRNAsi among different stages of colon and rectal cancer did not show signi cant differences (Mann-Whitney U test, p<0.05) (Figures 2B, C, I and J), which suggested the mRNAsi may not be closely related to the clinical stage in many tumor types [36,37]. In our study, the colon and rectal cancer patients were divided into high-mRNAsi and low-mRNAsi groups based on the cut-off value determined by the median mRNAsi index. Stage colon cancer patients with higher mRNAsi had a signi cantly shorter OS than those with lower mRNAsi (Log-rank test. P < 0.05, Figures 2D, G, K, and  N). These results suggested that the mRNAsi of colon and rectal cancer is closely associated with the prognosis of stage colon cancer patients.

Differences in cell stemness and mutations among the mRNAsi subtypes
We compared the mutational landscape between colon and rectal cancer patients in the high-and low-mRNAsi groups. A higher proportion of APC (80.7%) and TP53 (61.9%) mutations were found in the high-mRNAsi group than the proportion of APC (70.7%) and TP53 (51.2%) mutations in the low-mRNAsi group of colon cancer patients, and the difference was statistically signi cant ( Figure 3, Supplementary Figure  1). Furthermore, oncogenic KRAS mutations, signature factors of the receptor tyrosine kinase (RTK) pathway, were found in similar proportions in the two groups ( Figure 3, Supplementary Figure 1).

Differences of cell stemness and immune microenvironment between the mRNAsi subtypes
We used 29 immune-associated gene sets to quantify the enrichment levels of different immune cell functions, pathways and types in colon and rectal cancer samples based on ssGSEA scores [23]. The ssGSEA scores of the 29 gene sets were employed for hierarchical clustering of immune subtypes into high, moderate and low groups ( Figure 4A, 4B). We also evaluated the immune score using ESTIMATE, and the immune scores were higher in the high-subtype than those in the low-subtype [23]. The mRNAsi in the high-subtype was higher than in the low-subtype for colon cancer patients. However, there is no signi cant difference for rectal cancer patients (Kruskal-Wallis test, Wilcoxon rank sum test, P < 0.001) ( Figures 4C and F). Moreover, we found that the Immune Score was lower in the high-mRNAsi group than in the low-mRNAsi group for both colon and rectal cancer (Wilcoxon rank sum test, P < 0.001) (Figures 4D and G). In addition, when comparing the Tumor Mutational Burden (TMB) in groups with different levels of stemness index, we observed the opposite trend, with the level of TMB increasing from the low-mRNAsi group to the high-mRNAsi group (low-mRNAsi < high-mRNAsi) (Wilcoxon rank sum test, P < 0.001) ( Figures 4E and H).
We investigated the proportions and differences of tumor in ltrating immune cell subsets between the high-and low-mRNAsi subgroups of colon and rectal cancer patients using the CIBERSORT algorithm and the LM22 gene signature (Figures 5A and E). To further con rm the relationships among 22 tumorin ltrating immune cell types, we performed correlation analysis. The results revealed a positive correlation between CD8+ T cells and M1 macrophages in the high-mRNAsi subgroup. The CD8+ T cells were more strongly negatively correlated with mast cells and resting memory CD4+ T cells in the high-mRNAsi group than the low-mRNAsi group of colon cancer patients ( Figures 5B, C, F, and G). Eight types of tumor-in ltrating immune cells were correlated with the mRNAsi in colon cancer and ve types in rectal cancer (Wilcoxon rank sum, P<0.05) (Figures 5D and H). Among them, six types of tumor-in ltrating immune cells were positively correlated with high mRNAsi in colon cancer, including CD8+ T cells, resting NK cells, activated memory activated CD4+ T cells, follicular helper T cells, as well as resting and activated dendritic cells. Two types of immune cells were positively correlated with low mRNAsi, including regulatory T cells and M1 macrophages.
Therefore, the high-mRNAsi group exhibited higher levels of TMB and a lower immune score, which indicated that tumor stemness may be negatively correlated with tumor immunity and positively correlated with TMB, which is related closely with the production of neoantigens in tumors.  (Figure 6 A-B). To further understand the relationship among this HLA genes, we performed correlation analysis in high and low-mRNAsi groups for colon and rectal cancer patients (Figure 6 C-F). All these results revealed that the level of HLA could be associated negatively with the cancer stemness.
Cell stemness and the expression of m6A RNA methylation regulative factors in subtypes Accumulated evidence suggested that m6A mRNA methylation is signi cant for tumor development, selfrenewal and regulating the development of cancer stem cells [38,39]. High-mRNAsi was signi cantly associated with increasing gene expression, such as METTL3, WTAP, YTHDF1, YTHDF2, RBM15, ALKBH5, HNRNPC and FTO, which are involved in m6A mRNA methylation in colon and rectal (Wilcoxon rank sum test, *p < 0.05; **p < 0.01; ***p < 0.001) ( Figure 6 G-H). Correlation analysis was performed in high and low-mRNAsi subgroups of colon and rectal cancer patients to further investigate the relationship among this m6A RNA methylation regulative factors ( Figure 6 I-L). The result suggested that the high cancer stemness could be signi cantly with the high expression of m6A RNA methylation regulative factors and regulated their correlation.

The mRNAsi-related gene signature
The screening of DEGs and identi cation of mRNAsi-related modules We screened DEGs in datasets of colon and rectal cancer tissues and normal tissues. We identi ed 6498 DEGs in colon cancer, 4528 of which were up-and 1970 downregulated ( Figure 7A). In rectal cancer, 2072 DEGs were up-and 1776 were downregulated ( Figure 7F).
A gene co-expression network was established to classify the genes into biologically signi cant modules based on the average linkage hierarchical clustering strongly associated with the mRNAsi in colon and rectal cancer. In this model, we set β=6 (scale-free R2=0.95) as the soft threshold to construct the scalefree network (Supplementary Figure 2-3), which yielded 9 gene modules in colon cancer and 14 in rectal cancer (Figures 7B, C, G, and H). To further identify the relationship between key gene models and mRNAsi, we set MS as the overall gene expression level of a certain module to identify correlations with stemness phenotypes. We selected the yellow module with a correlation of more than 0.7 in colon cancer and brown module with a correlation of almost 0.6 in rectal cancer, for subsequent analyses (Figures 7B,  D, G, and I). We set cor. MM > 0.6 and cor. GS > 0.5 as the selection threshold, and identi ed 61 key genes signi cantly related to mRNAsi in colon cancer, as well as 41 key genes in rectal cancer. GO and KEGG pathway enrichment analyses were performed to evaluate the principal biological functions of the key genes in colon and rectal cancer (Supplementary Figure 4-5). The results showed that the keys gene are related to DNA replication, ATPase activity, and condensed chromosome kinetochores in colon cancer (Supplementary Figure 4-5). The key genes were found to be involved in regulation of chromosome segregation, condensed chromosomes, catalytic activity and DNA helicase activity in rectal cancer (Supplementary Figure 4-5). Additionally, we established a PPI network to validate protein interactions using the STRING database, based on the key gene selection in both colon and rectal cancer (Supplementary Figure 4-5). We also analyzed the interactions among these key genes in colon and rectal cancer ( Figures 7E and J).

Analysis of key gene expression and mutational landscape
By overlapping the mRNAsi-related genes in colon cancer ( Figure 7E) and those in rectal cancer ( Figure   7J), we obtained 34 overlapped genes and established the PPI network ( Figure 8A, supple Figure). We employed two databases, Cbioportal and Oncomine, to systematically understand the key gene mutational landscape and expression. The results showed that most of genes selected were signi cantly upregulated in more than one cancer type in addition to colorectal cancer with the analysis of Oncomine (Figure 8 B). We also investigated the mutational landscape of key genes in colorectal cancer with Cbioportal ( Supplementary Figure6).

Establishment of a three-gene prognostic signature
Prognostic value of genes in the mRNAsi-related modules in colon and rectal cancer The colon and rectal cancer patients were divided into training and validation cohorts, and 34 key genes were selected to established a prognostic stemness risk score model using the LASSO algorithm and nal (forward and backward elimination methods) multivariate Cox analysis in the training cohort ( Figure 9A,  Supplementary Figure7). As shown in Figure 8A, the poly (ADP-ribose) polymerase 1 binding protein (PARPBP), kinetochore-localized astrin/SPAG5 binding protein (KNSTRN) and kinesin family member 2C (KIF2C) were correlated with a poor prognosis in patients from the training cohort according to the multivariate Cox regression analysis. The risk scores were calculated based on the sum of loci β values and the risk coe cient in the risk prediction model with discrete clinical outcomes with regards to OS ( Figure 9B). The prognostic index formula for colon and rectal cancer patients in the training cohorts was as follows: Validation of the three-gene signature in stage IV colorectal cancer GSEA was conducted to analyze potential biological characteristics of the three signature genes in colon and rectal cancer patients. As shown in Figure 10A, according to HALLMARK collection de ned by MSigDB, the genes in the high-risk group were mainly enriched in cancer stemness-related pathways, such as DNA repair, glycolysis, MYC targets and PI3K-Akt-mTOR signaling. According to GO collection de ned by MSigDB, the genes were enriched in functions related to tumor development such as methylation, mitotic nuclear division, signal transduction by P53 class mediators and DNA damage ( Figure 10B). According to KEGG collection de ned by MSigDB, the genes were enriched in apoptosis, cell cycle, cysteine and methionine metabolism and P53 signaling pathway ( Figure 10C). According to the immunological gene set collection de ned by MSigDB, multiple immune-function gene sets were enriched in the high-risk group ( Figure 10D). To further explore the relationship between the expression of the three signature genes and the tumor immune microenvironment, we analyzed the corrections between the three genes and 22 types of immune cell in ltration pro les (Supplementary Figure9).

Establishment and validation of the nomogram
A Cox regression model was applied to the training cohort to identify the predictors of OS. Univariate analyses indicated that age, stage-T, stage-N, stage-M, and cancer stemness risk score group were associated with OS in colorectal cancer patients (p < 0.1 in all cases) ( Table 1). Next, the nal (forward and backward elimination methods) multivariate Cox analyses found that age, stage-T, stage-M, and cancer stemness risk score group were independent risk factors for OS (Table 1). A nomogram for predicting the 1-, 3-and 5-year OS was established using these independent variables (Fig. 11A). Because age, stage-T, stage-M, and cancer stemness risk score group were predictive for OS in multivariate analysis, these variables were further included in the nomogram. A weighted total score was calculated from these factors, which was applied to predict the 1-, 3-and 5-year OS of colorectal cancer patients.
The nomogram for predicting the 1-, 3-and 5-year OS of colorectal cancer patients was developed based on the multivariate model. The model showed good accuracy for predicting the OS, and internal validation was performed using the training cohort with a C-index of 0.738. Calibration curves for the probability of OS at 1, 3 and 5 years indicated satisfactory consistency between actual observation and nomogram-predicted OS probabilities in both the training cohort and validation cohort (Fig. 11B, Supplementary Fig. 10). Furthermore, the Decision Curve Analysis (DCA) results of the nomograms also con rmed their clinical applicability for predicting the OS, with superior performance compared to AJCC TNM stage. Thus, the results showed that the nomogram expressed a higher net prognostic bene t than the TNM staging system (Fig. 11C, Supplementary Fig. 10).
Evaluation of the expression and distribution of the three key tumor stemness-related genes and performance of a stemness-related genetic model Next, we detected the expression and distribution of PARPBP, KNSTRN and KIF2C in colorectal cancer, and performed a multiplex immuno uorescence staining in TMAs (Fig. 12A). We assessed the single index and single index strength score of the three genes in cancer tissue and matched normal tissue samples, which showed that the expression of these genes was higher in cancer tissues based on the single index strength score (Student's two-tailed t-tests, ***p < 0.001) (Fig. 12B).  (Table 2), and the Kaplan-Meier OS curves of the two groups were signi cantly different (log-rank test, p < 0.05) (Figs. 12C). Additionally, we also established calibration curves for the probability of OS at 1, 3 and 5 years, which indicated satisfactory consistency between actual observation and nomogram-predicted OS probabilities in this cohort (Fig. 12D, Supplementary Fig. 11).

Discussion
Colorectal cancer is a highly aggressive disease, with an urgent need for novel therapeutic strategies and biomarkers for its prognosis and early detection in clinical practice. According to the recently developed cancer stem-cell (CSC) hypothesis, tumor cells are suggested to originate from a stem cell population with self-renewal capacity [8]. These CSCs have been reported to be involved in resistance to cytotoxic conditions, promoting the propagation and recurrence of cancer [40]. Identifying key genes driving the transformation of tumor CSC and underlying biological mechanisms in colorectal cancer may uncover unprecedented therapeutic targets. In this study, we rstly revealed the association between the mRNAsi and prognosis of colorectal cancer patients. The results showed that stemness was signi cantly higher in colon and rectal cancer tumor tissues than normal tissues containing intestinal stem cells.
Recent studies indicated that cancer CSCs may be a dynamic population continuously in uenced by cooperating forces such as microenvironmental, epigenetic and genetic factors [41][42][43]. The stem cells in normal colonic crypts are continuously and random replaced by other homologous cells, which can provide an advantage for the accumulation of oncogenic mutations through complex stem-cell dynamics [44][45]. Poulsom et al. [46] demonstrated that oncogenic mutations accumulated in stem cells may trigger the rapid development of aggressive sub-clones in colon adenomas. Accumulating evidence suggests that the cancer stem cells can compete with normal stem cells during the early phases of tumor growth, and this competition then continues among neoplastic stem cells during cancer progression, which would be promoted genetic mutations as well as environmental pressures (including radiotherapy and chemotherapy) [47]. We analyzed the mutational landscape in different mRNAsi groups of colon and rectal cancer. The results revealed a higher proportion of APC and TP53 mutant cells in the high-mRNAsi group than in the low-mRNAsi group of colon cancer patients. Previous research showed that mutations of TP53 can shut down its tumor suppressor function, promoting the self-renewal, reprogramming and differentiation of cancer stem cells [48,49]. Mutant TP53 can inhibit cancer development and promote the acquisition of stem-cell like phenotypes and features to augment the invasion, metastasis and cell death resistance of cancer cells [50].
Analysis of tumor in ltration by different immune cells populations in the high-and low-mRNAsi groups indicated increased in ltration by CD8 + and CD4 + T cells in the high-mRNAsi group of colon cancer.
Cancer in ltration by CD8 + T cells may predict higher sensitivity to immunotherapy and better prognosis. However, CD8 + T cells depend on several transcriptional networks to enhance their maintenance and terminal differentiation, as well as the production of stem cell-like central memory cells [51,52]. A recent study revealed stem-like CD8 + T cell populations that are able to proliferate and produce high levels of checkpoint molecules under stimulation, with the ability to clonally expand to functional effector T cells and self-renew [53][54][55]. Tumors with high levels of stemness may have higher levels of in ltration by immune cells such as CD8 + and CD4 + T cells, as well as having the potential to produce neoantigens that sensitize them to treatment with immune checkpoint inhibitors. Although TME and stemness were both identi ed as signi cant features of cancer in recent years, their covariation across cancers has not been systematically investigated. We found that the TMB is higher in the high-mRNAsi group than in the low-mRNAsi group of both colon and rectal cancer. The TMB can be de ned as the amount of nonsynonymous mutations in protein coding regions, which may promote the production of neoantigens by tumor cells [56]. Recent studies established the signi cance of TMB as a predictive biomarker for the success of treatment with immune checkpoint inhibitors (ICIs) such as anti-programmed cell death (PD)-1 and anti-programmed death-ligand 1 (PD-L1) therapy [57][58][59]. However, there is accumulating evidence that cancer patients with a high TMB may have worse survival than patients with low TMB in the absence of treatment with ICIs [60,61]. In this study, we found that tumors with a high mRNAsi may potentially be more easily recognized by the immune system, and therefore more sensitive to treatment with immune checkpoint inhibitors.
M6A, methylation at the N6 position of adenosine, is one of the most abundant and common internal modi cations found in long noncoding RNAs (lncRNAs) and mRNAs [62]. It can regulate RNA stability, alternative splicing, decay, abundance, translation and nuclear export [63]. The m6A binding proteins ("readers"), demethylases ("erasers") and methyltransferases ("writers") can remove or add m6A sites, and can therefore participate in regulating every aspect of RNA metabolism [64]. Our results show that a number of m6A RNA methylation regulators were expressed at higher levels in the high-mRNAsi groups of both colon and rectal cancer patients. YTHDF1, a "reader" of m6A methylation, plays a signi cant role in regulating the colon cancer stem cell-like features via the Wnt/β-catenin pathway and regulating the translation of selectively recognized mRNAs [65]. Furthermore, ALKBH5 was found to promote tumor formation and increase cancer cell stemness by regulating mRNA stability of pluripotency factors via m6A demethylation in hypoxic environments [66]. Additionally, FTO plays a signi cant role in immune evasion and self-renewal of cancer stem cells, and FTO upregulation can protect cancer cells from T cell cytotoxicity [67].
To further explore the prognostic value and biological mechanisms of potential therapeutic targets, we built a cancer stemness-related prognostic model to provide novel insights into treatment options for colon and rectal cancer. The prognostic index is based on the fractions of three genes identi ed among the cancer stemness-related key module genes, which need to be further investigated to uncover their potential molecular biological mechanisms. PARPBP, a signi cant inhibitor of homologous recombination (HR), has been demonstrated to be related to increased AFP levels, proliferation, differentiation, as well as poor prognosis of lung and hepatocellular carcinoma patients [68,69]. PARPBP was shown to inhibit the formation of DNA double-strand break repair protein (RAD51)-DNA structures, and thereby decrease the recombination of replicating chromosomes, promoting genomic instability and DNA damage hypersensitivity in pancreatic cancer [70]. PARPBP promotes tumor cell migration and invasion by enhancing mutagenic replication, extravasation, anoikis resistance and self-renewal in lung cancer [71].
KNSTRN, an important component of the mitotic spindle, was found to regulate chromosome segregation and anaphase onset during mitosis in cancer cells [72]. Furthermore, recent studies demonstrated that accumulation of KNSTRN mutations may be an early event in cancer development that accelerates tumor growth in cutaneous squamous cell carcinoma and melanoma [73,74]. KIF2C is an important regulator of chromosome segregation, bipolar spindle formation and microtubule depolymerization during mitosis, and it may be related with poor prognosis in non-small cell lung cancer [75]. Additionally, KIF2C was found to inhibit apoptosis and cellular senescence, promoting the proliferation and cell cycle progression of lung and hepatocellular cancer cells via a P53-depended pathway [76][77][78]. Moreover, KIF2C was shown to act as a tumor antigen that can elicit spontaneous and frequent CD41 T cell responses of the Th1-type in colorectal cancer, in a process that is in uenced by peripheral T regulatory cells [79].
Furthermore, we established a nomogram to better predict the survival of colon and rectal cancer patients and visualize the prediction results, which can further improve the compliance and therapeutic e cacy for patients. Additionally, we compared the prognostic accuracy of the TNM stage and nomogram model with DCA, which showed that the nomogram model consisting of stemness-related genes and clinical phenotype could have a higher prognostic ability than the TNM stage. The present results suggest that the model based on three stemness-related genes may have reliable prognostic accuracy in conjunction with the clinical phenotype. However, some of data in this retrospective analysis released in publicly available datasets may be limited, and the incomplete clinicopathological information may cause potential bias that would in uence the evaluation of the prognostic ability. Data from TCGA are from western countries, and all of the datasets consisted of a mutational landscape and transcriptome, which may hinder the clinical translation and generalization of the prognostic model. Consequently, we performed multiplex immuno uorescence staining in TMAs of colorectal cancer patients from China, and veri ed the prognostic ability of the nomogram model in an advanced colorectal cancer cohort. As shown in Fig. 2, stage colon cancer patients with higher mRNAsi values had a lower apparent survival probability than those with lower mRNAsi values, and the difference was statistically signi cant, which suggested a correlation between CSCs and advanced tumor. The evaluation of CSCs-related genes may be a prognostic marker for selecting patients at high-risk of metastasis from CRC, that are likely to bene t from treatment. Immuno uorescence results could be employed to evaluate the expression and distribution of the three key stemness-related genes identi ed by proteomics, with the potential to make this model more convenient and reliable in clinical practice.
Several limitations in this study should be kept in mind. The stemness-related genes and corresponding downstream signaling pathways need further functional validation, since lacking of in vitro or in vivo experiments greatly reduces the reliability of present research. Although the risk score and nomogram based on the multivariate Cox proportional hazards regression analysis were used widely to evaluate signi cant factors related with clinical prognosis, some machine learning algorithms might have a better prognostic ability. Prominent examples of such algorithms include Naïve Bayes, Random Forest and Decision Tree, which we will test in our next stage of research. Many molecular mechanisms contributing to cancer stemness remain to be explored in the future, and will help elucidate the development of colorectal cancer.

Conclusions
Taken together, our study highlights a robust correlation between the level of cancer stemness and traits related to tumor heterogeneity, including the immune microenvironment, TMB, and the expression of m6A RNA methylation regulatory factors in colorectal cancer cells. The prognostic signature based on mRNAsi may contribute to personalized prognosis of clinical outcomes in colorectal cancer and act as a potential prognostic biomarker for responses to differentiation therapies in clinical practice. The proposed stemness-related genetic model could provide great assistance in formulating e cient therapeutic strategies for the personalized management of colorectal cancer.