A stemness-related seven-gene signature for predicting prognosis and treatment response in muscle invasive bladder cancer

Background: Muscle invasive bladder cancer (MIBC) is an aggressive cancer characterized by therapeutic resistance and poor prognosis, which are possibly due to the existence of cancer stem cells (CSCs). In this study, we aimed to characterize the expression of cancer stemness-related genes and develop a multi-gene risk signature to predict clinical outcome and treatment response in MIBC. Methods: The mRNA expression data and clinical data of MIBC patients were collected from The Cancer Genome Atlas (TCGA) database and Gene Expression Omnibus (GEO) database, which included the TCGA training cohort (n = 333) and three GEO validation cohorts, GSE13507 (n = 165), GSE32548 (n = 127), and GSE48075 (n = 72). A list of 166 stemness-related genes were obtained from the Cancer Single Cell State Atlas (CancerSEA) database and prognostic genes for overall survival (OS) were identified by univariate Cox analysis. Then, the least absolute shrinkage and selection operator (LASSO) regression and stepwise multivariate Cox regression were performed to generate a multi-gene risk signature. Kaplan-Meier curve, time-dependent receiver operating characteristic (ROC) curve, multivariate analysis, and stratification analysis were used to evaluate the performance of the gene signature. We also explored the relationship between risk score and response to chemotherapy and radiotherapy in MIBC patients. Moreover, independent prognostic factors for OS were combined together into a nomogram to improve predictive performance. Results: Firstly, a total of 25 prognostic genes were identified. Then, a seven-gene risk signature (EGFR, FOXA2, HES1, MME, RBM6, SMOC2, and TFRC) was constructed and it could robustly classify MIBC patients into high -risk and low-risk groups with different clinical outcomes. ROC curves showed that the seven-gene signature had a robust predictive accuracy in four cohorts. Besides, high risk score was significantly associated with advanced clinical stage and treatment failure. As an independent risk factor for OS, the stemness-related seven-gene signature could achieve better prognostic accuracy when integrated with clinical factors. Conclusions: We developed and validated a robust stemness-related gene signature

4 levels or genetic levels. Yang et al. conducted single-cell RNAseq on 59 cancer cells from three BC specimens and confirmed clonal homogeneity of BCSCs and identified 21 key altered genes in BCSCs [12]. The Cancer Single Cell State Atlas (CancerSEA) database, which store and reanalyze the published single cell RNA-seq data of cancers, has been established so as to explore the functional heterogeneity of cancer cells at single-cell resolution [13]. Up to date, this database has summarized critical genes related with 14 functional states (e.g. metastasis, stemness, hypoxia) by reanalyzing the sequencing data of 41,900 cancer cells from 25 cancer types.
In the present study, we aimed to utilize cancer stemness-related genes from CancerSEA database to develop a prognostic multi-gene risk signature in MIBC. With bioinformatics methods and public transcriptome data, we established a novel seven-gene signature, which is superior to TNM staging method in prognosis prediction. As an independent risk factor for overall survival, the seven-gene signature could robustly predict prognosis in MIBC patients. By emphasizing the genetic biomarkers of cancer stemness, our findings might provide new insights into cancer stem cells in bladder cancer.

Development of stemness-related gene signature
To develop a multi-gene risk signature, we firstly conducted univariate Cox survival analysis to identify the prognostic stemness-related genes for the overall survival (OS) of MIBC patients in training cohort. Then, we used two steps of regression analysis to select out the optimal gene signature, which were the least absolute shrinkage and selector operation (LASSO) regression and stepwise multivariate Cox regression. In LASSO regression, the penalty parameter "lambda" was adjusted by 100000 times cross validations [14]. Then a risk model was generated using a linear combination of expression values and Cox regression coefficients of signature genes. Patients were categorized into high-and low-risk group by the median risk score. The survival differences between two groups was plotted in Kaplan-Meier survival curve and compared by log-rank test. We also compared prognostic values of stemness-related gene signature and clinical factors using timedependent receiver operating characteristic (ROC) curve and area under the curve (AUC).

Independence of stemness-related gene signature
To assess whether the stemness-related gene signature was independently associated with OS, univariate as well as further multivariate Cox proportional hazards regression analyses were performed. Age was divided into two subgroups (<= 65 years old and > 65 years old). The risk score of stemness-related gene signature was used as a continuous variable while clinical parameters including age, gender, and clinical stage were used as categorical variables. We further performed stratification analysis to test the prognostic significance of stemness-related gene signature in different clinical subgroups.

Validation of stemness-related gene signature
To determine the robustness of the stemness-related gene signature, we tested it in three GEO validation cohorts. Risk scores of patients in these cohorts were also generated using a linear combination of expression values and regression coefficients of signature genes. We used the median risk score in training cohort as a cut-off value for classifying low-risk and high-risk group. The performance of the gene signature in validation cohorts were also evaluated by Kaplan-Meier curve and Time-dependent ROC curve.

Association between stemness-related risk signature and clinical features
We further analyzed the correlation between risk score and clinical features including age, gender, overall stage, T stage, and N stage. In TCGA training cohort, treatment response to primary chemotherapy and radiotherapy of 185 patients were recorded, which included 121 patients with complete response (CR), 12 patients with partial response (PR), 20 patients with stable disease (SD), and 32 patients with progression disease (PD). Therefore, we also investigated whether the gene 7 signature-based risk score could predict treatment response in MIBC patients.

Comparation between stemness-related gene signature and mRNA stemness indices
In 2018, Malta et al. put forward a novel mRNA expression-based stemness indices (mRNAsi), which ranges from 0 to 1, to measure cancer stemness and tumor purity [15]. Therefore, we compared our stemness-related gene signature with mRNAsi in TCGA training cohort.
The prognostic value of mRNAsi was evaluated by Kaplan-Meier survival curve and time-dependent ROC curve. We also explored the correlation between the mRNAsi and our stemness-related gene signature.

Integrated nomogram by combining the stemness-related gene signature with clinical factors
In order to improve the predicative accuracy for OS, a prognostic nomogram was established basing on the independent prognostic factors. A stepwise multivariate Cox regression method was used to determine the weight score and coefficients of each prognostic factor. The calibration curve was used to evaluate the differences in 1-, 3-, 5-, 10-years survival possibilities between nomogram prediction and the actual observation. The time-dependent ROC curve and Harrell's concordance index (C-index) were utilized to assess the predicative accuracy of nomogram.

Gene Set Enrichment Analysis
In TCGA training cohort, Gene Set Enrichment Analysis (GSEA) was performed using GSEA 4.0.1 software and referring to the Molecular Signatures Database (MSigDB) so as to identify the significantly enriched biological processes in high-risk and low-risk groups. Gene sets with a normalized p value < 0.05 and a false discovery rate (FDR) q value < 0.05 were considered to be significantly enriched. The most enriched biological processes were selected out by ranking of the 8 normalized enrichment score (NES).

Statistical analysis
All statistical analyses in our study were performed using R language, version 3.6.0 and Perl language, version 5.28.2. R packages used in our study are as follows: (1) "limma" R package (for normalization of TCGA data). (2) "survival" and "survminer" R package (for univariate Cox analysis, drawing Kaplan-Meier curve, and calculating Hazard ratio and C-index). (3) "glmnet" R package (for LASSO Cox regression analysis). (3) "timeROC" R package (for plotting time-dependent ROC curve).
(4) "beeswarm" R package (for clinical correlation analysis). Survival analyses were performed using the Kaplan-Meier method and compared using a log-rank test by the 'survival' package. Differences of risk scores among different variables were tested using two samples t-test. A p value < 0.05 was considered statistically significant in our study.
In training cohort, 333 patients were classified into high-risk group (n = 167) and low-risk group (n = 166) by the median risk score 1.051. Kaplan-Meier curve exhibited a significant difference in OS between the two groups as high-risk group of patients had poorer clinical outcome than low-risk group (p < 0.001, Fig. 1c). The time-dependent ROC curve suggested that the AUCs of the stemness-related signature were 0.673, 0.732, 0.760 at 1-year, 3-years, and 5-years respectively (Fig. 1d). In addition, the seven-gene signature had better sensitivity and specificity than clinical factors in predicting 1- year, 3-years, and 5-years OS in MIBC patients (Additional file 3: Figure S1). clinical subgroups and the results were significant (Fig. 2). Taken together, the novel gene signature could independently predict OS in MIBC patients. HR hazard ratio, CL: confidence interval.

Validation of stemness-related gene signature
We further verified the performance of the seven-gene signature in three validation cohorts. Patients in GSE13507 (high risk = 69, low risk = 96) and GSE32548 (high risk = 62, low risk = 65) were categorized by the cut-off value 1.051 in training cohort. While in GSE48075, only 6 patients were low-risk and the remaining 66 patients were all high-risk according to the cut-off value 1.051.

Association between stemness-related gene signature and clinical features
The relationship between the stemness-related gene signature and clinical factors was analyzed. As expected, the risk score was significantly elevated along with cancer progression, including advanced T stage (p < 0.001, Fig. 4a), N stage (p = 0.018, Fig. 4b), and overall stage (p < 0.001, Fig. 4c).
However, no difference in risk score was observed between different subgroups of age and gender.
Moreover, the seven-gene risk signature was closely linked to response to primary chemotherapy and radiotherapy as the risk score was sequentially increasing in patients with CR, PR, SD, and PD (p = 0.002, Fig. 4d). These findings suggested that high risk score based on seven-gene signature was generally associated with unfavorable clinical manifestations.

Comparation between stemness-related gene signature and mRNA stemness indices
The mRNAsi of TCGA bladder cancer patients were obtained from Malta et al.'s study. We matched the samples between clinical data and mRNAsi by sample id number. As a result, a total of 401 patients were provided with mRNAsi. We used the median mRNAsi to divide 401 patients into two groups. Kaplan-Meier survival curve showed that patients with higher mRNAsi had significantly poor long-term survival than patients with lower mRNAsi (p < 0.001, Fig. 4e). The time-dependent ROC curve demonstrated that the AUCs of mRNAsi were 0.559, 0.548, 0.597 at 1-year, 3-years, and 5-years respectively (Fig. 4f). Furthermore, we also identified a significant correlation between mRNAsi and our seven-gene signature. The high-risk group patients identified by our seven-gene signature had significantly higher mRNAsi (Fig. 4g).

Integrated nomogram by combining the stemness-related gene signature with clinical factors
We subsequently developed a nomogram in training cohort to achieve better predictive performance (Fig. 5a). Three independent risk factors for OS including age (HR = 1.969, p < 0.001), overall stage (HR = 1.582, p = 0.046), and the seven-gene risk signature (HR = 1.925, p < 0.001) were included in the nomogram (Table 3). No collinearity was observed between three risk factors. The AUCs of the nomogram were 0.761, 0.785, 0.776 at 1-, 3-, 5-years respectively (Fig. 5b). The calibration plot exhibited high degree of agreement between predicted and actual survival probabilities (Fig. 5c-e).

Gene Set Enrichment Analysis
Using GSEA, we explored the signaling pathways which were associated with our gene signature-   [18]. However, genetic biomarkers associated with cancer stemness features were seldom studied in bladder cancer.
In the present study, we developed and validated a novel stemness-related seven-gene signature for predicting prognosis and treatment response in MIBC patients, which is different from previously reported gene signatures. Firstly, the 166 stemness-related genes in CancerSEA database were derived from single cell RNAseq data. Therefore, the resource of candidate genes is novel and reliable. Secondly, we used one TCGA patient cohorts to construct the gene signature and three GEO clinical outcome. Taken together, the novel seven-gene risk signature is distinct, robust, and reliable.
Among seven genes, several genes have been uncovered to maintain cancer stem cells. EGFR, which encodes the epidermal growth factor receptor (EGFR) family of receptors, is frequently overexpressed or mutated in solid tumors and acts as driver gene and therapeutic target. EGFR plays an important role in bladder cell dedifferentiation as it maintains a less differentiated state of basal layer urothelial cells [19]. In bladder cancer, EGFR is widely reported to be overexpressed, which is more evident in MIBC than in NMIBC [20,21]. As members of EGFR families, ERBB2 and RBB2 expression correlates with tumor recurrence and metastasis in MIBC [21]. The roles of EGFR signaling pathway in CSCs have been implicated in breast cancer [22], and glioblastoma [23]. However, the roles of EGFR in BCSCs are still unclear. FOXA2, a member of fork head box (FOX) superfamily, is considered as a urothelial differentiation marker in human bladder development as co-expression of FOXA2 and uroplakins (UPK) initiate the transformation of epithelial cells into a urothelial lineage. Expression of FOXA2 also activates the expression of bladder-specific genes, which are required for developing urothelium to give rise to other cell types [24]. In ovarian cancer, FOXA2 is responsible for maintaining CSCs properties, and knockdown of FOXA2 impairs the self-renewal ability of ovarian CSCs [25]. HES1 encodes the Hes1 protein which is an important molecular component for maintaining the undifferentiated state of normal stem cells and CSCs due to its transcriptional repression ability [26,27]. Hes1 and its upstream Notch signaling pathway have been implicated to be associated with CSCs in colon cancer [28], pancreatic cancer [29], and breast cancer [30]. In oral cancer, loss of Hes1 decreases the tumor sphere-forming and self-renewal capacities of cancer cells [31]. In BC, absences of Notch-Hes1 pathway promotes epithelial-mesenchymal transition (EMT) and tumor cells with low levels of Hes1 present more mesenchymal features and more aggressive phenotype [32]. MME, also known as CD10, encodes a cell surface zinc-dependent metallopeptidase that helps bladder cancer cells invade the adjacent matrix [33]. High CD10 expression in tumor cells was significantly associated with advanced tumor stage and poorer prognosis in BC [34,35]. CD10 is reported to maintain CSCs in breast cancer and head and neck squamous cell carcinoma (HNSC) [36,37]. In breast cancer, expression of CD10 is involved in progression and dissemination and correlates with the expression of cancer stem cell markers such as CD44 and ALDH1 significantly [36]. In HNSC, upregulated expression of CD10 is related with resistance to cisplatin, fluorouracil and radiation and an increased ability to form spheres and higher level of expression of the CSC marker OCT3/4 [37].
RBM6 (RNA binding motif protein 6) is an RNA-binding protein involved in nucleic acid binding and RNA binding. RBM6 mRNA expression level has shown to be upregulated in breast cancer [38], and pancreatic cancer [39] when compared with normal tissues. However, the role of RBBM6 in CSCs and bladder cancer is seldom reported. Another member of RBM family, RBM5, is an RNA-binding protein that modulates cell cycle and apoptosis and acts as tumor suppressor in cancers [40]. In laryngeal carcinoma, expression of RBM6 is downregulated in tumor tissues and cancer cell lines. RBM6 could repress tumor growth and progression, which is possibly mediated by reducing the expression of EGFR, ERK and downregulation of EGFR-associated signaling pathways [41]. SMOC2 encodes the secreted modular calcium binding protein-2, which is a matricellular protein [42]. Overexpression of SMOC2 could significantly arrest cell at the G0/G1 phase so as to repress cell growth [43]. In HCC, downregulated expression SMOC2 was significantly associated with aggressive biological behaviors including advanced TNM stage and distant metastasis and reduced overall survival and disease-free survival [43]. TFRC (transferrin receptor), also known as CD71, is an important regulator for iron metabolism in cancers. CD71 is frequently upregulated and indicate aggressive phenotype as well as poor clinical outcome in esophageal squamous cell carcinoma [44] and breast cancer [45]. Expression of CD71 was closely related to the staining of proliferative marker Ki-67 and CD71-suppressed tumor cells exhibit cell growth inhibition and cell cycle arrest at S phase [44]. CD71 is also regarded as a phenotypic marker of cancer stem-like cells in cervical cancer [46].
GSEA analysis uncovered that the most enriched biological processes in high risk group were dominantly associated with extracellular adhesion, development of soft tissues, and angiogenesis, indicating that high risk group of patients might possess more prominent stemness characteristics than low risk group. CSCs are believed to reside in niches, which constitute the tumor environment and maintain stem-like properties in cancer cells [47]. The extracellular matrix is a major component of the tumor microenvironment and plays crucial roles in cell differentiation and determination of the balance between self-renewal and differentiation in stem cells [48]. The presence of CSCs are also associated with angiogenesis and tumor vasculature as CSCs could transdifferentiate into endothelial cells [49] or produce several cytokines and chemokines [50] to regulate the tumor vasculature and from their own vascular niches.
Despite the promising results, limitations of our study should be noted. Firstly, although the novel seven-gene signature, which was built in training cohort, could be confirmed in three validation cohorts. clinical features of patients in three validation cohorts were not intact and we were therefore unable to verify the independence of novel gene signature in three validation cohorts. Secondly, our seven-gene signature could effectively predicate treatment resistance. However, the detailed information of chemotherapy and radiotherapy were not recorded. Last but not least, our study was based on retrospective view of transcriptome data and clinical data. Prospective clinical trials with large cohorts are required before application of the seven gene signature into diagnosis and treatment. Biological and molecular experiments are also needed to explain the relationship between these stemness-related genes and cancer stem cell biology in bladder cancer.

Conclusion
In conclusion, we developed and validated a novel and robust seven-gene risk signature related with cancer stemness that could predict prognosis and treatment response in MIBC patients. Our findings could contribute to better understanding of CSCs and identification of novel therapeutic targets in bladder cancer. Figure 2 Kaplan-Meier analysis of seven-gene risk signature when bladder cancer patients were stratified by age, gender, and clinical stage. The calibration curve of the nomogram in predicting 1-, 3-, 5-, 10-years survival possibilities.

Declarations
28 Figure 6 Enrichment plots of top ten altered biological processes in high risk group patients.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.