Construction and validation of a novel eight-gene risk score to predict the progression and prognosis of bladder cancer

Background: The progression from non-muscle-invasive bladder cancer (NMIBC) to muscle-invasive bladder cancer (MIBC) increases the risk of death. It is therefore important to nd new relevant molecular models that will allow for effective prediction of the progression of bladder cancer (BC). Methods: Using RNA-Sequence data of 49 BC patients in our center and weighted gene co-expression network analysis methods, a co-expression network of genes was developed from which three key modules with prognostic value were selected using Univariate Cox regression in The Cancer Genome Atlas Program (TCGA). Subsequently, an eight-gene risk score was established using the Least absolute shrinkage and selection operator Cox model. Results: A novel eight-gene risk score was closely related to the malignant clinical features of BC and could predict the prognosis of patients in the training dataset (TCGA) and two validation sets (GSE3289 and GSE13507). Further, a nomogram for predicting the overall survival of patients was designed. The nomogram showed good calibration with clinical value through decision curve analysis. Lastly, we found that the mRNA and protein expression level of the eight genes were found to be differentially expressed in cell lines and tissue. Conclusions: In our study, we established a novel eight-gene risk score which could predict the progression and prognoses of BC patients.


Introduction
Globally, bladder cancer (BC) is ranked 4 th as the most commonly diagnosed cancer in men [1]. Among BC patients, 75% are diagnosed with non-muscle invasive bladder cancer (NMIBC) whereas 25% are diagnosed with muscle-invasive bladder cancer (MIBC). However, >30% of patients with NMIBC develop recurrence and progresses to MIBC within 5 years after diagnosis [2]. Moreover, three-quarters of MIBC patients develop distant metastases with their long-term survival rate being 15% [3].
Cystoscopy is considered the gold-standard diagnostic tool for BC [4]. However, its ability to predict disease progression had been determined [5]. Although some models have been developed to predict recurrence and progression of BC, the accuracy of such models is less than 60% [6,7]. Although some biomarkers and risk models are available for predicting clinical outcome of patients with BC, their speci city and/or sensitivity is unsatisfactory [2,8]. For this reason, it is important to identify valuable models or biomarkers for evaluating the prognosis and monitor the progression of BC.
In this study, a new prognostic risk model based on eight differently expressed genes between MIBC and NMIBC is established which can predict the progression of BC. The performance of the risk model was veri ed using gene expression data from the Gene Expression Omnibus (GEO). A prognostic nomogram based on the risk score was built to predict the prognosis of BC. Moreover, real-time quantitative (RT-qPCR) and immunohistochemistry (IHC) were conducted to investigate the expression levels of eight genes in cell lines and BC samples. and April 2020, were included in the training set. The following criteria were used to enroll patients in the training set were: (1) histologically con rmed BC; (2) availability of freshly collected tissue from surgery; and (3) availability of clinical data. Informed consent was obtained and ethical approval was granted by the Ethics Committee of Shanghai Tenth People's Hospital. Total RNA was isolated from BC tissues using Trizol reagent (Invitrogen, Thermo Scienti c, Shanghai, China). The cDNA libraries were constructed using a customized protocol and the Illumina Hiseq 2500 sequencer (Sangon Biotech, Shanghai, China).

Public data processing
Gene expression data from GSE32894 (n=224) and GSE13507 (n=165) datasets were downloaded from GEO (http://www.ncbi.nlm.nih.gov/geo) as a series of matrix le format that was earlier processed by the original authors using the MAS 5.0 algorithm. The latest TCGA data containing clinical features and follow-up information was downloaded using GDC API. The clinical data of the training sets (STPH and TCGA datasets) and validation sets (GSE32894 and GSE13507) are shown in Table 1. The representative immunohistochemistry (IHC) of genes in both normal human bladder tissues and tumor tissues were performed using data from the Human Protein Atlas (HPA) (http://www.proteinatlas.org/).
WGCNA was performed as follows: (1) Outlier samples were omitted to increase the reliability of the weighted gene co-expression network, (2) An appropriate β value was selected using 0.85 as the degree of independence (R 2 ), (3) A weighted adjacent matrix was transformed into a topological overlap matrix (TOM) to determine the network connectivity of the genes, (4) Genes with similar expression pro les were classi ed into gene modules based on the average linkage hierarchical clustering following the TOMbased dissimilarity measure, (5) All genes were represented by the expression of module eigengenes (MEs) in a given module. Modules that were highly correlated with NMIBC/MIBC subtype (|r| ≥0.3) were selected for further analyses, and (6) The genes in the selected modules (brown, turquoise, and yellow) were extracted.
2.3.2 Construction of a novel eight-gene risk score Based on hub genes extracted from WGCNA, univariate Cox regression analysis was conducted to select genes associated with overall survival (OS) in the TCGA dataset [10]. Thereafter, the genes were subjected to Least absolute shrinkage and selection operator (LASSO) analysis with ten-fold cross-validation to construct a novel eight-genes risk score using "glmnet" package. Genes with minimal in uence on the patient's overall survival were removed whereas those with non-zero coe cients were selected. The risk score for each patient was calculated as follows: risk score = Coef 1 × expression of gene 1 + Coef 2 × expression of gene 2 + … + Coef m × expression of gene m. Coef denotes the corresponding coe cient of the gene. Using these gene expression data, the 'survminer' R package v4.6 was used to evaluate the optimal cutoff value of the risk score in each cohort. The established optimal cutoff value was used to divide patients into low-and high-risk groups accordingly.

Gene sets enrichment analysis (GSEA)
GSEA (http://www.broadinstitute.org/gsea/index.jsp) was conducted to compare pathways between patients in high-and low-risk groups. Signaling pathways with P<0.05 and a false discovery rate <0.25 were regarded as statistically signi cant.

Gene Set Cancer Analysis
Gene Set Cancer Analysis (GSCALite, http://bioinfo.life.hust.edu.cn/web/GSCALite/) provides a single nucleotide variation (SNV) module through the maftools [11]. Herein, we employed the SNV module to analyze and visualize the SNV of the eight genes in BC.

Eight-Gene Risk score validation
To evaluate the generalizability of the model, the training sets (STPH and TCGA dataset) and validation datasets (GSE3289 and GSE13507) were analyzed using similar formula and coe cients. The Kaplan-Meier curves of the validation datasets were separately drawn before analyzing the prognostic performance of the novel eight-genes risk score. Furthermore, the receiver operating characteristic (ROC) curve and Violin plot were used to verify the relationship between the risk score and clinical features of MIBC.

Development of the nomogram
A nomogram was developed from the factors that were signi cant in the nal multivariate Cox regression analyses. The ROC, concordance index (C-index) and calibration plots were conducted to assess the performance of the nomogram. The nomogram, C-index and calibration plots were generated using the 'Rms' R package. Furthermore, a decision curve analysis (DCA) was employed to ascertain the net bene ts of nomogram and other crucial prognostic factors.

Cell culture
The human normal bladder epithelial cell line SV-HUC-1 and BC cell line T24, UMUC3, J82, 5637, EJ obtained from Shanghai Cell Bank of the Chinese Academy of Sciences (Shanghai, China) were used for in vitro experiments. The SV-HUC-1 cell lines were cultured in F12K containing 10% FBS and 1% penicillin/streptomycin (P/S). BC cell lines were cultured in RPMI-1640 (Gibco; Thermo Fisher Scienti c, Inc.) supplemented with 10% FBS (Gibco; Thermo Fisher Scienti c, Inc.) and 1% P/S (Gibco; Thermo Fisher Scienti c, Inc.). Cell culture reagents were bought from Gibco and the cells were cultured at 37˚C and 5% CO 2 .

RT-qPCR
TRIzol reagent (Invitrogen, Thermo Scienti c, Shanghai, China) was used for extracting RNA from the cell lines. The cDNA Synthesis SuperMix Kit (Cat No.11141ES60; Yeasen, Shanghai. China) was used to synthesize cDNA which was subjected to qPCR using qPCR SYBR Green Master Mix KIT (Cat No. 11203ES03; Yeasen, Shanghai. China). The gene expression level was normalized to the expression of GAPDH mRNA. Primer sequences used are presented in Table SI.

Statistical analysis
The Wilcoxon rank-sum test was used to identify the differential expressed genes between NMIBC and MIBC. The Student's t-test or one-way ANOVA was used to evaluate the association of the risk score with clinical characteristics. Kaplan-Meier and log-rank tests were employed to compare the OS of BC patients in the high-and low-risk groups. The prediction performance of the risk score in NMIBC/MIBC subtype was determined from ROC curves. A heatmap and correlation matrix were generated using 'Pheatmap' R package v1.0.12 and 'corrplot' R package v0.84, respectively. SPSS 22.0 (SPSS, Armonk, NY, USA), R v3.6.1 (https://www.r-project.org/) and Graphpad Prism V7 (GraphPad Software, Inc.) were used for data analysis. A two-sided P value<0.05 was considered signi cant.

Data acquisition and DEGs selection
A ow diagram of the study is shown Figure 1. A total of 49 BC patients (35 NMIBC and 14 MIBC) from STPH were enrolled for the study. From our ndings, 2725 differentially expressed genes were identi ed between NMIBC and MIBC.

WGCNA analysis based on DEGs between NMIBC and MIBC
After the hierarchical clustering analysis, one sample was removed as an outlier. A co-expression network was constructed using 48 BC samples with complete clinical T stage and NMIBC/MIBC subgroups ( Figure 2A). With the chosen power of β=13 (scale-free R 2 =0.85) as the soft-thresholding ( Figure 2B), seven modules were obtained ( Figure 2C). The highest module-trait association was found between three modules and the NMIBC/MIBC subgroups (brown, turquoise, and yellow) ( Figure 2D). Subsequently, 1351 genes from three key modules were selected for further analysis.

Construction and veri cation of the novel eight-genes risk score
Among the 1315 genes, 159 were statistically signi cant (P-values < 0.05) in the univariate Cox analysis, hence were selected for LASSO Cox regression analysis. The Lambda.1se penalty parameter was selected based on the ten-fold leave-one-out cross-validation in accordance with the minimum criteria. In this way, eight genes (CD96, PDCL3, IP6K2, TRIM38, U2AF1L4, DDB1, KCNJ15 and CTU1) with nonzero coe cients were obtained ( Figure 3A-C). The prognostic performance of the genes was assessed individually. Results showed that their prognostic values were statistically signi cant ( Figure  S1A). According to the aforementioned formula, risk scores for each patient were calculated. Kaplan-Meier analysis revealed that the high risk group showed signi cantly poorer OS than BC patients in the low risk group (P<0.001) ( Figure 3D). We further validated the prognostic value of the risk score in two validation sets. Interestingly, the risk scores still had the prognostic ability in GSE3289 (P=0.035) and GSE13507 (P=0.037) ( Figure 3E, F). Among the 411 BC samples with complete sequencing data in TCGA, mutations of the eight genes were only found in 29 independent samples in TCGA dataset. Notably, CD96 and DDB1 recorded the highest rates of mutations (28%) ( Figure S1B). GSEA analysis showed that high risk score was associatd with the 'Focal adhersion', 'pathway in cancer', 'WNT signal', 'bladder tumor', 'ECM receptor' and 'adherens junctions pathway' (Figure S2).  Figure 4A). A dot plot based on the survival status of TCGA set indicated that patients with high-risk scores had a higher mortality rate than those with low-risk scores ( Figure 4B). The ROC revealed that the risk score could effectively differentiate NMIBC from MIBC with an AUC of 0.903 ( Figure 4C). The distribution of risk scores in training and validation sets are presented in detail in Figure 4D. It can be observed that MIBC has higher score than NMIBC (all P<0.05).

Construction and calibration of the nomogram
The univariate Cox analyses revealed that T stage, histological grade, age and risk score were signi cantly related to OS in the TCGA dataset ( Figure 5A). After multivariate Cox analysis, risk score, age, T stage, and pathological stage retained their prognostic value. A nomogram was constructed for predicting the 1-, 3-and 5-year OS based on the signi cant factors in multivariate Cox analyses ( Figure  5B). The AUC values of ROC curves were 0.764 (1-year ROC), 0.796 (3-year ROC) and 0.805 (5-year ROC), whereas the C-index of the nomogram was 0.75±0.02 (mean±SD) for OS ( Figure 5C). Besides, calibration plots proved that the prediction results of the nomogram were largely in agreement with the actual 1-, 3-and 5-year OS ( Figure 5D). Furthermore, compared with clinical features, the nomogram achieved the best clnical net bene t via DCA plot ( Figure 5E).

Validation of the novel eight-genes score through in vitro experiments and public database
The mRNA expression pattern of the eight genes in ve human BC cell lines (T24, UMUC3, 5637, J82 and EJ) and one human normal bladder epithelial cell line (SV-HUC-1) was determined using RT-qPCR. Results showed that ve of eight genes (CTU1, IP6K2, KCNJ15, PDCL3 and U2AF1L4) were signi cantly different in transcriptional levels between BC and normal bladder cells ( Figure 6). The expression level of IP6K2 and KCNJ15 was lower in all cancer cell lines compared with normal cell lines. The expression of CTU1, PDCL3 and U2AF1L4 was varied in different BC cell lines. IHC images of normal and tumor tissues from HPA revealed that DDB1, CD96, IP6K2, TRIM38 and KCNJ15 had higher IHC staining intensity in normal tissues compared with tumor tissues. PDCL3 and U2AF1L4 showed medium to high in of IHC staining intensity in normal and tumor tissues. Neither normal tissues nor tumor tissues expressed the CUT1 protein ( Figure 6B).

Discussion
Recent developments in the eld of high-throughput sequencing technologies have resulted in the generation of many bioinformatics methods to nd biomarkers related to the progression of malignant tumors [12,13]. So far, several prediction models have been developed for BC [14][15][16]. However, these models are based on public datasets without validation set or data from their clinical centers, which limits the generalizability and reproducibility of the models. Through an in-depth exploration of the RNAsequence data from our center and TCGA, we constructed an eight-genes risk score using WGCNA and LASSO regression analyses The results showed that the eight-genes risk score had novel performance in prognostic strati cation in training (TCGA) and validation (GSE3289 and GSE13507) data sets, in which BC patients in the high risk group showed signi cantly poorer OS than BC patients in the low risk group. Taken together, the risk score presented an oncogenic function in malignant progression and was considered as a useful prognostic tool for prognostic prediction in BC. GSEA analysis con rmed that high risk scores were highly enriched in 'WNT signal' and 'bladder tumor' pathways, suggesting the important role of the risk score in tumorigenesis and malignant progression. The GSCA results further indicated that these genes rarely mutated, implying that potential dysfunction in these genes may not be as a result of genetic mutations, but rather due to aberrant alterations at the transcriptional level [17]. The genes in the risk score were reported to play important roles in tumor in previous studies. For example, high expression of CD96 in BC tissue, that was discovered in 1992, indicates high immune cells in ltration and hence killing of cancer cells [18,19]. Damagespeci c DNA binding protein 1 (DDB1), as the only oncogene in our model, is a multifunctional protein.
Previous studies showed that DDB1 interacts with several chaperone that regulate DNA repair mechanisms, cell cycle and gene transcription [20]. DDB1 can directly interact with Cullin4B (CUL4B) to form the CUL4B-DDB1 complex which promotes ubiquitination and degradation of a variety of tumor suppressor factors hence enhance progression of osteosarcoma [21]. The tripartite motif protein 38 (TRIM38), a member of the TRIM family, was found to confer protection against various cellular processes such as cell differentiation, proliferation, apoptosis, and antiviral defense. On the other hand, TRIM38 cause abnormal activation of NF-κB pathway thereby inhibiting cancer [22,23].
We then integrated the risk score and other independent prognostic factors to establish a nomogram for clinical practice. A nomogram that comprised the age, T stage, pathological stage and risk score was drawn to predict the OS. ROC showed that the nomogram could accurately predict the 1-, 3-and 5-year OS of BC. Moreover, calibration plots and C-index con rmed the prognostic signi cance and predictive superiority of the nomogram. DCA revealed that compared with clinical features, the nomogram had the optimal enhanced clinical net bene t with wider threshold probabilities. All of these ndings indicated that the risk score-based nomogram may be a reliable evaluation tool to perform mortality risk identi cation in BC patients.
The results of IHC assay in HPA and RT-qPCR revealed that hub genes were differentially expressed between normal and BC tissues or cells. Nevertheless, the mRNA expression level (RT-qPCR in cell lines) and protein intensity (IHC in tissues) may not match completely due to the following reasons: (a) the tumor microenvironment between tumor tissues and cell lines was different. The occurrence and development of tumors is as a result of complex interactions [24,25]. (b) The regulation of mRNA and protein expression is complex, and there is no clear relationship between mRNA and protein expression levels.
Some limitations exist in the present study. The sample size was small, and more samples with prognostic information are needed to validate our ndings. Further researches are needed to explore the mechanisms of the eight genes in tumor progression.

Conclusion
This study demonstrates that a novel eight-genes risk score can reliably predict the malignant progression and prognosis of BC patients. The constructed nomogram based on the eight-genes risk score could accurately predict the survival of BC patients. Dysregulated expression of the eight genes was validated in BC samples and cell lines by IHC and RT-qPCR, respectively. Further functional studies and mechanistic experiments should be conducted to better understand the clinical value of the eight hub genes in BC.

Ethics declarations
The present study was approved by the Ethics Committee of Shanghai Tenth People's Hospital (Shanghai, China). The research was conducted in accordance with the World Medical Association Declaration of Helsinki. All patients provided written informed consent.