2.1ǀ Clinical samples and RNA sequencing
A total of 49 BC patients (35 NMIBC and 14 MIBC) in Shanghai Tenth People’s Hospital (STPH) who underwent either transurethral resection of bladder tumor or radical cystectomy between November 2019 and April 2020, were included in the training set. The following criteria were used to enroll patients in the training set were: (1) histologically confirmed BC; (2) availability of freshly collected tissue from surgery; and (3) availability of clinical data. Informed consent was obtained and ethical approval was granted by the Ethics Committee of Shanghai Tenth People’s Hospital. Total RNA was isolated from BC tissues using Trizol reagent (Invitrogen, Thermo Scientific, Shanghai, China). The cDNA libraries were constructed using a customized protocol and the Illumina Hiseq 2500 sequencer (Sangon Biotech, Shanghai, China).
2.2ǀPublic data processing
Gene expression data from GSE32894 (n=224) and GSE13507 (n=165) datasets were downloaded from GEO (http://www.ncbi.nlm.nih.gov/geo) as a series of matrix file format that was earlier processed by the original authors using the MAS 5.0 algorithm. The latest TCGA data containing clinical features and follow-up information was downloaded using GDC API. The clinical data of the training sets (STPH and TCGA datasets) and validation sets (GSE32894 and GSE13507) are shown in Table 1. The representative immunohistochemistry (IHC) of genes in both normal human bladder tissues and tumor tissues were performed using data from the Human Protein Atlas (HPA) (http://www.proteinatlas.org/).
2.3ǀ Establishment of prediction models and bioinformatics analyses
2.3.1ǀWeighted gene co‑expression network analysis
The “limma” R package was adopted to screen for differentially expressed genes (DEGs) between 35 NMIBC and 14 MIBC samples in STPH. A weighted gene co-expression network analysis (WGCNA) was constructed  using the ‘WGCNA’’ R package v1.68 based on the DEGs (P-value<0.05 and |log2FC|≥0.5). WGCNA was performed as follows: (1) Outlier samples were omitted to increase the reliability of the weighted gene co-expression network, (2) An appropriate β value was selected using 0.85 as the degree of independence (R2), (3) A weighted adjacent matrix was transformed into a topological overlap matrix (TOM) to determine the network connectivity of the genes, (4) Genes with similar expression profiles were classified into gene modules based on the average linkage hierarchical clustering following the TOM-based dissimilarity measure, (5) All genes were represented by the expression of module eigengenes (MEs) in a given module. Modules that were highly correlated with NMIBC/MIBC subtype (|r| ≥0.3) were selected for further analyses, and (6) The genes in the selected modules (brown, turquoise, and yellow) were extracted.
2.3.2ǀConstruction of a novel eight-gene risk score
Based on hub genes extracted from WGCNA, univariate Cox regression analysis was conducted to select genes associated with overall survival (OS) in the TCGA dataset . Thereafter, the genes were subjected to Least absolute shrinkage and selection operator (LASSO) analysis with ten-fold cross-validation to construct a novel eight-genes risk score using “glmnet” package. Genes with minimal influence on the patient’s overall survival were removed whereas those with non-zero coefficients were selected. The risk score for each patient was calculated as follows: risk score = Coef1 × expression of gene1 + Coef 2 × expression of gene2 + … + Coefm × expression of gene m. Coef denotes the corresponding coefficient of the gene. Using these gene expression data, the ‘survminer’ R package v4.6 was used to evaluate the optimal cutoff value of the risk score in each cohort. The established optimal cutoff value was used to divide patients into low- and high-risk groups accordingly.
2.3.3ǀGene sets enrichment analysis (GSEA)
GSEA (http://www.broadinstitute.org/gsea/index.jsp) was conducted to compare pathways between patients in high- and low-risk groups. Signaling pathways with P<0.05 and a false discovery rate <0.25 were regarded as statistically significant.
2.3.4ǀGene Set Cancer Analysis
Gene Set Cancer Analysis (GSCALite, http://bioinfo.life.hust.edu.cn/web/GSCALite/) provides a single nucleotide variation (SNV) module through the maftools . Herein, we employed the SNV module to analyze and visualize the SNV of the eight genes in BC.
2.4ǀEight-Gene Risk score validation
To evaluate the generalizability of the model, the training sets (STPH and TCGA dataset) and validation datasets (GSE3289 and GSE13507) were analyzed using similar formula and coefficients. The Kaplan-Meier curves of the validation datasets were separately drawn before analyzing the prognostic performance of the novel eight-genes risk score. Furthermore, the receiver operating characteristic (ROC) curve and Violin plot were used to verify the relationship between the risk score and clinical features of MIBC.
2.5ǀDevelopment of the nomogram
A nomogram was developed from the factors that were significant in the final multivariate Cox regression analyses. The ROC, concordance index (C-index) and calibration plots were conducted to assess the performance of the nomogram. The nomogram, C-index and calibration plots were generated using the ‘Rms’ R package. Furthermore, a decision curve analysis (DCA) was employed to ascertain the net benefits of nomogram and other crucial prognostic factors.
The human normal bladder epithelial cell line SV-HUC-1 and BC cell line（T24, UMUC3, J82, 5637, EJ）obtained from Shanghai Cell Bank of the Chinese Academy of Sciences (Shanghai, China) were used for in vitro experiments. The SV-HUC-1 cell lines were cultured in F12K containing 10% FBS and 1% penicillin/streptomycin (P/S). BC cell lines were cultured in RPMI‑1640 (Gibco; Thermo Fisher Scientific, Inc.) supplemented with 10% FBS (Gibco; Thermo Fisher Scientific, Inc.) and 1% P/S (Gibco; Thermo Fisher Scientific, Inc.). Cell culture reagents were bought from Gibco and the cells were cultured at 37˚C and 5% CO2.
TRIzol reagent (Invitrogen, Thermo Scientific, Shanghai, China) was used for extracting RNA from the cell lines. The cDNA Synthesis SuperMix Kit (Cat No.11141ES60; Yeasen, Shanghai. China) was used to synthesize cDNA which was subjected to qPCR using qPCR SYBR Green Master Mix KIT (Cat No. 11203ES03; Yeasen, Shanghai. China). The gene expression level was normalized to the expression of GAPDH mRNA. Primer sequences used are presented in Table SI.
The Wilcoxon rank-sum test was used to identify the differential expressed genes between NMIBC and MIBC. The Student’s t-test or one-way ANOVA was used to evaluate the association of the risk score with clinical characteristics. Kaplan-Meier and log-rank tests were employed to compare the OS of BC patients in the high- and low-risk groups. The prediction performance of the risk score in NMIBC/MIBC subtype was determined from ROC curves. A heatmap and correlation matrix were generated using ‘Pheatmap’ R package v1.0.12 and ‘corrplot’ R package v0.84, respectively. SPSS 22.0 (SPSS, Armonk, NY, USA), R v3.6.1 (https://www.r-project.org/) and Graphpad Prism V7 (GraphPad Software, Inc.) were used for data analysis. A two-sided P value<0.05 was considered significant.