Data processing and research design
We obtained the raw RNA sequencing and clinical data of TCGA-BLCA (n=412), GSE13507 (n=165), GSE32548 (n=146) and GSE32894 (n=308). R package "edge" was used to standardize the raw RNA expression matrix of the four cohorts. We utilized data only from patients associated with RNA sequencing data, survival time, survival status, and primary tumor for further analysis. The basic clinical information of the remaining patients is summarized in Table 1, the sample sizes of these four cohorts are all greater than 100. The grade of bladder cancer is closely related to recurrence and invasive behavior. Two grading methods were used in these four cohorts. TCGA-BLCA and GSE13507 used the WHO grading standard of 2004, which was divided into PUNLMP (Papillary urothelial neoplasms of low malignant potential), low grade, and high grade. GSE32548 and GSE32894 used the WHO grading standard of 1999, which was divided into grade 1 (G1), grade 2 (G2), and grade 3 (G3).
The genes related to the prognosis of bladder cancer were screened from each dataset, and the identified genes were combined to obtain a set of genes that were common to the four datasets. We used some of these genes to construct a prognostic model in the TCGA-BLCA cohort, and then tested the model with five cohorts for verification. We also used the co-expression network, ESTIMATE, and CIBERSORT algorithm to explore the relationship between the model and tumor microenvironment. The research process is shown in Fig. 1.
Construction of prognostic model
Univariate Cox proportional hazard analysis was carried out in four cohorts, and genes affecting prognosis were obtained. There were 3301 genes in TCGA-BLCA, 440 genes in GSE13507, 2404 genes in GSE32548, and 2768 genes in GSE32894 that met the criteria of hazard ratio (HR > 1 and p < 0.05). Combining the four datasets allowed identification of twenty-four risk genes as shown in the Venn diagram (Fig. 1). There were 4743 genes in TCGA-BLCA, 414 genes in GSE13507, 1717 genes in GSE32548, and 3248 genes in GSE32894 that met the criteria of (HR < 1 and p < 0.05). Combining the four datasets allowed identification of ten protective genes (Fig. 1). Together, these 24 risk genes and 10 protective genes constitute the prognostic gene set. Because of the large sample size of TCGA-BLCA, we used this cohort to build the prognostic model. First, genes with univariate Cox p-values less than 0.01 in TCGA-BLCA were selected. The 24 genes were analyzed by Lasso regression analysis (Fig. 2a), when the number of genes in the model was 11, the deviance was the smallest (Fig. 2b). According to the Lambda value, the corresponding regression coefficients (coef) of the selected 11 genes could be determined. The prognostic model could then be constructed by using the corresponding coef of the 11 genes. The model has good predictive performance and includes a small number of genes with the Lasso regression analysis. Finally, we successfully constructed a prognostic module: Risk score = SERPINE2*0.02 + PRR11*0.13 + FABP6*(-0.000318) + C16orf74*(-0.0564) + DSEL*0.107 + DNM1*0.0142 + COMP*0.0223 + TNK1*(-0.0972) + ELOVL4*0.00152 + RTKN*0.126 + MAPK12*0.0304. The basic information and coef values of the 11 genes are listed in (Additional fle 1: Table S1). The results of univariate regression analysis of these 11 genes in 4 cohorts are shown in (Additional fle 2: Table S2).
Kaplan-Meier analysis of 11 genes
Eleven genes were taken Kaplan-Meier analysis in 4 cohorts. Using the heat map to show the results of the study (Fig. 2c), except for DSEL in GSE32894 and C16orf74 in GSE32548, the other analyses were statistically significant (p < 0.05). SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, DNM1, and COMP showed that the prognosis of patients with high expression was worse, and the analysis of ELOVL4 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2d). TNK1, FABP6, and C16orf74 showed that the prognosis of the low expression group was worse, and the analysis of FABP6 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2e).
Verification of the prognostic model
The prognostic model was used to calculate the risk scores of each patient in the training set (TCGA-BLCA) and three test sets (GSE13507, GSE32548, and GSE32894). The function "res.cat" of R package "survminer" was used to find the best cut-off value for Kaplan-Meier analysis. We identified the best cut-off value with a risk score of 2.40 for TCGA-BLCA. Using this method, the cut-off values of GSE13507 / GSE32548 / GSE32894 were 2.56 / 1.96 / 1.92. To observe the predictive ability of risk scores in their respective cohorts, we divided the samples into high-risk and low-risk groups according to the cut-off value of each cohort, and constructed the Kaplan-Meier curves. The results showed that the prognosis of patients with high-risk was significantly worse than that of patients with low-risk in the four cohorts (p<0.001, Fig. 3a-d). The Receiver Operating Characteristic (ROC) curves of the four cohorts were drawn: the 1 / 3 / 5 year Area Under the Curve (AUC) values for the TCGA-BLCA cohort were 0.686, 0.665, and 0.666, respectively (Fig. 3e); those for the GSE13507 cohort were 0.800, 0.742, and 0.697, respectively (Fig. 3f); those for the GSE32548 cohort were 0.826, 0.792, and 0.763, respectively (Fig. 3g); and those for the GSE32894 group were 0.781, 0.831 and 0.839 (Fig. 3h). Additional fle 3: Figure S1 shows the risk score distribution, gene expression values, and survival status of patients in both the high-risk group and the low-risk group.
The clinical factors and risk scores of the four cohorts were analyzed by univariate Cox and multivariate Cox regression analysis (Table 2). The results of univariate analysis showed that T stage was more effective in predicting prognosis among the clinical factors, and three cohorts had statistical significance. The risk scores were statistically significant in all four cohorts, and the p value of three cohorts was lower than that of the T stage. In multivariate Cox analysis, risk scores were statistically significant in three cohorts, indicating that the three cohorts were independent of other clinical factors in predicting prognosis. In this analysis, only two cohorts of T stage had statistical significance, so it is obvious that T stage is not as strong as risk score to predict the prognosis. Finally, we compared the risk scores for different grades and T stages in the four cohorts, and found that the risk scores increased with the increase of grade and T stage (p < 0.001, Additional fle 4: Figure S2a, b). In the GSE32548 cohort, we compared the risk scores of FGFR3, PIK3CA, and TP53 (or with the MDM2 alteration) for wild type and mutant type, and found a lower risk score of mutant type than that of wild type for the FGFR3 and PIK3CA groups (p < 0.001, Additional fle 4: Figure S2c). In TP53 (or with the MDM2 alteration), the score of mutant type was higher than that of wild type (P < 0.001, Additional fle 4: Figure S2c).
Seven genes and model were successfully verified in GSE48705
In order to further verify the robustness of this model, RNA sequencing and clinical data of GSE48075 were obtained from the GEO database. We selected the samples (n = 73) with transcriptional data, survival time, and survival status data for the following analysis. Firstly, the survival prediction ability of 11 genes was verified in the GSE48075 cohort. We used the R package "survminer" to find the best cut-off value for 11 genes, and the samples were divided into two groups for Kaplan-Meier analysis according to the best cut-off values. The results showed that SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, and COMP were statistically significant (p < 0.05, Additional fle 5: Figure S3), and the prognosis was worse in the high expression group which was consistent with the analysis result of the previous four cohorts. The risk score of patients was calculated according to the model, and the risk score was analyzed by Kaplan-Meier analysis. The prognosis of the high risk-score group was terrible, and the difference was statistically significant (p = 0.0028, Fig. 4a). We drew the ROC curves of the risk score, and the AUC value of 1/3/5 year was 0.676/0.630/0.755 (Fig. 4b). Fig. 4c showed the risk score distribution, gene expression values, and survival status of patients between high and low-risk groups.
Weighted co-expression network analysis and enrichment analysis
A co-expression network was constructed with the TCGA-BLCA data. The co-expression network was constructed with 3844 coding genes (with Cox regression analysis p-values less than 0.05) and 403 samples. First, the expression matrix was transformed into a topological overlap matrix according to β = 4. Then, the genes were divided into different modules (Fig. 5a) using the dynamic pruning tree method. Next, the association analysis of clinical traits and modules (Fig. 5b) showed a high correlation between the turquoise module and risk score (cor = 0.76, p = 2E-74). There was also a high correlation between the turquoise module and survival status (cor = 0.25, p = 9E-07) / grade (cor = 0.3, p = 1E-09) / stage (cor = 0.32, p = 1E-10). We selected genes in the turquoise module for further analysis and selected 128 key genes (Fig. 5c) according to gene-significance > 0.5 and module-membership > 0.8. To explore the potential function of these key genes, pathway and process enrichment analysis of these key genes were performed, as shown in Fig. 5d. The three most highly significantly enriched terms were extracellular matrix organization, collagen fibril organization, and ECM proteoglycans, all related to the tumor microenvironment (TME).
Immune cells can be combined with risk scores for prognostic analysis
We used CIBERSORT to calculate the infiltration ratio of immune cells in 22 TCGA-BLCA samples and used a bar chart to show the infiltration of high and low-risk groups (Fig. 6a). Then, the Wilcoxon test was used to compare the difference between high and low-risk groups. The results showed that B cells naive, Macrophages M0, and Macrophages M1 showed high infiltration in the high-risk group; B cells memory, Dendritic cells resting, and Dendritic cells activated showed high infiltration in the low-risk group (P < 0.001, Fig. 6b). Furthermore, we took the risk score and the infiltration degree of these six kinds of immune cells for joint prognostic analysis. The samples were divided into four clusters for Kaplan-Meier analysis according to the median value of the risk score and immune cell infiltration degree. The results showed that these groups could also be used for prognostic analysis (p < 0.05, Fig. 6c-h). Among them, the prognostic ability of B cells memory is the best. When the degree of B cells memory infiltration is low, and the risk score is high, the prognosis of this cluster is significantly worse than that of other clusters. We used ESTIMATE to calculate the immune-score of the samples, and a high immune-score indicates a high degree of immune cell infiltration. We still used the immune score and risk score for prognostic analysis. The results showed that the cluster with low immune-score and high risk-score had the worst prognosis (Additional fle 6: Figure S4).