Data processing and research process
We obtained the raw RNA sequencing and clinical data of TCGA-BLCA (n=412), GSE13507 (n=165), GSE32548 (n=146) and GSE32894 (n=308). We utilized data only from patients associated with RNA sequencing data, survival time, survival status, and primary tumor for further analysis. The basic clinical information of the remaining patients is summarized in Table 1, the sample sizes of these four cohorts are all greater than 100. The grade of bladder cancer is closely related to recurrence and invasive behavior. Two grading methods were used in these four cohorts. TCGA-BLCA and GSE13507 used the WHO grading standard of 2004, which was divided into PUNLMP (Papillary urothelial neoplasms of low malignant potential), low grade, and high grade. GSE32548 and GSE32894 used the WHO grading standard of 1999, which was divided into grade 1 (G1), grade 2 (G2), and grade 3 (G3). The research process is shown in Fig. 1.
Construction of prognostic model
Univariate Cox proportional hazard analysis was carried out in TCGA-BLCA, GSE13507, GSE32548, and GSE32894 cohorts. There were 3301 genes in TCGA-BLCA, 440 genes in GSE13507, 2404 genes in GSE32548, and 2768 genes in GSE32894 that met the criteria (HR > 1 and p < 0.05). There were 4743 genes in TCGA-BLCA, 414 genes in GSE13507, 1717 genes in GSE32548, and 3248 genes in GSE32894 that met the criteria (HR < 1 and p < 0.05). Combining the four datasets allowed identification of 24 risk genes and 10 protective genes (Fig. 1). Because of the large sample size of TCGA-BLCA, we used this cohort to build the prognostic model. First, 24 genes with univariate Cox p-values less than 0.01 in TCGA-BLCA were selected. Then the 24 genes were analyzed by Lasso regression analysis (Fig. 2a), when the number of genes in the model was 11, the deviance was the smallest (Fig. 2b). According to the lambda value, the corresponding regression coefficients (coef) of the selected 11 genes could be determined. The prognostic model could then be constructed by using the corresponding coef of the 11 genes. To see more intuitively whether these genes are collinear, we analyze the co-expression of these genes. As shown in the Fig. 2c, the co-expression index of none of these two genes is greater than 0.5. Finally, we successfully constructed a prognostic module: Risk score = SERPINE2*0.02 + PRR11*0.13 + FABP6*(-0.000318) + C16orf74*(-0.0564) + DSEL*0.107 + DNM1*0.0142 + COMP*0.0223 + TNK1*(-0.0972) + ELOVL4*0.00152 + RTKN*0.126 + MAPK12*0.0304. The basic information and coef values of the 11 genes are listed in (Additional file 1: Table S1). The average expression values of genes in the four cohorts are greater than 1, which is of practical significance for detection (Additional file 2: Table S2). The results of univariate regression analysis of these 11 genes in 4 cohorts are shown in additional file 3: Table S3.
Kaplan-Meier analysis of 11 genes
Eleven genes were taken Kaplan-Meier analysis in 4 cohorts. Using the heat map to show the results of the study (Fig. 2d), except for DSEL in GSE32894 and C16orf74 in GSE32548, the other analyses were statistically significant (p < 0.05). SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, DNM1, and COMP showed that the prognosis of patients with high expression was worse, and the analysis of ELOVL4 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2e). TNK1, FABP6, and C16orf74 showed that the prognosis of the low expression group was worse, and the analysis of FABP6 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2f).
The degree of DNA methylation of TNK1 and C16orf74 was negatively correlated with gene expression
DNA methylation can regulate gene expression. We explored the relationship between expression and methylation of these 11 genes (Additional file 4: Figure S1a-k). The results showed that there was a negative correlation between TNK1 gene methylation and gene expression (Spearmen cor = -0.51, p = 1.44e-28), so did as C16orf74 (Spearmen cor = -0.52, p = 8.97e-29). Then, we took TNK1 and C16orf74 methylation data combined with clinical data for Kaplan-Meier analysis. We found that the degree of methylation of these two genes can predict the prognosis of bladder cancer (p < 0.05, Additional file 4: Figure S1i, m), and the prognosis is worse in the case of hypermethylation. The expression of TNK1 and C16orf74 is inhibited by hypermethylation, which leads to a worse prognosis of bladder cancer.
Verification of the prognostic model
The prognostic model was used to calculate the risk scores of each patient in the training set (TCGA-BLCA) and three test sets (GSE13507, GSE32548, and GSE32894). We identified the best cut-off value with a risk score of 2.40 for TCGA-BLCA. Using this method, the cut-off values of GSE13507 / GSE32548 / GSE32894 were 2.56 / 1.96 / 1.92. The Kaplan-Meier curves showed that the prognosis of patients with high-risk was significantly worse than that of patients with low-risk in the four cohorts (p<0.001, Fig. 3a-d). The Receiver Operating Characteristic (ROC) curves of the four cohorts were drawn: the 1 / 3 / 5 year Area Under the Curve (AUC) values for the TCGA-BLCA cohort were 0.686, 0.665, and 0.666, respectively (Fig. 3e); those for the GSE13507 cohort were 0.800, 0.742, and 0.697, respectively (Fig. 3f); those for the GSE32548 cohort were 0.826, 0.792, and 0.763, respectively (Fig. 3g); and those for the GSE32894 group were 0.781, 0.831 and 0.839 (Fig. 3h). Additional file 5: Figure S2 shows the risk score distribution, gene expression values, and survival status of patients in both the high-risk group and the low-risk group.
The clinical factors and risk scores of the four cohorts were analyzed by univariate Cox and multivariate Cox regression analysis (Table 2). The results of univariate analysis showed that T stage was more effective in predicting prognosis among the clinical factors, and three cohorts had statistical significance. The risk scores were statistically significant in all four cohorts, and the p value of three cohorts was lower than that of the T stage. In multivariate Cox analysis, risk scores were statistically significant in three cohorts, indicating that the three cohorts were independent of other clinical factors in predicting prognosis. In this analysis, only two cohorts of T stage had statistical significance, so it is obvious that T stage is not as strong as risk score to predict the prognosis. Finally, we compared the risk scores for different grades and T stage in the four cohorts, and found that the risk scores increased with the increase of grade and T stage (p < 0.001, Additional file 6: Figure S3a, b). In the GSE32548 cohort, we compared the risk scores of FGFR3, and TP53 (or with the MDM2 alteration) for wild type and mutant type. Additional file 6: Figure S3c shows that a lower risk score of mutant type than that of wild type for the FGFR3 groups (p < 0.001). In TP53 (or with the MDM2 alteration), the score of mutant type was higher than that of wild type (P < 0.001, Additional file 6: Figure S3c).
Seven genes and model were successfully verified in GSE48705
We evaluated the prognostic ability of 11 genes and models in GSE48075 (n=73). The results showed that SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, and COMP were statistically significant (p < 0.05, Additional file 7: Figure S4), and the prognosis was worse in the high expression group which was consistent with the analysis result of the previous four cohorts. The risk score of patients was calculated according to the model, and the risk score was analyzed by Kaplan-Meier analysis. The prognosis of the high risk-score group was worse, and the difference was statistically significant (p = 0.0028, Fig. 4a). We drew the ROC curves of the risk score, and the AUC value of 1 / 3 / 5 year was 0.676 / 0.630 / 0.755 (Fig. 4b). Fig. 4c showed the risk score distribution, gene expression values, and survival status of patients between high and low-risk groups.
Weighted co-expression network analysis and enrichment analysis
The co-expression network was constructed with 3844 coding genes and 403 samples in TCGA-BLCA cohort. First, the expression matrix was transformed into a topological overlap matrix according to β = 4. Then, the genes were divided into different modules (Fig. 5a) using the dynamic pruning tree method. Next, the association analysis of clinical traits and modules (Fig. 5b) showed a high correlation between the turquoise module and risk score (cor = 0.76, p = 2E-74). There was also a high correlation between the turquoise module and survival status (cor = 0.25, p = 9E-07) / grade (cor = 0.3, p = 1E-09) / stage (cor = 0.32, p = 1E-10). We selected 128 key genes (Fig. 5c) in the turquoise module according to the standard. To explore the potential function of these key genes, pathway and process enrichment analysis of these key genes were performed, as shown in Fig. 5d. The three most highly significantly enriched terms were extracellular matrix organization, collagen fibril organization, and ECM proteoglycans, all related to the tumor microenvironment (TME).
Immune cells can be combined with risk scores for prognostic analysis
We used CIBERSORT to calculate the infiltration ratio of 22 immune cells in TCGA-BLCA samples and used a bar chart to show the infiltration of high and low-risk groups (Fig. 6a). Then, the Wilcoxon test was used to compare the difference between high and low-risk groups. The results showed that B cells naive, Macrophages M0, and Macrophages M1 showed high infiltration in the high-risk group; B cells memory, Dendritic cells resting, and Dendritic cells activated showed high infiltration in the low-risk group (P < 0.001, Fig. 6b). Furthermore, we took the risk score and the infiltration degree of these six kinds of immune cells for joint prognostic analysis. The samples were divided into four clusters for Kaplan-Meier analysis according to the median value of the risk score and immune cell infiltration degree. The results showed that these groups could also be used for prognostic analysis (p < 0.05, Fig. 6c-h). Among them, the prognostic ability of B cells memory is the best. When the degree of B cells memory infiltration is low, and the risk score is high, the prognosis of this cluster is significantly worse than that of other clusters. We used ESTIMATE to calculate the TCGA-BLCA cohort's immune score and then combined with the risk score for Kaplan-Meier analysis. The results showed that the cluster with low immune-score and high risk-score had the worst prognosis (Additional file 8: Figure S5).