Prognostic signature generation
In this study, in order to identify candidate prognostic genes that are significantly associated with OS, we performed univariate Cox proportional hazard regression analysis on each data. Using P <0.05 and HR <1 as the cutoff criteria, 1797 genes in GSE26085, 895 genes in GSE42568, 450 genes in GSE20711, and 666 genes in GSE88770 were identified as candidate protect genes. Using P <0.05 and HR> 1 as the cutoff criteria, there were 2528 genes in GSE26085, 1771 genes in GSE42568, 766 genes in GSE20711, and 1292 genes in GSE88770 were identified as candidate danger genes. Finally, the common genes in four datasets were retained as prognostic genes. Two protect genes (STXBP3, PKN2) and seven danger genes (TCAP, STARD3, CDR2L, PNMT, GPR4, ANGPT2, CAPN5) were finally obtained. Both them are mRNAs, and general information for these genes is shown in Table 1. The prognostic correlations of the 9-genes in each data set with respect to overall survival are shown in Table 2 (HR value, 95% confidence interval, P value).
9-gene prognostic signature validation
Regression coefficient weighted prognostic gene expression values were used to construct a risk score for each patient，and a prediction model was constructed to predict overall survival of patients. Calculate the risk score for each patient, and then divide patients into high-risk and low-risk groups based on the median risk score. In Figure 2, the ranking is based on the risk score values of 9 gene signatures from low to high, the risk score distribution, risk gene expression and patient survival status of the four data sets GSE20685, GSE42568, GSE20711, and GSE88770 are shown respectively.
In Figure 3, Kaplan-Meier survival curves and ROC curves generated from 9-gene signature in four data sets are shown. Outcome data indicate that the overall survival of patients in the high-risk group is shorter than in the low-risk group(GSE20685: HR=2.68(1.67-4.29), P value=1.99e-05 ; GSE42568: HR=9.75(3.42-27.79), P value=1.82e-07; GSE20711: HR=4.38(1.74-11.02), P value=6.38e-04; GSE88770: HR=3.26(1.42-7.51), P value=3.35e-03) (Figure 3 Left panel). For further verification, we divided patients into three groups: high, medium, and low risk based on the value of the risk score. The results still show that the higher the risk score, the worse the patient's overall survival (Figure 3 Middle panel). Finally, we constructed a ROC curve based on 9-gene signature. The results show that, as time went on, the AUC values of the four datasets have remained at a relatively satisfactory value, can effectively predict overall survival (Figure 3 Right panel).
Of these 9 genes, two of them are protect genes (STXBP3, PKN2; HR <1) and seven of them are danger genes (TCAP, STARD3, CDR2L, PNMT, GPR4, ANGPT2 and CAPN5; HR> 1). Figure 4 showed the difference in expression of these 9 prognostic genes in the low-risk and high-risk groups. The results showed that high-risk group patients had higher danger gene expression, while low-risk group patients had lower protect gene expression.
9-gene prognostic signature is independent of other clinicopathological factors
Multiple Cox regression analysis was used to assess whether 9 gene markers could be used as independent prognostic factors. The covariates in dataset GSE20685 include T_stage, M_stage, lymph node metastasis (nodes), in dataset GSE42568 include T_stage, nodes, grade, ER status, in dataset GSE20711 include T_stage, nodes, grade, ER status, HER2 status, and in dataset GSE88770 include nodes, grade, ER status, Ki67. Multivariate COX regression results showed that in four independent datasets, the 9-gene signature can be used as an independent prognostic factor, and its predictive ability has nothing to do with other clinicopathological factors(GSE20685: HR= 2.234(1.3668-3.651), P value= 0.001; GSE42568: HR= 8.388(2.908-24.196), P value= 8.33e-05; GSE20711: HR= 3.857(1.522-9.772), P value= 0.012; GSE88770: HR= 2.860(1.239-6.600), P value= 0.014)(Table 3). In addition, in four independent datasets, lymph node metastasis status can also be used as an independent prognostic factor.
Nomogram development and validation
In four independent datasets, we constructed a nomogram including statistically significant clinicopathological factors and 9-gene prognostic signature, to better quantitatively predict the three-year and five-year survival rate(Figure 5. Top panel And Supplementary figure 1. Left panel). The calibration curve (Figure 5. Lower panel And Supplementary figure 1. Right panel)and C index (GSE20685: Concordance= 0.735 (se=0.028); GSE42568: Concordance= 0.828(se=0.034); GSE20711: Concordance= 0.726(se=0.047); GSE88770: Concordance= 0.76(se =0.055)) indicate that the prediction results of the nomogram shows high accuracy.
In the above-mentioned multiple Cox regression analysis, some clinicopathological factors were identified as independent prognostic factors. In order to determine whether the 9-gene signature can be used to predict the overall survival of patients within the same clinical factor subgroup, we combined the four datasets for stratification analysis. We analyzed the patients in the entire cohort according to the lymph node metastasis status (nodes), ER status, T_stage, grade (due to the number of patients in grade I was too samll, only grade II and grade III were analyzed), and divided the patients into high-risk and low-risk groups. The results of the Kaplan-Meier survival curve show that in the same clinical subgroup, the overall survival of patients in the low-risk group is higher than that in the high-risk group(nodes(-): HR= 3.72(2.47-5.60), P value= 1.93e-11; nodes(+): HR= 2.09(1.05-4.13), P value= 3.10e-02; ER(+): HR= 4.80(2.50-9.21), P_value= 2.08e-07; ER(-): HR= 4.73(1.96-11.44), P value= 1.45e-03; T_stage I : HR= 2.32(1.21-4.44), P value= 9.22e-03; T_stage II: HR= 4.82(2.75-8.46), P value= 1.41e-09; Grade II: HR= 4.52(1.94-10.54), P value= 1.32e-04; Grade III: HR= 5.75(2.58-12.83), P value= 1.39e-06)(Figure 6).
Relationship between the 9-gene signature and Disease-Free Survival
We used disease-free survival (DFS) data in GSE42568 and GSE20711 datasets and distant recurrence-free survival (DRFS) data in GSE88770 to determine the role of the 9-gene prognostic signature in predicting disease relapse. The results showed that the disease-free survival (DFS) of BRCA patients in the high-risk group was significantly lower than in the low-risk group (GSE42568: HR = 3.71 (1.91-7.21), P = 3.28e-05; GSE20711: HR = 1.82 (0.95- 3.51), P = 6.88e-02; and GSE88770: HR = 2.59 (1.09-6.17), P = 2.59e-02) (Figure 7. Left panel). The time-varying ROC curve also further verified that the 9-gene prognostic signature has a substantially effective performance in predicting disease-free survival（Figure 7. Right panel）.
9-gene prognostic signature external validation
We used the GSE16446 dataset for external verification. The KM survival curves for OS and DFS showed that the 9-gene prognostic signature have good predictive ability, and the ROC working curve also showed that the gene signature has good working efficiency(Figure 8.).
Gene Set Enrichment Analysis
Finally, we used GSEA enrichment analysis to better determine the biological function of the 9-gene prognostic signature. The top 5 KEGG pathways enriched in high-risk and low-risk sample groups are shown according to the FDR <0.05 cut-off criteria: bladder cancer, glycosaminoglycan biosynthesis chondroitin sulfate, nicotinate and nicotinamide metabolism, steroid biosynthesis, and steroid hormone biosynthesis (Figure 9.).