Filtering the target genes related to DRFS
The GSE25055 and GSE25065 datasets were used as discovery cohorts, meanwhile GSE32603 as a validation cohort. Batch effects were removed with the help of “sva” R package (Fig. 2a,b). First of all, we performed univariate Cox regression analysis in GSE25055 for acquiring 2,753 genes that associated with DRFS, with the standard of P༜0.05, and in GSE25065 was 2,002. In order to remove confounding bias and sort out independent prognostic indicator for breast cancer patients, we conducted multivariate Cox regression analysis on the basis of the result of univariate Cox regression analysis in GSE25055 with the condition of screening of P༜0.05 for acquiring 889 genes as the predictors of DRFS of patients independently, and in GSE25065 was 642. The 95 genes were finally confirmed to be associated with DRFS both in GSE25055 and GSE25065.
Gene Ontology (GO) enrichment analysis of genes for identifying potential pathways and functional annotations
To discover the potential pathways and functional annotations of the genes associated with DRFS for further study. We performed the GO functional analysis separately in GSE25055 and GSE25065 (Fig. 2c,d) with P < 0.05 was statistical difference. From the cellular component (CC), biological process (BP), and molecular function (MF), and we found the genes were mainly concerned with cell division, cell-cycle progression, controlling cell proliferation, apotosis and DNA repair. Based on the above study, we found that these functional annotations were closely related to genes of E2F family.
Screening out the target genes in GSE32603
After the GO functional analysis, E2F family were chosen for further study. Two hundreds target genes of E2F family were extracted in hallmark E2F targets gene set from the GSEA database for explaining the relationship between genes related to DRFS and the Hallmark E2F target genes, thus we obtained 3 intersection genes CDC25B,MTHFD2༌PTTG1 (Fig. 3a). We conducted heatmaps of the three genes expression profiles in GSE25055 and GSE25065 individually (Fig. 3b,c). Furthermore, we chose GSE32603 as a validation cohort for screening the target genes. And Kaplan-Meier curve analysis of CDC25B, MTHFD2, PTTG1 was also performed.
The Kaplan-Meier (K-M) survival curve showed that in GSE32603 the DRFS of high CDC25B, MTHFD2, PTTG1 expression patients was lower than the low ones with P=0.012, P=0.008, P༜0.001 using log-rank tests (Fig. 4a). Then we conducted multivariate Cox regression analysis in GSE32603. It indicated the expression level of PTTG1 (P=0.029) and the pathological complete response (P༜0.001) could serve as independent DRFS predictors, besides, PTTG1 was a high-risk gene with HR=1.396, 95%CI: 1.035-1.883. However, CDC25B (P=0.093) and MTHFD2 (P=0.101) were considered to be confounding factors which couldn’t predict DRFS independently (Fig. 4b).
The above results proved that PTTG1 was a gene related to DRFS,and high expression of PTTG1 indicated a worse DRFS. Therefore, PTTG1 was sorted out for further study.
Discovery of E2F-related genes and enriched pathway connected with PTTG1 in breast cancer
We chose the proteins interacted with PTTG1 in the three subsets, and constructed a PPI network which is centered on PTTG1 (Fig. 5a). We used a 100,000 iterations permutation test to identify the reliability of the network, both the P value of the nodes and edges were < 0.001. The PPI network indicated that the most important biological processes were Nuclear division, Negative regulation of cell cycle process, Nuclear chromosome segregation, Regulation of chromosome organization. The results of GO and KEGG enrichment analysis seen in Supplementary Fig. 1,2. We separated patients into PTTG1 high expression and low expression groups, and additionally employed a GSEA functional annotation approach. The result showed PTTG1 high expression patients were significantly enriched in E2F family target genes that relative to patients from PTTG1 low expression patients (P =0.016), normalized enrichment score (NES) > 1.5 and false discovery rate (FDR)=0.007 (Fig. 5b), indicated the high expression of PTTG1 was associated with E2F pathways in breast cancer.
Using ROC curve to evaluate the accuracy and specificity of PTTG1 for predicting the DRFS of breast cancer patients, the AUC of PTTG1 was 0.682 that indicated a relative poor prediction performance (Fig. 5c).
Establishment of a nomogram for better DRFS performance
In order to predict DRFS better, a nomogram was established, including clinical characteristics and expression level of PTTG1 in GSE25055 and GSE25065 for predicting 1-year, 3-year, 5-year DRFS of breast cancer patients receiving taxane and anthracycline-based NACT (Fig. 6a).
The patient owned his total score by adding a corresponding score for each prognostic factor in the nomogram. Patients were separated into a high-risk cancer group and a low-risk cancer group that based on the risk scores. Patients with a high-risk score exhibited significantly worse DRFS that compared with the low-risk score (P༜0.001), the same result in theK-M survival curve (Fig. 6b).
The calibration plots for 1-year, 3-year and 5-year distant relapse-free survival of patients that indicated the nomogram-predicted outcome, showed high concordance with the actual outcome (Fig. 6c,d,e). The C-index of predicted relapse-free survival was 0.805 which showed its high prediction accuracy and a high sense of coherence. For the further assessment of the accuracy, specificity and sensitivity of the nomogram, we performed receiver operating characteristic (ROC) curve analysis for the evaluation of the prediction power of the model. The area under ROC curve (AUC) of the risk nomogram was 0.849, while AUC of the expression level of PTTG1 was 0.682, and AUC (pathologic_response) = 0.409, AUC (grade) = 0.593, AUC (ajcc_stage) = 0.696, AUC (N stage) = 0.698, AUC (T stage) = 0.701, AUC (status of PR) = 0.332, AUC (ER) = 0.306, AUC (age) = 0.490 (Fig. 6f).