In search of target genes related to DRFS
The GSE25055 and GSE25065 datasets were used as discovery cohorts, meanwhile GSE32603 as a validation cohort. Batch effects were removed with the help of “sva” package in R 4.0.2 (Fig. 1a,b). First of all, we performed univariate Cox regression analysis in GSE25055 for acquiring 2753 genes associated with DRFS, with the standard of P༜0.05, and in GSE25065 was 2002. Then, in order to remove confounding bias and sort out independent prognostic indicator for breast cancer patients, we conducted multivariate Cox regression analysis on the basis of the result of univariate Cox regression analysis in GSE25055 with the condition of screening of P༜0.05 for acquiring 889 genes as the predictors of DRFS of patients independently, and in GSE25065 was 642. And 95 genes were finally confirmed to be associated with DRFS both in GSE25055 and GSE25065.
Gene Ontology (GO) enrichment analysis of genes for identifying potential pathways and functional annotations
To discover the potential pathways and functional annotations of the genes associated with DRFS for further study. We performed the GO functional analysis separately in GSE25055 and GSE25065 (Fig. 1c,d) with P value < 0.05 was statistical difference (Table. S1,2). From the cellular component (CC), biological process (BP), and molecular function (MF),we found the genes were mainly concerned with cell division, cell-cycle progression, controlling cell proliferation, apotosis and DNA repair. Based on the current study, we found that these functional annotations were closely related to genes of E2F family.
Screening out the target gene in GSE32603
After the GO functional analysis, E2F family were chosen for further study. 200 target genes of E2F family were extracted in hallmark E2F targets gene set from the GSEA database for explaining the relationship between genes related to DRFS and the Hallmark E2F target gene set so that we obtained 3 intersection genes CDC25B ,MTHFD2 ༌PTTG1(Fig. 2a). Then we conducted heatmaps of the three genes expression profiles in GSE25055 and GSE25065 individually (Fig. 2b,c).
Next, we chose GSE32603 as a validation cohort for screening the target gene. First of all, Kaplan-Meier curve analysis of CDC25B, MTHFD2, PTTG1 was performed.
The Kaplan-Meier (K-M) survival curve showed that in GSE32603 the DRFS of high CDC25B, MTHFD2, PTTG1 expression patients was lower than the low ones using log-rank tests with P = 0.012, P = 0.008, P༜0.001 (Fig. 3a). Then we conducted multivariate Cox regression analysis in GSE32603. The result shows the expression level of PTTG1 (P = 0.029) and the pathological complete response (P༜0.001) could serve as independent DRFS predictors, besides, PTTG1 was a high-risk gene with HR = 1.396, 95%CI: 1.035–1.883. However, CDC25B (P = 0.093) and MTHFD2 (P = 0.101) were considered to be confounding factors which couldn’t predict DRFS independently (Fig. 3b).
The results above proved that PTTG1 was a gene related to DRFS,and PTTG1 high expression indicated a worse DRFS. Therefore, PTTG1 was sorted out for further study.
Discovery of E2F-related genes and enriched pathway connected with PTTG1 in breast cancer
Then we chose the proteins interacted with PTTG1 in the three subsets, and constructed a PPI network which is centered on PTTG1 (Fig. 4a). We used a permutation test to identify the reliability of the network, the P value of all the nodes and edges were less than 0.001. As is shown in the PPI network, the most important biological processes were Nuclear division, Negative regulation of cell cycle process, Nuclear chromosome segregation, Regulation of chromosome organization. The results of GO and KEGG enrichment analysis could be seen in Fig. S1,2. Then We separated patients into PTTG1 high expression and low expression groups, and additionally employed a GSEA functional annotation approach. The result showed PTTG1 high expression patients were signifificantly enriched for E2F family target genes relative to patients from PTTG1 low expression patients with P value = 0.016, normalized enrichment score (NES) > 1.5 and false discovery rate (FDR) = 0.007 (Fig. 4b) (Table. S3), indicating high PTTG1 was associated with E2F pathways in BC.
Using ROC curve to evaluate the accuracy and specificity of PTTG1 for predicting the DRFS of breast patients, the result showed the AUC of PTTG1 was 0.682 indicating a relative poor prediction performance (Fig. 4c).
Establishment of a nomogram for better DRFS performance
In order to predict DRFS better, a nomogram was established, including clinical characteristics and expression level of PTTG1 in GSE25055 and GSE25065 for predicting 1-year, 3-year, 5-year DRFS of breast cancer patients receiving taxane and anthracycline-based NACT (Fig. 5a).
Each patient owned his total score by adding a corresponding score for each prognostic factor in the nomogram. Patients were separated into a high-risk group and a low-risk group based on riskscores. Patients with a high-risk score exhibited significantly worse DRFS compared to the ones with a low-risk score (P༜0.001), as was shown in the Kaplan-Meier (K-M) survival curve (Fig. 5b).
The calibration plots for 1-year, 3-year, 5-year distant relapse-free survival of patients indicated the nomogram-predicted outcome showed high concordance with the actual outcome (Fig. 5c,d,e). The C-index of predicted relapse-free survival was 0.805 which showed its high prediction accuracy and a high sense of coherence. For the further assessment of the accuracy, specificity and sensitivity of the nomogram, we performed receiver operating characteristic (ROC) curve analysis for the evaluation of the prediction power of the model. The area under ROC curve (AUC) of the risk nomogram was 0.849, while AUC of the expression level of PTTG1 was 0.682, AUC of pathologic_response was 0.409, AUC of grade was 0.593, AUC of ajcc_stage was 0.696, AUC of N stage was 0.698, AUC of T satge was 0.701, AUC of the status of PR was 0.332, AUC of ER was 0.306, AUC of age was 0.490 (Fig. 5f).