2.1 Patient characteristics
All 335 patients were females. All clinical and pathological data of the patients were complete, including age, tumor size, axillary lymph node status, lymphovascular invasion status, histological grade, pathological type, ER/PR/HER2/Ki67/P53 status, and surgical and postoperative adjuvant treatment (specified as chemotherapy, targeted therapy, radiotherapy, and endocrine therapy). Patient characteristics and differences in characteristics by survival outcome are detailed in Table I. Among 335 patients, 22, 267, and 46 had histologic grade I, II, and III tumors, respectively [6]. According to the 2017 American Joint Committee on Cancer staging system [6, 7], 206, 126, 3, and 0 patients had stage T1, T2, T3, and T4 disease, respectively, while 180, 98, 41, and 16 patients had stage N0, N1, N2, and N3 disease, respectively.
2.2 Prognostic factors
First, the expression of Ki67, ER and PR in cancerous tissue areas was quantified as 1-100, which was the percentage of tumor nuclei positive (positive nuclei) over all the tumor nuclei (positive nuclei and negative nuclei) by manual counting. The prognostic factors included in this study were divided into categorical or continuous factors. Appendix I lists the prognostic factors included in this study. Second, to determine the cutoff values for these continuous variables (Ki67, ER, PR, and age), we transformed each continuous factor (Ki67, ER, PR, and age) into a series of binary variables using each observed value as a partition point. Appendix II provides the possible partition points for the four continuous factors. Each partition point converted the continuous variable into one binary variable. Taking Ki67 as an example, point 60% converted Ki67 into one binary variable named “Ki67 status (60%)”, which equals 1 if the immunochemical status of Ki67 was no less than 60% and equals 0 otherwise. Additionally, it was mandatory that each sample size of each binary variable was not less than 10. For each patient case, 95 variables were determined, including categorical factors (histological grade, histological type, stage, T, N, P53, and lymphovascular invasion), continuous variables (Ki67, ER, PR, and age) and a series of binary variables. Based on a previous clinic study and clinical practice [23], we introduced four potential interaction terms: T and N, ER and PR, age and ER, and age and PR. Therefore, we finally determined 1664 variables with the addition of interaction terms.
2.3 Analysis of patient survival times
On September 15, 2016, 311 of the 335 patients were still being followed, and 24 had been lost to follow-up (including one noncancer death). Among the 311 patients, 28 patients had died from breast cancer. The clinicopathologic characteristics of the 28 patients who died from breast cancer are presented in Appendix III. The median follow-up duration was 370 weeks, with a censoring rate of 91.6%. The longest survival time of patients who died was 341 weeks. The survival probability at 341 weeks (approximately six and a half years) of breast cancer patients was 91.2% (95% confidence interval 88.1%-94.3%) (Fig. 2).
2.4 Prognostic model development
To determine the cutoff values for continuous variables (Ki67, ER, PR, and age), we transformed each continuous factor (Ki67, ER, PR, and age) into a series of binary variables using each observed value as a partition point. We calculated the hazard ratios and their 95% confidence limits for these binary variables (see Appendix IV). It was meaningful to look for significant cutoff values for these four factors during modeling; we achieved this objective and focused on the specific prognostic models below.
2.4.1 Cox proportional hazards model without interaction terms
We first introduced 97 variables without interaction items into the model, and a SCAD variable selection method [19] was used to develop the Cox proportional hazard model (named Model 1).
\(\text{h}\left(t\right)={h}_{0}\left(t\right)\text{e}\text{x}\text{p}\{1.38{x}_{1}+1.01{x}_{2}+1.643{x}_{3}-0.949{x}_{4}+1.04{x}_{5}\) } Eq. (1)
The prognostic factors, beta coefficients, hazard ratios for prognostic factors, and P values of Z tests are provided in Table II.
The multivariable Cox regression analysis for BCSM without interaction terms, Model 1, showed that histological grade, N status, Ki67 status, PR status, and age were statistically significant predictors of patient survival. The cutoff value for Ki67 was 60%, the cutoff value for PR was 20%, and the cutoff value for age was 55 years old.
2.4.2 Cox proportional hazards model with interaction terms
Second, 1664 variables including interaction terms were used in the model to evaluate the interaction effect between prognostic factors. A SCAD variable selection method [19] was also used to develop the Cox proportional hazard model (named Model 2).
\(\text{h}\left(t\right)={h}_{0}\left(t\right)\text{exp}\left\{1.392{x}_{1}+0.995{x}_{2}+1.595{x}_{3}+1.202{x}_{5}-1.157{x}_{6}\right\}\) Eq. (2)
The prognostic factors, beta-coefficients, hazard ratios for prognostic factors, and P values of Z tests are provided in Table III.
Model 2 showed that there might be an interaction effect between PR (20%) and age (41 years). However, PR (20%) and age (41 years) were not statistically significant predictors of patient survival in Model 2. These results indicated that Model 2 may not be self-sufficient.
2.4.3 Extended Cox prognostic model
To further optimize the two above Cox models, we first included all variables in Model 1 and Model 2, which were histological grade, N status, Ki67 status (60%), PR status (20%), age (55 years), age (41 years) and PR-age (age ≥ 41years, PR ≥ 20%), in the Cox model. The outcome indicated that PR status (20%)、age (41 years old) and PR-age were not statistically significant. Furthermore, taking previous clinical experience into account, we were aware that the interaction between PR and age might vary according to the different time periods after surgery and that the model could be divided into two parts by a specific time point. To determine the optimal time point, we considered both death and censored time periods. We developed a new Cox Model 3 (named the “extended Cox prognostic model”) with a time threshold of 164 weeks after surgery, based on the smallest AIC value.
\(\text{h}\left(\text{t}\right)=\left\{\begin{array}{c}{h}_{0}\left(t\right)\text{exp}\left\{A\right\}, t \le 164weeks\\ {h}_{0}\left(t\right)\text{exp}\left\{A+1.594{x}_{8}\right\}, t > 164weeks\end{array}\right.\) Eq. (3)
$$\text{A}=1.329{x}_{1}+0.972{x}_{2}+1.65{x}_{3}-1.802{x}_{4}+1.378{x}_{5}-1.563{\text{x}}_{7}$$
The prognostic factors, beta-coefficients, hazard ratios for prognostic factors, and P values of Z tests are provided in Table IV and Table V.
The extended Cox prognostic model determined the cutoff values for multiple continuous prognostic factors and the interaction effect between the factors. First, the cutoff values of the prognostic factors were determined by Cox prognostic model with the SCAD variable selection method [19]. Among 1730 predictors, histological grade and N status were considered categorical factors. However, the model showed that Ki67, PR and age were also categorical factors and the prognostic model automatically determined their reasonable cutoff values. For Ki67 expression, a cutoff value of 60% was selected to distinguish between low expression (< 60%) and high expression (≥ 60%). For PR expression, a cutoff value of 20% was selected to distinguish between low expression (< 20%) and high expression (≥ 20%). Regarding age, we categorized patients into three groups based on age at the time of surgery: the older group, who were 55 years or older; the younger group, who were less than 41 years old; and the middle-aged group, who were between 41 and 55 years old. Second, for the interaction effect, after 164 weeks postoperation, there was an interaction between age and PR: the older the patients with ER/PR+, HER2-, PR ≥ 20% were, the lower survival and more likely to recur and metastasize exceeding 164 weeks (more than 3 years) after surgery.
In our study, we found that the hazard ratio for BCSM increased 2.78 times as the histological grade increased by one level. If N statuses increased one level, the hazard ratio for BCSM increased 1.64 times. The patients with high Ki67 expression had a hazard ratio for BCSM that was 4.21 times higher than that of the patients with low Ki67 expression. Within 164 weekspostoperation, the hazard ratio for BCSM of the patients with low PR expression was 4.88 times higher than that of the patients with high PR expression. Within 164 weeks postoperation, patients aged < 41 had the highest hazard ratio for BCSM, followed by patients aged ≥ 55, while patients aged 41 to 55 showed the lowest hazard ratio for BCSM. After 164 weeks postoperation, there was an interaction effect between age and PR for patients aged ≥ 41 years and PR ≥ 20%. The hazard ratio for BCSM in patients aged ≥ 41 years and PR ≥ 20% was elevated after 164 weeks postoperation. For patients with high PR expression, age was positively correlated with mortality after 164 weeks postoperation. For patients with high PR expression after 164 weeks postoperation, the patients aged 41 to 55 had nearly the same hazard ratio for BCSM as those aged < 41, while the hazard ratio for BCSM of patients aged ≥ 55 years was 3.09 times higher than that of patients aged < 41 years.
2.5 Utilizing the nomogram
To improve the practicability of our model, we established a nomogram of 1-year, 3-year and 5-year survival probability based on our model (Fig. 3). The nomogram consisted of eleven rows. The first “points” row was the point assignment for each variable. For an individual patient, each variable (rows 2–7) was assigned a point value by drawing a vertical line between the exact variable value and the first “points” row. Subsequently, the total points could be obtained by summing all of the allotted points for the six variables (rows 2–7). The total point sum was found in the ‘‘total points’’ row. Finally, the 1-, 3-, and 5-year survival probability could be predicted by drawing a vertical line between the ‘‘total points’’ row and the “predicted probability” rows (rows 9–11, respectively).
2.6 Model discrimination and calibration
As expected, this extended Cox prognostic model showed good discrimination. Figure 4 shows the ROC curves for BCSM at 1 year, 3 years, and 5 years postoperatively. The AUCs for BCSM at 1 year, 3 years, and 5 years postoperatively are shown in Table VI. The AUC values for BCSM at 1 year, 3 years and 5 years postoperatively were all larger than 0.80. The AUC value for BCSM at 3 years postoperation was as high as 0.94, with a 95% CI from 0.89 to 0.99. Therefore, our extended Cox prognostic model showed good discrimination.
The extended Cox prognostic model was also well calibrated by the GOF test [21]. We grouped the risk scores into 10 groups and then calculated the GOF statistic 7.37, with a P-value of 0.6 (> 0.05) [22]. These results indicated that our extended Cox prognostic model fit was good.