In the present study, we analyzed EMR data to identify significant prognostic factors for breast cancer survival and recurrence using a combination of ML approaches and the Cox proportional hazard model. History of recurrence and SEER stage were related to breast cancer prognosis, and the type of surgery and diagnostic route were prognostic factors for breast cancer survival. A history of diabetes and the presence of a breast lump were prognostic factors for breast cancer recurrence.
Among the diagnostic routes, an incidental diagnosis of breast cancer (i.e., not during formal cancer screening or surveillance) in the absence of symptoms related to the tumor [23] was associated with a significantly higher risk of death than diagnoses via breast screening. Although it is difficult to draw definite conclusions regarding the effect of the diagnostic method on survival time due to a lack of relevant studies, breast cancer detected by regular health screening is an independent prognostic factor for breast cancer and associated with a more favorable survival rate [24, 25]. Future research should build on these findings and explore potential reasons for prognostic differences between screening-detected and incidentally detected breast cancer. Consistent with the present findings, previous research has reported that breast cancer patients who experience recurrence tend to report breast lumps [26]. Although the clinical significance of a breast lump for the risk of recurrence has already been established, our findings based on big data and an ML approach provide confirmatory evidence that a breast lump is a key prognostic factor.
Notably, a history of diabetes increased the risk of early recurrence of breast cancer in this study. Previous prognostic analyses have emphasized the need to consider underlying diseases. For example, meta-analyses have revealed that poor overall and disease-free survival are associated with breast cancer among patients previously diagnosed with diabetes [27]. Furthermore, cooccurring cardiovascular and pulmonary diseases are likely to increase the risk of death in breast cancer patients [28]. Although the effect of diabetes on the early recurrence of breast cancer has been extensively investigated [27], our study provides additional evidence, based on ML and big data analysis, of the vital role that underlying health conditions play in breast cancer prognosis.
A strength of this study was the use of data from “scalable” EMRs of hospitals to identify key prognostic factors for breast cancer. Conventional prognostic predictions of breast cancer use data restricted in terms of populations or hospitals, or based on randomized controlled trials [6]. In contrast, EMR systems are comprehensive databases that help physicians to document patient care and contain detailed lists of patient symptoms. This expands the scope and accuracy of prognostic predictions. Despite these advantages of EMR systems, there are also some disadvantages, including redundant, missing or inaccurate data, and internally inconsistent progress notes [29], all of which pose a threat to patient safety. Thus, EMR systems require meticulous updating and maintenance.
Several limitations of this study should be discussed. First, we analyzed partial big data breast cancer datasets. Analysis of full datasets remains a challenge, as libraries of all breast cancer cases have not yet been compiled. Moreover, although hormone therapy, for example with estrogen or progesterone, is a key prognostic factor for breast cancer, this was not analyzed in the present study. In addition, while oversampling was done due to the relatively small number of patients available for the study, the elimination of variables with a large amount of missing data may have substantially affected the results. Nonetheless, the hybrid LASSO-Cox method used in this study was effective for identifying the most pertinent prognostic factors and improved the precision of the Cox proportional hazard model by omitting redundant features [14]. It is also notable that LASSO regression, which has strengths in terms of feature selection, was applied given that it has not been used extensively in breast cancer prognostic studies. Finally, our study demonstrated the immense potential of big data and ML techniques for obtaining new insight into breast cancer; in this manner, we have obtained a deeper understanding of the clinical course of the disease [30].