Modeling The Factors Associated with Mortality in Patients with Breast Cancer: A Machine Learning Approach

Background: Breast cancer (BC) was the fth leading cause of death worldwide in 2015 and the second leading cause of death in Iran in 2012. This study aimed to model the factors associated with mortality in patients with BC utilizing the machine learning approach. Methods: We used data of patients with primary BC during 2007-2016 in Tabriz, Iran. The data were analyzed using decision tree (DT), boosted tree (BT), random forest (RF), k-nearest neighbors (KNN) and generalized additive model (GAM) with inverse probability of censoring weighting (IPCW) technique to assess the risk factors of mortality. The models were compared by using diagnostic accuracy measures. Results: Accuracy of the models ranged from 76.0 to 93.0%, with sensitivity of 82.5-98.8% and specicity of 72.2-99.4%. The GAM t the data best with accuracy of 93.0% (95% CI: [90.5, 95.0]), sensitivity of 98.8% (95% CI: [96.9, 99.7]) and specicity of 84.3% (95% CI: [78.8, 88.9]) where non-linear effect of age (p-value = 0.006), grade (p-value = 0.024) and time to event (p-value < 0.001) on mortality were signicant. Conclusion: The GAM seems to be an optimal model for classifying the mortality in patients with BC. Considering the time to event, age and grade, as the prognostic factors obtained by GAM, more accurate prevention planning may be designed.


Background
Breast cancer (BC) is a prevalent cancer among women and the common cause of mortality from cancers. 1 The prevalence of BC is highest in the USA and Western Europe and lowest in East Asia. 1 Although the prevalence and mortality rates of BC are decreasing in many countries, the global prevalence is increasing with annual increase of 12% in age standardized incidence rates. [1][2][3][4] The mortality and prevalence of BC are increasing in Iran and in East Azerbaijan province as well. Although, the BC incidence rate in East Azerbaijan is lower than Western Europe and the USA, the increment in BC incidence rates needs further study. 5,6 The BC incidence and survival are affected by different factors such as environment, hormones, nutrition, heredity, number of pregnancies, tumor size, morphology, age, and tumor grade and stage. [7][8][9][10] Identi cation of main risk factors and high risk groups could help proper planning to reduce the BC incidence and mortality.
Various analyzes are used for investigating the relationships between the survival of patients with BC and its risk factors, like log-rank test, Kaplan-Meier approach, cox regression model, and parametric survival models. 7-10 These techniques need certain assumptions (like proportional hazard and linear relationship) to be satis ed, where in many practical situations may not be the case. According to the importance of more accurate risk prediction, different machine learning techniques are expanded for predicting with high precision and accuracy. 11 In addition, there are some approaches that can be used with these techniques to provide more accurate predictions. The IPCW is an approach that reduces the role of censored data and bias of risk estimation by giving less weight to censored participants. 11 There are many machine learning techniques that are used for prediction and classi cation [12][13][14][15][16][17] but it is necessary to determine which technique has more precision and accuracy. According to an extensive research in the literature, there was no study to assess the machine learning approach for classifying the mortality in patients with BC. In this study, we used several machine learning methods with IPCW approach including decision tree (DT), boosted tree (BT), random forest (RF), k-nearest neighbors (KNN) and generalized additive model (GAM) to the best model the relationships of mortality with common prognostic factors in patients with BC.

Statistical analyses
Data were expressed as mean (SD) and frequency (percentages) for numeric and categorical variables, respectively. Mortality from BC, the primary outcome of this study, was equal to 1 where the patient had experienced death, otherwise it was zero.
For calculating IPCWs, the value of τ should have been de ned. 11 The IPCWs were computed for censored event time at τ=9 years. The dataset was split randomly into training and validation dataset, and ve machine learning techniques including DT (using CART algorithm), BT, RF, KNN and GAM were tted in them. These techniques were used for classifying the patients with BC using sex, age, grade, morphology and survival time. The tted models were compared using diagnostic criteria containing accuracy, sensitivity, speci city, ROC area, likelihood ratio +, likelihood ratio -, odds ratio, positive predicted value, and negative predicted value. All analyses were performed using STATISTICA 12 (Statistica, StatSoft, Texas, USA).

Page 4/10
The mean age of participants was 50.4 (SD 12.5) years and 1132 individuals were female (98.1%). The details of demographic and background characteristics of the study participants were provided elsewhere. 8 Table 1 shows diagnostic statistics for comparing DT, BT, RF, KNN and GAM. The results indicate that most of the diagnostic statistics values are approximately close to each other in the investigated models, however GAM has the highest sensitivity, accuracy and negative predicted value.
The results of DT showed that the most important cause of mortality from BC was the survival time.
Other independent variables were in the very low importance level (Figure 1).
The results of BT indicated that the survival time was most important predictor for mortality from BC.
However, other independent variables were in the low level of importance ( Figure 2).
The results of RF showed that the survival time was in the high level of importance for mortality from BC whereas other independent variables were in the low level of importance ( Figure 3). Figure 4 show cross validation accuracy against number of nearest neighbors in KNN method. According to this gure, k=34 chosen as the optimal number of nearest neighbors for predicting mortality status in patients with BC.
The results of GAM are presented in

Discussion
In this study, some machine learning techniques were used and compared to classify the patients with BC. In this regard, ve machine learning methods including DT, BT, RF, KNN and GAM were applied in the prediction of mortality status.
The IPCW approach was used to reduce the role of censored data in classi cation of mortality in patients with BC. The IPCW is an approach that reduces the bias of risk estimation by giving less weight to censored participants. This approach can be used with many machine learning methods. 11 Interestingly, in our data, the GAM using IPCW approach had the best performance which had the highest accuracy, sensitivity and negative predictive value among the investigated machine learning methods. So we can use this model to achieve the accurate classi cations. In the line with our nding, GAM has outperformed other machine learning techniques in various elds such as classi cation of functional setting, detection of diplodia sapinea (which in ict severe damage upon pine trees) and forecast of postoperative complications. 19 Using the GAM, time to event, age and grade had signi cant nonlinear effect on the survival of patients with BC so that patients with lower time to event, higher age and higher grade had more mortality.
Similarly, previous studies have revealed that the survival of patients with BC is directly related by time and inverse related by age and grade. 8,9,12,24−28 Limitations And Strengths Clearly, this study has some limitations. The GAM does not provide statistical signi cance for each of the predictive variables, however examination of nonlinear relationships compensates for this problem. Some important clinical information like tumor size, progesterone receptor, estrogen receptor, and tumor-node-metastasis stage were not available to consider in the analyses. Furthermore, in Iran, the National Cancer Registry of Iran and EA-PBCR have not yet expanded. To avoid missing data; all of the patients which diagnosed with primary BC were registered from all over the province and data was collected via a combined active and passive protocol of follow-up in the EA-PBCR. However as an advantage, non-linear relationships between predictive variables and response variable are investigated in GAM.

Declarations
Ethics approval and consent to participate The study protocol was approved by the institutional review board of Tabriz University of Medical Sciences (IRB no.: IR.TBZMED.REC.1397.986).

Availability of data and materials
The data that support the ndings of this study are available from MAJ but restrictions are applied to the availability of these data, which were used under license for the current study, and are not publicly available. Data are, however, available from the authors upon reasonable request by MAJ.

Consent for publication
All authors have given consent for publication.

Con ict of interest statement
This research has been conducted in the absence of any potential con ict of interest, in the other words in the absence of any commercial or nancial relationships.

Funding
No funding to declare.