Machine Learning Model to Identify Prognostic Factors in Glioblastoma: A SEER-Based Analysis


 The aim of this study to create a competing risk model to identify prognostic factors in glioblastoma (GB). The study included 31663 patients diagnosed with GB between 2007 and 2018. The data in the study were taken from the Surveillance, Epidemiology, and End Results (SEER) database. Overall survivals (OS), age, race, gender, primary site, laterality, surgery and tumor size at the time of diagnosis, vital status, and follow-up time (months)were selected for the analyzes. The median OS of the patients was found to be 9.00±0.09 months. In addition, all variables in the table were statistically significant risk factors for survival except gender. Therefore, surgery, age, laterality, primary site, tumor size, race, gender variables were used as independent risk factors, and vital status was used as a dependent variable for ML analysis. Looking at the ML results, Hybrid Model gave the best results according to Accuracy, F-measure, and MCC performance criteria. According to hybrid model, which has the best performance, the diagnosis of alive/dead in 84 and 74 out of 100 patients can be interpreted as correct for 1- and 2-year, respectively. Recognition of the fundamental ideas will allow neurosurgeons to understand BD and help evaluate the extraordinary amount of data within the associated healthcare field.


Introduction
Science and industry have an extraordinary data production in our age. Traditional statistical approaches are not su cient in the analysis and interpretation of Big Data (BD). Machine learning (ML) and arti cial intelligence methods have become essential in the perception of these data [17]. The BD analysis supports the storage, classi cation, and analysis of patient information in the healthcare eld and improves disease identi cation, treatment evaluation, surgical planning, and outcome prediction [50].
Hidden patterns in large datasets can be revealed by BD analysis [15].
In adults, the most common primary malign brain tumor is glioblastoma (GB) [33]. Surgical resection, adjuvant external beam radiation therapy, plus concurrent and adjuvant temozolomide is the standard management of newly diagnosed high-grade gliomas (HGGs) [43,44]. The median survival in patients with this protocol was 14.6 months [44], and 5-year survival is 5% despite aggressive therapies [19,31,47]. The independent prognostic factors for progression-free survival (PFS) and overall survival (OS) con rm age, preoperative performance status, and tumor size [10]. MGMT promoter methylation was added to these factors in a recent systematic review [14].
This study extracted 31663 patients with histologically con rmed GB from Surveillance, Epidemiology and End Results (SEER) database. This study aims to create a competing risk model to identify prognostic factors in GB.

Study Design
The study included 31663 patients diagnosed with GB between 2007 and 2018, and all patient data were analyzed for the study. January 2007 was chosen as the starting point for the study, and December 2018 was selected as the end date of the study. The data in the study were taken from the SEER database. These data, published by the National Cancer Center Institute, are a compilation of databases of 18 SEER cancer registries in the USA. The SEER program is used to summarize data from patients' medical records. It is estimated that more than 95% of all cancer cases are detected and included in this database in areas under surveillance [45]. The duration of follow-up is calculated in months using the date of diagnosis and whichever occurs rst, 1) date of death, 2) date last known to be alive, 3) December 2018 (the follow-up cutoff date used in our analysis). Since all patient data were obtained with the permission of SEER without including personal patient information, there is no need to get ethical committee approval from any committee within the scope of this research.
The main hypothesis in the study was OS in years (censored observations), de ned from the date of diagnosis to the date of death or, for living patients, the last control date. In addition to survival, other variables selected for the analyzes were age, race, gender, primary site, laterality (unilateral/bilateral), surgery and tumor size at the time of diagnosis, vital status, and follow-up time (months). Surgical methods, radiotherapy, and chemotherapy techniques were not included in the study because of missing data.
In this study, in addition to the classical ML methods, we created a hybrid model consisting of a combination of existing methods. Such hybrid models have been preferred more in recent years, as they are a combination of ML methods and use the most substantial aspects of these methods. For 2-year survival prediction model, we used J48, Multilayer Perceptron and Naïve Bayes to create a hybrid model. Statistical Analysis SPSS 11.5 and Weka 3.7 programs were used in the analysis of the data. Mean±standard deviation and median (minimum-maximum) were used as descriptors for quantitative variables, and the number of patients (percentage) for qualitative variables. Survival analyzes on qualitative variables were performed using the Kaplan-Meier method, and signi cant differences between groups were determined using the General descriptors of the variables in the data set are given in Table 1. According to descriptors, 1.1% of the patients were younger than 19 years old or equal, 7.0% were in the 20-44 age range, 42.3% were in the 45-64 age range, and 49.6% were 65 years old or older. While 88.8% of the patients were White, 5.8% were Black, and 5.3% were from other races. In addition, the male-female ratio was 58.4% / 41.6%. The table shows the primary site, laterality, and surgery information of the patients. Tumor sizes of the patients are also grouped, and the patients' vital status and follow-up periods are given (Table 1).   When survival is evaluated in primary site types, the lowest median survival time is found in the group classi ed as ventricle, cerebellum, and overlapping brain lesion, followed by the brain stem, parietal, frontal, occipital, and temporal lobes, respectively. Survival statistics for laterality, tumor size, and surgery are also given in Table 2.
Gain Ratio Attribute Evaluation and Information Gain Attribute Evaluation attribute selection methods in WEKA were used. Using these methods, the importance of the variables and the values added to the data set were examined for last 2-year (2017-2018). A total of 8 variables (7 independent variables and one dependent variable) were used from the data set. These variables are surgery, age, laterality, primary site, tumor size, race, gender, and vital status. Percentages of variable importance according to the dependent variable vital status were given in Figure 1A. For 1-year data set, a total of 8 variables (7 independent variables and 1 dependent variable) used. These variables are surgery, age, laterality, primary site, tumor size, race, gender and vital status. Percentages of variable importance according to dependent variable vital status was given in Figure 1B.
The performance criteria of ML Methods for the 2-year survival prediction model are given in Table 3. Looking at the ML results, the Hybrid Model gave the best results according to Accuracy, F-measure, and MCC performance criteria, which are the most accepted criteria in the literature. Considering these three performance criteria, the Hybrid model is followed by J48, Naïve Bayes, Logistic Regression, Bagging, and Multilayer Perceptron, respectively. According to the hybrid model, which has the best performance, the diagnosis of alive/dead in 74 out of 100 patients can be interpreted as correct. As another explanation, when a patient is diagnosed as alive/dead with the hybrid model method, the accuracy rate of this diagnosis is 74.1%. The performance criteria of ML methods for the 1-year survival prediction model are given in Table 4. Looking at the ML results, the Hybrid Model gave best results according to Accuracy, F-measure and MCC performance criteria, which are the most accepted performance criteria in the literature. Considering these three performance criteria, the Hybrid model is followed by J48, Naïve Bayes, Logistic Regression, Bagging and Multilayer Perceptron, respectively. According to the hybrid model which has the best performance, the diagnosis of alive/dead in 85 out of 100 patients can be interpreted as correct. As another explanation, when a patient is diagnosed as alive/dead with the hybrid model method, the accuracy rate of this diagnosis is 84.9%.

Discussion
Many studies [5, 6, 9, 11, 13, 22, 23, 26, 28, 30, 36, 40-42, 48, 54] investigate prognosis and survival in GBs using the SEER database. The main difference of our study is that it processes data created following the last two World Health Organisation (WHO) classi cations and creates a high-performance model that predicts 1-and 2-year survival using ML.
The overall median survival of our study was 9.00±0.09 months. It is quite a short time compared to the literature, but the main reason is that 49.6% of the patient group in our study was 65 years and older. Less than 20% of elderly GB patients survive up to 1 year, with median survival between 5 and 9 months [21,38]. Survival may differ according to race and ethnicity in patients diagnosed with GB [3]. The incidence of GB was higher in the White population than others in our study, and it is consistent with previous publications [29,32,34,44]. Survival in the White race was lower than in the other races, as in the analysis by Qstrom et al. [34]. Although some publications are stating that survival is higher in the female gender [34,44,48], no signi cant relationship was found between gender and survival in our study.
There is no consensus on whether tumor location is a prognostic factor. In a recent study [12], GBs' survival in the central core (basal ganglia, corpus callosum) and left temporal lobe pole was less than six months. The survival of the dorsomedial right temporal lobe GBs was more than 24 months. In our study, the temporal lobe tumors' survival was the highest, but no comparison was made in the right or left hemispheres. The prognosis of ventricular [4,25,51], brainstem [24], and bilateral hemispheric [8] HGGs are poor, and the results of our study are similar. Although some authors state that cerebellar GBs are worse, comparable, or better than supratentorial ones [1,2,6,18,27], cerebellar GBs had signi cantly improved lower survival in our study.
Liu et al.
[26] stated that tumor size over 5,4 cm in the SEER database between 2007 and 2016 in patients over 65 years of age is an independent risk factor for GB-related deaths. The larger the FLAIR-T2 hyperintensity volume correlates with, the worse OS and PFS prediction [35]. In our study, the survival of tumors larger than 5 cm was the shortest.
Despite the existence of different treatment modalities, the management of GBs remains a challenge [20].
Although there is no consensus on the limits of surgery in the literature [16,20] when the maximal surgical resection of abnormal tissue (including FLAIR signal) is safe, it optimizes the patient survival [52]. In our study, the survival of patients who underwent surgical resection was signi cantly higher.
Various survival predicting models created with the ML method has been published [7,37,39,46,49,53], and a recent systematic review reported that the accuracy of these studies was in the range of 0.66-0.98 [46]. The success of our model to predict 1-and 2-year survival was 0.849 and 0.741, respectively.
There are some limitations to this study. There are many subclassi cations for each variable when creating data stored in online databases. The authors who process the data can combine or narrow these subsets to the extent they choose for the years they will evaluate. For this reason, different results can be obtained using the same database. The clusters we created in our study are a similar limitation.

Conclusions
Age, race, gender, tumor site/laterality/size, and surgical resection are independent survival risk factors in the analysis performed on 31633 patients between 2007-2018 in the SEER database. The model created by ML was 84.9% and 74.1% successful in predicting 1-and 2-year survival in GB patients, respectively.
Recognition of the fundamental ideas will allow neurosurgeons to understand BD and help assimilate and evaluate the extraordinary amount of data within the associated healthcare eld.