Deep-learning Model for Predicting the Survival of Rectal Adenocarcinoma Patients based on the SEER Database


 Background: We collected information on patients with rectal adenocarcinoma in the United States from the Surveillance, Epidemiology, and End Results (SEER) database. We used this information to establish a model that combined deep learning with a multilayer neural network (the DeepSurv model) for predicting the survival rate of patients with rectal adenocarcinoma.Methods: We collected patients with rectal adenocarcinoma in the United States and older than 20 years who had been added to the SEER database from 2004 to 2015. We divided these patients into training and test cohorts at a ratio of 7:3. The training cohort was used to develop a seven-layer neural network based on the analysis method established by Katzman and colleagues to construct a DeepSurv prediction model. We then used the C-index and calibration plots to evaluate the prediction performance of the DeepSurv model.Results: The 49,275 patients with rectal adenocarcinoma included in the study were randomly divided into the training cohort (70%, n=34,492) and the test cohort (30%, n=14,783). There were no statistically significant differences in clinical characteristics between the two cohorts (p>0.05). We applied Cox proportional-hazards regression to the data in the training cohort, which showed that age, sex, marital status, tumor grade, surgery status, and chemotherapy status were significant factors influencing survival (p<0.05). Using the training cohort to construct the DeepSurv model resulted in a C-index of the model of 0.824, while using the test cohort to verify the DeepSurv model yielded a C-index of 0.821. These values show that the prediction effect of the DeepSurv model for the test-cohort patients was highly consistent with the prediction results for the training-cohort patients.Conclusion: The DeepSurv prediction model of the seven-layer neural network that we have established can accurately predict the survival rate and time of rectal adenocarcinoma patients.

advanced metastatic rectal cancer. [7][8][9] Developments in surgical techniques and the combined use of radiotherapy and chemotherapy in recent years have greatly improved the treatments applied to patients with rectal cancer, but their mortality rate remains as high as 40%. [10,11] Current treatment decisions and prognoses of rectal cancer patients are mainly based on the AJCC TNM staging system. [8] Different patients in the same stage of rectal cancer who receive similar treatments can exhibit large differences in treatment effects and survival rates. [12] Some studies have found that certain prognostic factors such as age, sex, and race might crucially affect survival predictions in individual patients. [11][12][13][14] Previous studies have used multiple types of assessment model to assess the survival rate of cancer patients, including the AJCC TNM staging system, logistics regression analysis, and the Cox proportionalhazards model. [15][16][17][18] The AJCC TNM staging system is currently the most commonly used tumor staging system worldwide, [19] and it classi es cancer patients based on tumor and lymph node metastasis when evaluating and predicting their survival rate. [20] However, this method has disadvantages of a short evaluation time and data loss. [21] Logistics regression analysis identi es risk factors that affect different outcomes. [22] However, this method has the disadvantage of losing temporal information that affects the ending event, which reduces its prediction ability. [23] The Cox proportionalhazards model includes survival outcomes and survival time as dependent variables. This model can be used to simultaneously analyze the impact of multiple factors on survival time, and it is widely used to predict outcome events without knowledge of the survival distribution of the analyzed data. [24,25] A nomogram is a widely used method for combining and quantifying various important clinical characteristics of patients when calculating the probabilities of outcome events occurring based on Cox proportional-hazards model. [26] However, an assumption underlying the Cox proportional-hazards model is that each predictor variable has the same impact at the follow-up time, which ignores differences in the impact of predictor variables on individual patients at different times. [24] Therefore, a new method is needed that has a higher accuracy in predicting the survival rate of cancer patients.
Developments in computer and information technology over recent years have made it possible to improve the accuracy of predictions of the survival rate of cancer patients. [27] Deep learning is a new research direction in the eld of machine learning that involves discovering the distributed characteristics of sample data by learning the underlying laws and representation levels. [28] Deep learning is essentially a statistical model that includes an input layer, hidden layer, and output layer, which can be used to solve multifactor and nonlinear problems. The continuous developments in deep-learning research methods and the availability of biomedical big data have led to machine learning being used to predict the clinical outcomes of patients. [29] Liu et al. reported that an arti cial neural-network model can be applied to clinical information to predict the survival rate of patients with nasopharyngeal cancer. [30] Katzman et al.
combined deep learning with a multilayer neural network (the DeepSurv model) to develop a system for personalized treatment recommendations. [31] The present study collected data on patients with rectal adenocarcinoma in the United States from the Surveillance, Epidemiology, and End Results (SEER) database and applied the DeepSurv model to investigate their survival rates.

Data source
All of the patients with rectal adenocarcinoma included in this study were selected from the SEER "18 Regs Custom Data Nov 2017 Sub (1973-2015 varying)" data set with additional treatment elds (http://seer.cancer.gov). The SEER database contains data on cancer patients from 18 regions of the United States, and accounts for around 28% of the total country population. [32] This database contains a considerable amount of relevant information on patients, including demographic data, tumor data, and information on causes of death and survival times. We used SEER*Stat software (version 8.3.6) to identify patients in the data set who had rectal adenocarcinoma in the United States from 2004 to 2015. We obtained permission to access the database by signing the SEER Research Data Agreement form and submitting it via email.

Inclusion and exclusion criteria for the study population
We identi ed patients with rectal adenocarcinoma using primary site code C20.9 of the third revision of the International Classi cation of Diseases for Oncology codes (ICD-O-3) along with rectal and morphology codes 8140, 8210-8221, 8261-8263, 8480, and 8490. The inclusion criteria for the study population included being diagnosed during 2004-2015 and aged > 20 years, while the exclusion criteria included the rst tumor not being rectal adenocarcinoma and unknown tumor grade, survival time, race, marital status, or surgery status. We screened 49,275 patients with rectal adenocarcinoma and collected the following information from the SEER database: sex, age, marital status, race, tumor grade, AJCC TNM stage, tumor size, tumor location, degree of tumor invasion, surgery status, radiotherapy status, chemotherapy status, survival time, and cause of death. We divided the collected rectal adenocarcinoma patients into the following four groups based on ICD-O-3 morphology codes: papillary adenocarcinoma (code 8140), tubular adenocarcinoma (codes 8210-8221 and 8261-8263), mucinous adenocarcinoma (code 8480), and signet-ring-cell carcinoma (code 8490). We recoded marital status into married and unmarried, where the latter status included single, unmarried, widowed, separated, and divorced. We subsequently randomly divided the patients into training and test cohorts at a ratio of 7:3. Figure 1 shows the screening procedure applied to identify patients with rectal adenocarcinoma.

Design and analysis of deep-learning models
DeepSurv is a deep feedforward neural network that can be used to predict the effects of patient covariates on patient survival. The structure of this network includes huge numbers of simulated neurons that are divided into three main layers: input, hidden, and output layers. There can only be one input layer and one output layer, while there can be multiple hidden layers (Fig. 2). We performed deep-learning calculations based on the DeepSurv calculation method described by Katzman et al. [31] to predict the survival outcome of patients with rectal adenocarcinoma. The training-cohort data were used to develop a DeepSurv model of a seven-layer neural network. We then used the test-cohort data to perform DeepSurv analysis to evaluate the effectiveness of the model and predict the survival rate of patients with rectal adenocarcinoma. Finally, we used Harrell C statistics and correction graphs to evaluate the prediction performance in the training and test cohorts.

Statistical analysis
Python software (version 3.7.6) was used to perform all computations and analyses in this study. We rst used the Pandas library to perform a basic statistical analysis of the data. Kaplan-Meier analysis and logrank testing were then performed using the Python lifelines survival analysis module. Meanwhile, sklearn was used to randomize the data and normalize the mean and variance. A k-fold check (k = 10) was used in the model training process to ensure its accuracy. We nally used Python combined with the deeplearning framework theano to complete the simulations. All tests were double-sided, and the signi cance criterion was set to p < 0.05.

Results
Baseline characteristics of the patients The 49,275 included patients with rectal adenocarcinoma comprised 29,504 male patients (59.9%) and 19,771 female patients (40.1%). The basic clinical characteristics in the two study cohorts are listed in Table 1, which indicates that none of the clinical characteristics differed signi cantly between the cohorts (p>0.05). The patients were aged 62.6±13.5 years (mean±SD), and most of them were white (81.3%), had grade II tumors (76.2%), and papillary adenocarcinoma (74.2%). The maximum follow-up time for patients was 143 months, with a mean of 47 months. During the study period from 2004 to 2015, 14,078 (28.5%) patients died of rectal adenocarcinoma. Table 1 Analysis of the main characteristics of patients with rectal adenocarcinoma.

Variables
Overall N(%)  Table 2). The C-index for the Cox proportional-hazards regression model was 0.788. We produced calibration charts of the Cox proportional-hazards model for the 3-, 5-, and 10-year survival of rectal adenocarcinoma patients in the training cohort, which revealed some discrepancies between the predictions of the Cox proportional-hazards regression model and the actual events ( Figure 3). The C-index obtained when using the training-cohort data to construct the DeepSurv model was 0.824. The graph of the training-cohort C-index and loss function is shown in Figure 4. The calibration chart of the DeepSurv model for the survival of training-cohort patients at 3, 5, and 10 years also revealed discrepancies between the predictions of the DeepSurv model and the actual events ( Figure 5). However, the predictions of the DeepSurv model were better than those based on the Cox proportional-hazards regression model.

Calibration and veri cation of the DeepSurv model in the test cohort
Applying the variables selected by the Cox proportional-hazards regression model of the training cohort to the test cohort with the DeepSurv model showed that the latter had a good predictive effect, with a Cindex of 0.821. The calibration curves for the survival of patients in the test cohort at 3, 5, and 10 years are presented in Figure 6, which shows that the predictions of the DeepSurv model for the test-cohort patients are highly consistent with the prediction results for the training-cohort patients.
Comparison between the DeepSurv model and the AJCC TNM staging system The AJCC TNM stages were dichotomized into stages I-III and stage IV based on the presence of distant metastasis, which corresponded to no distant transfer and distant transfer, respectively. Figure 7 shows that the survival rate was signi cantly lower for patients at stages I-III than for those at stage IV. That gure shows that the DeepSurv model predicted that the survival risk was lower than for patients classi ed as AJCC TNM stages I-III, and higher than for those classi ed as AJCC TNM stage IV. Moreover, the survival curve was smoother for the DeepSurv model than for the AJCC TNM staging system. The area under the receiver operating characteristic (ROC) curve (AUC) was larger for the DeepSurv model than for the AJCC TNM staging system, while the latter ROC curve was located above and to the left of that for the AJCC TNM staging system. The results showed that the DeepSurv model was more accurate in predicting the survival prognosis of rectal adenocarcinoma patients compared with the AJCC TNM staging system.

Discussion
Rectal adenocarcinoma is a common clinical malignant tumor that is reasonably common in developed countries, including those in North America and Europe. [3,4] Tumor metastasis is reportedly present in more than 50% of newly diagnosed patients, which is due to the atypical clinical symptoms of early-stage rectal adenocarcinoma. [7] Effective methods for the early detection and early treatment of rectal adenocarcinoma would therefore be of great signi cance for improving the prognosis of affected patients. Various risk factors affecting the prognosis of these patients have been reported in recent years, including age, sex, histological type, tumor stage, and tumor differentiation status. [33,34] With the aim of improving the accuracy of survival-time predictions for patients with rectal adenocarcinoma, various methods have been used to establish prediction models, including the AJCC TNM staging system, logistics regression analysis, and the Cox proportional-hazards model. [15][16][17][18] Each of these prediction models has certain advantages and disadvantages, and different models produce different predictions of patient survival. The Cox proportional-hazards model is currently one of the most widely used models for prognostic predictions, [26] and such models require each predictor variable to be a linear factor, which therefore ignores the impacts of any signi cant nonlinear factors on outcome variables. It is well known that the development of tumors and changes therein are affected by many factors, and so traditional linear models are highly unlikely to accurately predict the prognosis of cancer patients. This situation makes it necessary to develop new methods that can combine linear and nonlinear factors in the construction of prediction models.
The ongoing developments in computer and information technology can facilitate the construction of the required novel predictive models. For example, Katzman et al. implemented the DeepSurv analysis method by combining deep learning with a multilayer neural network. [31] The DeepSurv method includes a complex three-layer network structure comprising input, hidden, and output layers. [29] The input layer includes each linear or nonlinear predictor variable, the hidden layer has a multilayer structure for variable conversion, and the output layer is the converted target variable. The DeepSurv method uses deeplearning technology to convert multiple linear and nonlinear factors into a linear combination via multilevel fusion and transformation to predict outcome events. The DeepSurv approach is being gradually applied in various elds related to biomedical research. Multiple research results have shown that the predictions made using the DeepSurv model are better than those made using traditional linear prediction models. [35][36][37] She et al. used a DeepSurv model to provide non-small-cell lung-cancer-speci c survival and prognosis predictions as well as treatment recommendations, and found that its prediction effect was signi cantly better than that of the traditional AJCC TNM staging system. [38] Biglarian et al. demonstrated that the DeepSurv model is superior to the Cox proportional-hazards model in predicting distant metastasis in patients with rectal cancer. [39] Rau et al. found that a DeepSurv model for predictions associated with liver cancer was superior to those obtained using a logistic regression model. [40] This study constructed a DeepSurv model of the survival rate of rectal adenocarcinoma patients by collecting affected patients living in the United States from the SEER database. We rst conducted a Cox proportional-hazards regression analysis of 34,492 patients with rectal adenocarcinoma in the training cohort to identify risk factors for their prognosis. These risk factors were age, race, sex, marital status, tumor grade, AJCC TNM stage, surgery status, chemotherapy status, tumor size, and degree of tumor invasion (p < 0.05) ( Table 1). We then developed a seven-layer neural-network DeepSurv prediction model based on the analytical method established by Katzman et al. [31] The C-index when applying the new prediction model was 0.821 for the test cohort and 0.824 for the training cohort. These values show that the predictions of the DeepSurv model for the test-cohort patients are highly consistent with those for the training-cohort patients. The results obtained for the calibration curves of the patients in the test cohort at 3, 5, and 10 years further support this conclusion. The DeepSurv model was also found to provide more accurate predictions of the prognosis of patients with rectal adenocarcinoma compared with the Cox proportional-hazards model, which is consistent with the results of some previous studies of cancer prognoses. It has also been shown previously that the DeepSurv model provides powerful variableprocessing capabilities. [35,41] Finally, we compared the DeepSurv prediction model with the AJCC TNM staging system, and found that the AUC was higher for the former (AUC = 0.800) than the latter (AUC = 0.755). Meanwhile, the survival curve was smoother for the DeepSurv model than for the AJCC TNM staging system. The superior results for the survival prognosis of patients with rectal adenocarcinoma obtained by applying the DeepSurv model are due to it transforming linear and nonlinear predictive variables into a linear combination by utilizing a multilevel neural network. [31] Deep learning can be used to solve nonlinear problems involving multiple factors, and so the DeepSurv model has particular advantages over other models when dealing with large samples, multiple variables, and nonlinearity.
The present study was subject to some limitations. First, some potentially information that might affect survival was missing for the patients with rectal adenocarcinoma collected from the SEER database, such as whether tumors were surgically removed, the type of chemotherapy applied, medications, the psychological status, religious beliefs, and education of the patients, and their familial tumor history.
Second, our study only included data for patients with rectal adenocarcinoma living in certain parts of the United States, and the established DeepSurv prediction model was not validated using external data. The accuracy of the DeepSurv approach could be further assessed using patients with rectal adenocarcinoma living in other countries. Third, the DeepSurv model has its own inherent limitations during the construction process. The existence of hidden layers in the black-box model meant that we cannot exactly understand the calculations performed during the model construction process, or the associated limitations. Future studies should attempt needed to resolve the above-mentioned problems.

Conclusions
This study used Cox proportional-hazards regression analysis to identify the risk factors affecting the prognosis of rectal adenocarcinoma patients, which include age, sex, tumor grade, tumor size, degree of tumor invasion, surgery status, and chemotherapy status. We constructed a seven-layer neural-network DeepSurv prediction model that has been demonstrated to provide good predictions of the prognosis of patients with rectal adenocarcinoma. This novel DeepSurv model can be used to accurately predict the survival time of patients with rectal adenocarcinoma.

Declarations
Ethics approval and consent to participate: The data of this study comes from the SEER database. The SEER database is a tumor-related database developed by the National Cancer Institute of the United States, providing research data for researchers free of charge. All patients participating in the study received the ethical approval sought by the National Cancer Institute. The informed consent was obtained from all patients or, if patients are under 18, from a parent and/or legal guardian.

Consent for Publication:
Consent for publication was obtained from all participants.
Availability of data and materials: The ow diagram of patients with rectal adenocarcinoma selection.   The plots of the training cohort C index and loss function. Calibration plots of the survival rate of the training cohort in the DeepSurv model. Figure 6 Calibration plots of the survival rate of the test cohort in the DeepSurv model.