A New Survival Prediction Model for Patients with Synchronous Colorectal Carcinomas Based on SEER CURRENT STATUS: POSTED

Introduction The nomogram for postoperative prediction of overall survival (OS) in patients’ synchronous colorectal carcinomas (SCC) was developed and validated by LASSO regression combined with COX regression. Methods The data was obtained from the SEER database of patients diagnosed with colorectal cancer (CRC) more than one time between 2004 and 2013. The cut-off points for the continuous variable were identified by the K-adaptive partitioning algorithm and x-tile software. Using LASSO regression combined with the Cox regression, a model for predicting the overall survival of SCC was built, internally and externally validated, and measured through a calibration curve, C-index, AIC, BIC, IDI, NRI, timeROC, timeAUC, and decision curve analysis (DCA), and results compared to the model developed by the Cox regression. Results Patients with SCC were found to be older, more often men, and likely to have a depth of invasion by T3. In addition, there were no significant differences between the model developed by LASSO regression combined with Cox regression and the Cox regression in the calibration curve, C-index, AIC, BIC, IDI, NRI, timeROC, and DCA. Besides, the model developed by LASSO regression combined with Cox regression was found to perform better than the Cox regression in the timeAUC. Moreover, the model developed by LASSO regression combined with Cox regression showed good calibration, C-index, AIC, BIC, IDI, NRI, timeROC, timeAUC and had a larger net benefit compared to both the first time TNM staging and the combination of two times TNM staging. Discussion This present study indicates that a close follow-up of older patients, male, and T3 should be made. LASSO regression combined with COX regression decreases the variables of the model, avoids overfitting and collinearity and has clinical significance.


Introduction
Colorectal cancer (CRC) is the third common cancer and ranks third as a cause of cancer-related 3 death for males in America [1,2] . The definition of multiple primary colorectal carcinomas (MPCC) is the presence of 2 or more primary invasive adenocarcinomas diagnosed in patients. Synchronous colorectal carcinomas (SCC) is identified as the second invasive adenocarcinomas diagnosis within 6 months after the first [3] . Metachronous colorectal carcinomas (MCC) is identified as the second invasive adenocarcinomas diagnosis after more than 6 months after the first [4] . Among patients suffering from colorectal cancer, synchronous colorectal carcinomas contribute 1-8% [5] .
Patients with familial adenomatous polyposis, hereditary nonpolyposis colorectal cancer (HNPCC), hereditary non-polyposis colorectal cancer, serrated polyps/hyperplastic polyposis and inflammatory bowel diseases (ulcerative colitis and Crohn's disease) have a higher risk of synchronous colorectal carcinomas. These predisposing factors contribute to slightly more than 10% of synchronous colorectal carcinomas [6] . Compared to solitary colorectal cancer, synchronous colorectal carcinomas are more common in the right colon and sigmoid colon. During the pathological examination, synchronous colorectal carcinomas are usually found to be mucinous adenocarcinomas. Most of the patients with synchronous colorectal carcinoma have two cancers but only 6 cases have been reported. Compared to patients with a solitary colorectal carcinoma, patients with synchronous colorectal carcinoma have a higher percentage of microsatellite instability. Also, compared to patients with solitary tumors, synchronous colorectal carcinomas are enriched with MSI-H tumors, particularly those arising from SSAs. Moreover, MSI contributes to the improvement of the overall survival of patients with synchronous tumors. It is notable that SSA-associated SCC has a predilection for elderly women. Besides, SSA is associated with a favorable prognosis and is more likely to be MSI-H and BRAF V600E positive [3] . What's more, there is no appreciable differences between patients with synchronous tumors and single neoplasm in survival when compared to individuals who had single neoplasms. In contrast, individuals who had metachronous carcinomas have been observed to show a poor clinical outcome after the development of the second carcinoma [7] . Therefore, patients with colorectal cancer must be fully studied endoscopically [8] .
4 Currently, there only exists literature of mostly small series (< 80 patients) in which epidemiology and clinicopathology are described [3][4][5][6][7]9] . However, there are few reports on the impact factors of synchronous colorectal carcinoma's overall survival and formulation of prognostic models. In this study, we evaluated the impact factors of synchronous colorectal carcinomas on the overall survival and made a prognostic model with a large cohort of patients.
In this study, our aim was to develop and validate a nomogram based on treatment variables, surgical variables, clinical characteristics and tumor characteristics to predict the survival of synchronous colorectal carcinomas patients. The data was obtained from the population-based Surveillance, Epidemiology, and End Results (SEER) database which contains a large sample size and has a long follow-up time. However, if the prediction model is associated with the first and second-time treatment variables, surgical variables and tumor characteristics, there would be multiple mutual linear problems. Furthermore, because of incorporating too many variables, there may be over-fitting in predicting model. For these reasons, we selected the least absolute shrinkage and selection operator (LASSO) method to deal with the above concerns. In order to determine whether the prediction model fitted with LASSO combined with Cox was better, we compared the LASSO model (fit

Data Source
We identified the survivors from the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) 13-registry database by analyzing patients diagnosed from 2004 to 2013. SEER is a publicly available, nationally representative, population-based cancer database that contains more than 8 million cancer cases, with data that spans 4 decades and covers 28% of the United States population. It is considered a valid source of cancer incidence and survival data in the United States.
In addition, the SEER has developed and maintained high-quality, validated data on causes of death among cancer survivors, providing insight into relative and cause-specific deaths in this population [10, 5 11] . Our study was determined and it exempted the data from Colorectal Surgery Union Hospital in Fuzhou because publicly available de-identified data were used. Data was retrieved using SEER*Stat 8.3.5 (Surveillance Research Program, National Cancer Institute, Bethesda, MD).

Patients
Patients over 18 years old who were diagnosed with colorectal carcinoma between 2004 and 2013 with surgery were initially analyzed. Patients diagnosed with colorectal carcinoma less than twice were excluded to explore synchronous colorectal carcinomas. Patients who survival time was unknown were excluded to explore the epidemiology, pathogenesis and factors that influenced the survival of synchronous colorectal carcinomas. Patients with an unknown grade of the tumor, unknown T stage, unknown N stage, and unknown M stage were excluded for further comparison of the feasibility of the TNM model and TTNNMM model. Patients with unknown prognostic characteristics (including race, tumor size and location) were also excluded. The clinicopathologic variables were then collected from the SEER 13 database, including gender, race, sex, delta t, months survived and first and second times age, marital, location of tumor, TNM staging [12] , histologic grade, number of lymph nodes examined, number of positive lymph nodes, tumor size, radiation sequence, chemotherapy, and surgical related variables. Then, patients who had colorectal cancer more than 3 times or multiple metachronous primary carcinomas were excluded. Lastly, we excluded patients who survived less than 1 month or other variable unknown ( Figure 1).
In the construction of the survival predicting model, the internal cohort included patients from SEER database, while the external validation cohort consisted of patients from Colorectal Surgery Union Hospital in Fuzhou database.

Statistical Analysis
Statistical analysis was carried out with R software (version 3.4.2; http://www.Rproject.org) and SPSS (Statistical Product and Service Solutions, version 22.0). The packages in R used in this study are as follows. statistical significance levels were all two-sided, with statistical significance set at .05.

6
The least absolute shrinkage and selection operator (LASSO) method, which are suitable for the regression of high-dimensional data [13,14] , were used to select the most useful predictive variables from the primary data set. The "glmnet" package was used to perform the LASSO Cox regression model analysis [15] . To compare the differences in the Cox method and LASSO combined with Cox method, we separately used the Cox regression or LASSO combined with Cox regression to construct models. The COX model was selected and constructed using the internal cohort by backward Cox analysis using Akaike's information criterion (AIC) selection criteria and the best model was selected with the least AIC [16][17][18] . The LASSO model was selected and constructed using the internal cohort by LASSO combined with Cox regression. The TNM model was established using the internal cohort by the first time T, N and M staging and that of TTNNMM model was established using the internal cohort by the first and second times T, N and M staging [19] .

Compare Models
The LASSO model using LASSO combined with Cox regression and COX model using backward Cox analysis was first internally validated in the internal cohort using a bootstrap method (1,000 bootstraps resamples) and then externally validated in the external cohorts. The 3-and 5-year OS calibration of the LASSO model and COX model were performed by comparing the observed survival with the predicted survival in the internal and external cohorts. Then for survival testing with the LASSO model and COX model of the specificity, time-dependent receiver operating characteristic (timeROC) curves were estimated for two cohorts by inverse probability of censoring weighting estimators (KM-weight) at 3-, and 5-year [20,21] . Sequential AUCs were compared among the four models using identically and independently distributed representations of the AUC estimators [22] . Also, the overall prognostic performance of the four models was assessed using the Bayesian Information Criterion (BIC) via bootstrap-resampling analysis. Lastly, four models were evaluated with AIC, C-index [23] , the net reclassification improvement (NRI) [24,25] and integrated discrimination improvement (IDI) [26] .

Clinical Use
The net benefit and clinical usefulness of the four models above were estimated with decision curve analysis (DCA) throughout the whole cohort [27] .

Nomogram for a visualization model
For the purpose of illustration and clinical applicability, we created a nomogram based on the LASSO model. In the nomogram, model-based score points for each predictor variable category were displayed, which has to be summarized for any individual patient. From the resulting total number of points, the corresponding predicted survival probabilities from the nomogram could be easily read.

Clinical Characteristics
From the data obtained from 2004 to 2013, 4616 patients with synchronous colorectal carcinomas (SCC) in the SEER database were found. Patient characteristics are shown in Table 1. There are significant correlations in age and slight correlation in pN, pM, examined lymph nodes, Surg Prim Site, and chemotherapy between the twice synchronous colorectal. Patients with SCC were mostly older (>65 years), more often men, and likely to have a depth of invasion by T3. Tumors were mostly situated in the cecum, ascending colon and sigmoid colon.
Results of the selected variables with Cox regression and LASSO combined with Cox regression are listed in Table 2. Table 2 indicates that the age of the first time SSC diagnosis, sex, first time size, first time surgery, second time marital, second time grade, second time chemotherapy, first and second times pT, pN, pM, regional nodes examined and site of disease was significantly associated with overall survival (OS) by Cox regression. Table 2 also indicates that the age of first time SSC diagnosis, sex, second-time chemotherapy, first and second times pT, pN, pM, regional nodes examined were significantly associated with overall survival (OS) by LASSO combined with Cox regression.
Results from the relation between first and second times pT, pN, pM, grade and regional nodes are listed in Table 1.

Predictive Variable Selection
31 variables were reduced to 11 or 16 potential predictors on the basis of 4616 patients by LASSO combined with Cox regression or Cox regression in the internal cohort (Figure 2A, 2B, 2C) and were featured with nonzero coefficients in the LASSO Cox regression model or the minute AIC in Cox regression model ( Figure 2D).

Development of COX Model and LASSO Model
The multivariable regression model for age, sex, marital, race, site, pT, pN, pM, radiation, chemotherapy, surgery, nodes examined, etc. were included in the Cox regression after variables were selected by the LASSO Cox regression or Cox regression. We showed hazard ratios with 95% CIs for covariates which are included in Table 2.

Apparent Performance of the LASSO Model or COX Model in the Internal Cohort
The calibration curves of the LASSO model and COX model for the probability of overall survival (OS) in 3-5 years between prediction and observation in the internal cohort ( Figure 3A,3B, 3C,3D) were plotted to assess the calibration of the COX model and LASSO model, which were accompanied with the Hosmer-Lemeshow test (A significant test statistic implies that the model calibrates perfectly).

Validation of the LASSO Model and COX Model
Internally validation was tested using the internal cohort. The external validation was tested in the external cohort. The LASSO model was formed in the internal cohort and was applied to all the patients of the external cohort. The calibration curves in 3-5 years ( Figure 4A, 4B) were derived on the basis of the regression analysis.

C-index and AIC
To quantify the discrimination performance of the COX model, LASSO model, TNM model, and TTNNMM model, Harrell's C-index and AIC were applied ( Table 3). The C-index for the COX model,  Table   4.

Predictive Accuracy of COX Model and LASSO Model
According to the survROC curves for 1-,3-,5-years overall survival (OS) for the COX model, LASSO model, TNM model, and TTNNMM model (Figs 5A,5B,5C,5D), the ROC curve (a general measure of predictiveness) was found to be greater in 3-and 5-years.

Whether Apparent Different Performance of The LASSO and COX Model TimeAUC
Time-dependent ROC curves were generated to compare the sequential trends of the LASSO, COX, TNM and TTNNMM model for OS. The time-dependent ROC curve of the LASSO model was continuously superior to that of the COX model, TNM model and TTNNMM model (Figure 6).

BIC
The prognostic performances of the LASSO, COX, TNM, and TTNNMM model were compared using BIC, which is not only a measure of the goodness of fit of an estimated statistical model but also accurately considers the number of parameters included in the model. As shown in Figure 7

Net Reclassification Improvement (NRI) and Integrated Discrimination Improvement (IDI)
The discriminant ability for LASSO model, COX model, TNM model, and TTNNMM model was calculated using NRI and IDI (Table 5). Compared to the TNM model and TTNNMM model, LASSO model was found to be a higher discriminant and possess reclassification indices (integrated discrimination improvement 0.072 and 0.064; p < 0.001; net reclassification improvement 0.525 and 0.466) ( Table   4). In addition, compared to the COX model, the LASSO model doesn't significantly decrease the discriminant and reclassification indices (integrated discrimination improvement -0.002, p, 0.058; NRI -0.009) ( Table 5).

Clinical Use
Decision curve analysis was conducted to determine the clinical usefulness of the LASSO model by quantifying the net benefits at different threshold probabilities. We also plotted the decision curve for the four models in 3-5 years ( Figure 8A, 8B).

Visualization of SCC Survival Prediction Model
Survival prediction model of the nomogram was established based on factors selected by LASSO combined with the Cox regression (Figure 9). The nomogram showed that first time age had the most contribution to prognosis, followed by first-and second-times T stage, N stage, metastases and

Discussion
In this study, we developed and validated a prognostic model about SSC which based on large data from SEER by combining LASSO regression and COX regression. Our results show that the OS is associated with age, sex, second-time chemotherapy and first and second times pT, pN, pM.
Notably, the second time examined lymph nodes didn't show enough predictive strength on the basis of Cox regression, which makes a common strategy to exclude this variable for model development.
However, it may be a result of nuances in the data set or confounding by other predictors that reject important predictors, [17,28] for which no significant statistical association with OS does not definitively imply that examined lymph nodes are unimportant. In addition, more lymph nodes examined may mean a better quality of operation. Therefore, we kept the second time examined lymph nodes as a candidate factor in the process of model development. For the same reason, we kept surgery, sex, the first-time tumor size and the second-time grade, pN, and the site in the COX model.
Grade, size, surgery and marital which may be multi-collinearity bias with pT, pN, pM, and age were not included in the LASSO model. Grade, size, surgery may be associated with TNM grading [12] .
Besides, the old aged were more likely to be widowed. Also, because of overfitting site was not included in the LASSO model.
From Table 1, we found that patients with SCC were generally older (>65 years), more often male, and likely with the depth of invasion of T3. There may be less estrogen to protect in male [29] and a high probability of microsatellite instability (MSI) in older patients [30] . Besides, it may be as a result of tumor biologic characteristics with the depth of invasion by T3. Therefore, patients who are older (>65 years), men, and depth of invasion by T3 should closely monitor the postoperative enteroscopy for early detection of SCC.  Figure 6 shows that the LASSO model performs better than the Cox model in the timeAUC. Although the LASSO model included fewer variables, the LASSO model performed better in the timeAUC compared to the COX model. (Table 6) Some studies have also indicated that males are more susceptible to SCC [5] . Besides, SCC occurs more often in the right hemi colon and sigmoid colon [4][5][6] , therefore there may exist short-term postoperative complications in SCC which contribute to the offset within 1 year [5] .
We also found that the AUC for timeROC in 3-5 years OS was larger than that in 1-year. There is a possibility that most patients die of postoperative complications within 1 year, so we didn't perform prognosis of patient's OS within 1 year.
The most important and final argument for the use of the nomogram is based on the need to interpret the individual need for additional treatment or care. However, the clinical consequences of a particular level of discrimination or degree of miscalibration cannot be adequately assessed by the risk-prediction discrimination, performance, and calibration [17,31,32] . Therefore, in order to justify the clinical usefulness, it is crucial to ascertain whether the LASSO model-assisted decisions can improve patient outcomes. With this aim, in this study, the application of the decision curve analysis instead of the multi-institutional prospective for the validation of the model was performed. This novel method offers an insight into the clinical consequences on the basis of threshold probability, from which the net benefit could be derived. (Net benefit is defined as the proportion of true positives minus the proportion of false positives, weighted by the relative harm of false-positive and falsenegative results.) [17,33] Through the decision curve plot, we can ascertain whether the probability of threshold of a patient or doctor is 5% using the LASSO model is more beneficial than either the TNM model or TTNNMM model and not inferior compared to the COX model. (Figure 8A, 8B) There are some limitations in the present work that should be discussed. The collection of the SEER database is retrospective. There is a lack of molecular data and data for biological prognostic factors that might also influence the prognosis of SCC patients. In recent years, increased research with gene markers, such as MSI, SSA, BRAF V600E associated with SCC has been proposed. [3,6] in this regard, there might be some increase in the bias that we excluded all patients who had missing data from the collected variables. The study didn't incorporate detailed chemotherapy and radiation methods due to the lack of adequate information and large bias of the information. Finally, although this nomogram performed well in both internal and external cohorts, due to the influence of deaths related to the operation, the data should be used with caution when predicting 1-year risk. Even so, although it didn't include the genomic characteristics, excluded patients who had missing data and was retrospect to analysis, it was the first model to perform a prognosis OS of SCC.
Based on the database content, the main influencing factors were screened for the LASSO model. Due to the limitations of the database, some important factors weren't covered. In the future, we hope to have relevant data to incorporate it into our research.
In conclusion, this study presents a prognosis nomogram that incorporates both the first-time and the second-time variables and can be conveniently used to facilitate the prediction of OS in patients with SCC.

Declarations
Ethics approval and consent to participate 13 Not applicable.

Consent for publication
All authors reviewed and approved the manuscript.

Availability of data and materials
The datasets are available in SEER database to select the eligible cases. The data are also available from the corresponding author.

Conflict of interest
Mr. YuXin Xu have no conflict of interest or financial ties to declare.    Figure 1 22 Flowchart of patient selection for this study    showed that if the threshold probability of a patient or doctor is >5%, using the LASSO model in the current study to predict OS more benefit than the treat-all-patients scheme or the treat-none scheme. The net benefit was not comparable, with several overlaps, on the basis of the LASSO model and the COX model.

Figure 9
Developed LASSO model nomogram. The LASSO model nomogram was developed in the primary cohort, with the first time age and sex, the second time chemotherapy and the first and second times pT, pN, pM, and regional nodes examined incorporated. LASSO model nomograms to predict 3-and 5-year overall survival probability with SCC. For each predictor, read the points assigned on the 0-10 scale at the top and then add these points.
Find the number on the "Total Points" scale and then read the corresponding predictions of 3-and 5-year risk.