Validation of a nomogram to predict survival in patients with ESCC based on examined lymph nodes

Background: Current the number of examined LNs are controversial in predicting the survival of ESCC. We aimed to develop an alternative LN-classication-based nomogram to individualize ESCC prognosis. Methods: Using the data of patients diagnosed with ESCC from SEER database between 2004 and 2015, we determined the cut-off values for the number of LNs examined via the K-adaptive partitioning (KAPS) algorithm. A nomogram predicting the survival of ESCC was performed, internally and externally validated, and evaluated by calibration plot, C-index, and decision curve analysis, and compared to the 7 th TNM stage. Results: Totally, we included 3629 patients with detailed information. The optimal cut-off for examined LN number was 8. The C-index for the nomogram was higher than the 7 th TNM staging (internal: 0.708; 95%CI, 0.678-0.753 vs 0.601; 95%CI, 0.573-0.656, P<0.001; external: 0.687; 95%CI, 0.601-0.734 vs 0.605; 95%CI, 0.563-0.659, P<0.001). Additionally, the nomogram showed good agreement between internal and external validation. DCA analysis showed no matter in the internal cohort or external cohort, the nomogram showed a greater benet across the period of follow-up compared to 7th TNM stage. Conclusion: We found examining LNs that was more than 8 beneted for prognosis of patients. Based on these, a nomogram with greater benet for predicting survival of EC patients than TNM staging was constructed.


Introduction
Esophageal cancer (EC) is the fth leading cause of cancer death for men over the world, and esophageal squamous cell carcinoma (ESCC) accounts for a consitituent ratio of 80%-85% among EC cases in Asian countries [1,2]. Despite improvements in treatments such as surgical resection and adjuvant chemoradiation, the long-term prognosis of ESCC patients remained less than 30%, especically for 5-year survival [3,4]. Poor outcomes of patients with ESCC are correlated with diagnosis at advanced stage and high propensity for metastasis caused by insu ciency of lymph node dissection (LND) [5,6]. Moreover, the number of LND remained to be controversial, as the National Comprehensive Cancer Network (NCCN) guidelines recommended, it is necessary to remove at least 12-15 nodes [7,8]. As 7th edition of the American Joint Committee on Cancer (AJCC) suggested, dissection does bene t as many regional lymph nodes (LNs) as possible [9]. Additionally, according to numerous retrospective studies, it considered an extensive removal of 6-30 LNs to be favourable for survival improvement [10][11][12][13]. Meanwhile, other independent factors such as age, grade, and tumor size could also signi cantly affect the survival [13].
Based on these, we aimed to determined the optimal number of examined LNs by a large-population study.
In our study, the best cut-off points were determined by The K-adaptive partitioning (KAPS) algorithm which was demonstrated as a useful tool to obtain heterogeneous subgroups by survival [14]. Through the Surveillance, Epidemiology, and End Results (SEER) database, we developed and validated a nomogram based on multivariate analysis results to predict survival of ESCC patients.

Patients
All patients with EC were retrieved from SEER database by National Cancer Institute's SEER*Stat software (version 8.3.6). All patients did not get informed consent because the SEER database was free for public.
According to International Classi cation of Diseases in Oncology (ICD-O-3), tumors with code 8070, 8071, 8072, 8073, 8074, 8075, 8076, and 8078 were identi ed as squamous cell carcinoma [15,16]. In our study, patients with ESCC were included according to the following criteria: (1) patients aged more than 20 who were diagnosed as EC by positive histology from 2004 through 2015; (2) patients with a histopathology of squamous carcinoma; (3) patients with detailed information of survival were included; (4) patients with detailed information including Race, Grade, Regional nodes examined, tumor size, T stage, N stage and M stage.

Clinicopathological factors
The total of clinicopathological variables extracted from the SEER database in our study include age, race, sex, pathology grade, M stage, tumor size, N stage, regional nodes examined, 7 th TNM stage and Marital status. The age of patients was divided into three groups, namely <50 years, 50-70 and >70 years.
Race was classi ed into three types: white, black and other. Sex includes male and female. Pathology grade was categorized as well, moderately differentiated type, poorly differentiated and undifferentiated type. Lymph node metastasis (LNM) was described as N1 (Yes), while N0 was negative. M1 (Yes) indicated positive. As for the tumor size, EC were categorized into 4 groups: ≤2 cm, ≤3 cm, ≤5 cm, and >5 cm. With respect to regional nodes examined, according to the result of K-adaptive partitioning (KAPS) algorithm [17], the cut-off value was 8. Therefore, regional nodes examined was divided into two groups: ≤8 and >8. Marital status was recorded as marrried, unmarried and divorced/widowed. In our study, the main observation indicator was cancer-speci c survival (CSS). CSS was de ned as death attributable to this cancer.

Statistical analysis
For the basic statistics, patients were divided into two groups according to diagnosed years, and Pearson's Chi square test was utilized to investigate the association among the categorical variables. With respect to CSS for patients with ESCC, we performed curves of survival by survminer package in R software. Furthermore, to analyze the related risk factors for survival, we performed Lasso regression analysis and multivariate cox regression, then constructed nomogram. The internal cohort included patients diagnosed with EC between 2004 and 2009, while the external validation cohort contained participants diagnosed between 2010 and 2015. The predicting value was assessed by C-index value, time-dependent receiver operating characteristic (tdROC) curves and decision curve analysis (DCA) [14,18,19]. The nomogram was rst internally validated and then externally validated in the independent cohorts. All statistical analysis was gured out via R software (version 3.6.1, StataCorp LLC, College Station,Tex). The main packages used in our study included ggplot2, survival, rms, kaps and survminer package. Chi square test was carried out by SPSS (version 24.0). It was considered to be statistically signi cant when value of P was less than 0.05.

Patients Characteristics
Totally, we enrolled 3629 patients diagnosed as ESCC from 2004 through 2015 from SEER database. As the ow chart showed (Supplementary Figure 1), we rstly identi ed the diagnosis of ESCC according to to pathological diagnosis, then patients without T stage information (n=20354), N stage information (n=108), M stage information (n=108), survival information (n=8192) and other information (n=4164) were excluded. Baseline characteristics of patients were presented in the Table 1. 1732 patients were diagnosed from 2004 to 2009 and 1897 patients were found from 2010 to 2015. As for the basic characteristics analysis, we found patients with ESCC were more frequent in the patients aged at >50 years older and male patients. In addition, the lymph node metastasis rate is about 54.42%, while patients with distant metastasis accounted for 24.28%. The median survival was 9 months which ranged from 4 to 23 months.

Grouping of lymph nodes in EC patients
Using KAPS algorithm, we found the optimal cut-off dividing the number of examined lymph nodes (LNs) into two groups was 8, and then performed the Kaplan-Meier survival analysis. As showed in Figure 1A, it could be considered as signi cantly different between the two groups (P<0.001) for over survival (OS) rate. Also, for analysis of CSS, patients with less than 8 examined LNs had worse prognosis that those who with more than 8 examined LNs ( Figure 1B). Additionally, we performed Kaplan-Meier survival analysis in the different stage and found patients with >8 examined LNs remained to have a better survival ( Figure 2). The difference of survival among the two goups for patients in the different TNM stage was statistically signi cant (P<0.0001).

EC survival prediction model
Since the variables we extracted were su cient, to sellect the most suitable characters to predict prognosis, we performed Lasso regression analysis and found age, tumor size, T stage, N stage, M stage and examined LNs were highly associated with survival ( Figure 3). Moreover, using multivariate cox analysis model, we identi ed age, tumor size, T stage, N stage, M stage and examined LNs were independent prognostic factors ( Figure 4). Patients with age>=70, tumor size>5cm and advanced stage had poorer prognosis, while patients with examined LNs>8 were associated with better prognosis. Then model of nomogram that predicts survival was performed based on multivariate cox analysis ( Figure 5).
The nomogram showed T stage contributed the most to prognosis, followed by examined LNs, M stage, tumor size and age, whereas lymph node metastasis had the least effect for prognosis. As for explannation of nomogram, a straight line can be drawn down at each time point to determine the estimated probability of survival. With respect to each predictor, we could read the points assigned on the 0-10 scale at the top and then add these points. Lastly, nd the number on the "Total Points" scale and read the corresponding predictions of 1-, 3-, and 5-year risk.

Discussion
Our present investigation shows that the number of examined was demonstrated to be inversely correlated with the survival of ESCC patients, so we gured out the cut-off value of LNs examined. The cut-off value of examined LNs was 8 and we divided patients into two groups by the number of examined LNs: <=8 examined LNs and >8 examined LNs. As we all know, Lasso regression analysis could well avoid confounding factors and made our predicting model more acurate [20]. According to the results (Figure 3), six characters including tumor size, T stage, age, examined LNs, N stage and M stage were identi ed as the most suitable variables for constructing model. In addition, we performed multivariate regression analysis and the nomogram, nding that nomogram was more accurate in internal and external cohorts than the traditional 7th TNM staging, and showed better clinical usefulness for predicting survival assessed by DCA.
As we all know, examining LNs is an important factor for prognosis of patients with gastrointestinal cancer, and also the association between survival and the number of examined LNs was also demonstrated by previous studies [21,22]. A large population-based study using the National Cancer Database has showed patients who had more than 15 lymph nodes resected had better survival [23], as well as other studies [24]. In our study, the best cut-off values for examined LNs was 8, which does make a difference for survival evaluation. Moreover, our results were in line with other studies [11,25,26]. Published studies showed the area under curves (AUC) of ROC for predicting sensitivity and speci ty was about 0.61, although it was considered as powerful prognostic factor to evaluate survival [27,28]. Based on the these notion, the nomogram was performed well with an AUC of more than 0.7 compared to 7 th TNM staging. For the traditional TNM staging, some studies also thought it was controversial and proposed other different subdivision [11,29,30]. With repect to the clinical utility, we performed C-index, tdROC and Calibration plot and demonstrated it was effective predicting model in the external and internal validation. No matter external or internal validation, the value of C-index was about 0.7, suggesting the nomogram model was an effective model [31]. The result of tdROC showed nomogram has a great superiority than 7 th TNM staging via comparing the AUC between them [14]. Also, as the Calibration plot suggested, there is a good agreement between the predicting value and actual observations [14]. As for the number of examined LNs in different TNM stage, we found the cut-off values for two groups made a great difference for prognosis. Previous studies reported more than 8 [34], which greatly depended on the operation and pathological diagnosis [10]. Therefore, this difference of conclusion may be due to the heterogeneity of clinical data.
Our study has some limitations that should be discussed. Firstly, TNM staging was 7 th edition rather than 8 th edition, which may in uence the result of comparion between nomogram and TNM staging system.
Hence, it is necessary to perform study to compare the model constructed by our study with 8 th TNM staging. Then we excluded many patients who had missing data associated with our collected variables, increasing the selection bias. Next, the variables including examined LNs were sellected depending on each doctor in different clinical centers. Therefore, although this nomogram performed well in the two cohort, it should be applied with great caution when assessing risk of 1-, 3-and 5-year survival. In the future, we will collect our relevant data to incorporate factors above into further research.

Conclusions
In conclusion, the present study was constructed to investigate the optimal examined LNs. We found examined LNs did bene t for prognosis of patients, which was more favorable as many as 8 examined LNs. Based on the Lasso and multivariate analysis, we developed and validated a nomogram with greater bene t for predicting survival of ESCC patients than TNM staging, which was demonstrated by td-ROC and DCA.

Con icts of interest
The authors disclose no con icts.