Development and Validation of a Gene-panel-based Nomogram for Prediction of Lymph Node Metastasis in Esophageal Squamous Cell Carcinoma

Background: Esophageal squamous cell carcinoma (ESCC) is one of the main histological subtypes of esophageal cancer. This study aimed to develop a gene-panel-based nomogram for identication of lymph node metastasis (LNM) in ESCC patients. Methods: RNA sequencing proles of ESCC patients were obtained from the Gene Expression Omnibus (GEO) database and The Cancer Genome Atlas (TCGA) database. A bioinformatic approach was employed to investigate differentially expressed genes (DEGs) between ESCC patients with LNM and those without. A 4-DEGs panel was eventually identied and integrated with clinical characteristics to construct a nomogram for predicting LNM. Predictive performance of the nomogram was further evaluated by calibration curves and concordance index (C-index). Results: A total of 179 ESCC patients with 32059 genes from the GEO dataset were included. Among these genes, 3524 DEGs were correlated with lymph node involvement. Meanwhile, TCGA dataset containing 93 ESCC patients was obtained, in which 82 DEGs were selected out of 18416 genes. Among the 11 communal DEGs, four genes were identied to be (ALG3, CPOX, LMLN, PSMD2) associated with LNM. A nomogram was established by integrating the four-gene panel and three clinical characteristics including T stage, G stage and tumor location. The nomogram exhibited good performance with C-indices of 0.710 and 0.693 in the GEO and TCGA datasets, respectively. Conclusion: Our novel 4-gene-based nomogram displayed its value in prediction of LNM in ESCC patients, which may be helpful in determining treatment approach for early-stage ESCC patients.

In the present study, we performed a genome-wide discovery of genes associated with LNM in ESCC by analyzing datasets from the GEO and TCGA databases. Four LNM-associated genes identi ed, a genepanel-based nomogram was established and validated, which displayed good performance in predicting LNM in ESCC patients.

Patient cohorts and transcriptomic data
Transcriptomic data and the corresponding clinical variables of ESCC samples were downloaded from the GEO database (GSE53625) and TCGA database. The exclusion criteria were as follows: (1)  Identi cation of communal differentially expressed genes (DEGs) GSE53625 including mRNA sequencing data of ESCC patients were preprocessed and annotated. Differential expression analysis was performed between ESCC samples with LNM and those without using the R package limma(version 3.6.2) (20). DEGs were determined by signi cance criteria (false discovery rate ((FDR) <0.05, |logFC| > 1, p < 0.05). Similar criteria (|logFC| > 1, p < 0.05) were applied to TCGA dataset to identify DEGs associated with presence of LNM. Considering the limited number of overlapped DEGs between the two datasets, we subsequently assessed the association of the communal DEGs for the identi cation of LNM. The communal DEGs was de ned as intersection of the DEGs between the two datasets and was visualized by the Venn Diagram package (version 1.6.20).

Multivariate Logistic regression analysis and calculation of Pearson correlation coe cients
The Logistic regression model was applied to identify the independent predictors of the presence of LNM among the 11 communal DEGs. Four of the 11 communal DEGs were eventually con rmed to be associated with LNM, and the correlation coe cients of the four DEGs were calculated with Pearson correlation.

Construction and validation of the gene-panel-based nomogram
The predictive power of the four communal DEGs was initially assessed. Integrating clinical variables to improve the predictive performance, we established a nomogram for predicting LNM by using the GEO dataset as the primary cohort. TCGA dataset was used as the validation cohort. In addition, we constructed another cohort comprised of patients with T1-2 disease from the two cohorts as the earlystage cohort. Calibration curves were plotted to assess the calibration of the gene-panel-based nomogram. To quantify the discrimination performance of the nomogram, Harrell's C-index was measured. Meanwhile, the predicted probability of LNM was also evaluated by Receiver Operating characteristic (ROC) curves with area under curves (AUC) calculated. Finally, the predictive power of our nomogram was veri ed in the early-stage cohort.

Statistical Analysis
Pearson χ2 test was used for categorical variables, and two-sample t test was used for continuous variables. A two-sided P-value of <0.05 was considered as statistically signi cant. All the statistical analyses were performed using R(version 3.6.2) and SPSS statistical software version 23 (IBM Corp, Armonk, NY).

Baseline information and study design
The baseline characteristics of the two datasets are summarized in Table 1 Table 2).The design of our study are illustrated using a ow chart in Fig. 1.

Identi cation of communal DEGs
The gene expression pro les of patients with LNM and those without were compared to screen DEGs. Ultimately, a total of 3524 DEGs were identi ed from 32059 transcripts in the GEO cohort ((FDR) < 0.05 |logFC| > 1, p < 0.05), in which 1082 genes were up-regulated and 2442 genes were down-regulated. Meanwhile, 82 DEGs were found in TCGA cohort (|logFC| > 1, p < 0.05) with 5 genes up-regulated and 77 genes down-regulated. Heat map and volcano map approaches were used to demonstrate the expression levels and upregulation/downregulation of DEGs in the GEO cohort (Suppl Fig. 1A-B) and TCGA cohort (Suppl Fig. 1C-D). A Venn diagram was employed to visualize the intersection of DEGs between the two datasets (Suppl Fig. 2). As shown in Fig. 2A Identi cation of gene signatures associated with LNM and development of a four-gene panel.
As shown in  Fig. 3A). A four-gene panel was thereafter built to predict LNM. Surprisingly, the ROC curve for predicting probability of LNM indicated unsatisfactory performance with the AUC merely 0.547(Suppl Fig. 3B).  (14,(21)(22)(23)(24)(25). And some studies also aimed to construct LNM prediction models based on the aforementioned clinical variables (26,27). Ultimately, the nomogram consisting of 7 variables including 4 DEGs, T stage, G stage and tumor location was established to predict the probability of LNM (Fig. 3). Calibration curves regarding the model performance in the two datasets are shown in Fig. 4. With a C-index of 0.710 and that of 0.693 in respective dataset, the nomogram displayed good discrimination in both the primary and validation cohorts, which outperformed the four-gene panel itself (p < 0.01). Similarly, the nomogram also exhibited good predictive potential in patients with T1-2 disease with a C-index of 0.755 in the early-stage cohort.

Discussion
In this study, we performed a comprehensive RNA sequencing-based gene expression pro ling analysis of ESCC patients with and without LNM to establish a gene expression panel for identi cation of LNM. After extraction of the communal LNM-associated gene signatures, we subsequently evaluated the robustness of our gene-panel-based nomogram in the two public datasets available. We demonstrated that the nomogram was valuable in both the primary and validation cohorts to identify patients with LNM.
From a clinical standpoint, the LN status is vitally important for therapeutic decision-making, especially in ESCC patients with relatively early T stage. Without LN metastasis, Tis and T1 ESCC patients can be successfully treated with ER (28). However, if the tumor is most likely to involve LNs, esophagectomy with LN dissection is often required to obtain better curability. Even for cases that apparently are not suitable for ER, a more precise information on LN status can signi cantly contribute to decision making for neoadjuvant treatment or de nite chemoradiation, as recommended by the National Comprehensive Cancer Network Guidelines. Therefore, availability of robust biomarkers that facilitate categorization of patients with ESCC based on the LN status will permit personalized treatment modality.
Although researches on seeking robust biomarkers for predicting LNM has been ongoing over the past decades, few studies available reported gene panels associated with the LN status (3,29). It is known to us that any single gene is not reliable enough to identify LNM since its expression level can be affected by many confounding factors. Notably, the recent advancements in whole genome and transcriptome sequencing have allowed us to characterize the associations between LNM and gene pro les from a new perspective. In recent years, attention has been focused on the LNM prediction model for gastrointestinal tumors(30). However, the comprehensive prediction model of LNM for ESCC is still blank. Consequently, our nomogram derived from transcriptomic data analyses demonstrated that integrating gene panel and clinical variables might be a promising approach to predict LN status.
Admittedly, there are several limitations in our study. First, not all patients underwent neoadjuvant therapy followed by esophagectomy and may have in uenced the effectiveness of the gene-signature. Second, our nomogram could not provide relevant information on location of the metastatic lymph nodes. Notably, a previous study reported that the prognosis of patients with LNs involvement limited to the peritumor areas was signi cantly better than that of patients with LNs involvement farther away (17). In addition, there are difference in the sequence data between the GEO dataset and TCGA dataset. More speci cally, GEO employs gene chip technology to detect gene expression in patient tissues while TCGA uses the RNA sequencing technology, which may cause potential errors.

Conclusion
In conclusion, our gene-panel-based nomogram can help predict the status of LNM in ESCC patients, and may determine the suitability of ER or esophagectomy for ESCC patients.