Baseline information and study design
The baseline characteristics of the two datasets are summarized in Table 1. A total of 179 ESCC samples in the GEO database and 93 samples in TCGA database were included in our study. There were 96 patients with LNM and 83 patients without LNM in the GEO cohort, while 38 patients with LNM and 55 without LNM in TCGA cohort. In the early-stage cohort, there are 34 patients with LNM and 41 patients without LNM (Supplementary Table 2).The design of our study are illustrated using a flow chart in Fig. 1.
Table 1
Baseline Characteristics of Patients with ESCC in each database
GEO database(n = 179)
|
TCGA database(n = 93)
|
Validation
|
LN(+)
|
LN(-)
|
P value
|
LN(+)
|
LN(-)
|
P value
|
Age,years
|
|
0.503
|
|
0.024
|
≤ 65
|
72
|
58
|
|
34
|
38
|
|
> 65
|
24
|
25
|
|
4
|
17
|
|
Sex
|
|
0.565
|
|
0.264
|
Female
|
16
|
17
|
|
4
|
11
|
|
Male
|
80
|
66
|
|
34
|
44
|
|
Race
|
|
NA
|
|
0.856
|
White
|
0
|
0
|
|
18
|
25
|
|
Non-white
|
96
|
83
|
|
20
|
30
|
|
Tumor Location
|
|
0.025
|
|
0.156
|
Upper thoracic
|
13
|
7
|
|
4
|
2
|
|
Middle thoracic
|
43
|
54
|
|
14
|
30
|
|
Lower thoracic
|
40
|
22
|
|
20
|
23
|
|
T stage
|
|
0.720
|
|
0.202
|
T1-T2
|
22
|
17
|
|
13
|
27
|
|
T3-T4
|
74
|
66
|
|
25
|
28
|
|
Differentiation
|
|
0.455
|
|
0.080
|
Well differentiated; G1
|
16
|
16
|
|
2
|
14
|
|
Moderately differentiated; G2
|
50
|
48
|
|
24
|
25
|
|
Poorly differentiated; G3
|
30
|
19
|
|
8
|
11
|
|
Unknown; Gx
|
0
|
0
|
|
4
|
5
|
|
Abbreviations: LN, lymph node; NA, not applicable. |
Identification of communal DEGs
The gene expression profiles of patients with LNM and those without were compared to screen DEGs. Ultimately, a total of 3524 DEGs were identified from 32059 transcripts in the GEO cohort ((FDR) < 0.05 |logFC| > 1, p < 0.05), in which 1082 genes were up-regulated and 2442 genes were down-regulated. Meanwhile, 82 DEGs were found in TCGA cohort (|logFC| > 1, p < 0.05) with 5 genes up-regulated and 77 genes down-regulated. Heat map and volcano map approaches were used to demonstrate the expression levels and upregulation/downregulation of DEGs in the GEO cohort (Suppl Fig. 1A-B) and TCGA cohort (Suppl Fig. 1C-D). A Venn diagram was employed to visualize the intersection of DEGs between the two datasets (Suppl Fig. 2). As shown in Fig. 2A-B, the expression levels of the 11 communal DEGs (ALG3,AP2M1,CPOX,CRHR2,LMLN,MAP6D1,MRPL147,PARL,PSMD2,SLC15A2,SMYD3) in the two datasets were illustrated using heat maps.
Identification of gene signatures associated with LNM and development of a four-gene panel.
As shown in Table 2, among the 11 communal genes, ALG3(odds ratio (OR), 0.772; 95% CI: 0.66–0.904; p = 0.0015), CPOX(OR, 1.155; 95% CI: 0.621–1.06; p = 0.038), LMLN(OR, 0.861; 95% CI: 0.754–0.983; p = 0.028) and PSMD2(OR, 1.64; 95% CI: 1.186–2.267; p = 0.003) were associated with the presence of LNM in multivariate Logistic regression analysis. The expression sites of 4 genes, biological functions, KEGG pathways and AUC of individual genes predicting the probability of LNM are shown in Supplementary Table 1. The absolute pair-wise Pearson correlation coefficients among the four communal DEGs genes were calculated to show their independence (Suppl Fig. 3A). A four-gene panel was thereafter built to predict LNM. Surprisingly, the ROC curve for predicting probability of LNM indicated unsatisfactory performance with the AUC merely 0.547(Suppl Fig. 3B).
Table 2
Logistic regression analysis for intersection genes in ESCC tissues with LNM and NLMN
Gene
|
Multivariate
|
OR(95%CI)
|
P -value
|
ALG3
|
0.772(0.66–0.904)
0.812(0.621–1.06)
1.155(1.009–1.322)
1.001(0.897–1.117)
0.861(0.754–0.983)
0.941(0.827–1.071)
1.191(0.993–1.52)
0.991(0.781–1.257)
1.64(1.186–2.267)
0.992(0.934–1.054)
0.933(0.883–1.045)
|
0.00155
|
AP2M1
|
0.12746
|
CPOX
|
0.03859
|
CRHR2
|
0.98875
|
LMLN
|
0.02840
|
MAP6D1
|
0.35660
|
MRPL47
|
0.16198
|
PARL
|
0.93986
|
PSMD2
|
0.00317
|
SLC15A2
|
0.80537
|
SMYD3
|
0.23428
|
Abbreviations: OR, odds ratio; CI, confidence interval. |
Construction and validation of a gene-panel-based nomogram for Predicting LNM
To improve the predictive ability of the aforementioned panel, we developed a nomogram integrating clinical variables and the four communal DEGs. Previous studies have identified clinicopathologic factors including T stage, G stage and tumor location, as predictors of the risk of LNM in patients with ESCC(14, 21–25). And some studies also aimed to construct LNM prediction models based on the aforementioned clinical variables(26, 27). Ultimately, the nomogram consisting of 7 variables including 4 DEGs, T stage, G stage and tumor location was established to predict the probability of LNM (Fig. 3). Calibration curves regarding the model performance in the two datasets are shown in Fig. 4. With a C-index of 0.710 and that of 0.693 in respective dataset, the nomogram displayed good discrimination in both the primary and validation cohorts, which outperformed the four-gene panel itself (p < 0.01). Similarly, the nomogram also exhibited good predictive potential in patients with T1-2 disease with a C-index of 0.755 in the early-stage cohort.