Identification of DEGs and robust DEGs.
In the present study, we conducted a systematic analysis of the biological characteristics of DEGs from nine GEO datasets (Table 1). The overall study design was illustrated in Fig. 1. A total of 665 tissue samples including 343 ESCC and 322 normal tissues were analyzed. According to the cutoff criteria of | log2 FC| > 2 and adjusted P < 0.05, 226 DEGs in GSE17351, 219 DEGs in GSE20347, 389 DEGs in GSE29001, 108 DEGs in GSE38129, 692 DEGs in GSE45670, 686 DEGs in GSE53625, 387 DEGs in GSE70409, 223 DEGs in GSE75241 and 147 DEGs in GSE161533 were identified. Among the DEGs, 110, 56, 168, 38, 249, 204, 115, 124 and 57 genes were upregulated while 116, 163, 221, 70, 443, 482, 272, 99 and 90 genes were downregulated in GSE17351, GSE20347, GSE29001, GSE38129, GSE45670, GSE53625, GSE70409, GSE75241 and GSE161533, respectively. For visualizing distributions of DEGs, the volcano plots (Fig. 2A-I) and heat maps (Supplementary Fig. S1) were drawn. Based on the RRA algorithm results, a total of 152 robust DEGs were determined, including 54 upregulated and 98 downregulated genes (Supplementary Table S1). According to the adjusted P of robust DEGs, the top 20 upregulated and downregulated robust DEGs were shown in a heat map (Fig. 2J).
GO and KEGG enrichment analysis of robust DEGs
To explore the biological classification of the 152 robust DEGs in ESCC, we used the “clusterprofiler” R package for GO and KEGG pathway enrichment analysis. GO enrichment analysis in the category biological process suggested that the robust DEGs were mainly accumulated in “extracellular matrix organization”, “extracellular structure organization” and “skin development” (Fig. 3A). In the cellular component category, the robust DEGs were mainly enriched in “collagen-containing extracellular matrix”, “apical part of cell” and “endoplasmic reticulum lumen” (Fig. 3B). In the molecular function category, the robust DEGs were mainly involved in “endopeptidase activity”, “receptor ligand activity” and “signaling receptor activator activity” (Fig. 3C). KEGG pathway analysis indicated that the robust DEGs were mainly related to “IL-17 signaling pathway”, “Cytokine-cytokine receptor interaction” and “Viral protein interaction with cytokine and cytokine receptor” (Fig. 3D). The above results suggested that robust DEGs were positively associated with cancer cell development.
Immune Cell Infiltration Characteristics
Infiltrating cells of the immune system in the tumor microenvironment (TME) are accepted to be generic constituents of tumors . The CIBERSORT algorithm was used to analyze the immune cell infiltration of 665 samples, and the immune infiltration results were filtered with P <0.05 as the standard, then the proportions of 22 immune cells in 149 ESCC samples and 54 normal tissue samples were obtained (Fig. 4A). The heat map (Fig. 4B) and violin diagram (Fig. 4C) further provided visualization of the differences in immune cell distribution between the ESCC and normal samples. The results showed that 7 immune cells (naïve CD4+T cells, activated memory CD4+T cells, follicular helper T cells, resting NK cells, M0 macrophages, M1 macrophages and activated dendritic cells) were in a higher proportion in the ESCC tissues than those in the normal tissues, whereas 6 immune cells (naïve B cells, resting memory CD4+T cells, gamma delta T cells, M2 macrophages, resting dendritic cells and resting mast cells) were in a higher proportion in the normal tissues. As demonstrated in the principal component analysis (PCA) results (Fig. 4D), ESCC and normal samples can be roughly distinguished using the 22 immune cells.
PPI network construction and hub genes identification
To further study the interaction of the 152 robust DEGs, we construct a PPI network using STRING database and Cytoscape software with a combined score >0.4 as the cutoff criterion. As shown in Fig. 5A, the PPI network covered 91 nodes and 304 edges, including 45 upregulated genes and 46 downregulated genes. Subsequently, the cytoHubba plugin was used to calculate the scores of topological algorithms in each node. The top 50 scoring genes calculated by each of the 12 algorithms (MCC, DMNC, MNC, Degree, EPC, BottleNeck, EcCentricity, Closeness, Radiality, Betweenness, Stress and ClusteringCoefficient) were intersected to obtain hub genes (Fig. 5B). The nine hub genes were cytidine deaminase (CDA), chemokine ligand 1(CXCL1), insulin-like growth factor binding protein 3(IGFBP3), matrix metallopeptidase 3(MMP3), matrix metallopeptidase 11(MMP11), plasminogen activator, urokinase (PLAU), Serpin Family E Member 1(SERPINE1), Secreted Phosphoprotein 1(SPP1) and Versican (VCAN). The mRNA expression of 9 hub genes were validated using the Gene Expression Profiling Interactive Analysis (GEPIA) database. As demonstrated in Fig. 6A-I, the mRNA expression levels of hub genes were markedly upregulated in Esophageal carcinoma (ESCA) tissues compared to those in normal tissues (P < 0.01). Moreover, ROC curves were generated to verify the diagnostic performance of nine genes based on the GSE53625 database. The area under the curve (AUC) of CDA, CXCL1, IGFBP3, MMP3, MMP11, PLAU, SERPINE1, SPP1 and VCAN were 0.8816, 0.8303, 0.9627, 0.9462, 0.9975, 0.9822, 0.9344, 0.9890 and 0.9454, respectively (Fig. 7A-I). Kaplan-Meier survival curves showed that high expression of PLAU (P < 0.001), SPP1 (P = 0.024) and VCAN (P = 0.031) were significantly correlated with poor prognosis (Fig. 7J-L).
Survival model construction and analysis
To investigate the prognostic significance of 152 robust DEGs, a total of 17 survival-related genes (P < 0.05) were identified by the Univariate Cox analysis (Table 2). After selecting the most suitable combination of candidate genes by multiple stepwise Cox regression, seven genes including Interleukin 18 (IL18), Plasminogen Activator, Urokinase (PLAU), Anoctamin 1 (ANO1), Solute Carrier Organic Anion Transporter Family Member 1B3 (SLCO1B3), Cystatin SN (CST1), Neural EGFL Like 2 (NELL2) and MAGE Family Member A11 (MAGEA11) were used to build a prognostic model (Table 3). The risk score of each patient was calculated according to the following formula: (－0.2232 × ExpIL18) + ( 0.4659 × ExpPLAU) + (0.1876 × ExpANO1) + (－0.0921 × ExpSLCO1B3) + ( 0.1844 × ExpCST1) + (－0.1203 × ExpNELL2) + (－0.1501 × ExpMAGEA11). 179 ESCC patients in GSE53625 were divided into a low-risk group and a high-risk group according to the median risk score. Kaplan-Meier curve demonstrated that the prognosis of low-risk group was significantly better than that of the high-risk group (P = 1.979e-08) (Fig. 8A). The time-dependent ROC curve was plotted, and the fact that the AUC of the risk score was 0.777 for a 5-year survival prediction indicates the exactitude of this model (Fig. 8B). As shown in Fig. 8C, the expression heat map of 7 prognostic genes was profiled. Cox regression analysis was carried out to demonstrate whether the model can be an available independent prognostic indicator. Univariate Cox regression analysis showed that N staging (P < 0.001), tumor stage (P < 0.001) and risk score (P < 0.001) were significantly correlated with prognosis (Fig. 8D). Multivariate Cox regression analysis showed that only the risk score (P< 0.001) was significantly correlated with prognosis (Fig. 8E). To better predict the prognosis of patients with ESCC at 1, 3, and 5 years after esophagectomy, we integrated the seven genes signature to establish a nomogram (Fig. 8F). A higher total point indicates a lower overall survival.