Development of a Novel Prognostic Model for Esophageal Squamous Cell Carcinoma: Insights into Immune Cell Interactions and Drug Sensitivity

Abstract Esophageal squamous cell carcinoma (ESCC) presents a five-year survival rate below 20%, underscoring the need for improved prognostic markers. Our study analyzed ESCC-specific datasets to identify consistently differentially expressed genes. A Venn analysis followed by gene network interactions revealed 23 key genes, from which we built a prognostic model using the COX algorithm (p = 0.000245, 3-year AUC = 0.967). This model stratifies patients into risk groups, with high-risk individuals showing worse outcomes and lower chemotherapy sensitivity. Moreover, a link between risk scores and M2 macrophage infiltration, as well as significant correlations with immune checkpoint genes (e.g., SIGLEC15, PDCD1LG2, and HVCR2), was discovered. High-risk patients had lower Tumor Immune Dysfunction and Exclusion (TIDE) values, suggesting potential responsiveness to immune checkpoint blockade (ICB) therapy. Our efficient 23-gene prognostic model for ESCC indicates a dual utility in assessing prognosis and guiding therapeutic decisions, particularly in the context of ICB therapy for high-risk patients.


Introduction
Esophageal cancer (EC) is ranked as the eighth most common cancer and the sixth leading cause of cancer-associated mortality, with a low fiveyear overall survival rate (1).Base on pathological characteristics, EC is typically categorized into Esophageal Squamous Cell Carcinoma (ESCC) and Esophageal Adenocarcinoma (EAC).The characteristics and causes of EC may vary by regions or races.In China, ESCC accounts for 90-95% of all EC incidences (2).Smoking, alcohol consumption, consumption of overheated water, and other environmental factors are recognized as key risk elements for ESCC (3).Recent advancements in surgical techniques and the use of neoadjuvant chemotherapy have significantly improved the prognosis for patients with ESCC.However, as most patients with EC are diagnosed in the late stages, the five-year survival rate for patients with ESCC still remains less than 20% (4).Therefore, the development of molecular markers that can guide the prognosis and treatment plan for ESCC patients is of great clinical significance.
The rapid progression of bioinformatics in recent times has significantly propelled the construction of multifaceted gene signatures, drawing from intricate patterns of gene expression.These signatures hold promise for revolutionizing molecular subtyping and forging novel therapeutic avenues for a myriad of tumors (5).Amidst these scientific strides, EC research lags behind, with the paucity and inadequacy of comprehensive datasets being a considerable obstacle.Scholars in existing investigations have delved into a plethora of datasets, scrutinizing genes intertwined with pivotal pathways, such as ferroptosis (6,7), epithelial-mesenchymal transition (8), and autophagy-related genes (9).Despite these efforts, challenges including data scarcity, suboptimal dataset quality, and incomplete methodological frameworks have dampened the potential of these gene signatures.As a result, their predictive prowess, often encapsulated by areas under the curve (AUC) floating around the 0.7 mark, fails to offer the much-needed discriminative power in forecasting outcomes for ESCC patients.In essence, the gene signatures conceived to date have yet to reach a level of clinical utility that justifies their adoption in mainstream practice.
In our study, we utilized multiple ESCC datasets encompassing paired adjacent non-cancerous and cancerous tissues.We analyzed their common differential genes, and based on this, we established a multi-gene marker composed of core genes, which we subsequently optimized.Finally, through a wide variety of analyses, we unraveled the inherent molecular mechanisms through which these gene markers can efficiently predict the prognosis of ESCC patients.

Data acquisition
The available datasets related to ESCC are mainly obtained from the Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/) and The Cancer Genome Atlas (TCGA, www.cancer.gov/)databases.The GSE22954, GSE77861, and GSE20347 datasets are obtained from the GEO database.The ESCC dataset with prognosis information is sourced from the TCGA database.The inclusion criteria for data sets are: (a) Having a sufficient number of ESCC samples; (b) Having paired adjacent cancer tissue and healthy tissue samples; (c) Data should preferably come from the same sequencing platform to reduce systematic differences; (d) Samples should preferably come from multiple centers and be representative; (e) Samples should contain detailed clinical data.

Acquisition of differentially expressed genes
We obtained RNA-sequencing expression (Level 3) profiles and associated clinical data for ESCC from the GEO and TCGA datasets.The limma package in R software was used to analyze the differentially expressed mRNA.The threshold for differential expression of mRNAs was defined as "Adjusted P < 0.05 and Log2 (Fold Change) >1 or Log2(Fold Change)< −1."

Analysis of shared differentially expressed genes
To the identification of shared differentially expressed genes across multiple gene sets, we employed VEN analysis.The construction of Venn diagrams was performed using the JVENN tool (10).

Pathway and process enrichment analysis
Pathway and process enrichment analysis were performed at Metascape (metascape.org)(11).
The enrichment analysis for each pathway and process was executed using given gene lists and the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, CORUM, WikiPathways, and PANTHER Pathway.The enrichment background used all genes in the genome.Terms gathered included those with a p value of below 0.01, a minimum of 3 counts, and an enrichment factor above 1.5, which is the ratio of the observed counts to the expected counts predicted by chance.These terms were then organized into clusters based on their mutual traits.p Values were derived from the cumulative hypergeometric distribution, whilst q values were determined using the Benjamini-Hochberg procedure to account for any multiple testings (12).The similarity metric for the hierarchical clustering of the enriched terms applied Kappa scores, considering sub-trees with a similarity above 0.3 as a cluster.
The most statistically significant term within a cluster was selected to represent that cluster.

Pathways network analysis
To better understand the links between the terms, we have chosen a subset of enriched terms and displayed them in a network plot.Here, terms with a similarity of over 0.3 are linked by edges.
From each of the 20 clusters, we have specifically chosen the terms with the top p values.However, we have ensured that each cluster does not exceed 15 terms and the total term count does not surpass 250.The network visualization was achieved using Cytoscape.

Genes relative transcription factor analysis
The transcription factor analysis related to differentially expressed genes was performed using TRRUST online analysis (13).

Protein-protein interaction enrichment analysis
For each given genes, the Molecular Complex Detection (MCODE) algorithm10 were applied to identify densely connected network components (14).

Correlation analysis of genes
The genes correlation map is realized by the R software package ggstatsplot, and the multi-gene correlation pheatmap is displayed by the R software package (15).Spearman's correlation analysis to describe the correlation between quantitative variables without a normal distribution.p Values less than 0.05 were considered statistically significant ( � p < 0.05).

mRNA expression of core genes
RNA-sequencing expression (Level 3) profiles and corresponding clinical information for ESCC were downloaded from the TCGA dataset (https://portal.gdc.com).All the analysis methods and R package were implemented by R version 4.0.3(16).

Establishment of a prognostic model
RNA-sequencing expression (Level 3) profiles and corresponding clinical information for xx were downloaded from the TCGA dataset (https://portal.gdc.com).Converting counts data to TPM and normalizing the data log2 (TPM þ 1), keeping samples with clinical information at the same time.Indicted samples for subsequent analysis.Log-rank test was used to compare differences in survival between these groups.The time-ROC version 0.4 analysis was used to compare the predictive accuracy of genes and risk score.Multivariate cox regression analysis was used to construct a prognostic model, and the R package survival used for the analysis (17).

Analysis of the correlation between gene expression and the amount of immune cells in the tumor
TRNA-sequencing expression (Level 3) profiles and corresponding clinical information for ESCC were downloaded from the TCGA dataset (https://portal.gdc.com).The R software GGSTATSPLOT package was used to draw the correlations between gene expression and immune score (18), the R software heatmap package was used to draw multi-gene correlation.Used Spearman's correlation analysis to describe the correlation between quantitative variables without a normal distribution.p Values less than 0.05 were considered statistically significant ( � p < 0.05).

Network analysis of the correlation between the prognostic model and immune cells
To assess the reliable results of immune score evaluation, we used immuneeconv.It's an R software package that integrates six latest algorithms, including TIMER, xCell, MCP-counter, CIBERSORT, EPIC, and quanTIseq.These algorithms had been benchmarked, each had a unique advantage.Analysis and visualization are performed by the R software ggClusterNet package (https://doi.org/10.1002/imt2.32)(19).

Prediction of drug sensitivity
RNA-sequencing expression (Level 3) profiles and corresponding clinical information for ESCC were downloaded from the TCGA dataset (https://portal.gdc.com).Predicted the chemotherapeutic response for each sample based on the largest publicly available pharmacogenomics database the Genomics of Drug Sensitivity in Cancer (GDSC), https://www.cancerrxgene.org/.The prediction process was implemented by R package "pRRophetic".The samples' half-maximal inhibitory concentration (IC50) was estimated by ridge regression.All parameters were set as the default values using the the batch effect of combat and tissue type of all tissues, and the duplicate gene expression was summarized as mean value (20).

Prediction of patients' sensitivity to immune checkpoint blockade (ICB) therapy
Potential ICB response was predicted with TIDE algorithm (21).

Statistics
In this article, comparisons between two groups of data were performed using the T-test, while comparisons among multiple groups were conducted using one-way ANOVA.� p < 0.05, �� p < 0.01.The statistical analysis in this study was performed using GraphPad Prism version 9 (GraphPad Software, Inc, La Jolla, CA).

Identification of signature differentially expressed genes in esophageal squamous cell carcinoma
EC has a very poor prognosis, but there is currently no effective prognostic evaluation model in clinical practice.We attempted to identify genes with common differential expression in ESCC through data analysis.Four datasets with paired cancer adjacent tissues were retrieved from the GEO and TCGA databases, namely GSE22954, GSE77861, GSE20347 (GEO), and the ESCC dataset in the TCGA database (Figure 1(A)).
Through VEN analysis, we identified 119 differentially expressed genes with high homogeneity (Figure 1(B)).

Cluster analysis of differentially expressed genes
Furthermore, we performed cluster analysis on the 119 genes that showed significant differential expression in different ESCC databases using the Metascape website.In the pathway cluster analysis, the top five pathways were extracellular matrix organization, burn wound healing, PID integrin1 pathway, regulation of cell cycle process, and network map of SARS-CoV-2 signaling pathway (Figure 2(A)).The network analysis of pathway clustering revealed that extracellular matrix organization, burn wound healing, and regulation of cell cycle process pathways predominated among the pathways showing significant clustering (Figure 2(B)).Additionally, using the TRRUST module, we analyzed the transcription factors behind these differentially expressed genes.The results showed that the classical transcription factors RELA, NFKB1, and SP1 may play important roles in the expression of these genes (Figure 2(C)).The relationship between the transcription factors and the corresponding overlapped genes is listed in Table 1.

Identification of core genes
Using the MCODE module, we constructed a network diagram illustrating the interactions among the differentially expressed genes (Figure 3).This network diagram provides a comprehensive depiction of the pathways associated with these genes.Referring to the STING database and conducting a joint analysis, we identified the core genes with a connectivity score of 10 or higher (The core genes identified with a cutoff of 10 can achieve the best prognosis prediction effect).The core genes include FN1, ACVR1, COL1A2, COL3A1, CDC45, CDK1, CDC6, CDKN3, CENPF, COL4A1, COL4A2, CXCL1, FEN1, FOXM1, CDH11, CKS1B, HMMR, MMP1, MMP2, KIF14, COL5A2, IGFBP3, and LUM.

Assessment of expression and correlation of core genes
To depict the correlation of gene expression, we conducted an analysis using the TCGA database's ESCC dataset.First, we generated a correlation heatmap (Figure 4(A)).The heatmap analysis revealed that these genes could be divided into two subgroups, exhibiting high correlation within each subgroup.One subgroup is associated with cell proliferation regulation, while the other subgroup is related to extracellular matrix-associated proteins.The mRNA expression of these genes was significantly upregulated in cancer tissues compared to adjacent non-cancerous tissues (Figure 4(B)).

Development of a prognostic model based on core genes
To investigate the significance of core genes in ESCC, we utilized the TCGA dataset, which provides more clinical information.Using the Cox algorithm (When constructing a prognostic model based on gene sets, the coefficient of each gene was determined by maximum partial likelihood estimation, which is an optimization problem aimed at finding the coefficient values that maximize the model's fit to the observed data), we developed a weighted risk assessment algorithm based on the core genes.The specific formula is as follows: Risk score

Pan-cancer study to investigate the prognostic value of gene signature
In order to investigate whether the core genes identified in ESCC are applicable to multiple cancers, we analyzed the prognostic value of the ESCC core genes in various tumors using the Cox algorithm (Table 2).Among all the cancers analyzed, ESCC and ESCA had the lowest AIC values.For ESCC, the 1-year AUC was 0.78, and the 3-year AUC was 0.967.For ESCA, the 1-year, 3-year, and 5-year AUC values were 0.908, 0.881, and 1, respectively.Furthermore, in both ESCC and ESCA, there were significant differences in median survival between the high-risk and lowrisk groups when stratified based on the core genes.However, in the remaining tumors, there were no significant differences observed, unlike ESCC and ESCA.Highlighting the tumor specificity and high prognostic value of the core genes.
The heatmap depicting the relationship between cancer types and the weighted coefficients of genes is shown in Figure 6.

Gene signature and signaling pathways
To investigate the impact of core gene algorithms on cell signaling pathways or processes, we analyzed the mRNA expression differences between high-risk and low-risk patient groups after evaluating the core genes.In comparison to the lowrisk group, the high-risk group showed 663 genes with higher expression and 205 genes with lower expression.The volcano plot and heatmap of gene expression are shown in Figure 7(A,B), respectively.Furthermore, through GO clustering, we found that the highly expressed genes in the high-risk group are mainly involved in focal adhesion, ECM-receptor interaction, and the PI3K-AKT signaling pathway.

The correlation between gene signature and immune cells within ESCC
In order to further elucidate the mechanisms underlying the poor prognosis in high-risk group patients, we analyzed the correlation between each core gene and immune cells within tumor (Figure 8(A)).The analysis results indicated a significant positive correlation between the core genes and M1 macrophages, M2 macrophages and regulatory T cells.Notably, 56% of the core genes showed a significant positive correlation with M2 macrophages.Furthermore, we constructed a network diagram illustrating the correlation between the risk coefficient and immune cells as well as the correlation between immune cells themselves (Figure 8(B)).The analysis results suggested that our prognostic model exhibited the most significant correlation with the quantity of M2 macrophages in the tumor microenvironment.The network analysis also revealed a significant negative correlation between the quantity of follicular helper T cells and the quantity of M2 macrophages.

Analysis of the relationship between gene signature and chemotherapy resistance
After analyzing the association between core genes and the prognosis of ESCC, as well as the relevant molecular mechanisms, we were interested in the potential benefits of frontline chemotherapy in patients of the high-risk group.The analysis results implied that tumors in high-risk group have higher IC50 values for cisplatin, suggesting that the benefit of standard cisplatin treatment may be compromised in

Analysis of the relationship between gene signature and immunotherapy
Previous analyses indicated that high-risk patients might not derive significant benefits from standard chemotherapy.Consequently, we assessed the correlation between the expression of core genes and common immune checkpoint genes (Figure 10(A)).A notable positive correlation was revealed between core genes and checkpoint relative genes SIGLEC15, PDCD1LG2, and HVCR2.Further analysis also shows that tumors in high-risk groups possess lower TIDE values, suggesting that immunotherapy could potentially offer improved outcomes for patients in the high-risk group (Figure 10(B)).

Discussion
In our investigation, we conducted an extensive examination of multiple datasets to identify a set of genes that were differentially expressed consistently.Our pathway network analysis revealed strong enrichment of genes related to extracellular matrix organization and cell cycle control, both of which are critical to tumor growth, dissemination, and overall patient prognosis    (Figure 2(B)).These genes represent vital oncogenic processes common to many cancers (22).Delving into transcription factors, we uncovered putative relationships between gene expression patterns and a host of transcription factors, including RELA, NFKB1, SP1, HDCA1, E2F1, and HIF1.Notably, RELA and NFKB1, NFKB family members, are central to inflammatory response and pathogenesis, particularly in EC, which is often accompanied by chronic inflammation (23) -a finding supported by studies in murine models (24) and clinical samples (25).
Various factors, such as diets with excessively hot foods, alcohol-induced mucosal injury (26), and esophageal bacterial colonization due to poor oral health (27,28), may contribute to chronic inflammation and are linked with genes regulated by RELA and NFKB1, such as SERPINE1, MMP1, MMP3, and CXCL1, known to facilitate tumor metastasis (29).While SP1 appears to influence similar target genes, its prognostic value and role in EC require further study.Through network analysis, we identified central genes with a resemblance to those implicated in transcription factor analysis.This agreement underscores the crucial role of transcription factors, such as SP1 in ESCC, warranting deeper investigation from this network.The analysis of gene mRNA expression reveals that all selected genes are significantly up-regulated in tumor tissues compared to healthy esophageal tissue.Subsequently, we employed the Cox model, obtaining a prognostic model from the TCGA-ESCC dataset that can effectively predict patient outcomes.Although some studies have reported on gene-set-based prognostic models, their AUC values are not high, possibly due to their method of selectively using certain gene classes to build their prognostic models.Our model, while not incorporating some renowned genes associated with cancer prognosis like ARUKA and NEK2 (30), nevertheless achieved the best result reported until now, with a 3-year AUC value of 0.98.
Our investigation into the tumor microenvironment highlighted the role of macrophages, particularly the M2 subtype, which increases during EC progression and is involved in modulating the tumor-supporting environment (31).As reported in many established gene set model studies related to tumor prognosis, increased infiltration of M2 macrophages is commonly associated with higher risk factors and poor patient outcomes (32).Our network analysis corroborated this link, revealing that nearly half of the core genes exhibited a significant positive correlation with M2 macrophage presence.In fact, the number of M2 macrophages gradually increases in the early stages of EC, and it can be clearly observed even at the hyperplasia stage (33,34).The long-term chronic inflammation leads to an imbalance in the immune microenvironment of EC (35), which is consistent with what we observed in our network analysis (Figure 8(B)).Recent studies also showed that M2 polarization of macrophages can enhance the expression of PD-L2 in tumor-associated macrophages, leading to immune evasion and tumor promotion through the PD-1 pathway.The blockade of the CCL2-CCR2 axis can significantly hinder the recruitment of tumor-associated macrophages, resulting in a positive response to treatment (36).In our model, we observed a strong correlation between the core genes and PD1L2 expression (Figure 10(A)).In addition, we noticed another checkpoint gene, SIGLEC-15 (37).This suggests that immunotherapy may yield better benefits in the treatment of high-risk patients.The prediction of ICB response based on TIDE supports our hypothesis (Figure 10(B)).
In contrast, predictions related to conventional chemotherapy responsiveness in high-risk patients indicate a lack of sensitivity.Yet, metformin, known for its diverse mechanisms of action, proved highly sensitive in the high-risk group, potentially due to its effects on tumor metabolism, microenvironment, and immune modulation -including the reversion of M2 macrophage polarization (38), and reduction of myeloidderived suppressor cells and M2 macrophages in tumors (39).The potential combination of metformin with anti-PD-1/PD-L1 antibodies holds promise for more effective patient treatment strategies (40).
Our research has made progress in understanding gene expression in EC and its relationship with the tumor microenvironment; however, there are limitations including dependence on limited data sources, lack of experimental validation, insufficient clinical verification, and a need for a deeper understanding of the complexity of the microenvironment.Future work will focus on gathering a broader range of data, strengthening laboratory validation of key findings, and testing our predictive models and therapeutic strategies through clinical trials.This will help to further decipher tumor biology and provide more precise treatment options for patients with EC.
In conclusion, our study provides a robust genetic and molecular framework for understanding the complexities of EC and opens a new avenue for targeted therapy and personalized medicine.Our prognostic model, based on gene expression and immune checkpoint correlations, shows promise for improving the management of cancer patients, particularly with regard to the selection of those who would benefit from specific treatments such as metformin and immunotherapy.

Declaration of interest
No potential conflict of interest was reported by the author(s).
2296) � LUM.Using this algorithm, we evaluated the risk score for each patient and ranked them from low to high (Figure 5(A)).The survival status of the patients is shown in Figure 5(B).The expression of core genes in each patient is illustrated in Figure 5(C).After stratifying patients based on the risk score, we observed significantly worse prognosis in the high-risk group compared to the low-risk group (Figure 5(D), p ¼ 0.000245).The analysis of the true positive fraction demonstrated that our prognostic risk model exhibited excellent predictive performance (Figure 5(E), 3-year AUC ¼ 0.967).

Figure 1 .
Figure 1.The dataset used for bioinformatics analysis and Venn diagram analysis.(A) The datasets used and the heatmap of differentially expressed genes.(B) Venn diagram analysis of differentially expressed genes in different datasets.

Figure 2 .
Figure 2. Bioinformatics analysis of commonly differentially expressed genes.(A) Pathway and process analysis.(B) Network analysis of pathways and processes.(C) Transcription factor analysis using TRRUST database.
high-risk group (Figure9(A)).The drug sensitivity to paclitaxel (Figure9(B)), doxorubicin (Figure 9(C)), and BI.2536 (Figure 9(D)) also showcased similar characteristics across high and low-risk groups.Conversely, no significant difference in drug sensitivity was observed between the two groups with respect to the frontline chemotherapy drug 5-fluorouracil (Figure 9(E)).However, what surprises us is that compared to the low-risk group, patients in the high-risk group show strong sensitivity to metformin (Figure 9(F)).

Figure 3 .
Figure 3. Protein-protein interaction network analysis of commonly differentially expressed genes.

Figure 4 .
Figure 4. Expression and interrelationships of core genes in tumors.(A) Heatmap of the correlation between the expression of core genes, the expression correlation of genes was analyzed with Spearman; (B) Expression of core genes in adjacent non-cancerous tissues and ESCC tissues.

Figure 5 .
Figure 5. Establishment of a prognostic model based on core genes.(A) Risk score and grouping of patients; (B) Survival status of patients; (C) Heatmap of the scores of each core gene for each sample; (D) Analysis of high and low-risk patients and their prognosis; (E) Kaplan-Meier curves of the prognostic model at 1 and 3 years.

Figure 6 .
Figure 6.Heatmap of the scores of each core gene in different tumors in our prognostic model.

Figure 7 .
Figure 7. Signaling pathways associated with the prognostic model.(A) Volcano plot of differentially expressed genes; (B) Heatmap of differentially expressed genes; (C) KEGG clustering analysis of downregulated genes in the high-risk group compared to the low-risk group; (D) KEGG clustering analysis of upregulated genes in the high-risk group compared to the low-risk group. 3

Figure 8 .
Figure 8. Correlation between prognostic model and immune cells in tumors.(A) Heatmap of the correlation between multiple genes or models and immune score.The abscissa and ordinate represent genes, different colors represent different correlation coefficients, blue represents positive correlation whereas red represents negative correlation, the darker the color, the stronger the relation; (B) The heat map in the schematic represents the correlation analysis of the immune score itself, red represents positive correlation, blue represents negative correlation, the more red or blue color means the greater correlation, also the larger circle means the stronger correlation; the red line in the schematic represents the negative correlation between the model score or gene expression and the immune score, green means the positive correlation.�� p < 0.01; � p < 0.05.

Figure 10 .
Figure 10.Prediction of prognostic model and immune therapy response.(A) Correlation between core gene expression and immune checkpoint gene expression.(B) Prediction of therapeutic effect of immune checkpoint blockade (ICB) therapy in high-risk group patients and low-risk group patients.�� p < 0.01.

Table 1 .
Transcription factors associated with overlapping genes.

Table 2 .
Pan-cancer validation of gene signature.