Screening and identification of significant genes in esophageal squamous cell carcinoma by bioinformatical analysis

Background: Esophageal squamous cell carcinoma (ESCC) is one of the most common cancers with notably high incidence and mortality rates. However, the molecular mechanism underlying ESCC pathogenesis and prognosis is very complicated. The main objective of our investigation has been to obtain some knowledge of significant genes with poor outcome and their underlying mechanisms. Methods: Gene expression profiles of GSE26886, GSE23400, GSE20347 and GSE17351 were available from GEO database. The differentially expressed genes (DEGs) were identified, and function enrichment analyses were performed. The protein-protein interaction network (PPI) was constructed and the module analysis was performed using STRING and Cytoscape software. Results: A total of 105 DEGs were identified between normal esophagus and ESCC bioinformatical analysis samples. Functional annotations of the common DEGs indicate that extracellular matrix (ECM) remodeling plays a key role in tumor formation and progression.18 hub genes were identified and disease free survival analysis showed that CDKN3, RAD51AP1, KIF4A may be involved in poor prognosis in ESCC patients. Conclusions: DEGs and hub genes identified in the present study help us understand the molecular mechanisms underlying the carcinogenesis and progression of ESCC, and provide candidate targets for diagnosis and treatment of ESCC.


Background
Esophageal cancer (EC) is one of the common malignant tumors, with the sixth highest mortality and the eighth highest incidence rate worldwide [1]. There are regional differences in the incidence of esophageal cancer. In the high-risk area, stretching from Northern Iran through the central Asian republics to North-Central China, 90 percent of cases are squalors cell carcinomas (SCC) [2,3]. The typical clinical symptom of esophageal squamous cell carcinoma(ESCC) is progressive dysphagia.
Early esophageal cancers are not specifically symptomatic. Dysphagia usually occurs once the esophageal lumen diameter is less than 13mm, which indicates advanced disease and leads to a high mortality rate as well as a heavy burden on patients and their families [4]. Therefore, it is urgent to find out the core genes for clinical diagnosis and poor prognosis of esophageal squamous cell 3 carcinoma. Gene chip used for more than ten years can quickly detect differentially expressed genes and had proven to be a reliable technology [5]. It can generate many slice data and can be stored in a shared database. Therefore, a lot of valuable clues can be explored from these data for further research. Integrated bioinformatical analysis can help us further research and better explore the underlying mechanisms [6]. In this study, to identify potential biomarkers for ESCC, we chose GSE26886, GSE23400, GSE20347, and GSE17351 from Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) database. The data were downloaded and analyzed to obtain the differentially expressed genes (DEGs) between esophageal cancer tissue and non-cancerous tissue.
The Database for Annotation, Visualization and Integrated Discovery (DAVID) was utilized to analyze these DEGs including molecular function (MF), cellular component (CC), biological process (BP) and Kyoto Encyclopedia of Gene and Genome (KEGG) pathways. Moreover, the protein-protein interaction (PPI) network of DEGs was used to identify some core genes.

Data Collection
Gene expression profiles of GSE26886, GSE23400, GSE20347, and GSE17351 in esophageal squamous cell carcinoma and normal tissues were downloaded from the NCBI-GEO database.
GEO2R is an interactive web tool that allows users to compare two or more datasets in the GEO series in to identify DEGs across experimental conditions [7] . The DEGs with log FC<0 was taken to be down-regulated genes, while the DEGs with log FC>0 was considered as an up-regulated gene.

KEGG and GO enrichment analyses of DEGs
KEGG is a bioinformatics resource that deals with genomes, diseases, enzymatic pathways, drugs and chemicals [8]. GO is a major bioinformatics initiative for interrogating gene sets in multiple annotation categories, and analyzing their biological processes [9]. DAVID is an online analysis tool designed to identify enriched gene ontological functions [10]. We used DAVID to construct the enrichment plots in order to visualize the common DEGs in the probable biological processes (BP), cellular components (CC), and molecular functions (MF) and KEGG pathways. Adjusted P value <0.05 was regarded as statistically significant.

Protein Interaction Network (PPI), Modular Analysis and Hub Genes
This research used an online database search tool (STRING, http://string-db.org) for the retrieval of interacting genes to predict the PPI network. Protein performs functions by interacting with each other. The PPI network gives a chance to understand protein functions at a system level. In this study, a PPI network of the differentially expressed genes was constructed with the STRING database. We kept the network with a confidence score > 0.4 in STRING, and then entered it into Cytoscapesoftware for visualization [11]. Cytoscape's inserted Molecular Complex Detection (MCODE) is an application for clustering a given network based on topology to find densely connected regions [12]. We used Cytoscape to draw a PPI network and use MCODE to identify the most important modules in the PPI network. The selection criteria are as follows: MCODE score> 5, degree cutoff = 2, node score cutoff = 0.2, maximum depth = 100, and k score = 2. KEGG pathway enrichment analysis is then performed on important modules of functional interpretation. The hub genes were selected with degrees ≥10. The biological processes of the hub genes were analyzed and visualized using the Biological Networks Oncology Tool (BiNGO) plugin of Cytoscape [13] .

Expression and Survival Analysis of Hub Genes
Hierarchical clustering of hub genes was constructed using UCSC Cancer Genomics Browser (http://genome-cancer.ucsc.edu) [14]. GEPIA (http://gepia.cancer-pku.cn/) [15] is a newly web server for gene expression profiling of 9,736 tumors and 8,587 normal samples from TCGA and GTEx projects. The cBioPortal (http://cBioPortal.org) [16] is a public online tool based on TCGA. To validate these hub genes, we applied the GEPIA and cBioPortal website to analyze the data of RNA sequencing expression on the basis of thousands of samples. GEPIA was used to analyze the expression levels of central genes in cancer tissues and healthy control tissues. Kaplan-Meier curve in cBioPortal was used to perform the overall survival and disease-free survival analyses of hub genes. The log rank P value and/or hazard ratio (HR) with 95% confidence intervals were/was computed and showed on the plot.
In addition, we used the Biological Network Gene Oncology Tool (BiNGO) plugin in Cytoscape to analyze and visualize biological l processes of the hub genes.

Identifcation of DEGs in esophageal squamous cell carcinoma
Via GEO2R online tools, we extracted 8267, 673,1733 and 744 DEGs from GSE26886, GSE23400, GSE20347, and GSE17351, respectively. The detailed information about GEO datasets is summarized in Table 1. Then, we used Venn diagram software to identify the commonly DEGs. The overlap areas shown in Venn diagram contained 105 genes among the 4 datasets, consisting of 42 downregulated genes and 63 upregulated genes between esophageal squamous cell carcinoma tissues and healthy control tissues. The venn diagram was shown in Fig. 1.

KEGG and GO enrichment analyses of DEGs
All 105 DEGs were analyzed by DAVID software and the results of GO analysis indicated that 1) for biological processes (BP), up-regulated DEGs were particularly enriched in extracellular matrix organization, collagen catabolic process and extracellular matrix disassembly, and down-regulated KEGG pathway enrichment analysis showed that the DEGs mainly participated in ECM-receptor interaction, Small cell lung cancer and Protein digestion and absorption pathways, while down regulated DEGs in no significant signaling pathways (P < 0.05) (Fig. 2).

6
A total of 105 DEGs were imported into the DEGs PPI network complex which included 102nodes and 234 edges, including 42 down-regulated and 63 up-regulated genes (Fig.3a ). Three significant models were obtained using the MCODE application in Cytoscape. The scores were 17.294, 5.000,3.000, respectively ( Fig.3c/d/e). The most important module was selected for further analysis.
Path enrichment analysis indicates that the module is mainly related to DNA replication and cell cycle Table 2 . A total of 18 genes were identified as hub genes with degrees ≥10. The names, abbreviations and functions for these hub genes are shown in Table3. The biological process analysis of the hub genes is shown in Fig.3f.

Expression and Survival Analysis of Hub Genes
Hierarchical clustering showed that the hub genes could basically differentiate the esophageal squamous cell carcinoma samples from the healthy control tissues (Fig.4). According to data from the GEPIA database, the expression levels of all hub genes in cancer patients were significantly higher than those in healthy controls, except for GMNN (Fig.5). In overall survival Analysis, those hub Genes were not statistically significant. Disease Free survival (RFS) analysis of the hub genes indicates that high expression of CDKN3, RAD51AP1, KIF4A was associated with poor prognosis in ESCC patients (Fig.6).

Discussion
Esophageal cancer is one of the most aggressive cancers in the world and the sixth leading cause of cancer-related death [17]. Histologically, the subtypes of esophageal squamous cell carcinoma in Asia has the highest incidence. Escc's pathogenesis is related to multiple factors including race, genetics and dietary intake [18]. The main treatments for ESCC include surgery, chemotherapy and radiochemotherapy. However, the 5-year survival rate of patients with esophageal cancer remain low (<15%) [19]. It is of utmost importance to understand the pathogenic mechanisms of ESCC and develop effective strategies to treat ESCC. We need to improve our understanding of the pathogenesis of esophageal squamous cell carcinoma through molecular biology research. Microarray technology has allowed us to explore the genetic alterations of ESCC, and has been proved to be a useful method for identifying new biomarkers in other diseases. In this study, four microarray data 7 sets were analyzed to obtain DEGs between esophageal squamous cell carcinoma tissue samples and healthy control tissue samples. A total of 105 DEGs were identified in these four data sets, including 63 up-regulated genes and 42 down-regulated genes. GO and KEGG enrichment analyses were performed to explore the interaction between the DEGs. The up-regulated genes were mainly enriched in extracellular matrix organization the breakdown of collagen in the extracellular matrix extracellular matrix disassembly hemidesmosome assembly the extracellular matrix (ECM) -receptor interaction, protein digestion and absorption, and cell-matrix adhesions, while the down-regulated genes were mainly enriched in cellular response to xenobiotic stimulus hypodermal cell differentiation, negative regulation of cell migration cell differentiation, and extrinsic component of membrane cell cortex. Increasing evidence shows that a comprehensive understanding of the molecular composition of esophageal cancer requires attention not only to tumor cells but also to the tumor microenvironment [20]. The cells are surrounded by the extracellular matrix (ECM), the complex network consisting of molecules, proteins, and polysaccharides. Related research shows that cancerrelated fibroblasts secrete growth factors and change the ECM to create a tumor niche and enhancing tumor cell migration and metastasis [20]. Extracellular matrix (ECM) remodeling plays a key role in tumor formation and progression, especially invasion [21]. Stromal cells such as fibroblasts secrete ECM remodeling enzymes, such as lysyl oxidase (LOX) and matrix metalloproteinase (MMP), which can help with the formation of primary tumors or metastatic micro-ecologies and the down-regulation of cell adhesion to achieve invasion, migration and intravascular infiltration [22,23]. ECM remodeling is associated with esophageal cancer, especially ESCC. For example, LOX-L2 is overexpressed in more than 90% of ESCC [24]. In addition, several matrix metalloproteinases (MIMPs) in ESCC, including MMP-2, MMP-7, and MMP-9, are all up-regulated and related to tumor stage [25]. These theories are consistent with our results. 18 hub genes were identified from the PPI network, and 3 of these genes, namely CDKN3, RAD51AP1, and KIF4A, were related to DFS in ESCC patients. The high expression levels of these genes were related to a shorter DFS and the poor prognosis of patients with esophageal cancer. Cyclin-dependent kinase inhibitor 3 (CDKN3) performs crucial roles in the modulation of tumor development. Many previous studies reported that CDNK3 plays an inhibitory 8 role in tumor development [26]. New research finds that CDKN3 plays different roles in different types of cancer. CDKN3 is reported to be involved in the occurrence and development of some types of tumors [27,28]. CDKN3 facilitated the promotion of ESCC cell proliferation, invasion and migration by activating the AKT signaling pathway [29]. CDKN3 knockdown significantly inhibited ESCC cell proliferation, migration, and invasion, and suppressed the G1 / S transition of tumor cells, providing a potential target for ESCC treatment [29]. RAD51-associated protein 1 (RAD51AP1) is an emerging protein that is essential for RAD51-mediated Homologous recombination (HR). HR serves to repair DNA double-strand breaks and damaged replication forks, which is essential for maintaining genome stability and suppressing tumors. However, there are no reports about the clinical value of RAD51AP1 in ESCC. Related studies have found that the expression of RAD51AP1 is up-regulated in primary hepatocellular carcinoma and intrahepatic bile duct cancer [30,31]. High expression of RAD51AP1 is associated with shortened survival in patients with breast and ovarian cancer [32,33]. Although the exact molecular role of RAD51AP1 in cancer development and progression has not been fully understood, research has uncovered a possible link between RAD51AP1 and tumor metastasis.
Studies have reported that the high expression of RAD51AP1 is associated with poor prognosis in patients with breast and ovarian cancer [34]. In some tumor cells and tissues, the increased RAD51AP1 may be related to the advantages of selective growth, enabling the replication of early tumor cells and metastatic cells [35,36]. However, we still don't know the precise molecular mechanisms by which RAD51AP1 carries out the HR reaction in cells and stabilizes DNA replication forks, and how posttranslational modifications affect the activities and biology of RAD51AP1. Therefore, we need to further study and explore the impact of RAD51AP1 on the pathogenesis and prognosis of ESCC. Filling the knowledge gap in the above fields will also provide prospects for targeted treatment of high RAD51AP1 cancer. Human kinesin family member 4A (KIF4A) is a 140 kD protein that plays a key role in a variety of cellular processes, including chromosome condensation and segregation, cytok inesis during mitotic division, and middle-spindle formation [37,38] Recently, it has become apparent that KIF4A plays an essential role in cancer development and progression.
Recent studies have discovered that KIF4A might serve as a biomarker in breast cancer based on bioinformatics analysis [39]. KIF4A overexpression is observed in colorectal cancer and pancreatic ductal adenocarcinoma as well as in lung cancer, for which it is an independent prognostic risk factor [40][41][42]. However, the clinical value of KIF4A in esophageal squamous cell carcinoma is rarely mentioned. In the future, in-depth studies on the roles of KIF4A in ESCC might provide new clues for inhibiting ESCC.

Conclusions
In conclusion, with integrated bioinformatics analysis, we identified 105 DEGs and 18 hub genes in   .00) . f The biological process analysis of hub genes was constructed using BiNGO.
19 Figure 4 Hierarchical clustering of hub genes was constructed using UCSC.
20 Figure 5 Expression of hub genes in ESCC and normal tissues using GEPIA 21 Figure 6 Prognostic value of hub genes in ESCC obtained from GEPIA. High expression levels of CDKN3 (a),, RAD51AP1(b), KIF4A(c)were associated with poor disease-free survival.