2.1 Data acquisition and preprocessing
The expression profile of GSE20347, GSE29001, and GSE111044 was achieved based on gene expression omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), which is a public database containing comprehensive data of gene profiling and sequencing. GSE20347 included 17 pair-wise ESCA tissues and normal adjacent tissues. GSE29001 included matched normal basal epithelial cells, normal differentiated squamous epithelium, and ESCA from 12 patients. GSE111044 included 3 ESCA tissues and corresponding 3 normal tissues from 3 patients. The experiment GSE20347 and GSE29001 were conducted on platform GPL571 (Affymetrix Human Genome U133A 2.0 Array), and GSE111044 was on platform GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array). After eliminating redundant data (e.g., null value, time), all the gene symbols were matched with probes and were subsequently screened with the ‘limma’ package of R software 3.4.1 to perform background correction, quartile normalization, and quantiles summarization.
2.2 Identification of DEGs between ESCA and normal tissues
In this study, we utilized the ‘limma’ package in bioconductor (http://www.bioconductor.org/) to uncover DEGs between normal and ESCA tissues. Adjusted P < 0.05 and |log2 fold change| > 1 was set as the criteria for selecting significant DEGs according to the normalized gene expression levels.
2.3 Establishment of WGCNA on ESCA
WGCNA is regard to a methodology to reconstruct a free-scale gene co-expression network and concurrently identify modules consisted of highly correlated genes to appraise connectivity between external clinical traits and the module, in which eigengene is used for summarizing relationship among internal gene membership. In the study, we applied the one-step network construction and module detection function of WGCNA package in R to handle the analysis of microarray dataset GSE70409, which contains 17 cancerous tissues and 17 normal tissues. First, a weighted adjacency matrix was calculated to represent the connecting strength over each pair of gene with outliers removed and ensure a co-expression network. The soft thresholding power was set as 7 to obtain a scale‑free topology network. Then, a hierarchical clustering dendrogram was established constituted of abundant branches, and each branch was assigned with a colour to reveal a module. Finally, the modules were used to the association with clinical traits by using module-trait associations according to module membership (MM) and gene significance (GS). Moreover, topological overlap matrix method was utilized for verifying correlation character of eigengenes in different modules.
2.4 Module preservation evaluation and principal component analysis (PCA) analysis
Zsummary is usually used to evaluate the preservation of modules. It takes into consideration several statistics, such as the density and connectivity patterns of module nodes, as well as the overlap among module membership. However, a huge difference in module size is easy to induce deviation in Zsummary value. In our study, the green module had far more genes than the greenyellow module, and consequently, we adopted medianRank, since it eliminates the impact of module size. The module with a lower medianRank has more preservation value than that with a higher medianRank. The finally preserved greenyellow module was processed with PCA to examined the ability of gene in the module to discriminate tumor tissues from normal tissues. PCA was conducted through ‘gmodels’ and ‘scatterplot3d’ in R software.
2.5 Pathway enrichment analyses of genes in the hub module
Gene ontology (GO) is a common method for gene analysis [20]. In the study, we used GO analysis to classify the genes in greenyellow module into three categories based on their bio-function, including biological process (BP), cellular component (CC), and molecular function (MF). Meanwhile, Kyoto encyclopedia of genes and genomes (KEGG) pathway enrichment analysis was also performed for exploration of their biological characteristics. Both the GO and KEGG pathway analyses were conducted using the ‘clusterProfiler’ package in R, with adjusted P-value of the analysis calculated by the Benjamini and Hochberg false discovery rate algorithm. Furthermore, a circular chordal graph of the data was generated utilizing gene network analysis via the default euclidean distance and average linkage.
2.6 Extract of hub genes from DEGs and the hub module in WGCNA
39 candidate genes were screened out from the intersection of venn diagram between 3 set of DEGs and the greenyellow module in WGCNA (Figure 5). CCNB1 was chosen for the next research for the most significant P value. For further study, the protein–protein interaction (PPI) network was constructed among 39 candidate genes (string, https://string-db.org/), and Cytoscape software was applied to visualize the PPI network.
2.7 Verification of gene aberrant expression in UALCAN, GEPIA2, and HCMDB
UALCAN (http://ualcan.path.uab.edu/index.html) is a comprehensive web resource for analyzing cancer OMICS data with the use of TCGA and MET500 databases. In our study, CCNB1 expression data was obtained through ‘Expression Analysis’ module for the contrast of promoter methylation and gene overall expression level. Further stratified analysis based on gender, age, cancer stages, and histology was also conducted. Additionally, GEPIA2 (http://gepia2.cancer-pku.cn/) and Human Cancer Metastasis Database (HCMDB, http://hcmdb.i-sanger.com/index) analysis was performed to explore the differential expression of CCNB1 between normal tissues and ESCA tissues. GEPIA2 is based on CGA and the GTEx databases with a total of 84 cancer subtypes analysis. HCMDB is designed to examine gene expression of primary and metastasis tumour from 124 previously published transcriptome datasets from TCGA and GEO databases.
2.8 Survival analysis with Kaplan–Meier (KM) plotter database
KM plotter (http://kmplot.com/) was used to plot the overall survival status and estimate prognostic valve of CCNB1, which is an interactive website containing 54k genes impact data on survival of 21 cancer types. According to median gene expression level, all patients’ survival data was divided into two group: high expression group and low expression group. The P value was set < 0.05 to ensure statistically significance.
2.9 Tumor infiltration analysis of TIMER database
TIMER database provides a systematical analysis of infiltrating abundances in 6 types of immune cells (B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells) and the infiltration relevant clinical outcome. Hence, we adopted this method to research the tumor infiltration as well as the survival data on CCNB1.