3.1 Identification of DEGs in CC
The CC expression microarray datasets (GSE6791, GSE9750, GSE39001 and GSE63514) were firstly standardized (Figure S1). With limma package, 256 DEGs were filtered from GSE6791 (60 downregulated and 196 upregulated), 236 DEGs from GSE9750 (136 downregulated and 100 upregulated), 98 DEGs from GSE39001 (38 upregulated and 60 downregulated), 489 DEGs from GSE63514 (177 upregulated and 312 downregulated). DEGs from the 4 microarray datasets were exhibited in volcano maps (Figure S2A-D) and heatmaps (Figure S3A-D). We analyzed the four microarray datasets via the limma package and then with RRA method according to their log-folding variation values ((|log2fold change (FC)| > 1 and adj. p <0.05). The RRA method was based on the theory that genes in each experiment were randomly ordered. For the genes ranking higher in the experiment, the possibility of differential expression is inversely proportional to the value of P. Through analytic hierarchy analysis, we sorted out 74 up-regulated genes, and 73 down-regulated genes (Table 2). Finally, the R-heatmap software was performed to plot the top 40 up- and down-regulated genes (Figure 2).
3.2 Functional analysis of DEGs
The biological annotations of DEGs in CC were obtained with an online analysis tool named DAVID, which had GO analysis of up- and down-regulated genes (P<0.05). The GO analysis of DEGs covered three aspects: molecular function, biological processes and cellular composition (Figure S4A). The upregulated genes were significantly enriched in microtubule binding, tubulin binding and ATPase activity (Figure 3A), and the down-regulated genes in serine-type peptidase activity, serine hydrolase activity and serine-type endopeptidase (Figure 3B). These results indicated that most DEGs were prominently enriched in structural molecule activity, midbody, kinesin complex and microtubule motor activity. (Figure S4 B-C and Figure S5A). Clusterprofile was performed to analyze the DEGs. The result showed that the upregulated genes were mostly enriched in DNA replication, Oocyte meiosis and Cell cycle. (Figure 3C), and the down-regulated genes in Arachidonic acid metabolism, prostate cancer and signaling pathways regulating pluripotency of stem cells (Figure 3D). The pathway-gene network (Figure S5B) suggested that the cell cycle was the most important term in the biological processes of CC.
3.3 PPI network construction and modules selection
The PPI network of DEGs was constructed, including 147 nodes (74 up-regulated genes and 73 down-regulated genes) and 562 edges (Figure 4A). Degrees ≥30 was set as the cutoff. A total of 16 genes, such as CDK1, TOP2A, NCAPG, and KIF11, were showed the most significant difference in expression (Figure 4B). A significant module was selected with plug-in MCODE, including 27 nodes and 343 edges (Figure S6A). GO and KEGG pathway enrichment analysis indicated that the genes in this module were mainly related to microtubule binding, tubulin binding, cell cycle and oocyte mitosis (Figure S6B and 6C).
3.4 Small molecule drugs screening
CMap network was used to analyze 147 differentially expressed genes into two groups (74 in up-regulated group and 73 in down-regulated group). After the signature query, the three compounds with the highest negative enrichment score (apigenin, thioguanine, and trichostatin A) were identified as potential therapeutic agents for CC (Table 3). The three-dimensional chemical structure of these three compounds is shown in Figure 5.
3.5 Validation of hub genes
We validated DEGs at GEPIA website, including survival analysis and tissue sample expression analysis (Figure S7 and Figure S8). Eight genes (APOD, CXCL8, MMP1, MMP3, PLOD2, PTGDS, SNX10 and SPP1) had the same trend in both the above analysis. We literature-reviewed these eight genes, finding that only PTGDS and SNX10 had not been reported to be associated with CC. Therefore, we used GSE7803 and GSE29576 to validate PTGDS and SNX10 (Figure S9). The results showed that PTGDS had high expression levels in normal tissues and low expression levels in tumor tissues, while SNX10 showed an opposite profile. We further validated the two genes using immunohistochemistry (Figure 6A-B) and ONCOMINE, obtaining the results consistent with those from the GEO database (Figure 6C-D). The area under the curve of PTGDS was 0.919 and that of PTGDS was 0.905, suggesting that both can distinguish CC and normal tissue and have a good diagnostic efficiency (Figure 7A). GSEA was performed to search KEGG pathways enriched in the TCGA samples. The top ten most enriched pathways included “hematopoietic cell lineage”, “adhesion molecules cams”, “vascular smooth muscle contraction”, “systemic lupus erythematosus”, “chemokine signaling pathway”, “t cell receptor signaling pathway”, “cytokine cytokine receptor interaction”, “calcium signaling pathway”, “neuroactive ligand receptor interaction” and “leukocyte transendothelial migration” (Figure 7B) (adj.p < 0.05). In addition, the univariate and multivariate Cox proportional hazards regression analyses showed that SNX10 and PTGDS were independent prognostic indicators for OS among CESC patients (P=0.007 and 0.003)(Table 4).
To find out the mechanism of abnormal gene expression, we analyzed the gene expression level and methylation level from the Illumina Human Methylation 450 platform based on TCGA data. The association between the methylated expression and the gene expression of the two key driving genes were shown in Figure7C-D. The survival analysis showed that the patients with low-expressed and hyper-methylated PTGDS had a worse prognosis than those with high-expressed and hypo-methylated PTGDS(P=0.037) (Figure 7E). However, SNX10 methylation has no statistical significance in survival analysis.
3.6 Establishment of Cox regression model
Univariate cox regression analysis screened out seven genes, including APOD, CXCL8, MMP1, MMP3, PLOD2, PTGDS and SPP1 (Figure S10). Multivariate Cox regression analysis screened five genes, including SPP1, CXCL8, PTGDS, PLOD2 and MMP3 (Figure S11). The score for predicting overall survival risk was calculated as followed: Risk score= 0.143* SPP1+0.136* CXCL8-0.093* PTGDS+0.206* PLOD2+0.067* MMP3. Based on the risk score, CC patients were divided into low- and high-risk groups. Kaplan-Meier survival analysis suggested that low-risk patients had better overall survival than high-risk patients in the TCGA cohort (Figure 8A). ROC curve analysis was also completed according to the 5-year survival of the area under the receiver operating characteristic curve (AUC) value. The specificity and sensitivity were both highest when the risk score was 0.738 (Figure 8B). The distribution of risk score, survival status, and the expression levels of five genes was also analyzed (Figure 8C-F). The expression levels of five genes in low- and high-risk groups were shown in Figure S12A. The univariate and multivariate Cox proportional hazards regression analyses showed that only the risk score based on five genes was independent prognostic indictor of CC (Figure S12B-C).
The Heatmap showed the expression levels of the five genes in high- and low-risk patients in the TCGA dataset. We observed significant difference in survival state (P < 0.001) and stage (P < 0.05) (Figure S 12D).