1 Standardized processing of endometrial cancer datasets
The whole study flowgram is presented in Figure 1. We first performed standardization on three endometrial cancer (EC) datasets: the TCGA-UCEC dataset, the GSE6008 dataset, and the GSE17025 dataset using the R package "limma" (Figure 2A-F). The TCGA-UCEC dataset consists of 552 EC samples (grouped as EC) (Figure 2A-B), the GSE6008 dataset includes 99 EC samples (grouped as EC) (Figure 2C-D), and the GSE17025 dataset contains 91 EC samples (grouped as EC) (Figure 2E-F). From Figure 2A-F, it can be observed that the expression profiles of the three EC datasets (TCGA-UCEC, GSE6008, and GSE17025) tend to exhibit consistent data expression patterns after undergoing standardization among the samples.
2 Construction of prognostic models of GRGs and clinical correlation analysis
To determine the prognostic value of 113 GRGs (Table 2) in the TCGA-UCEC dataset, we employed LASSO regression analysis to construct a prognostic model for GRGs (Figure 3A). Additionally, we visualized the results of the LASSO regression analysis to obtain the LASSO variable trajectory plot (Figure 3B). From the plot, it can be observed that our constructed prognostic model for GRGs includes 13 genes: AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B. Furthermore, we visualized the sample grouping in the constructed prognostic model using a risk factor plot (Figure 3C). The risk factor plot consists of three parts: 1. Risk group: The samples in the dataset were grouped based on the RiskScore (risk score) predicted by the GRGs prognostic model, with the median score as the cutoff; 2. Survival outcome: The survival time and survival outcome of clinical samples in the TCGA-UCEC dataset are displayed using a dot plot; 3. Heatmap: The expression patterns of the prognostic GRGs in the GRGs prognostic model samples are visualized through a heatmap.
To analyze the differences in gene expression between high and low RiskScore groups in the TCGA-UCEC dataset based on the GRGs prognostic model, we performed differential analysis using the limma package. The analysis identified differentially expressed genes (DEGs) between the high and low RiskScore groups in the TCGA-UCEC dataset. The results are as follows: a total of 59,427 DEGs were obtained in the TCGA-UCEC dataset, among which 1,499 genes met the threshold criteria of |logFC| > 1 and P.adj < 0.05. Under this threshold, there were 688 genes with higher expression in the high RiskScore group (lower expression in the low RiskScore group, positive logFC), and 811 genes with lower expression in the high RiskScore group (higher expression in the low RiskScore group, negative logFC). We generated a volcano plot to visualize the differential analysis results of the high and low RiskScore groups in the TCGA-UCEC dataset (Figure 3D). Furthermore, we analyzed the differential expression of the 13 GRGs (AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B) between the high and low RiskScore groups in the TCGA-UCEC dataset and created a heatmap using the R package pheatmap to illustrate the specific differential expression patterns (Figure 3E). From Figure 3E, it can be observed that the expression patterns of GRGs AKR1A1, GCK, NUP155, NUP188, PFKM, and PPP2R1B exhibit significant differences between the high and low RiskScore groups in the TCGA-UCEC dataset.
3 Analysis of the expression difference of GRGs prognostic model in patients with endometrial cancer
To analyze the expression differences of 13 GRGs (AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B) between the high-risk group (group: High) and low-risk group (group: Low) of endometrial cancer (EC) patients in three endometrial cancer datasets (TCGA-UCEC dataset, GSE6008 dataset, and GSE17025 dataset), we multiplied the coefficients of the LASSO model with the gene expression levels of each sample in the GSE6008 dataset and GSE17025 dataset, and then summed them up. This process yielded risk scores (RiskScores) for each sample in the GSE6008 dataset and GSE17025 dataset. Based on the risk scores, we divided the GSE6008 dataset and GSE17025 dataset into high-risk group (group: High) and low-risk group (group: Low) using the median grouping method.
We used the Mann-Whitney U test (Wilcoxon rank sum test) to analyze the expression differences of 13 GRGs in different groups of samples from three endometrial cancer (EC) datasets. The results were presented through group comparison plots (Figure 4A, Figure 4D, Figure 4G). In the TCGA-UCEC dataset in FPKM format, the expression level of GP1 among the 13 GRGs did not show statistically significant differences among different groups (P > 0.05). However, AKR1A1 (P < 0.001), NUP155 (P < 0.001), PC (P < 0.001), PDHA1 (P < 0.001), PDHA2 (P < 0.05), and PFKM (P < 0.001) showed significantly upregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low). On the other hand, AKR1A1 (P < 0.001), ALDOB (P < 0.001), HK1 (P < 0.01), NUP188 (P < 0.001), PGK2 (P < 0.05), and PPP2R1B (P < 0.001) showed significantly downregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low) (Figure 4A).
In the GSE6008 dataset, the expression levels of AKR1A1, ALDOB, GCK, GPI, PC, and PGK2 did not show statistically significant differences among different groups (P > 0.05). However, NUP155 (P < 0.001), PDHA1 (P < 0.01), PDHA2 (P < 0.05), and PFKM (P < 0.001) showed significantly upregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low). Additionally, KK1 (P < 0.05), NUP188 (P < 0.05), and PPP2R1B (P < 0.05) showed significantly downregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low) (Figure 4D).
In the GSE17025 dataset, AKR1A1, AKR1A1, HK1, NUP188, and PC showed no statistically significant difference in expression among different groups (P > 0.05), while GPI (P < 0.05), NUP155 (P < 0.001), PDHA2 (P < 0.001), and PFKM (P < 0.01) showed significantly upregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low). ALDOB (P < 0.05), PDHA1 (P < 0.01), PGK2 (P < 0.05), and PPP2R1B (P < 0.01) showed significantly downregulated expression in the high-risk group (group: High) compared to the low-risk group (group: Low) (Figure 4G).
Then, we further plotted the ROC curves of GRGs with significant statistical significance (P<0.05) in the high-risk group (group: High) and low-risk group (group: Low) within three endometrial cancer (EC) datasets (TCGA-UCEC dataset, GSE6008 dataset, and GSE17025 dataset) to present the results (Figure 4B-C, Figure 4E-F, Figure 4H-I). The selection criteria were set as having a certain diagnostic accuracy (AUC>0.7) in at least 2 datasets. According to Figure 4, it can be observed that in the TCGA-UCEC dataset, the expression of NUP155 (AUC = 0.796, Figure 4B) has a certain diagnostic accuracy for the high-risk group (group: High) and low-risk group (group: Low) in the TCGA-UCEC dataset, and the expression of PFKM (AUC = 0.823, Figure 4C) has a certain diagnostic accuracy for the high-risk group and low-risk group in the TCGA-UCEC dataset. In the GSE6008 dataset, the expression of NUP155 (AUC = 0.793, Figure 4E) has a certain diagnostic accuracy for the high-risk group and low-risk group in the GSE6008 dataset, and the expression of PFKM (AUC = 0.733, Figure 4F) has a higher diagnostic accuracy for the high-risk group and low-risk group in the GSE6008 dataset. In the GSE17025 dataset, the expression of NUP155 (AUC = 0.742, Figure 4H) and PFKM (AUC = 0.681, Figure 4I) has lower diagnostic accuracy for the high-risk group and low-risk group in the GSE17025 dataset.
To further explore the correlation of 13 GRGs (AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B) in the TCGA-UCEC dataset, we used the Spearman algorithm to analyze the expression levels of these 13 GRGs and presented the results using a correlation heatmap (Figure 4J). The results showed that the majority of correlations between these 13 GRGs in the TCGA-UCEC dataset were statistically significant (P<0.05). We then selected the gene pairs with the highest and lowest correlations and presented the results through correlation scatter plots (Figure 4K-L). Among them, we showed the correlation analysis results of the gene pair with the highest positive correlation, PDHA1 and GPI, in the TCGA-UCEC dataset (Figure 4K, r = 0.370, P < 0.001), and the gene pair with the highest negative correlation, AKR1A1 and NUP155, in the TCGA-UCEC dataset (Figure 4L, r = -0.319, P < 0.001), using the correlation scatter plot.
4 GRGs GO and KEGG analysis
To analyze the biological processes, molecular functions, cellular components, and biological pathways related to endometrial cancer (EC) for the 13 GRGs (AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B), we first performed Gene Ontology (GO) gene function enrichment analysis on these GRGs (Table 4). The selection criteria for enriched terms were P.adj < 0.05 and FDR value (q-value) < 0.05, indicating statistical significance.
The results showed that the 13 GRGs were mainly enriched in biological processes (BP) such as the pyruvate metabolic process, monosaccharide metabolic process, hexose metabolic process, glucose metabolic process, and glycolytic process in endometrial cancer (EC). They were also enriched in cellular components (CC) such as nuclear pore, mitochondrial matrix, sperm flagellum, 9+2 motile cilium, as well as molecular functions (MF) such as monosaccharide binding, carbohydrate kinase activity, carbohydrate binding, glucose binding, structural constituent of nuclear pore. We visualized the results of the GO function enrichment analysis using a bubble plot (Figure 5A).
Additionally, we presented the BP pathways, CC pathways, and MF pathways from the results of the GO gene function enrichment analysis in the form of network diagrams (Figure 5B-D).
Next, we performed KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis on 13 GRGs (Table 5). The results showed that the 13 GRGs were significantly enriched in KEGG pathways such as Glycolysis/Gluconeogenesis, Carbon metabolism, HIF-1 signaling pathway, Central carbon metabolism in cancer, and Pyruvate metabolism. We presented the gene expression profiles of the top 5 KEGG pathways in the enrichment results using bar graphs (Figure 5E) and network diagrams (Figure 5F).
5 GSEA enrichment analysis of differentially expressed genes between high and low-risk groups
To determine the impact of gene expression levels on the development of endometrial cancer (EC) in high-risk and low-risk groups based on the GRG prognostic model in the TCGA-UCEC dataset, we conducted Gene Set Enrichment Analysis (GSEA) to analyze the association between the expression of all genes, biological processes involved, cellular components affected, and molecular functions exerted. We used a significant enrichment screening criterion of P.adj.<0.05 and FDR value (q-value) < 0.25. The results showed that genes in the high-risk and low-risk groups of the GRG prognostic model in the TCGA-UCEC dataset were significantly enriched in pathways such as il6 deprivation (Figure 6B), senescence tp53 targets (Figure 6C), apoptosis by cdknia via tp53 (Figure 6D), hypoxia (Figure 6E), cell cycle targets of tp53 and tp73 (Figure 6F) (Figure 6B-F, Table 6).
6 GSVA enrichment analysis of differentially expressed genes between high and low-risk groups
To explore the differences of hallmark gene sets between high-risk and low-risk groups in the TCGA-UCEC dataset, we performed Gene Set Variation Analysis (GSVA) on the differentially expressed genes (DEGs) between the high-risk and low-risk groups in the TCGA-UCEC dataset (Figure 7, Table 7). The GSVA analysis results of the TCGA-UCEC dataset between the high-risk and low-risk groups showed significant differences (P.adj.<0.05) in 29 hallmark gene sets, including adipogenesis, allograft rejection, apoptosis, bile acid metabolism, coagulation, and others. Among them, 12 pathways such as DNA repair, E2F targets, and G2M checkpoint obtained significantly higher enrichment scores in the high-risk group compared to the low-risk group (P.adj.<0.05), while the other 17 pathways such as adipogenesis, allograft rejection, and apoptosis showed significantly higher enrichment scores in the low-risk group compared to the high-risk group (P.adj.<0.05). Based on the GSVA results, we analyzed the differential expression of the 29 HALLMARK pathways between different groups in the TCGA-UCEC dataset and used the R package pheatmap to generate a heatmap to visualize the specific differential analysis results (Figure 7A). We used the Mann-Whitney U test to analyze the degree of group differences in the HALLMARK pathways between different groups in the TCGA-LUAD dataset and presented the results through comparative graphics (Figure 7B). The results showed that the differential expression of the 29 HALLMARK pathways between different groups in the TCGA-UCEC dataset had significant statistical differences (P<0.05).
7 Protein-protein interaction networks (PPI networks), mRNA-miRNA, mRNA-RBP, mRNA-TF, and mRNA-drugs interaction networks of GRGs
We used the STRING database to perform protein-protein interaction (PPI) analysis on 13 gene-related genes (GRGs) (AKR1A1, ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2, PPP2R1B) in the high-risk and low-risk groups of the TCGA-UCEC dataset of the GRGs prognosis model. We only kept the GRGs that had connections with other nodes and constructed a PPI network consisting of 11 GRGs (ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2). The PPI network, which includes all the nodes, serves as key genes (hub genes). We visualized the PPI network using Cytoscape software (Figure 8A).
We used the mRNA-miRNA data from the miRDB database to predict the interaction between 11 key genes (hub genes: ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) and miRNAs. We then visualized the mRNA-miRNA interaction network by drawing it using Cytoscape software (Figure 8B). In the mRNA-miRNA interaction network, the light blue ellipses represent mRNA, and the dark green circles represent miRNA. From the mRNA-miRNA interaction network, we identified that our mRNA-miRNA network consists of 7 hub genes (GPI, HK1, NUP155, NUP188, PC, PDHA1, PFKM), and 61 miRNA molecules, resulting in a total of 67 interactions between mRNA and miRNA. The specific mRNA-miRNA interaction relationships can be found in Table S1.
We also used the ENCORI database to predict the interaction between 11 key genes (hub genes: ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) and RNA binding proteins (RBPs). We then filtered out the mRNA-RBP interaction pairs with clusterNum > 4 and clipExpNum > 4 and visualized the mRNA-RBP interaction network using Cytoscape software (Figure 8C). In the mRNA-RBP interaction network, the light blue circles represent mRNA, and the purple circles represent RBPs. From the mRNA-RBP interaction network (Figure 8C), we identified that our mRNA-RBP network consists of 7 hub genes (GPI, HK1, NUP155, NUP188, PC, PDHA1, PFKM), and 24 RBP molecules, resulting in a total of 123 interactions between mRNA and RBP. Specifically, the hub gene GPI interacts with 23 RBP molecules. The specific mRNA-RBP interaction relationships can be found in Table S2.
We searched for transcription factors (TFs) that bind to the key genes (hub genes) using the CHIPBase database (version 3.0) and the hTFtarget database. After downloading the interaction data from both databases, we intersected it with the 11 key genes, resulting in a final set of 8 hub genes (ALDOB, GCK, HK1, NUP155, NUP188, PC, PDHA1, PFKM) and 139 transcription factors (TFs). We visualized this interaction data using Cytoscape software, where the light blue circles represent mRNA and the light green circles represent transcription factors (TFs) (Figure 8D). In the mRNA-TF interaction network, the DEGs BAK1 has the highest number of interactions with TFs, while the PC gene has a total of 108 pairs of mRNA-TF interactions. The specific mRNA-TF interaction relationships can be found in Table S3.
The CTD database was used to identify potential drugs or molecular compounds for the 11 key genes (hub genes). We searched the CTD database for 21 potential drugs or molecular compounds that are associated with 9 key genes (ALDOB, GCK, GPI, HK1, NUP155, PC, PDHA1, PFKM, PGK2), as shown in the mRNA-drug interaction network. The light blue oval shapes represent mRNA, while the pink circular shapes represent drugs (Figure 8D). Among them, we found 9 drugs or molecular compounds targeting the GCK gene. The specific mRNA-drug interaction relationships can be found in Table S4.
8 Construction of a multivariate Cox prognostic model for the TCGA-UCEC dataset
To determine the prognostic value of the 11 GRGs (ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) in the TCGA-UCEC dataset, we first performed univariate Cox regression analysis on the expression levels of the 11 GRGs in the TCGA-UCEC dataset. Variables with a univariate analysis P<0.1 were included in the multivariate Cox regression analysis to construct a prognostic model for the high-risk and low-risk groups of GRGs (Table 8). Then, we presented the results of the univariate Cox regression analysis in the form of a forest plot (Figure 9A) after organizing them.
Next, we conducted nomogram analysis to assess the prognostic ability of the model and plotted a column line chart (Figure 9B). The column line chart, also known as a nomogram chart, is based on a multivariate regression analysis. It represents the variables within the multivariate regression model through a set scale to score and predict the probability of events occurring. From Figure 9B, it can be observed that PGK2 among the 11 genes in the TCGA-UCEC dataset's GRGs prognostic model has significantly higher utility compared to other variables. Additionally, we performed calibration analysis for the multivariate Cox regression model using the column line chart (nomogram chart) for 1-year (Figure 9C), 3-year (Figure 9D), and 5-year (Figure 9E) prognosis predictions. We also plotted calibration curves (Figure 9C-E), where the x-axis represents the predicted survival probabilities of the model, and the y-axis represents the actual survival probabilities. The lines and points of various colors represent the model's predictions at different time points. The closer the lines of different colors are to the gray ideal line, the better the predictive performance at that time point.
We then used Decision Curve Analysis (DCA) to evaluate the clinical utility of the constructed multivariate Cox regression model at 1-year (Figure 9F), 3-year (Figure 9G), and 5-year (Figure 9H) time points and presented the results (Figure 9F-H). In the DCA chart, the x-axis represents the probability threshold or threshold probability, while the y-axis represents the net benefit. By observing the stability and higher values of the model's lines compared to the "All positive" and "All negative" lines, we can determine the range of x-values that indicates better performance. The larger the range of x-values, the better the model's effectiveness. The results show that the clinical predictive performance of our constructed multivariate Cox regression model is ranked as follows: 5-year > 3-year > 1-year.
9 Prognostic analysis of GRGs
We performed a prognostic analysis on the 11 GRGs (ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) of the multivariate Cox regression model constructed from the TCGA-UCEC dataset (Table 9). We considered molecular factors with a significance level of P<0.05 to have statistical significance.
We plotted survival KM curves for each of the 11 GRGs (ALDOB, GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) using a screening threshold of P<0.05. We obtained five immune-cell death-related prognostic genes that met the criteria (Figure 7A-E). These five genes are: GCK (P<0.001, Figure 10A), NUP155 (P=0.041, Figure 10B), NUP188 (P=0.022, Figure 10C), PC (P=0.026, Figure 10D), and PDHA1 (P=0.049, Figure 10E).
In addition, we also performed a time-dependent ROC curve analysis to evaluate the correlation between the 11 GRGs (GCK, GPI, HK1, NUP155, NUP188, PC, PDHA1, PDHA2, PFKM, PGK2) and the survival information (OS event) of endometrial cancer (EC) patients in the TCGA-UCEC dataset. The genes with an AUC>0.6 were selected for further analysis. The results showed that GCK (AUC1=0.665, AUC3=0.559, AUC5=0.598, Figure 10F), NUP155 (AUC1=0.617, AUC3=0.624, AUC5=0.648, Figure 10G), PC (AUC1=0.577, AUC3=0.593, AUC5=0.614, Figure 10H), and PFKM (AUC1=0.588, AUC3=0.635, AUC5=0.674, Figure 10I) had relatively low accuracy in predicting the survival outcome (OS) of EC patients in the TCGA-UCEC dataset.