- Identification of molecular subtypes related to energy metabolism and cancer prognosis.
Among the 584 EMRGs, a total of 86 genes were significantly associated with the prognosis of STAD patients according to the results of univariate Cox regression analysis with P < 0.05 (Supplementary Table S2). The NMF analysis based on the expression of the 86 genes eventually identified two distinct subtypes (Cluster1[n=123], Cluster2[n=216]) among the 336 STAD patients in the TCGA dataset, which might have close association with cancer energy metabolism processes and prognosis in STAD (Figure 1A). Kaplan-Meier survival analysis revealed that the PFS of STAD patients of the two clusters was significantly different (P = 0.025, HR = 0.66, 95%CI 0.46-0.95; Figure 1B). As shown in the heatmap of gene expression across the two clusters (Figure 1C), the great majority of EMRGs presented higher expression levels in Cluster 2 compared with Cluster 1.
Further comparison with WHO classification suggested that Cluster 1 and Cluster 2 were inclined to the poorly cohesive and tubular subtypes, respectively (Figure 1D). Supported by the TCGA project, Adam et al. [25] once divided STAD patients into four TCGA subtypes: Epstein–Barr virus positive (C1), microsatellite unstable (C2), genomically stable (C3), and chromosomally unstable tumors (C4). By comparing the present results of molecular classification with the classical TCGA four subtypes-classification, we discovered that the identified Cluster1 was inclined to the Barr virus positive (C1) subtype which had a poorer prognosis, while Cluster2 showed more relevance with the genomically stable (C3) subtype whose prognosis was much better (Figure 1E). Comparison analysis with other well-established clustering methods demonstrated the reliability of the classification results.
The distribution of the two clusters in STAD patients with different clinical characteristics was further analyzed. It was observed that most of patients with T1 or T2 stage and TNM Stage I were divided into Cluster 2 which had better survival. The Cluster 1 with poor outcomes inversely showed a trend of younger ages (Supplementary Figure 1). The proportions of tumor-infiltration immune cells and the fractions of immune and stromal cell components in tumor microenvironment (TME) were further compared between the two subgroups to explore the association between energy metabolism phenotype and immune status in STAD (Figure 2). The proportions of CD4+ T cells, CD8+ T cells, macrophage, neutrophils, and dendritic cell were all significantly higher in Cluster 1 than in Cluster2 (Figure 2A-F). The calculated ImmuneScore, StromalScore, and ESTIMATEScore were also remarkably higher in Cluster 1, which represented more immune and stromal cell components in TME and lower tumor priority for the samples in Cluster 1 (Figure 2G-I). The results further suggested the close association between cancer cell energy metabolism, immune regulation, and clinical outcomes in STAD.
- Selection of hub genes by WGCNA and PPI analyses.
One outlier sample was identified by the hierarchical clustering analysis and removed from WGCNA co-expression analysis (Supplementary Figure S2A-C). Based on the expression of EMRGs in the TCGA dataset, a total of 29 co-expression modules were obtained after module fusion (Figure 3A, grey modules represent gene sets couldn’t be merged). The relationship between the identified modules and clinical characteristics as well as molecular classifications was shown in Figure 3B. It was concluded that Cluster 1 and Cluster 2 were significantly correlated with the yellow and brown module, respectively (r > 0.4, P < 0.05). The correlation between clinical phenotypes and the obtained modules as well as the genes of the modules was listed in Supplementary Table S3. As shown in Figure 3C, members in the yellow module were largely correlated with the Cluster 1 subtype, while members in the brown module were remarkably associated with the Cluster 2 phenotype. Therefore, the two modules having close relationship with energy metabolism-based subtypes of STAD were considered as the key modules, and the genes involved in these key modules were regarded as candidate genes for hub genes identification.
Functional enrichment analysis demonstrated that 23 KEGG pathways (i.e., MAPK signaling pathway, ECM−receptor interaction) were significantly involved in the yellow module (FDR < 0.01; Figure 3D, Supplementary Table S4) and 35 pathways (i.e., cGMP−PKG signaling pathway, Rap1 signaling pathway) significantly involved in the brown module (FDR < 0.01; Figure 3E, Supplementary Table S5). Most of these pathways were classical cancer-related biological processes. Moreover, the crosstalk of pathways was quite limited (Figure 3F), which further demonstrated the functional heterogeneity of the two key modules.
Subsequently, the expression of the candidate genes in key modules was mapped to STRING database to construct PPI network. A total of 3585 PPI pairs with a score higher than 0.9 were matched among the 1713 co-expression genes (Figure 4A, Supplementary Table S5). The top hub genes identified by the Degree (Figure 4B), Closeness (Figure 4C), and Betweenness (Figure 4D) methods were largely consistent (Supplementary Table S6). The topological properties of the PPI network were also evaluated and the distributions of degree, closeness, and betweenness were shown in Figure 4E-G. A total of 220 genes that satisfied high degree, closeness, and betweenness scores were selected out as hub genes for further analysis (Supplementary Table S7). These hub genes were assumed to be strongly correlated with the development of STAD, and were enrolled for subsequently identification of prognostic gene.
- Identification of energy metabolism-related prognostic model.
The clinical information of STAD patients in the training (n=170), testing (n=169), and external validation (n=300) sets used for model construction and evaluation was listed in Table 1. In the training set, after the selection of univariate Cox regression and Lasso regression analysis (Supplementary Figure S2D-E), six genes (DYNC1I1, GPER1, MFAP2, ARRB1, C3, and GLI1) out of the 220 hub genes were included in the prognostic model (Table 2). And a gene-based prognostic model was established to evaluate the survival risk of each patient as follows: Risk score=0.38585×expDYNC1I1+0.10411×expGPER1+0.04476×expMFAP2−0.70386×expARRB1+0.09187×expC3+0.21797×expGLI1.
According to the cut-off value of normalized risk score (Z-score = 0), STAD patients were divided into high- and low-risk groups. The distribution of risk scores in the training set was shown in Figure 5A, which showed that expression levels of DYNC1I1, GPER1, MFAP2, C3, and GLI1 were positively correlated with risk scores, whilst ARRB1 levels was negatively correlated with risk scores. It was concluded that higher ARRB1 expression was associated with a worse prognosis and was a favorable prognostic, while the other 5 genes were identified as unfavorable prognostic factors for STAD patients. The AUCs of 1-, 3-, and 5-year ROC curves for the 6-gene signature to predict STAD survival were 0.70, 0.71, and 0.73, respectively (Figure 5B). Kaplan-Meier survival analysis confirmed that the high-risk group had significantly worse PFS than the low-risk group (Figure 5C).
The risk scores of STAD patients in the testing and internal validation sets were further calculated using the same coefficients. Patients were sub-grouped using the same cutoff value as the training set. The corresponding ROC curve and Kaplan-Meier survival curves for the TCGA testing set and the entire TCGA dataset showed that the AUCs of the signature remained high and the high-risk groups had consistently shorter PFS periods than the low-risk groups (Figure 6).
A total of 300 STAD samples in GSE62254 were analyzed for the external validation of the signature. In this dataset, ARRB1 was a consistent protective factor while the other five genes were still risk factors for STAD survival (Figure 7A). The robustness of the signature was further verified (Figure 7B-C).
- Association between the identified signature and clinical characteristics.
The predictive performance of the prognostic model was evaluated among the 339 STAD patients with varied clinical features in the TCGA dataset. The results of subgroup survival analyses revealed that the 6-gene signature could effectively discriminate high-risk and low-risk patients among the elder, both sexes, all stage, intestinal, and microsatellite instability-high (MSI-H) subgroups, which expanded its potential application (Supplementary Figure S3A). Univariate and multivariate Cox regression analyses were further performed to evaluate the clinical independence of the identified signature. It was proved that the calculated risk score could independently predict the PFS of STAD patients without the interference of other clinical parameters in the TCGA dataset (Supplementary Figure S3B).
- Comparison with previous prognostic models.
Previous studies had identified several prognostic models for survival prediction of STAD patients. The predictive performance of the present 6-gene signature was further compared with three previous models (a 5-gene signature proposed by Wang et al. [26], a 6-gene signature proposed by Cho et al. [27], and a 10 immune-related gene signature proposed by Yang et al. [28]. For normalization, gene expression levels in each model was uniformly extracted from the original matrix of the TCGA-STAD dataset. The risk score of each STAD patient was calculated accordingly based on the corresponding coefficients provided by each model. Patients were divided into high-and low-risk groups separately according to the median value of normalized risk score for each signature. The comparative plots of Kaplan-Meier survival curve and ROC curves were shown in Figure 8A-C. Restricted mean survival time (RMST) was applied to calculated and compared the C-index of all signatures. The AUCs of the present 6-EMRG model were relatively higher and more stable than the other signatures, and the C-index was the highest among the four models (Figure 8D). DCA curves further demonstrated that the 6-gene signature had better clinical utility than the other models in the survival prediction of STAD patients (Figure 8E).
- GSEA analysis of enriched pathway based on risk score
ssGSEA was performed to determine the potential related pathways according to patients’ prognostic risk in the entire TCGA dataset, and pathways with FDR < 0.05 were derived. By divided samples into high-risk group and low-risk group based on whether the Riskscore is greater than 0, and analyzed the enriched pathway in both groups by using GSEA, we found that 10 pathways were significantly enriched in the high-risk group, such as ECM receptor interaction、hedgehog signaling pathway, and etc.; whilst only citrate cycle TCA cycle was significantly enriched in the low-risk group (P < 0.05; Supplementary Figure S4). Thus, the 6-gene signature may involve in the development and progression of STAD by participating these pathways.