Analysis of immune cell subsets of EC using ESTIMATE and CIBERSORT
The analysis of cellular characteristics showed that tumor-related macrophages were the most abundant TME-infiltrating cells, followed by CD8+ T-cells. We noted that the numbers of M0 and M1 macrophages were low in normal tissues, but high in cancer tissues (Fig. 1a, 1c). We also observed that the levels of activated macrophages M0 and CD8+ T-cells decreased with EC stage, whereas M1 and M2 macrophages increased with EC stage (Supplementary Fig. 1). The correlation heatmap revealed that CD4+ T-cells and M0 macrophages were negatively correlated with resting memory CD8+ T-cells, activated mast cells and M0 macrophages were negatively correlated with resting mast cells, and CD8+ T-cells were positively correlated with CD4+ T-cells (Fig. 1b). Similarly, the obtained boxplot showed an increase in follicular helper T-cells, T-regulatory cells (Tregs), and M0 and M1 macrophages in cancer tissues compared with those in normal tissues. In contrast, we found that naive B-cells, CD4 memory resting T-cells, gamma delta T-cells, activated NK cells, and mast cells showed high abundance in normal tissues, but low abundance in tumor tissues (Fig. 1c). The tSNE plot of 22 immune signatures showed an increased number of M0 macrophages in tumor tissues compared with that in normal tissues (Fig. 1d).
Construction of weighted co-expression network and identification of gene modules
We chose a threshold value of 7 for the WGCNA, which had the lowest power of the scale-free topological fit index of 0.85 (Fig. 2a). After merging similar clusters, we identified 21 modules that contained groups of genes with similar patterns of connection strengths with other genes (Fig. 2b, 2c). Finally, we determined the correlation between these modules and traits (Fig. 2d). A significant association was found between the salmon, red, royal blue, and purple modules and stromal cell, immune cell, and ESTIMATE scores. Evaluating the correlation between GS and MM is key in measuring the quality of the construction of gene modules. After correlating the modules with the ImmuneScore, the correlation between GS and MM in the four modules was observed to reach 0.39, 0.99, 0.9, and 0.22 (Fig. 2e). Thus, we set more stringent screening conditions, namely GS correlation > 0.2 and MM correlation > 0.8, and finally selected 107 key genes for the downstream analyses.
GO and KEGG functional enrichment analyses
Using the “clusterProfiler” R package, 434 GO terms and 37 KEGG pathway were indicated in 107 key genes. The KEGG pathway analysis showed that the top 15 significantly enriched pathways were related to hematopoietic cell lineage, phagosome pathways, and the human T-cell leukemia virus 1 infection pathway (Supplementary Fig. 2a). The top five GO terms of each subclass were found to be mainly related to immune activation, such as T-cell activation and proliferation (Supplementary Fig. 2b).
PPI network of key genes
To better understand the interplay among the key genes, we constructed a PPI network using STRING, which revealed that all key genes were densely interconnected. The PPI network of the top 20 key genes sorted by degree using Cytoscape software revealed several gene interactions (Supplementary Fig. 3a). The biological process analysis of the key genes is shown in Supplementary Fig. 3b.
Establishment of a prognostic risk model of the four immune-related genes
The entire cohort (N = 529) was divided into the training (n = 395) and validation (n = 134) groups based on stratified sampling according to AJCC staging. The univariate Cox regression analysis of the training group showed that 58 immune-related genes were found to be associated with the OS of patients (P < 0.05; Fig. 3a). To verify the reliability of the model, we selected four immune-related genes using the random forest algorithm (Fig. 3b), with its coefficient in multivariate Cox regression as follows: four immune-related genes with risk scores − 0.08485*Exp(LPXN), -0.18955*Exp(TRBC2), -0.14632*Exp(TRAC), and 0.13717*Exp(ARHGAP30). We noted significant differences between the high- and low-risk groups. Additionally, up on optimization of the expression median for analysis of the four immune-related genes based on the selected cut-off, we found that the low expression of TRBC2, LPXN, TRAC, and ARHGAP30 was associated a with poor prognosis (Fig. 3c). We then calculated the risk score for each patient, with 1 as the cut-off point, and divided the patients into the high-risk (n = 255) and low-risk (n = 274) groups (Fig. 3d); the expression heatmap of these four genes in the two groups is shown in Fig. 3e.
Prognostic analysis of the four immune-related genes classifier and construction of the nomogram prognostic model
The Kaplan–Meier analysis of the OS showed that the high-risk group in the training cohort was associated with poor OS (P = 0.0026), whereas the OS rate of high-risk patients in the validation and entire cohorts was significantly lower than that of low-risk patients (P = 0.038, P < 0.0001) (Fig. 4a). As shown in Fig. 4b, time-dependent receiver operating characteristic (ROC) analysis revealed that the area under the curve (AUC) for 1-, 3-, and 5-y OS in the training cohort was 0.687, 0.699, and 0.76, respectively. The predictive value of the four immune-related gene classifiers was confirmed in the validation cohort, which showed that the AUCs for 1-, 3-, and 5-y OS in the validation cohort were 0.445, 0.606, and 0.679, respectively, whereas those of the entire cohort were 0.637, 0.671, and 0.74, respectively. To establish a more reliable predictive method for clinical practice, we combined the two cohorts (n = 453) and constructed a compound nomogram. We performed univariate and multivariate Cox proportional hazards regression analyses on the relationship between clinical characteristic variables and OS (Supplementary Table 2). Meanwhile, by drawing the ROC curve, we found that risk score has a better predictive ability than other clinical factors for 5-y OS. The AUC of ROC increased after combining the risk score with other clinical factors (Fig. 4c), which suggested that the risk score may be an independent risk factor for patients. As shown in Fig. 4d, we employed these variables to analyze the survival probability of patients at 1, 3, and 5 y. To verify the predictive value of the nomogram, we used C-statistics to analyze the generated nomogram model. The C-index of the nomogram was 0.818 (95% CI, 0.865–0.77). In addition, the calibration plot generated for patients with a 1-, 3-, and 5-y OS prediction demonstrated that the predicted outcome of the nomogram showed a good agreement with the actual outcome (Fig. 4e). Finally, we used the online website GEPIA2 to analyze the overall prognosis of four immune-related genes in endemic cancer (Fig. 5).
Tumor microenvironment and immune status
To understand the immunological evaluation value of the model, we performed immunological analysis between the high- and low- risk groups. The ssGSEA analysis of the immune status of patients showed that type I IFN response was not significantly different between the groups. However, the scores of other immune cells, immune function, and immune pathway gene set were significantly higher in the low-risk group than in the high-risk group, indicating that the immune status of the low-risk group was higher than that of the high-risk group (Fig. 6a, Supplementary Fig. 4a). In addition, we compare the immune infiltration results of CIBERSORT analysis between the groups (Fig. 6b). The results show that the infiltration of T cells CD8, T cells follicular helper, T cells regulatory (Tregs), T cells gamma delta, and macrophages M1 in the low-risk group was significantly higher than that in the high-risk group (Supplementary Fig. 4b). The results of the ESTIMATE analysis showed that the immune score, stromal score, and ESTIMATE score of the low-risk group were significantly higher than those of the high-risk group. Furthermore, the tumor purity was significantly lower than that of the high-risk group, indicating that the degree of immune infiltration in the low-risk group was higher than that in the high-risk group (Fig. 6c). The checkpoint score based on the ssGSEA showed that the low-risk group was more suitable for immunotherapy than the high-risk group. We analyzed the clinical common immune checkpoint expression, tumor mutation burden (TMB), and mutant-allele tumor heterogeneity (MATH) of patients in the two groups, respectively, as references for the evaluation of clinical immunotherapy. The expression of immune checkpoints except CD276 (no significant) and TMB were higher in the low-risk group than in the high-risk group, and MATH was significantly lower in the low-risk group (Fig. 6d, 6e, and 6f). These results indicated that the gene model has a good ability for immune evaluation, and the low-risk group has higher immune infiltration and immune status, which may result in better efficacy of immunotherapy.
Enrichment analysis, stemness analysis, and mutation profile
We further stratified the transcript levels of patients according to risk score into the high- and low-risk subgroups. We performed the GSEA to identify potential biological processes in the four immune-related prognostic gene model. Our results revealed that the pathways of T-cell activation, T-cell proliferation, activation of immune response, and alpha beta T-cell activation were enriched in low-risk patients. The details are shown in Fig. 7a. We also generated a heatmap of the transcriptional expression profiles of the 100 DEGs (Supplementary Fig. 5). In addition, Supplementary Table 3 shows the enrichment score and statistical significance of each pathway. The hallmark pathway enrichment results showed that the top six hallmark pathways in the low-risk group were all immune-related (Fig. 7b).
Stemness is one of the factors that determine the malignancy of a tumor. In the analysis of stemness, we found that the low-risk group had lower stemness than the high-risk group (Fig. 7c). This might be one of the reasons for the better prognosis in the low-risk group. For further research of potential mechanism of EC in the different risk groups, we analyzed the somatic mutation data of the two groups of patients and showed the mutation profile (Fig. 7d). There were some differences in the top 20 genes between the groups. The top 10 genes of mutation frequency in the high-risk group were PTEN (46%), PIK3CA (45%), TP53 (45%), ARID1A (35%), PIK3R1 (27%), KMT2D (23%), CTNNB1 (21%), ZFHX3 (19%), CHD4 (18%) and FBXW7 (18%). And the top 10 genes of mutation frequency in the low-risk group were PTEN (67%), PIK3CA (51%), ARID1A (50%), PIK3R1 (32%), CTCF (31%), KMT2D (30%), ZFHX3 (28%), FAT4 (28%), FAT1 (27%), TP53 (27%), CTNNB1 (26%) and LRP1B (25%). These results indicate that our gene model has good performance and could be considered as a candidate factor for clinical indicators.