3.1 Identification of glioma progression related genes
A total of 24 modules were identified by WGCNA in TCGA gene expression profile data (Fig. 1A-C), of which darkolivegreen module showed the most significant negative correlation with tumor grade (r = -0.6, p = 3e-69). The pink and sienna3 modules were also inversely correlated with darkolivegreen and tumor grade. (r = -0.48, P = 5e-42; r = -0.46, P = 2e-38), while darkgreen module showed the most positive correlation with tumor grade (r = 0.46, P = 5e-38). The midnightblue and orange modules had positive correlation with darkgreen and tumor grade (r = 0.41, P = 3e-30; r = 0.34, P = 3e-20). We finally chose these 6 modules, which contain 705 genes in the TCGA cohort, for further study. Also, 12 modules in the meta_GSE1 dataset were identified by WGCNA (Figure.S1A-C). Of these modules, brown module showed the most significantly negative correlation with tumor grade (r = -0.6, P = 6e-50), and magenta module was highly correlated with brown module and negatively correlated with tumor grade (r = -0.41, P = 2e-22) as brown module. Red module had the most significantly positive correlation with tumor grade (r = 0.58, P = 1e-46). Blue and tan modules were highly correlated with red module and had positive correlation with tumor grade (r = 0.5, P = 5e-34;r = 0.42, P = 6e-23). We then identified five modules associated with tumor grade. 254 genes in these 5 modules in the meta_GSE1 cohort were identified. Finally, we select 67 intersected genes in these two cohorts as candidate glioma progression associated genes (Tab. S2).
3.2 Construction and validation of the prognostic gene signature.
3.2.1 Construction
All 67 genes were statistically significant in the TCGA cohort (P<0.05) after the univariate Cox regression. We also performed the univariate Cox analysis of 67 genes in each pathological grade and 37 genes, which possessed statistical significance among all grades, were identified, suggesting the prognostic role of these genes that were independent of tumor pathological grades (Tab. S3). Next, we performed the LASSO regression analysis to build a robust model with these representative genes. We first used 37 genes as an input for LASSO regression analysis. After 105 times iterations, two stable models were obtained: one with the best predictive performance when minimum lambda was obtained and it contained five genes (mod_37_5). The other was a model that contained the minimal number of variables with ideal predictive efficiency and was called mod_37_4 that contained four genes. A similar procedure was performed for 67 genes mentioned above and two models (mod_67_6 and mod_67_5) were obtained, including 6 and 5 genes, respectively. We did not observe a significant difference in the predictive power of four models through the three methods: C_index, CPE and IDI (Tab. S4). We finally chose the simplest model with good predictive power, mod_37_4. A risk score was derived from these four genes. ROC curves of survival prediction power from 1 to 10 years showed that the risk score had a high predictive power across different time points (AUC=0.785~0.903) (Fig. 1D). Multivariate Cox multivariate analysis showed that the risk score was an independent prognostic factor (HR=1.633, P = 9.695E-06) (Tab.1). We divided patients of all grades into high or low subgroups according to the risk score, indicating that the independent prognostic role of this score in all grades (Fig. 2).
3.2.2 Validation
To further verify the model prediction power, we divided samples into the low- and high-risk groups according to the optimal cut point in five independent datasets. And we observed a significant difference in survival amid all these datasets (Fig.S2-S6). The dataset GSE74187 contained only GBM samples and was considered as an independent validation set for GBM. The results showed that the risk score remained its distinction ability in survival differences (Fig. S6). Multivariate Cox model revealed that the risk score was an independent prognostic factor in all validation sets (Tab.S5). Besides, we found that all four genes were highly expressed in patients with a high-risk score.
3.3 Pathway analysis based on prognosis model
High-risk scores were closely related to ESTIMATE Score, Immune Score, and Stromal Score, but negatively correlated with tumor purity (Fig. 3). We drew the GOplot for GO-annotation for genes that were significantly associated with risk scores (r >= 0.7). Finally, 83 GO BP related terms enriched in the high-risk group were obtained (qvalue < 0.05), among which these genes mainly involved in cellular processes such as extracellular structure organization, collagen fibril organization (Fig. 4C). To further discover the underlying functional mechanisms that lead to different prognosis in patients with high- and low-risk groups, we used GSEA for gene expression analysis based on ImsignSignature and KEGG. The results indicated nine immune signals, such as Macrophages, Macrophages M1, Macrophages M2, Dendritic cells, T cells gamma delta, were activated in the high-risk group (Fig.4A). KEGG analysis enriched 52 pathways significantly correlated with risk scores, including KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION, KEGG_FOCAL_ADHESION, and KEGG_JAK_STAT_SIGNALING_PATHWAY, etc. were highly enriched in the high-risk patients (Fig. 4B).
3.4 Clinic features and genomic background based on prognosis model
After comparing the high-risk group with the low-risk group, we found no difference in gender between the two groups. The average age of patients in the low-risk group is 41.06±12.61y (n=455), which is lower than that in the high-risk group (58.63±12.97y, n=222) with statistical significance (t=-16.636, p-value<2.2e-16). As for the pathological grades, the high-risk group mainly consists of high-grade gliomas, glioblastomas (GBM), while the low-risk group contains mostly LGG (p=1.17e-80) composed of oligodendrogliomas (O), oligoastrocytomas (OA) and astrocytomas (A) histologically (Fig.3B). The risk score was associated with distinct genomic alterations. According to the glioma classification reported by Verhaak et al (13), we discovered proneural subtype (PN) and neural subtype (NE) gliomas are predominantly involved in the low-risk group, and classical subtype (CL) and mesenchymal subtype (ME) gliomas accumulated predominantly in the high-risk group contrarily. Besides, G-CIMP- and G-CIMP+ gliomas enriched in the high-risk and the low-risk group separately. Loss of heterozygosity (LOH) at 1p/19q, IDH mutation, MGMT methylation largely enriched in the low-risk group. On the other hand, the gain of chr 7/loss of chr 10 occurred in the high-risk group chiefly (Tab.2). We acquired consistent results after conducting the same analysis in the validation dataset (Tab.S6).
Risk level reflected the multi-omics alteration: IDH1 mutation is the most common mutation in glioma which was involved mainly in the low-risk group. In our exploration, we found out that by combining IDH1 mutation with ATRX mutation, gliomas in the low-risk group could be further divided into three subtypes: LowRisk_IDH1wt, LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt with obvious survival difference (Fig.4E). LowRisk_IDH1wt group contained A, O, OA, and GBM without 1p/19q codeletion. LowRisk_IDH1mut/ATRXmut group contained more A with a high mutation frequency of TP53 and few 1p/19q codeletions. LowRisk_IDH1mut/ATRXwt group, consisted of O largely, enriched almost all gliomas with 1p/19q codeletion, CIC and FUBP1 mutation. However, IDH1wt and mutation of EGFR, PTEN, NF1 and RB1 were mostly enriched in the high-risk group (Fig. 4D, S7C). According to the overview of the mutation landscape, we could find that TP53 mutation existed in all subtypes. The mutation of EGFR and IDH would not coexist in the same sample and the same relationship appeared in ATRX mutation and TERT mutation. The else results we discovered from the landscape was that TERT mutation mostly exists in LowRisk_IDH1mut/ATRXwt group and the high-risk group and CIC mutation accompany with 1p/19q codeletion.
Then the differential methylation analysis of high-grade and low-grade gliomas was performed, and 62,813 regions were obtained, of which 47,271 regions were significantly correlated with the risk score (r>0.5, p<1.0E-15) and simultaneously these methylation regions were widely distributed on all chromosomes. 46,116 (97.6%) regions were negatively correlated with the risk score and 1155 (2.4%) regions had positive correlation. These negative-correlated methylations were mainly associated with IDH1 mutations, enriched in LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt subtypes. Meanwhile, positive-correlated methyl regions were mainly associated with IDH wild type (Fig. S7A). We further explored the degree of methylation of 154 probes that could further distinguish patients of the high-risk group into two subgroups with significant differences in overall survival (Tab. S7).
The CNV-changing genes that were significantly associated with risk scores were mainly enriched on chr1, 7, 10, and 19. The gene deletion on chr1 was mainly enriched in LowRisk_IDH1mut/ATRXwt; while chr7 was mainly amplified in all grades of gliomas, especially in the high-risk group. In IDH1 wild type, chr10 deletion coexisted with chr7. Chr19 was mainly characterized by the deletion in LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt, especially in the latter subtype, and the amplification predominantly in the high-risk group. The frequency of changes in chr19 in LowRisk_IDH1mut/ATRXmut was significantly lower than that of chr1, indicating that the co-deletion of chr1 and chr19 occurred only in LowRisk_IDH1mut/ATRXwt (Fig. S7B). We further found that the 314 genes, as risk factors, significantly affected the overall survival rate of patients with LowRisk_IDH1mut/ATRXmut subtype gliomas. These genes were located on chromosomes 9p22 and 10q24. And the amplification of 212 genes, locating at chromosomes 8p21 and 7q21, significantly affected the overall survival rate of patients with LowRisk_IDH1mut/ATRXwt subtype gliomas (Tab. S8).