DOI: https://doi.org/10.21203/rs.2.23057/v1
Background: Diffuse gliomas are a group of diseases that contain different degrees of malignancy and complex heterogeneity. Previous studies proposed biomarkers for certain grades of gliomas, but few of them have conducted a systematic analysis of different grades to search for molecular markers.
Methods: WGCNA was used to find significant genes associated with malignant progression of diffuse glioma in TCGA glioma sequencing expression data and the GEO expression profile-merge meta dataset. Lasso regression was used for potential model building and the best model was selected by CPE, IDI and C_index. Risk score model was used to evaluate the gene signature prognostic power. Multi-omics data, including CNV, methylation, clinical traits and mutation, were used for model evaluation.
Results: We find out 67 genes significantly associated with malignant progression of diffuse glioma by WGCNA. Next, we establish a new 4 gene molecular marker (KDELR2, EMP3, TIMP1, and TAGLN2). Multivariate cox analysis identified the risk score of the 4 genes as an independent predictor of prognosis in patients with diffuse gliomas, and its predictive power was independent of the histopathological grades of glioma. Further, we had confirmed in five independent test datasets and the risk score remained good predictive power. The combination of the prognosis model with specific molecular characteristics possessed better predictive power than risk score solely and we divided the low-risk group into three subtypes: LowRisk_IDH1wt, LowRisk_IDH1mut/ATRXmut, and LowRisk_IDH1mut/ATRXwt by combining IDH1 mutation with ATRX mutation, which possessed obvious survival difference. In further analysis, we found that the 4 gene prognosis model possessed multi-omics features.
Conclusion: we established a malignant-related 4-gene molecular marker by glioma expression profile data from multiple microarrays and sequencing data. The four markers had good predictive power on the overall survival of glioma patients and were associated with gliomas’ clinical and genetic backgrounds, including clinical features, gene mutation, methylation, CNV, signal pathways.
Gliomas are the most common and lethal type of primary CNS tumors in adults. Among them, diffuse gliomas (DGs) constitute the majority, including lower-grade gliomas (LGG, grade II-III), and glioblastomas (GBM, grade IV) according to WHO classification (1). Due to intra- and intertumoral heterogeneity, gliomas' molecular characteristics, and clinical phenotypes can be different, even if identified as a similar histological grade. For example, grade II glioma with IDH mutation and 1p19q codeletion possesses a far better prognosis, whose median survival is 96 months, than the one with IDH wild type whose median survival is merely 20.4 months long (2). Although regarded as benign natured tumors, more than 70% of diffuse LGGs would transform into anaplastic gliomas or even secondary glioblastomas in 10 years (3). Thus, the classification system based on histology is not sufficient for glioma precise diagnosis and WHO introduced molecular characteristics into glioma classification system in 2016 for improving diagnostic accuracy. Meanwhile, increasing the development of genome sequencing technologies and bioinformatics tools shed a light for us on understanding the pathology of gliomas. By analyzing these sequencing data, researchers have found plenty of specific molecular markers like IDH mutation, MGMT methylation, and TERT mutation, etc. These molecular markers of glioma gave us a novel understanding of the mechanism, diagnosis and treatment of glioma (4, 5). However, the molecular mechanisms of different survival outcomes and tumor heterogeneity are still unclear and remain further exploration. For improving the diagnosis accuracy and precision of prognosis prediction, it is necessary for us to search for more effective molecular markers and targets.
Previous studies have confirmed that tumor classification based on expression profiles is an objective and reliable method to help diagnosis and treatment, and lots of practices were performed on glioma by now (6). But most of them were inclined to a specific glioma subtype, there are few studies build an adequate prediction model for all DGs. In our study, we united multiple complete large-scale gene expression profiles, high-throughput sequencing datasets and multidimensional data, including gene mutations, copy number variations and methylation datasets, of DGs to figure out the most responsible molecular markers for glioma prognosis.
2.1 Data collection
We obtained TCGA glioma RNA-seq raw count data, copy number variation (CNV) data, DNA methylation data, mutation data and corresponding clinical information from GDC by using R package TCGAbiolinks. RNA-seq raw count data were preprocessed and normalized with R package DESeq2 and R package preprocessCore. CGGA RNA-seq data or microarray data (level 3) is obtained from http://cgga.org.cn. Besides, we obtained 9 datasets from the GEO database (https://www.ncbi.nlm.nih.gov/geo) (Tab.S1). All files obtained from the GEO datasets were in the CEL format and were preprocessed by R package affy and gcrma. Due to limited patient size or incomplete pathological information, we merged 6 microarray sets (GSE43378, GSE19728, GSE4290, GSE61374, GSE43289, GSE7696) in the Hg-u133 Plus 2.0 platform into one large dataset (named meta_GSE1) and corrected batch effect by using R package sva. The batch effect in the CGGA RNA-seq data was also removed. Another 3 GEO datasets (GSE16011, GSE68848, GSE74187) were downloaded for model validation after batch effect removal. All non-DGs samples from all datasets were excluded from this study. We merged the gene symbol with multi probes and the average value was used. The study was approved by the Jishou university ethics review committee.
2.2 Identification of tumor progression-associated genes with weighted correlation network analysis (WGCNA)
In order to screen progression-associate genes, we used the expression profile data from TCGA and meta_GSE1 and constructed a co-expression network using R package WGCNA (7). Common genes in the most grade-relevant module identified from these two datasets respectively were selected as the candidate for tumor progression-associated genes. Briefly, we first filtered out genes with missing value or low variable genes (var < 0.05). Outliers in samples were detected and removed. Next, we calculated sij and the adjacency matrix aij as follows: sij = |cor(xi, xj)|, aij = Sijβ. Where Xi and Xj were vectors of expression value for gene i and j, sij represented the Pearson's correlation coefficient of gene i and gene j, aij encoded the network connection strength between gene i and gene j (8). The power of β = 9 (scale free R2 = 0.95) was selected as the soft-thresholding parameter to ensure a scale-free network. Then, average linkage hierarchical clustering was conducted based on a topological overlap matrix (TOM)-based dissimilarity measure. We considered 30 as the minimum number of genes in each module. The gene modules were identified using the DynamicTreeCut algorithm and merged with a cutheight of 0.2. The module eigengenes (MEs) were generated as the first principal component after the principal component analysis was performed with the expression data for co-expressed modules. Module membership assignment (kME) was determined as Pearson’s correlation coefficient between gene expression values and MEs. The candidate tumor progression-associated genes were screened from modules highly correlated with glioma grade (Absolute value of gene significance (GS) ≥0.2 and MM absolute value ≥0.8).
2.3 Prognostic model construction
First, we screened genes that are associated with prognosis by univariable Cox analysis using coxph function of R package survival from candidate tumor progression-associated genes. To identify common prognosis-associated genes regardless of glioma grades, we selected intersected genes from all survival-associated genes in patients with grade II, III, and IV DGs separately (p < 0.05). To construct a robust model, we excluded samples with a survival time of less than one month as these patients may die due to reasons other than tumors. 630 samples were obtained for further analysis. To reduce overfitting, we used the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis to filter these genes by R package glmnet (9, 10). The best lambda value is selected by the cv.glmnet function. The predictive ability of the model was determined by C_index, Concordance Probability Estimate (CPE) and integrated discrimination improvement (IDI) method (using coxph function, R package clinfun, and R package survIDINRI, respectively). A risk score model was constructed based on the expression levels of prognosis-associated genes and the contribution coefficient (β) of the univariate Cox proportional hazards regression model.
N, Expi, and Coei represented the number of signature genes, gene expression level, and coefficient value, respectively. The risk score was then divided into high and low-risk groups with the optimal cut-off value calculated by surv_cutpoint function. Multivariate Cox analysis was conducted from the analyse_multivariate function in R package survivalanalysis. Prognostic receiver operating characteristic (ROC) curves of risk scores were then created at 1–10 years.
2.4 Integrated analysis combined clinical and multi-omics data of risk scoring model
Further, we used TCGA glioma expression profiles data, gene mutations data, CNV, methylation data, and immune activity to investigate the underlying mechanisms in each risk subgroup. Immune gene signatures were obtained from the R package imsig (https://github.com/ajitjohnson/imsig). ImSig is a set of gene signatures that can be used to estimate the relative abundance of immune cells in tissue transcriptomics data, especially in cancer datasets. The KEGG gene signatures were obtained from the Molecular Signatures Database v6.2 version (http://software.broadinstitute.org/gsea/downloads.jsp). GSEA analysis was performed by R package clusterProfiler (11). Tumor ESTIMATE Score, Immune Score, Stromal Score and tumor Purity were analyzed with R package estimate. Gene mutations data, DNA methylation data and CNV data were preprocessed and analyzed using R package TCGAbiolinks. 101 glioma-related driver genes that we selected were derived from the mutational cancer driver database (https://www.intogen.org/search) (12). Gene mutation analysis was performed by R package maftools. The R package ComplexHeatmap was used for heatmap drawing. The GOplot package was used for Gene Ontology (GO) annotation. The R package ggstatsplot was used for correlation analysis and plotting.
2.5 Statistical analysis
All statistical analyses were performed with R (version 3.5.2, http://www.r-project.org). The optimal cutoff value for patients' risk stratification was performed by surv_cutpoint function in R package survminer. Survival differences between the low-risk and high-risk groups were assessed by the Kaplan–Meier estimate and compared using the log-rank test. All statistical tests were two-sided and P values < 0.05 were considered statistically significant.
3.1 Identification of glioma progression related genes
A total of 24 modules were identified by WGCNA in TCGA gene expression profile data (Fig. 1A-C), of which darkolivegreen module showed the most significant negative correlation with tumor grade (r = -0.6, p = 3e-69). The pink and sienna3 modules were also inversely correlated with darkolivegreen and tumor grade. (r = -0.48, P = 5e-42; r = -0.46, P = 2e-38), while darkgreen module showed the most positive correlation with tumor grade (r = 0.46, P = 5e-38). The midnightblue and orange modules had positive correlation with darkgreen and tumor grade (r = 0.41, P = 3e-30; r = 0.34, P = 3e-20). We finally chose these 6 modules, which contain 705 genes in the TCGA cohort, for further study. Also, 12 modules in the meta_GSE1 dataset were identified by WGCNA (Figure.S1A-C). Of these modules, brown module showed the most significantly negative correlation with tumor grade (r = -0.6, P = 6e-50), and magenta module was highly correlated with brown module and negatively correlated with tumor grade (r = -0.41, P = 2e-22) as brown module. Red module had the most significantly positive correlation with tumor grade (r = 0.58, P = 1e-46). Blue and tan modules were highly correlated with red module and had positive correlation with tumor grade (r = 0.5, P = 5e-34;r = 0.42, P = 6e-23). We then identified five modules associated with tumor grade. 254 genes in these 5 modules in the meta_GSE1 cohort were identified. Finally, we select 67 intersected genes in these two cohorts as candidate glioma progression associated genes (Tab. S2).
3.2 Construction and validation of the prognostic gene signature.
3.2.1 Construction
All 67 genes were statistically significant in the TCGA cohort (P<0.05) after the univariate Cox regression. We also performed the univariate Cox analysis of 67 genes in each pathological grade and 37 genes, which possessed statistical significance among all grades, were identified, suggesting the prognostic role of these genes that were independent of tumor pathological grades (Tab. S3). Next, we performed the LASSO regression analysis to build a robust model with these representative genes. We first used 37 genes as an input for LASSO regression analysis. After 105 times iterations, two stable models were obtained: one with the best predictive performance when minimum lambda was obtained and it contained five genes (mod_37_5). The other was a model that contained the minimal number of variables with ideal predictive efficiency and was called mod_37_4 that contained four genes. A similar procedure was performed for 67 genes mentioned above and two models (mod_67_6 and mod_67_5) were obtained, including 6 and 5 genes, respectively. We did not observe a significant difference in the predictive power of four models through the three methods: C_index, CPE and IDI (Tab. S4). We finally chose the simplest model with good predictive power, mod_37_4. A risk score was derived from these four genes. ROC curves of survival prediction power from 1 to 10 years showed that the risk score had a high predictive power across different time points (AUC=0.785~0.903) (Fig. 1D). Multivariate Cox multivariate analysis showed that the risk score was an independent prognostic factor (HR=1.633, P = 9.695E-06) (Tab.1). We divided patients of all grades into high or low subgroups according to the risk score, indicating that the independent prognostic role of this score in all grades (Fig. 2).
3.2.2 Validation
To further verify the model prediction power, we divided samples into the low- and high-risk groups according to the optimal cut point in five independent datasets. And we observed a significant difference in survival amid all these datasets (Fig.S2-S6). The dataset GSE74187 contained only GBM samples and was considered as an independent validation set for GBM. The results showed that the risk score remained its distinction ability in survival differences (Fig. S6). Multivariate Cox model revealed that the risk score was an independent prognostic factor in all validation sets (Tab.S5). Besides, we found that all four genes were highly expressed in patients with a high-risk score.
3.3 Pathway analysis based on prognosis model
High-risk scores were closely related to ESTIMATE Score, Immune Score, and Stromal Score, but negatively correlated with tumor purity (Fig. 3). We drew the GOplot for GO-annotation for genes that were significantly associated with risk scores (r >= 0.7). Finally, 83 GO BP related terms enriched in the high-risk group were obtained (qvalue < 0.05), among which these genes mainly involved in cellular processes such as extracellular structure organization, collagen fibril organization (Fig. 4C). To further discover the underlying functional mechanisms that lead to different prognosis in patients with high- and low-risk groups, we used GSEA for gene expression analysis based on ImsignSignature and KEGG. The results indicated nine immune signals, such as Macrophages, Macrophages M1, Macrophages M2, Dendritic cells, T cells gamma delta, were activated in the high-risk group (Fig.4A). KEGG analysis enriched 52 pathways significantly correlated with risk scores, including KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION, KEGG_FOCAL_ADHESION, and KEGG_JAK_STAT_SIGNALING_PATHWAY, etc. were highly enriched in the high-risk patients (Fig. 4B).
3.4 Clinic features and genomic background based on prognosis model
After comparing the high-risk group with the low-risk group, we found no difference in gender between the two groups. The average age of patients in the low-risk group is 41.06±12.61y (n=455), which is lower than that in the high-risk group (58.63±12.97y, n=222) with statistical significance (t=-16.636, p-value<2.2e-16). As for the pathological grades, the high-risk group mainly consists of high-grade gliomas, glioblastomas (GBM), while the low-risk group contains mostly LGG (p=1.17e-80) composed of oligodendrogliomas (O), oligoastrocytomas (OA) and astrocytomas (A) histologically (Fig.3B). The risk score was associated with distinct genomic alterations. According to the glioma classification reported by Verhaak et al (13), we discovered proneural subtype (PN) and neural subtype (NE) gliomas are predominantly involved in the low-risk group, and classical subtype (CL) and mesenchymal subtype (ME) gliomas accumulated predominantly in the high-risk group contrarily. Besides, G-CIMP- and G-CIMP+ gliomas enriched in the high-risk and the low-risk group separately. Loss of heterozygosity (LOH) at 1p/19q, IDH mutation, MGMT methylation largely enriched in the low-risk group. On the other hand, the gain of chr 7/loss of chr 10 occurred in the high-risk group chiefly (Tab.2). We acquired consistent results after conducting the same analysis in the validation dataset (Tab.S6).
Risk level reflected the multi-omics alteration: IDH1 mutation is the most common mutation in glioma which was involved mainly in the low-risk group. In our exploration, we found out that by combining IDH1 mutation with ATRX mutation, gliomas in the low-risk group could be further divided into three subtypes: LowRisk_IDH1wt, LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt with obvious survival difference (Fig.4E). LowRisk_IDH1wt group contained A, O, OA, and GBM without 1p/19q codeletion. LowRisk_IDH1mut/ATRXmut group contained more A with a high mutation frequency of TP53 and few 1p/19q codeletions. LowRisk_IDH1mut/ATRXwt group, consisted of O largely, enriched almost all gliomas with 1p/19q codeletion, CIC and FUBP1 mutation. However, IDH1wt and mutation of EGFR, PTEN, NF1 and RB1 were mostly enriched in the high-risk group (Fig. 4D, S7C). According to the overview of the mutation landscape, we could find that TP53 mutation existed in all subtypes. The mutation of EGFR and IDH would not coexist in the same sample and the same relationship appeared in ATRX mutation and TERT mutation. The else results we discovered from the landscape was that TERT mutation mostly exists in LowRisk_IDH1mut/ATRXwt group and the high-risk group and CIC mutation accompany with 1p/19q codeletion.
Then the differential methylation analysis of high-grade and low-grade gliomas was performed, and 62,813 regions were obtained, of which 47,271 regions were significantly correlated with the risk score (r>0.5, p<1.0E-15) and simultaneously these methylation regions were widely distributed on all chromosomes. 46,116 (97.6%) regions were negatively correlated with the risk score and 1155 (2.4%) regions had positive correlation. These negative-correlated methylations were mainly associated with IDH1 mutations, enriched in LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt subtypes. Meanwhile, positive-correlated methyl regions were mainly associated with IDH wild type (Fig. S7A). We further explored the degree of methylation of 154 probes that could further distinguish patients of the high-risk group into two subgroups with significant differences in overall survival (Tab. S7).
The CNV-changing genes that were significantly associated with risk scores were mainly enriched on chr1, 7, 10, and 19. The gene deletion on chr1 was mainly enriched in LowRisk_IDH1mut/ATRXwt; while chr7 was mainly amplified in all grades of gliomas, especially in the high-risk group. In IDH1 wild type, chr10 deletion coexisted with chr7. Chr19 was mainly characterized by the deletion in LowRisk_IDH1mut/ATRXmut and LowRisk_IDH1mut/ATRXwt, especially in the latter subtype, and the amplification predominantly in the high-risk group. The frequency of changes in chr19 in LowRisk_IDH1mut/ATRXmut was significantly lower than that of chr1, indicating that the co-deletion of chr1 and chr19 occurred only in LowRisk_IDH1mut/ATRXwt (Fig. S7B). We further found that the 314 genes, as risk factors, significantly affected the overall survival rate of patients with LowRisk_IDH1mut/ATRXmut subtype gliomas. These genes were located on chromosomes 9p22 and 10q24. And the amplification of 212 genes, locating at chromosomes 8p21 and 7q21, significantly affected the overall survival rate of patients with LowRisk_IDH1mut/ATRXwt subtype gliomas (Tab. S8).
All four genes were positively correlated with tumor pathological grades. Epithelial Membrane Protein 3 (EMP3), a 4-transmembrane glycoprotein, identified firstly as a putative tumor suppressor in gliomas (14). Jun F et al. (15) found that EMP3 is highly expressed in CD44-high primary GBMs, and promoted tumor progression by TGF-β/Smad2/3 signaling activation. Recent studies (16) have shown that EMP3 expression is significantly higher in high-grade gliomas than in low-grade gliomas and normal brain tissues, and its high-level expression is a prognostic indicator of poor survival. KDELR2, a transmembrane protein of the endoplasmic reticulum, can recognize proteins with KDEL (lam-aspartate-glut-leucine) tetrapeptide sequence, mediating these proteins’ recycling from the Golgi back to the endoplasmic reticulum. It has been found that KDELR2 stimulates ECM degradation by inducing Src activation at the invadopodia and leads to phosphorylation of the Src substrates, cortactin, thereby promoting tumor metastasis and invasive growth (17). The expression level of KDELR2 in GBM was significantly higher than that of LGG and can be used as a prognostic marker for overall survival (6). TAGLN2, an actin-binding protein, can regulate cell morphology, movement and transformation by binding to actin. TAGLN2 is abnormally expressed in a variety of cancers (18), and its deregulation is considered to be correlated with tumorigenesis and tumor development. In gliomas, TAGLN2 is a potential oncogenic factor, which is regulated by TGFβ2 to promote glioma invasion and growth (19). Its expression, particularly enriched in mesenchymal molecular glioma subtype, is associated with pathological grades and poor prognosis. TIMP1 belongs to the matrix metalloproteinase (MMP) family and inhibits most MMPs, especially selectively inhibits MMP9, participating in the homeostatic regulation of the extracellular matrix. TIMP1 plays an important role in the gliomas with IDH wild type, which may promote the survival of cancer cells by negatively regulating the adaptive immune response (20). It has also been reported that low expression of TIMP-1 in glioblastoma patients predicts longer survival (21). Aaberg-Jessen et al. (22) found that Co-expression of TIMP-1 and CD63 might have effects in glioblastoma stemness and may predict the poor prognosis of patients through influencing tumor aggressiveness and resistance of therapy. The above evidence confirms the prognostic value of the four genes for glioma patients.
GSEA of the 4-gene signature showed that Macrophages, Dendritic cells and T cells gamma delta activities may contribute to patients' high risk. Also, KEGG CYTOKINE CYTOKINE RECEPTOR INTERACTION, KEGG JAK-STAT SIGNALING PATHWAY, KEGG FOCAL ADHESION were significantly enriched. Macrophages/microglial cells constituted the largest immune cell populations in GBMs (23). These cells promote tumor proliferation, invasion, and survival (24). The cytokines are secreted by immune cells or stromal cells (such as vascular endothelial cells, epidermal cells and fibroblasts), and participate in the immune response and inflammatory process, regulating cell proliferation, differentiation, and apoptosis. JAK/STAT is a common pathway for many cytokine-signaling (25). Cell adhesion is essential for interaction with the extracellular matrix (ECM) when tumor cells invade (26). The results of the greater presence of inflammatory immune responsive microenvironment and ECM components were associated with tumor purity. The progress of gliomas is accompanied by an increase in angiogenesis and invasiveness, which inevitably depends on the production of a large number of adhesion molecules and extracellular matrices (27). The low levels of ECM and adhesion molecules in tumors mean high tumor purity and patients with low tumor purity usually occur in low-grade gliomas, IDHmut, proneural GBM and predict better clinical outcomes (28). In contrast, low-purity tumors have higher cell adhesion and extracellular matrix content, mainly found in IDHwt and mesenchymal subtype GBM and both are labels for poor prognosis (29). Meanwhile, they are enriched with immunological processes and inflammation (30). Thus, we applied the ESTIMATE algorithm (31) to predict tumor purity using gene expression profiles. It showed a significant positive relationship of ESTIMATE Score, Immune Score, Stromal Score, low tumor purity (an indicator of poor prognosis in glioma (28)) and high-risk score.
The clinical features of glioma are also important factors influencing the prognosis of patients. We found that the risk score of older patients was higher than the youth, which means a worse prognosis in older patients, and the phenomenon was consistent with the existing research conclusions (32). Compared with the low-risk group, the high-risk group was mainly composed of GBM and lower-grade glioma (G2/G3) with poor prognosis. This group mainly contains the currently known molecular and genetic characteristics that contribute to poor prognosis, such as CL and ME subtypes (13), non-LOH at 1p/19q, IDH1wt, G-CIMP − and MGMT unmethylation (33), chr7 gain and chr10 loss (29). We also found that a small number of GBM patients with a relatively good prognosis can be classified into the low-risk group by this scoring system. To further understand the underlying molecular mechanisms of risk score, we used TCGA data to analyze the association of gene mutations, methylation, CNV and risk score. We found that patients in high and low-risk groups had significant differences in gene mutation characteristics. Even within the same risk group, the patients in the low-risk group could be divided into three subtypes with different overall survival prognosis based on IDH1 and ATRX mutation. ATRX mutation is mutually exclusive with 1p/19q codeletion (34) and 1p/19q codeletion often represents a better prognosis. Methylation analysis revealed that methylation probes that were negative-correlated with risk scores were enriched in the low-risk group and correlated with IDH1 mutations. On the contrary, positive-correlated methylation probes were primarily associated with IDH wild type. We found that CNV-changing genes significantly associated with risk scores were mainly enriched on chr1, 7, 10, and 19, and each risk group had specific CNV characteristics. In our study, the risk score was correlated with the current major glioma molecular characteristics, and through combining methylation and CNV features this model can be used to further subdivide gliomas into different subtypes. We also found that the deletion of genes on 9p22/10q24 and 8p21/7q21 affected the prognosis of low-risk group patients with IDH mutations, which suggests that amplification or deletion of these chromosomal fragments may play a potential role in specific LGG subtypes, meanwhile, partial methylation probes possess similar ability to stratify the patients in high-risk groups. All these deserve further study.
In summary, we established a malignant-related 4-gene molecular marker by glioma expression profile data from multiple microarrays and sequencing data. The four markers had good predictive power on the overall survival of glioma patients and were independent of pathological grades. To our knowledge, the 4-gene signature related integrated multi-omics prognostic model has not been reported previously and could be a useful prognostic and diagnostic classification tool of glioma. However, several limitations of our study should be taken into consideration. The incompleteness of follow-up data, no progress-free survival analysis was performed, and no more datasets were further validated for the re-stratification of the four subtypes (CL, PN, NE, ME). Second, all data are based on microarray and sequencing expression data, so their predictive effects are only applicable to mRNA expression and not to protein expression levels. Third, the prognostic model only considered genes that are highly expressed in lower-grade glioma relative to adjacent cancer but did not consider the prognostic role of non-differentiated genes. Some molecules may not be highly expressed in cancer, but still affect the prognosis of patients via other means. Furthermore, the molecular mechanisms of how the 4-gene signature affected the prognosis of glioma patients should be further elucidated by a series of experiments.
Additional file 1: Supplementary Tables S1-S8. Table S1: Datasets list. Table S2: Candidate genes selected by WGCNA. Table S3: PAGs unique cox regression. Table S4: Comparison of prediction capabilities of 4 models. Table S5: Unique and multiple Cox regression for risk score model. Table S6: clinical and molecular trait validation. Table S7: Specific methylation probes. Table S8: Specific CNV.
Additional file 2: Supplementary Figures S1-7. Figure S1: Module identification and correlation in the meta_GSE1 dataset. Figure S2-S6: Risk score distribution and survival difference in CGGA microarray, CGGA RNA-seq, GSE16011, GSE68848, GSE74187 datasets. Figure S7: Multi-omics feature of subgroups.
DGs: diffuse gliomas, LGG: lower-grade gliomas, GBM: glioblastomas, IDH: isocitrate dehydrogenase, MGMT: O6methylguanine DNA methyltransferase, TERT: telomerase reverse transcriptase, CNV: copy number variation, TCGA: the cancer genome atlas, CGGA: Chinese glioma genome atlas, TOM: topological overlap matrix, MEs: module eigengenes, CPE: Concordance Probability Estimate, IDI: integrated discrimination improvement, ROC: receiver operating characteristic curve, KEGG: Kyoto Encyclopedia of Genes and Genomes, GO: Gene Ontology, GSEA: gene set enrichment analysis, O: oligodendrogliomas, OA: oligoastrocytomas, A: astrocytomas, PN: proneural subtype, NE: neural subtype, CL: classical subtype, ME: mesenchymal subtype, G-CIMP: Glioma CpG island methylator phenotype, LOH: Loss of heterozygosity, ATRX: X - linked alpha thalassemia mental retardation syndrome gene, MMP: the matrix metalloproteinase, ECM: extracellular matrix.
Fundings
This work was supported by the National Natural Science Foundation of China (Grant Nos. 81770781, 81472594) for Xuejun Li and the National Natural Science Foundation of China (Grant No. 81560414) for Chunhai Huang.
Acknowledgements
We gratefully acknowledge the TCGA and CGGA project organizers as well as all study participants for making data and results available.
Availability of data and materials
The RNA sequencing data and microarray data of human glioma samples were obtained from the TCGA data portal (https://portal.gdc.cancer.gov) and CGGA data portal (http://cgga.org.cn). GSE43378, GSE19728, GSE4290, GSE61374, GSE43289, GSE7696, GSE16011, GSE68848 and GSE74187 were from GEO database (https://www.ncbi.nlm.nih.gov/geo). Copy number variation data, DNA methylation data, mutation data and corresponding clinical information of glioma samples were from the TCGA data portal.
Author Contributions
Chunhai Huang and Zujian Xiong conducted the analysis. Zujian Xiong wrote this manuscript and Xuejun Li came up with this idea and supported this paper.
Ethics approval and consent to participate
Ethical approval was waived since we used only publicly available data and materials in this study.
Consent for publication
Not applicable.
Conflict of Interest
The authors state that the research was conducted without any commercial or financial relationship that could be interpreted as a potential conflict of interest.
Table 1 Multivariate Cox multivariate analysis
Factor |
HR |
Lower_CI |
Upper_CI |
p value |
age |
1.0214 |
1.0062 |
1.0369 |
0.005816 |
gender |
0.9167 |
0.6707 |
1.2530 |
0.585471 |
hist_GBM |
0.4834 |
0.2059 |
1.1351 |
0.095105 |
hist_LGG_O |
0.6788 |
0.3695 |
1.2469 |
0.211736 |
grade |
2.6161 |
1.5405 |
4.4426 |
0.000372 |
hist_LGG_A |
1.0190 |
0.5849 |
1.7753 |
0.946989 |
hist_LGG_OA |
1.0000 |
NA |
NA |
NA |
KPS |
0.9834 |
0.9712 |
0.9957 |
0.008511 |
RiskScore |
1.6327 |
1.3139 |
2.0287 |
9.69E-06 |
Table 2 Distribution overview of risk groups
Characteristics |
N |
L_risk |
H_risk |
X-squared |
p-value |
|
|
Gender |
|
|
|
1.055 |
0.304 |
|
|
Male |
386 |
252 |
134 |
|
|
|
|
Female |
284 |
197 |
87 |
|
|
|
|
Grade |
|
|
|
368.1 |
1.17E-80 |
|
|
Grade II |
248 |
238 |
10 |
|
|
|
|
Grade III |
263 |
201 |
62 |
|
|
|
|
Grade IV |
161 |
11 |
150 |
|
|
|
|
Histology |
|
|
|
367.084 |
2.98E-79 |
|
|
A |
192 |
141 |
51 |
|
|
|
|
GBM |
161 |
11 |
150 |
|
|
|
|
O |
190 |
179 |
11 |
|
|
|
|
OA |
128 |
118 |
10 |
|
|
|
|
IDH_STATUS |
|
|
|
527.552 |
9.64E-117 |
|
|
WT |
238 |
27 |
211 |
|
|
|
|
MUT |
425 |
420 |
5 |
|
|
|
|
IDH_1P19Q_SUBTYPE |
|
|
|
106.61 |
5.42E-25 |
|
|
noncodel |
498 |
281 |
217 |
|
|
|
|
codel |
168 |
168 |
0 |
|
|
|
|
MGMT_PROMOTER_STATUS |
|
|
|
169.203 |
1.10E-38 |
|
|
nonMethy |
164 |
49 |
115 |
|
|
|
|
Methy |
474 |
399 |
75 |
|
|
|
|
CHR7_GAIN_CHR10_LOSS |
|
|
|
354.11 |
5.40E-79 |
|
|
WT |
507 |
438 |
69 |
|
|
|
|
MUT |
156 |
8 |
148 |
|
|
|
|
TERT_PROMOTER_STATUS |
|
|
|
49.614 |
1.87E-12 |
|
|
WT |
163 |
155 |
8 |
|
|
|
|
MUT |
157 |
98 |
59 |
|
|
|
|
ATRX_STATUS |
|
|
|
101.316 |
7.84E-24 |
|
|
WT |
463 |
259 |
204 |
|
|
|
|
MUT |
195 |
188 |
7 |
|
|
|
|
DNAMethyl_PANCAN |
|
|
|
|
1.11E-07* |
|
|
|
|
||||||
GBM non-CIMP |
109 |
2 |
107 |
|
|
|
|
GBM CIMP |
9 |
7 |
2 |
|
|
|
|
Cluster 5 |
3 |
0 |
3 |
|
|
|
|
Low purity |
6 |
0 |
6 |
|
|
|
|
TRANSCRIPTOME_SUBTYPE |
|
|
|
354.038 |
1.99E-76 |
|
|
PN |
241 |
224 |
17 |
|
|
|
|
NE |
114 |
99 |
15 |
|
|
|
|
CL |
88 |
4 |
84 |
|
|
|
|
ME |
102 |
14 |
88 |
|
|
|
|
*Fisher's Exact Test |
|
|
|
|
|
|