Identification of Potential Key Genes Associated with the Pathogenesis and Prognosis of Glioblastoma Based on Integrated Bioinformatics Analysis

Background : Despite striking advances in multimodality management, the low survival rate of Glioblastoma (GBM) patients has not been significantly improved and identifying novel diagnostic and prognostic biomarkers is urgently demanded. The present study aimed to identify potential key genes associated with the pathogenesis and prognosis of GBM. Methods : Differentially expressed genes between GBM and normal brain tissue samples were screened by an integrated analysis of multiple gene expression profile datasets. Key genes related to the pathogenesis and prognosis of GBM were identified by employing protein–protein interaction network and Cox proportional hazards model analyses. Results : We identified nine hub genes (TP53, FN1, EGFR, MYC, RRM2, EZH2, FOXM1, CD44 and MMP2) which might be closely associated with the pathogenesis of GBM. A prognostic gene signature consisted of RAB33B, KIAA1199, TEK, EVC, SOD2, CXCR4, hCG_40738, CHD9, GCSH, SUHW1, RPS6KA5, PDCD4, ZG16, KCNG1, DECR1, PPCS, SERPINF1, TMSB10, NAT10, HIC2, PIR and OR2W1 was constructed with a good performance in predicting overall survivals (OS). Conclusions : The findings of present study would provide certain reference for further predicting the diagnostic and prognostic biomarkers to facilitate the molecular targeting therapy of GBM.

3 as targeted and immune-based therapies, not all patients react to existing molecularly targeted agents developed for specified acknowledged biomarkers and no improvement is observed in the overall survival rate of GBM. [8] Existing studies still lack of a comprehensive understanding of the underlying pathophysiological mechanisms leading to invasiveness nature of GBM hinders the development of therapies for GBM. Therefore, comprehensive understanding the molecular mechanism of GBMs is of great necessity for locating a completely unique and effective therapeutic strategy. [9][10][11] Recently, microarray and high-throughput genomic sequencing technology has significantly facilitated the detection of critical genetic or epigenetic alternations in carcinogenesis and the identification of novel promising diagnostic, prognostic biomarkers and therapeutic targets. [12][13][14][15] Over the past years, expanding data and accumulating results from computational analyses furthering the understanding of underlying pathophysiological mechanisms and prognosis of GBM. [16,17] Moreover, in order to overcome the restricted or inconsistent results due to the application of either different technological platforms or a small sample size, integrated bioinformatics strategies have been adopted in cancer research and a large amount of valuable biological information has been uncovered. [18] Herein, we initially performed an integrated analysis and identified differentially expressed genes (DEGs) by using microarray and RNA sequencing data in human GBM and normal GBM tissue samples from multiple studies. Additionally, functional enrichment analysis was further conducted to analyze the main biological functions modulated by the DEGs. Ultimately, key genes affecting the pathogenesis and prognosis of GBM patients were identified by utilizing protein-protein interaction (PPI) network and survival analyses.

Gene Expression Profile Data Collection and Pre-processing
Microarray data on gene expression (GSE16011, GSE111260, GSE50161, GSE61335, GSE66354, GSE15824, GSE94349, GSE86574 and GSE68015) were downloaded from Gene Expression Omnibus (GEO) [http://www.ncbi.nlm.nih.gov/geo/]. All enclosed datasets meet the subsequent criteria: (1) 4 Human brain tissue samples were employed. (2) Case-control groups included. (3) No less than ten samples contained. Small sample size is one of the main challenges of microarray analysis, while a large sample size may reliably reveal the DEGs or non-coding RNAs, and integrated bioinformatics studies also prefer to use datasets with a relatively large sample size. [19,20] Therefore, the GEO datasets which contained no less than ten samples were chosen for further study. Raw RNA sequencing data containing 528 GBM samples and 10 matched non-cancerous samples were obtained from The Cancer Genome Atlas (TCGA) [https://cancergenome.nih.gov/].

Integrated Analysis of Microarray Datasets
Limma package in R software was applied to perform the normalization for the matrix data of each GEO dataset, and the DEGs were also screened between tumor and normal tissues.
[21] The DEGs identified from the 9 datasets were reliably integrated by an R package "RobustRankAggreg" [22] based on a robust rank aggregation (RRA) method. This RRA method screens genes ranked consistently better than expected based on the comparison of actual data with a null model that assumes random order of input lists. [22] Meanwhile, bonferroni correction was performed for avoiding false positive results and leave-one-out cross-validation was applied to assess the stability of acquired p-values. Therefore, we did not integrate the gene expression values of samples varies from different datasets. And as with many published papers based on the RobustRankAggreg package [23, 24], we have not performed batch effect correction as well. |log 2 FC| >1, P-value < 0.05 and adjust P-value < 0.05 were considered statistically significant for the DEGs.

DEGs Validation by TCGA
The integrated analysis results of GEO datasets were further validated with the RNA sequencing data in the TCGA GBM dataset. The data were normalized and analyzed by the Limma package.
[21] Genes were considered to be significantly differentially expressed at a significance level of log fold change (|log2FC|) >1, P-value < 0.05 and adjust P-value < 0.05. Overlapping DEGs between the integrated microarray data and RNA sequencing data analyses were retained for further study.

Functional and Pathway Enrichment Analysis of DEGs
In order to figure out the potential biological functions correlated with the overlapping DEGs, Gene 5 Ontology(GO) Term Enrichment analysis was first conducted based the 512 DEGs utilizing the clusterProfiler package [25], which including enrichment for GO 'Biological Process', 'Molecular Function' and 'Cellular Component' terms. Moreover, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis based these genes were also implemented by clusterProfiler [25] to expound promising signaling pathways associated with the overlapping DEGs. At a significance level of P-value < 0.05 and adjust P-value < 0.05 were defined as the cut-off criteria.

PPI Network and Module Analysis for Co-expression Network of Predicted Target Genes.
The STRING database [26] has been widely applied to explore potential protein-protein interactions

Survival Analysis
Both preprocessed RNA-sequencing data and clinical data of GBM were downloaded from TCGA.
Among the clinical data, 525 sample files that contained status and survival time of patients were chosen for subsequent survival analysis. Univariate Cox proportional hazards regression analysis was utilized to predict the candidate genes that were strongly associated with survival. [31,32] Subsequently, the candidate genes with P-value < 0.05 were selected for further identification of the prognostic gene markers with multivariate Cox proportional hazards regression analysis.[33] These GBM patients were divided into either low-or high-risk groups according to the median prognostic risk score. Moreover, the screened prognostic gene markers were then enrolled into prognostic risk score model employing a least absolute shrinkage and selection operator (LASSO) regression to further narrow the correlated key genes for predicting prognosis of GBM.
[34] In addition, we performed time-dependent receiver operating characteristic (ROC) curve analysis by using an R package "survivalROC" to evaluate the predictive accuracy of the prognostic signature for time-6 dependent cancer death. [35] The area under the curve (AUC) was calculated to measure the predictive ability of the gene signature for clinical outcomes.

Statistical Analysis
The univariate and multivariate Cox proportional hazards regression analyses were performed using an R package "survival". Meanwhile hazard ratio (HR) and 95% confidence interval (CI) were calculated to identify protective (HR < 1) or risky genes (HR > 1). Besides, a survival curve made by Kaplan-Meier method was implemented to estimate the differences in survival time between the highand low-risk patients. All the statistical analyses were conducted with R (version 3.4.3) (https://www.rproject.org/).

Results
Identification of robust and reliable DEGs could provide immense help for understanding molecular pathologies and mechanisms underlying complex disease. To facilitate the identification of DEGs between GBMs and control samples, p-values were estimated to identify genes that were differentially expressed. Thus we integrated the DEGs of the nine datasets downloaded from GEO database and these DEGs were further validated by TCGA database. The information of the nine included GEO datasets in this study was shown in Table 1 and the detailed information of each sample was displayed in Supplementary Table 1. A total of 670 DEGs comprising 293 down-regulated and 377 up-regulated genes were screened after the integrated analysis of nine GEO datasets (Supplementary Table 2

PPI Network and Module Analysis
The PPI network of 512 overlapping DEGs consisted of 197 nodes and 1727 interactions ( Figure 3A and Supplementary Table 7). The candidate hub nodes were identified according to the degree  Table 8). Additionally, to detect significant clustering modules in present PPI network we conducted module analysis and obtained top three modules with high scores ( Figures   3B-D). The eight candidate hub nodes except TP53 and MYC were contained in the three modules, which implied that the three modules might remarkably represent the key biological characteristics of this PPI network, and thereby the eight nodes were defined as major hub nodes in the PPI network ( Figure 4). At the aspect of GO enrichment analysis, module 1 was closely correlated with nuclear division, organelle fission, chromosome segregation, mitotic, nuclear division, nuclear chromosome segregation, mitotic sister chromatid segregation, sister chromatid segregation, microtubule cytoskeleton organization and microtubule cytoskeleton organization involved in mitosis; module 2 was highly connected to extracellular matrix organization, extracellular structure organization, post-  Module 3 (MCODE score = 4). The gradual node color ranging from blue to red represents the changing process from low to high degrees of each gene. Expression values of these genes are log2-transformed.

Survival Analysis
A total of 2018 genes significantly correlated with survival time (P < 0.05) were preliminarily identified by the univariate Cox proportional hazards regression model (Supplementary Table 11). A total of 261 patients with the risk scores larger than the median risk score (1.060) were divided into the high-risk group, whereas the other 264 patients were divided into the low-risk group. The risk score result of the TCGA GBM dataset was presented in Figure 6A. Moreover, a prognostic gene signature composed of 526 genes was further developed after using the multivariate Cox proportional hazards regression model. In addition, we performed LASSO analysis to eliminate possible interactions between genes which might rise from the factor that small sample size with large number of genes. HR < 1 were identified as protective prognostic genes, whereas CXCR4, TMSB10, TEK, PDCD4, ZG16, PIR and SERPINF1 with HR > 1 were identified as risky prognostic genes. As shown in Figure 6B, a highly significant difference in OS was detected between the high-and low-risk groups (P < 0.0001).
The prognostic gene signature shown a good performance in survival prediction, as the AUC was 0.756, 0.722, and 0.797 for 1-, 3-, and 5-year OSs (Figure 6C), respectively. The expression level distribution of the 22 genes in low-and high-risk groups was shown in Figure 7. The ROC curves for predicting OS in GBM patients by the risk score. which is a free web-accessible resource that hosts a subset of the glioblastoma TCGA data and enables an intuitive query and interactive display of the resultant data. Moreover, this resource provides visualization tools for the exploration of gene, miRNA, and protein expression, differential 13 expression within the subtypes of GBM, and potential associations with clinical outcome, which are useful for virtual biological validation.
[39] Whereas, most of these studies mainly focused on a single database with small sample size or a single bioinformatics approach which results may not that convincing. To overcome these limitations, the current study not only integrated microarray data with relative large sample size from multiple GEO datasets and RNA sequencing data from TCGA, but also constructed PPI networks and a Cox proportional hazards model to identify potential diagnostic and prognostic biomarkers in GBM.
In the present study, nine microarray datasets were integrated with RNA sequencing data downloaded reported that FN1 is closely related to cell adhesion and invasion in glioblastoma. [47] EGFR is known to contribute to the malignant properties of glioblastoma multiforme (GBM), such as uncontrolled cell proliferation and evasion of apoptosis. [48] The EGFR signaling pathway is thought to play a crucial role in GBM pathogenesis as well, initiating the early stages of tumor development, sustaining tumor growth, promoting infiltration, and mediating resistance to therapy. [49] c-Myc, a crucial regulator of the Warburg effect [50], was regulated by an unexpected Akt-independent role for mTORC2 in GBM metabolic reprogramming. Importantly, overexpression of c-Myc is associated with significantly shorted survival in GBM patients. [51] It has been demonstrated that Myc-mediated regulation of glioblastoma multiforme cell differentiation [52], which might attribute to either promote or reinforce an undifferentiated phenotype required for glioma cells to respond to the oncogenic effects of elevated Ras and Akt activity.  [67], and promotes clonogenic growth and radiation resistance of GBM via FoxM1-Sox2 signaling axis. [68,69] CD44 is a major cell surface hyaluronan receptor and cancer stem cell marker that participates in the progression of GBM. CD44 depletion blocks GBM growth and sensitizes GBM cells to cytotoxic drugs in vivo. [70] Lower levels of CD44 expression particularly correlated with lower survival of GBM patients which is a promising candidate for further development as a prognostic and therapeutic tool. [71] Specifically, CD44s-targeted treatment with specific mAb may prevent the progression of highly invasive GBMs. [72] It has been demonstrated that the complex balance between TIMP-2 and MMP-2 is a critical determinant of glioblastoma invasion. [73] Additionally, MMP2 is essential for cancer neovascularization and cancer invasion in promoting extracellular matrix degradation. [74] The current study identified 22 pivotal genes associated with GBM prognosis and constructed a prognostic gene signature comprised of these genes. Among these 22 genes, RAB33B, KIAA1199, EVC, SOD2, hCG_40738, CHD9, GCSH, SUHW1, RPS6KA5, KCNG1, DECR1, PPCS, NAT10, OR2W1 and HIC2 were identified as protective prognostic genes. The prognostic value of RPS6KA5, DECR1, NAT10 and HIC2 in GBM has been evaluated before. RPS6KA5 has been identified as a protective prognostic gene in GBM based on bioinformatics approach and is significantly associated with overall survival (OS) of GBM patients. [75] DECR1, an enzyme that required for the mitochondrial β oxidation of lipids [76,77], is highly up-regulated in GBM. [78] Moreover, a latest study based on co-expression network analysis observed that overexpression of DECR1 is relevant to the overall survival time of patients with GBM. [79] NAT10 belongs to a family of N-terminal acetyltransferases (NATs) which are arranged in complexes and believed to target ~80-90 % of all soluble human proteins. Increased expression of NAT10 correlated negatively with patient survival in the group of "all gliomas" patients.
[80] HIC2 gene encodes a hypermethylated in cancer 2 protein in human which functions involve in DNA binding, protein C-terminus binding, metal ion binding and nucleic acid binding and so on.
Recent study reported that frequent deletion in 9p11.2 (FOXD4L2 and AQP7P3) and 22q11.21 (HIC2) are significantly associated with short-term survivor of GBM patients based on a deep genomic comparative analysis of a cohort of patients receiving standard therapy. [81] However, the prognostic value of RAB33B, KIAA1199, EVC, SOD2, CHD9, KCNG1 and PPCS in GBM has not been validated in previous studies. RAB33B is concordantly overexpressed in GBM patients [82], whereas, the prognostic value of RAB33B has not been validated in previous studies. KIAA1199 is a newly found gene with significance in senescence, progression, distant metastasis and poor prognosis of cancer patients. Importantly, the KIAA1199 gene is located on chromosome 15q25, where shows a strong linkage to familial gliomas. [83] Motaln et al showed that hMSCs were responsible for the impairment of GBM cell proliferation, invasion and growth by affecting KIAA1199 genes. [84] However, the prognostic value of KIAA1199 in GBM remain buried. EVC transcript were identified in GBM and LGG but not in normal brain tissue, and the expression data suggests that EVC is expressed more in GBM and LGG tumors when compared with normal samples, although these data did not pass their stringent 2-fold difference in expression. [85] SOD2, encoding an intramitochondrial free radical scavenging enzyme located in the mitochondria, was identified as a novel primary GBM-specific marker. [86] Although limited to GBMs, there was a significant relationship between iNOS and SOD1 expression. The induction of both iNOS and SOD1 would be functionally and precisely controlled within the iNOS-inducing tumors. [87] CHD9 genes is highly upregulated in glioblastoma in a silicon analysis using the TCGA [88] and correlated with GBM outcomes by contributing to stress response. [89] Besides, KCNG1 and PPCS remains largely uncharacterized genes. KCNG1 encodes a member of the potassium channel, voltage-gated and subfamily G, thus regulating the potassium channel activity, voltage-gated potassium channel activity, ion channel activity and voltage-gated ion channel activity. It is reported that KCNG1 encodes extracellular antigenic epitopes with the highest difference between ovarian tumors and normal tissues. [90] PPCS is an enzyme that catalyzes the chemical reaction which constitutes the second of five steps involved in the conversion of pantothenate to Coenzyme A. The molecular function of PPCS consists of regulating phosphopantothenate--cysteine ligase activity and ligase activity. Furthermore, the prognostic value of hCG_40738, GCSH, SUHW1 and OR2W1 has not been absolutely clarified and further studies are still demanded to validate our findings, the importance of these four genes should not be underestimated.
With regard to the seven risky prognostic genes (CXCR4, TMSB10, TEK, PDCD4, ZG16, PIR and SERPINF1), the correlation of TMSB10 with GBM has been investigated before. Existing studies presented that TMSB10 is identified as a negative prognostic gene of GBM via Robust Likelihood-Based Survival Modeling Analysis. [91] CXCR4 is a cell surface chemokine receptor that have been identified as a robust mediator of glioma cell proliferation and invasion. [92] CXCR4 is proven to play a key part in glioma cell migration through HIF-1α in the pseudopalisading tumor cells. [93] Additionally, CXCL12 has been known to modulate the CXCR4 and CXCR7 activity in human glioblastoma stem-like cells and regulate the tumor microenvironment. [94] It has been reported that inhibition of CXCL12/CXCR4 autocrine/paracrine loop reduces viability of human glioblastoma stem-like cells affecting self-renewal activity. Moreover, recent evidence suggested that miR-663 negatively regulated CXCR4 to inhibit its oncogenic effect. TEK, also known as EC-specific receptor tyrosine kinase Tie2 (EC, endothelial cell), whose activation is positively and negatively modulated by angiopoietin-1 (Ang1) and angiopoietin-2 (Ang2), respectively. The modulation of Tie2/Tek activation by Ang1 and Ang2 contributes to GBM vascularization and overall growth. Recent studies have found an up-regulation in expression of Ang1, Ang2 and Tie2/Tek with increasing malignancy grade of astrocytomas. [95] It has been reported that a robust increase in Tie2/Tek expression, restricted to the EC of GBMs compared to low-grade astrocytomas and normal brain. [95] Similarly, it is found that Tie2 signal transduction plays a significant regulatory role in the pathological vascular growth of GBMs. [96] Additionally, purified ExTek treatment in intracranial models of GBMs increases the overall survival by approximately 35%. PDCD4, a well-known tumor-suppressor gene, has been identified as a functional target of mir-21. [97] It is demonstrated that downregulation of PDCD4 by mir-21 facilitates glioblastoma proliferation. [98,99] And the loss of PDCD4 expression was closely associated with the progression and prognosis of various tumors, including glioblastomas. [100] Besides, the loss of PDCD4 contributes to enhanced chemoresistance in GBM via de-repression of Bcl-xL translation. [101] As for PIR, it has been identified as a protective prognostic gene and been reported that high The limitations of our study were as follows: (1) we performed our study based on integrated bioinformatics analysis, our results need further biological experiments validation. (2) we obtained the data used in present study from public databases and we cannot evaluate the quality of these data; (3) our study mainly focused on the genes commonly identified as significantly altered ones in multiple datasets, some biological information (characteristic details such as gender, age, race, tumor grade and stage, etc.) may be overlooked in our study.      Expression of the nine hub DEGs in GBM and normal brain tissues (TCGA dataset).
Expression values of these genes are log2-transformed.