Identication of novel genes associated with diagnosis and prognosis in glioblastoma

Background: Glioblastoma (GBM) is the most lethal primary brain cancer and its survival rate is very low. Comprehensive genomic characteristics and high degree of heterogeneity of GBM are main causes that contribute to the absence of effective therapeutic targets and prognostic markers and eventual treatment failures. Here, Gene set enrichment analysis (GSEA) was used to explore comprehensive genomic characteristics and high degree of heterogeneity of GBM. Our study will help explain potential tumorigenesis mechanism and contribute to development of targeted therapeutics and prediction of patient prognosis. Methods: Gene expression prole of GSE50161 was downloaded from Gene Expression Omnibus (GEO) database and GSEA was performed to evaluate the microarray data at level of gene sets of GO, KEGG and hallmark of Molecular Signatures Database (MSigDB). Core enrichments of selected hallmark gene sets were applied for clinical prognosis verication in Gene Expression Proling Interactive Analysis (GEPIA). Genes associated with signicant overall survival (OS) and unreported before were recognized as novel; the expression levels of novel genes in GBM samples and every novel gene between normal brain and GBM samples were compared. Receiver operating characteristic (ROC) and Pearson correlation analysis were also performed to evaluate the diagnostic biomarkers of GBM and correlations among novel genes respectively. Results: We obtained 511 and 494 GO gene sets, 12 and 10 KEGG gene sets, and 2 and 8 hallmark gene sets in normal and GBM samples respectively. Five novel genes SERPINA5, TENM2, ARAP3, THBD, TCF19 associated with GBM prognosis were selected and four novel genes SERPINA5, TENM2, ARAP3, TCF19 performed well for GBM diagnosis prediction were obtained. In addition, we also found that four novel genes SERPINA5, ARAP3, THBD, TCF19 that high expressed in GBM samples were all positively correlated with each other and correlation between ARAP3 and TCF19 was the strongest. Conclusions: Our study will help deeper understand the molecular heterogeneity and develop of targeted therapeutics of GBM; ve novel related to prognosis genes (of which four are also related to diagnosis) may be applied to more accurate clinical decision-making process.


Introduction
GBM is the most lethal primary brain cancer making up 57% of all gliomas and malignant central nervous system tumors [1]. Despite multidisciplinary treatments of resection, radiotherapy and adjuvant chemotherapy, the median survival for patients with GBM is only 15 months and 5-year survival rates remain < 10% [2,3]. Comprehensive genomic characteristics and high degree of heterogeneity of GBM are main causes that contribute to the absence of effective therapeutic targets and prognostic markers and eventual treatment failures. Therefore, understanding the molecular alterations of GBM will help explain potential tumorigenesis mechanism and contribute to development of targeted therapeutics and prediction of patient prognosis [4]. Now, high-throughput genomic technology (such as microarrays) combined with bioinformatics analysis make it possible to reveal the biologic heterogeneity and molecular alterations of GBM.
GSEA, a Java based software, is a robust computational method that assesses whether an a-priori de ned set of genes shows statistically signi cant, concordant differences between two groups. Initial application of GSEA is to microarray experiments, and now the applications distribute many elds, including RNA-seq gene expression experiments, genome-wide associations studies, proteomics and metabolomics studies [5][6][7][8]. Here, gene expression pro le data GSE50161 was downloaded from GEO database and GSEA was performed to evaluate the microarray data at level of gene sets of Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and hallmark of Molecular Signatures Database (MSigDB) [9]. Based on the set cut-off value, hub gene sets of GO, KEGG and hallmark were screened in GSEA software and genes of core enrichment of hallmark gene sets were selected for clinical prognosis veri cation in a web server of Gene Expression Pro ling Interactive Analysis (GEPIA) [10]. In short, our study will help deeper understand the molecular heterogeneity and develop of targeted therapeutics of GBM; ve novel related to prognosis genes (of which four are also related to diagnosis) may be applied to more accurate clinical decision-making process.

Microarray data preparation
Series Matrix le of GSE50161 including 13 normal brain samples, 34 GBM samples and 83 other brain tumors was downloaded from GEO database. Perl language was used to map the probe data to gene symbols based on GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) annotation information [11].
Subsequently, K-nearest neighbor (KNN) method of impute R package was used to ll the probe missing value; background correction, quartile normalization and probe summarization were done using the limma R package [12,13]. Finally, normal brain and GBM samples were selected for study.

Identi cation of gene sets of GO, KEGG and hallmark of MSigDB
MSigDB is one of the most widely used and comprehensive databases of gene sets and can provide the bene ts of better representation and coverage of biological processes [14]. GO, KEGG and hallmark gene sets were downloaded from MSigDB database (version:7.1) and the obtained annotation les of c5.all.v7.1.symbols.gmt, c2.cp.kegg.v7.1.symbols.gmt, h.all.v7.1.symbols.gmt were input into GSEA software to detect the gene sets in normal brain and GBM samples. "1,000" times of permutations, "False" of collapse dataset to gene symbols and "phenotype" of permutation type were set. Besides, normalized enrichment scores (|NES|) > 1.0, false discovery rate (FDR) q < 0.25 and nominal p < 0.05 were set as cut-off criteria to select hub gene sets. novel genes selection, expression level and survival analysis Based on the Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data, GEPIA is a web-based tool to deliver fast and customizable functionalities [10]. Therefore, in GEPIA server, GBM patients were split into two groups by the median expression of genes (high vs. low expression) and the overall survival (OS) was assessed by Kaplan-Meier survival analysis(log-rank p value < 0.05); subsequently, all the core enrichments of hub hallmark gene sets were input into GEPIA to determine whether a gene have different prognosis. Genes associated with signi cant OS and unreported before were recognized as novel and the expression levels of novel genes were compared in GBM samples, in addition, expression level comparations of every novel gene were also performed between normal brain and GBM samples.

Evaluation of diagnostic biomarkers and correlations
Novel genes were applied for further analysis. Based on the microarray expression values, receiver operating characteristic (ROC) analyses were performed in GraphPad Prism 8 software to evaluate the speci city and sensitivity of novel genes for GBM prediction, a gene associated with an area under the curve (AUC) value greater than 0.8 was considered good. Besides, in GEPIA server, four novel genes that were high expressed in GBM samples were also correlated by Pearson correlation analysis.

Result
Differential expression levels of gene symbols A total of 19171 gene symbols that had complete expression values in normal brain and GBM samples were obtained. Some genes showed signi cantly differential expression in normal brain and GBM samples ( Fig. 1).

Hub gene sets of GO, KEGG and hallmark
There were 511 and 494 selected GO gene sets, 12 and 10 selected KEGG gene sets, and 2 and 8 selected hallmark gene sets in normal and GBM samples, respectively. According to the |NES| of each gene sets, GO_SODIUM_ION_TRANSMEMBRANE_TRANSPORT, KEGG_CARDIAC_MUSCLE_CONTRACTION, and HALLMARK_PANCREAS_BETA_CELLS were the highest of GO, KEGG, and hallmark gene sets in normal brain samples; GO_ANAPHASE_PROMOTING_COMPLEX_DEPENDENT_CATABOLIC_PROCESS, KEGG_CYTOSOLIC_DNA_SENSING_PATHWAY, and HALLMARK_MITOTIC_SPINDLE were the highest of GO, KEGG, and hallmark gene sets in GBM samples, respectively, here, we showed top ve gene sets of GO and KEGG and all the gene sets of hallmark (Table 1,  survival analyses and expression levels of novel genes A total of 347 core enrichments of hub hallmark gene sets were obtained [see Additional le 3] and 5 unreported genes SERPINA5, TENM2, ARAP3, THBD, TCF19 that had survival differences were selected (Fig. 4a-e). Comparation of expression levels of novel genes in GBM samples showed that ARAP3 had the highest expression value, TENM2 had the lowest expression value (Fig. 5a). In addition, we also showed every expression level of novel genes in normal and GBM samples and found that all genes except TENM2 were highly expressed in tumor samples, TENM2 was highly expressed in normal samples (Fig. 5b-f).

diagnostic biomarkers and correlations analyses
Four novel genes SERPINA5, TENM2, ARAP3, TCF19 performed well for diagnosis prediction (p < 0.001) and ROC analyses displayed an AUC of 0.8303, 0.8937, 0.9638, and 0.9819 respectively; however, THBD was not suitable for diagnosis prediction for an AUC of 0.6719 (p = 0.0707) (Fig. 4f-g). Since TENM2 was low expressed in GBM samples, we correlated the other four novel genes that were high expressed in GBM samples by Pearson correlation analysis. We found that the four novel genes were positively correlated with each other and correlation between ARAP3 and TCF19 was the strongest (Fig. 6).
Discussion when no signi cant expression level differences, simply looking for differentially expressed genes that exist between normal and tumor tissues to associate biological phenotypes with their underlying molecular mechanisms may lose some vital information. Unlike this, GESA, based on a given list of genes, focuses on evaluating whether an a-priori de ned set of genes shows statistically signi cant, concordant differences between two groups [15]. In our study, we performed GSEA to evaluate the microarray data at level of gene sets of GO, KEGG and hallmark of MSigDB, the aims were to select some important and meaningful gene sets distributed in normal brain and GBM tissues and hub genes that would help deeper understand the molecular heterogeneity of GBM and were used to GBM diagnosis and treatment. In total, 511 GO gene sets, 12 KEGG gene sets and 2 hallmark gene sets were selected in normal brain tissues. 494 GO gene sets, 10 KEGG gene sets and 8 hallmark gene sets were selected in GBM tissues. In normal brain samples, GO_SODIUM_ION_TRANSMEMBRANE_TRANSPORT, KEGG_CARDIAC_MUSCLE_CONTRACTION, and HALLMARK_PANCREAS_BETA_CELLS had the highest |NES| in GO, KEGG, and hallmark gene sets, respectively; In GBM samples, GO_ANAPHASE_PROMOTING_COMPLEX_DEPENDENT_CATABOLIC_PROCESS, KEGG_CYTOSOLIC_DNA_SENSING_PATHWAY, and HALLMARK_MITOTIC_SPINDLE had the highest |NES| in GO, KEGG, and hallmark gene sets, respectively.
In addition, ve novel and related to GBM patients' prognosis genes (SERPINA5, TENM2, ARAP3, THBD, TCF19) were also selected from the 347 core enrichments in hallmark gene sets. SERPINA5 (serpin peptidase inhibitor clade A member 5), a member of the alpha-1-antitrypsin clade (clade A) of serpins, was originally identi ed as an inhibitor of the anticoagulant protease activated protein C (aPC) and was therefore named protein C inhibitor (PCI). However, SERPINA5 was not only an inhibitor of activated protein C in plasma, but a serpin with broad protease reactivity and wide tissue distribution, as a result, the biological role of SERPINA5 had not yet been de ne [16]. As for the role in tumors, although Ying Jing and Bijsmans respectively found that SERPINA5 inhibited tumor cell migration in hepatocellular carcinoma and loss of SERPINA5 protein expression was associated with advanced-stage serous ovarian tumors [17,18], Kunihiro Asanuma also found that Protein C inhibitor could inhibit breast cancer cell growth, metastasis and angiogenesis [19], to the best of our knowledge, studies on SERPINA5 in intracranial tumors had not been reported. Our analyses showed that SERPINA5 was high expressed in GBM tissues and may be a risk factor for poor prognosis. Teneurins with four paralogs (teneurin-1, teneurin-2, teneurin-3,teneurin-4) were type II transmembrane glycoproteins and mainly involved in morphogenesis and development of the central nervous system [17,18]. As one of the four paralogs, Annemarie Ziegler found that TENM2(teneurin-2)was primarily expressed in breast and cervix cancer and in neuroblastoma cells, but almost no expression in gastric cancer cells. Besides, the authors also found that the expression of TENM2 in ovarian cancer had a negative relationship with tumor differentiation and survival time [19]. In our analyses, we found that the expression level of TENM2 in GBM tissues was lower than that in normal brain tissues, which was also a special case we encountered in ve novel genes. However, it was worth noting that different expression level of TENM2 still had prognostic signi cance. ARAP3 (ArfGAP with RhoGAP domain, ankyrin repeat and PH domain 3) encoded a phosphoinositide binding protein containing ARF-GAP, RHO-GAP, RAS-associating, and pleckstrin homology domains, in which ARF-GAP and RHO-GAP domains cooperated in mediating rearrangements in the cell cytoskeleton and cell shape. De ciency of ARAP3 in cells may cause defects of cellautonomous and lamellipodia producing [23][24][25][26]. Yagi reported that ARAP3 inhibited peritoneal dissemination of scirrhous gastric carcinoma cells by regulating cell adhesion and invasion and Qing-Xuan Wang found that downregulation of ARAP3 led to ability inhibition of cell proliferation, migration and invasion of papillary thyroid carcinoma [27,28]. Our study found that the expression level of ARAP3 in GBM was the highest, and its expression level was clearly related to the prognosis of patients. THBD (thrombomodulin) was a membrane receptor that bound thrombin resulting in activation of protein C, degradation clotting factors Va and VIIIa and reduction of thrombin [29]. Studies had found that THBD was expressed in bladder cancer, lung cancer and other tumors, and its expression level was related to tumorigenesis, invasiveness and metastasis [30][31][32], speci cally, the expression of THBD was inversely correlated with tumor progression. There were no reports of the expression level of THBD in brain tumors, our study showed that THBD was highly expressed in GBM patients, and patients with different expression levels had inequable prognosis. Finally, several studies had found that in non-small cell lung cancer, hepatocellular carcinoma, and colorectal cancer TCF19 (transcription factor 19) could promote cell proliferation and increase tumor invasiveness through some pathways, but there were no reports about its role in GBM [33][34][35][36]. Our study found that TCF19 was highly expressed in GBM patients, and patients with different expression levels also had inequable prognosis.
Besides, novel genes were performed ROC and Pearson correlation analyses in GraphPad Prism 8 software and GEPIA server respectively. We found that four novel genes SERPINA5, TENM2, ARAP3, TCF19 performed well for diagnosis prediction for an AUC of 0.8303, 0.8937, 0.9638, and 0.9819 respectively (p < 0.001); however, THBD was not suitable for diagnosis prediction for an AUC of 0.6719 (p = 0.0707). Four novel genes SERPINA5, ARAP3, THBD, TCF19 that high expressed in GBM samples were correlated by Pearson correlation analysis and the results showed that they were positively correlated with each other.
In conclusion, GSEA analyses were conducted to evaluate gene sets between normal brain tissues and GBM tissues. Hub gene sets of GO, KEGG, and hallmark of normal brain and GBM tissues as well as ve novel genes (SERPINA5, TENM2, ARAP3, THBD, TCF19) related to GBM patients' prognosis and diagnosis were selected. Our study would help deeper understand the molecular heterogeneity and develop of targeted therapeutics of GBM; ve novel related to prognosis genes (of which four are also related to diagnosis) may be applied to more accurate clinical decision-making process.

Conclusions
Our study will help deeper understand the molecular heterogeneity and develop of targeted therapeutics of GBM; ve novel related to prognosis genes (of which four are also related to diagnosis) may be applied to more accurate clinical decision-making process.    Figure 1 Some genes had a clear role in the difference of normal brain and GBM samples Top ve gene sets of GO and KEGG were showed. According to the |NES|, all ve GO gene sets were enriched in GBM samples (a-e); two and three KEGG gene sets were enriched in GBM samples (f,g) and normal brain samples (h-j) respectively.

List Of Abbreviations
Page 15/18

Figure 3
A total of ten hallmark gene sets were selected, eight gene sets were enriched in GBM samples(a-h), two gene sets were enriched in normal brain samples (i-j).

Figure 5
Comparation of expression levels of novel genes in GBM samples showed that ARAP3 had the highest expression value, TENM2 had the lowest expression value (a). Comparations of expression level of every novel gene in normal and GBM samples showed that all genes except TENM2 were highly expressed in GBM samples, TENM2 was highly expressed in normal samples (b-f).