Integrative Analysis of Dna Methylation and Gene Expression Proles to Explore Potential Biomarkers of Glioblastoma

Glioblastoma multiform (GBM) is the most common, most invasive, and malignant type of primary brain tumor with poorly prognosis and poorly survival rate. Using GSE22891 the expression and methylation status of same GBM patients was evaluated to explore key epigenetic genes in GBM. Using |log2FC| > 1 and FDR < 0.05 as the threshold, DEGs including 4910 downregulated and 2478 upregulated were screened and by |log2FC| > 0.2 and p.value < 0.05, 3223 DMCs were detected. By merging the results of DEGs and DMCs, 643 genes were selected for network analysis by WGCNA, and based on expression values three modules and by methylation values, one module was selected. Using STRING and Cytoscape, PPI network of genes of all modules were constructed separately. According to the PPI network, core genes were picked out. The expression status of core genes was evaluated using GSE77043, GSE42656, GSE30563, GSE22891, GSE15824, and GSE122498, and 50 genes were validated. The methylation status of 50 genes was explored using GSE50923, GSE22891, and GSE36245, and nally, 12 hub genes including ARHGEF7, RAB11FIP4, PPP1R16B, OLFM1, CLDN10, BCAT1, C1QB, C1QC, IFI16, NUP37, PARP9, and PCALF were selected. Using GEPIA database, the expression and survival plot, and using cBioportal the scatterplot of methylation versus expression of 12 hub genes were extracted based on TCGA. To determine the diagnostic values of the hub genes, the ROC curve and the area under the curve (AUC) were extracted based on GSE22891. GSE15824, and GSE122498. Methylation patterns were checked by application of microarray data from GEO series GSE50923, GSE22891, and GSE36245 arrays and also two datasets of TCGA database (13) and Glioblastoma Multiforme (TCGA, Firehose Legacy). The survival data and box plot expression of the detected genes were evaluated by cBioPrtal and GEPIA, respectively (14) (15). We also used receiver operating characteristic (ROC) curve analysis to evaluate the prognostic power of the genes based on the data from GEO seriesGSE22891.


Introduction
According to the world health organization (WHO) classi cation, gliomas are staged into four grades based on histopathologic features. Glioblastoma multiformes (GBM) is in grade IV of this classi cation and accounts for about seventy percent of glioma malignancies. Glioblastoma is the most common, most invasive and malignant type of primary brain tumors which poor prognosis is predicted for the patients despite performing therapeutical strategies (surgery, radiotherapy, and chemotherapy) for whom a 10% veyear survival and median survival of 12 to 15 months is reported (1). The glioblastoma is divided into two primary and secondary subtypes according to the origin of the malignancy. Over ninety percent of glioblastomas are of the primary type, which rapidly develops and affects older people. A secondary form of low-grade astrocytoma progresses gradually to glioblastoma and affects younger people (2)(3)(4)(5).
The main causative factors resulting in pathogenesis of human various cancers including GBM are epigenetic phenomena. Since development of gene microarray and RNA-seq technologies, the aberrant expression and methylation pro les of GBM both in transcriptome and genome levels have been reported.
By employment of the registered expression results in the Gene Expression Omnibus (GEO) web database, Bo et al. (1) identi ed 431 differentially expressed genes (DEGs) in GBM tissues compared to healthy samples. According to in situ studies among these DEGs, 69 genes were known to be associated with considerable prognosis in the patients with GBM. In another experiment using the gene expression pro le of GEO series (GSE) GSE50161, of a total 486 DEGs, upregulated CDK1, CCNB1 and CDC20 were showed to be associated with poorly survival in GBM patients (6) Despite many studies and identi cation of several sets of DEGs, our understanding of the biological mechanisms in pathogenesis of the disease requires further studies.
Epigenetic mechanisms, and in particular DNA hypermethylation, have been shown to play a crucial role in the cancer development and invasion studied in many types of cancers. Hypermethylation of promoter in the CpG island region of DNA repair and tumor suppressor genes result in the silencing of these genes which eventually affects various biological pathways. This is important in many types of cancers including gliomas. The term CpG island methylator phenotype (CIMP) refers to extensive methylation of the CpG islands that were primarily identi ed in colorectal cancer (CRC) and then in other types of cancer. Many studies have reported hypermethylation of promoter-associated CpG islands in human glioblastoma (2,3,(7)(8)(9)(10). However, the valuable prognostic rate of the mentioned aberrant methylated markers in various subtypes of GBM and the complexity of DNA methylation roles of different gene regions are still controversial issue and it requires more veri cation in larger independent cohorts of patients with GBM.
Molecular mechanism studies using in situ computations has been recognized as a brilliant method in cancer research. Bioinformatics experiments not only can help the molecular pathogenesis of tumors to be explored, but also provides identi cation of biomarkers for early diagnosis, treatment, and prediction of prognosis in cancers. In the present study weighted gene co-expression network analysis (WGCNA) was used to analyze the relationship between gene sets and phenotypic traits (11,12). WGCNA is a systems biology procedure that can describe the correlation between genes in high throughput data mining approaches like microarray and next generation sequencing (NGS). WGCNA can constitute clusters from highly correlated genes (modules), nd the correlation between modules and also between modules and external clinical data.
In this study, we evaluated the effect of methylation on gene expression by employment of an expression array, and then both upregulated-hypomethylated and downregulated-hypermethylated genes were selected for network analysis using WGCNA. For further validation of the detected genes, their expressions were evaluated using six expression arrays including GEO series GSE77043, GSE42656, GSE30563, GSE22891, GSE15824, and GSE122498. Methylation patterns were checked by application of microarray data from GEO series GSE50923, GSE22891, and GSE36245 arrays and also two datasets of TCGA database (13) and Glioblastoma Multiforme (TCGA, Firehose Legacy). The survival data and box plot expression of the detected genes were evaluated by cBioPrtal and GEPIA, respectively (14) (15). We also used receiver operating characteristic (ROC) curve analysis to evaluate the prognostic power of the genes based on the data from GEO seriesGSE22891.

Methods
Selecting dataset and explore of differentially expressed genes (DEGs) GEO series GSE22891 expression array data was recaptured from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) that has evaluated gene expression in 40 GBM samples and 6 healthy controls. This dataset also has assessed the DNA methylation patterns in 56 GBM and 4 healthy samples. Expression and methylation data of GSE22891 array were normalized via the quantile normalization function in the limma package (16). Using the aggregate function in the S4 vectors package, which gives an average measure for the probes of each gene, only a single measure for each gene in the expression array was considered. Limma package was used for detecting the DEGs by the cutoff value > 1 and P-value < 0.05.

Preprocessing and detecting the DMCs
To explore the CpG islands with the most difference between GBM and control groups, the min (17) and limma packages were utilized. First, the methylation β-value of the CpG site was downloaded. Then, it was more converted to the methylation M-value (log ratio of beta values) of the CpG. The differentially methylated CpGs (DMCs) between healthy and GBM samples were screened by utilizing the moderated ttests provided in the limma R -package, under settings of cutoff > 0.2 and P-value < 0.05 to detect the CpG sites.
WGCNA and key modules identi cation By merging the DEGs and DMCs of the GSE22891, a list of upregulated-hypomethylated and downregulated-hypermethylated genes with their expression and methylation values was considered for construction of WGCNA. Two WGCNAs were created. Constructing a WGCNA is necessarily accompanied by selecting the soft thresholding power β, which through creating co-expression similarity leads to adjacency calculation. At this point, we intended to demonstrate that utilizing the pickSoftThreshold function has the capacity to both analyze the topology network and assist in selecting an appropriate soft-thresholding power for providing a scale-free topology t index that reaches values above 0.9. After the calculation of adjacencies, to minimize the effect of noise and spurious associations, adjacency results transformed into Topological Overlap Matrix (TOM). The results of TOM considered as input to produce dendrogram of genes, cutreeDynamic function used for branch cutting. Minimum module size of 30 and the module detection sensitivity deep Split 2 in block wise Consensus Modules function were used for network construction.

Identi cation of clinically signi cant modules and functional annotation
The correlation between, and association of individual genes and the clinical data were measured through de ning Gene Signi cance (GS) as the absolute value. For each module, a quantitative criterion of module membership (MM) was considered as the correlation between the eigengene module and gene expression pro le through which similarity of all genes on the array was measured and determined to every module. Utilizing the GS and MM, it is possible to explore the interesting module(s) encompassing genes with the great signi cance of both clinical data and module membership.

Protein-Protein interaction (PPI) Network Construction and enrichment
We constructed the protein-protein interaction (PPI) network of the selected module using STRING (stringdb.org) and cytoscape (18). ClueGO v2.5.3 (19) was utilized to perform pathway enrichment analysis of the genes, and the most important signaling pathways based upon KEGG database were detected (20).

Validation of detected genes by other datasets
After construction of PPI network, hub genes were detected and their expression and methylation values were evaluated in other independent datasets. GEO series GSE77043, GSE42656, GSE30563, GSE22891, GSE15824, and GSE122498 were utilized to explore aberrant expression between GBM and healthy controls and also GEO series GSE50923, GSE22891, and GSE36245 were used to consider the methylation status of CpGs related to the hub genes.

Hub gene validation by GEPIA and cBioportal
Overall survival (OS) analysis of the hub genes was performed using the Kaplan-Meier plotter in cBioportal (14). Hazard ratio (HR), 95% con dence intervals (CI) and log-rank P-values were calculated. DNA Methylation versus gene expression of the hub genes were explored using cBioportal. The online tool GEPIA (15) accompanied by data sourced from the Cancer Genome Atlas (TCGA) was used to validate the expression of these hub genes in the GBM.

Comparison of diagnostic values of the detected hub genes using ROC curve
To determine the diagnostic values of the hub genes, we plotted ROC curve and the area under the curve (AUC). We considered P-values of < 0.05 for statistical difference based on GEO seies GSE22891.

Results
Differentially expressed genes (DEGs) screening Normalization boxplot can be useful for determining if the selected samples are suitable for differential expression analysis. We normalized expression dataset by quantile normalization function in the limma package (Fig. 1A). Uniform manifold approximation and projection (UMAP) was used for visualizing how samples are related to each other (Fig. 1B). Vooma plot was utilized to evaluate if there is a lot of variation in the data (Fig. 1C). Using |log2FC| > 1 and FDR < 0.05 as the threshold, DEGs including 4910 downregulated and 2478 upregulated were screened (Fig. 1D).
Weighted co-expression network construction and key modules identi cation Expression and methylation values of nal genes considered as input for the WGCNA package. According to pick Soft Threshold function power of β = 9 and 8 were selected for expression and methylation respectively, for providing scale-free topology t index that reaches values above 0.9, for both datasets. By hierarchical clustering for two separate network analysis, 15 modules were identi ed for expression values (Fig. 4A) and 18 modules were identi ed for methylation values (Fig. 5A). These modules exist different sizes in term of the number of genes and labeled by the different colors which show in (Fig. 4B and 5B). Green (correlation= 0.6, P.Value= 3.2e-22) (Fig. 4C), blue (correlation= 0.51, P.Value= 2e-46) (Fig. 4D), and turquoise (correlation= 0.96, P.Value= 1e-200) (Fig. 4E) modules were selected for expression network and yellow module (correlation= 0.41, P.Value= 6.1e-11) (Fig. 5C) for methylation network.

Protein-Protein interaction (PPI) Network Construction
Genes of detected modules were used for PPI network construction in STRING database and were visualized in Cytoscape. PPI network of green (Fig. 6A), blue (Fig. 7A), and turquoise (Fig. 8A) modules identi ed some genes as core genes related to expression data. This gure was used for genes of yellow module related to methylation data (Fig. 9A). We enriched genes of each module separately, and analyzed them through the Cytoscape plug-in ClueGo based upon KEGG database, genes of green module have role in oocyte meiosis, cell cycle, cellular senescence, p53 signaling pathway, human t-cell leukemia virus 1 infection, mismatch repair, DNA replication, and complement and coagulation cascades (Fig. 6B). For blue module human t-cell leukemia virus 1 infection, central carbon metabolism of cancer, cellular senescence, leukocyte transendothelial migration, age-rage signaling pathway of diabetic complications, and ribosome biogenesis in eukaryotes were the most important signaling pathways (Fig. 7B). Synaptic vesicle cycle, insulin secretion, endocrine and other factor-regulated calcium reabsorption, cAMP signaling pathway, and GnRH secretion were the most dysregulated signaling pathways for genes of turquoise module (Fig. 8B). We also performed gene enrichment for yellow module and NF-κB signaling, chemokine signaling pathway, pancreatic secretion, basal cell carcinoma, and B cell receptor signaling pathway were detected as most important signaling pathways (Fig. 9B).

Validation of detected genes by other datasets
The expression and methylation status of all genes of PPI networks were evaluated by other separate datasets. The expression status of these genes were evaluated using GSE77043, GSE42656, GSE30563, GSE22891, GSE15824, and GSE122498 and was shown as heatmap (Fig. 10), and the methylation values of all genes were evaluated by GSE50923, GSE22891, and GSE36245 and were visible as heatmap (Fig. 11). The expression evaluation validated 50 gens including 32 upregulated and 18 downregulated genes that seems play a signi cant role in GBM. The methylation status of mentioned 50 genes approved 12 of them including ARHGEF7, RAB11FIP4, PPP1R16B, OLFM1, CLDN10, BCAT1, C1QB, C1QC, IFI16, NUP37, PARP9, and PCALF as nal gene list.
Comparison of diagnostic values of detected hub genes using ROC curve A ROC curve was obtained to verify the potent diagnostic performance of 12 hub genes based on the GSE22891 dataset. The AUC showed that ARHGEF7, RAB11FIP4, PPP1R16B, OLFM1, CLDN10, BCAT1, C1QB, C1QC, IFI16, NUP37, PARP9, and PCALF indicated excellent diagnostic e ciency for tumor and normal tissues (Fig. 15).

Discussion
GBM is the most common central nervous system tumor that begins in brain tissue and originates from astrocyte cells (21). The subtypes have been identi ed for this disease based on gene expression changes including the TP53 (p53), the epidermal growth factor receptor (EGFR), and IDH1 (Isocitrate dehydrogenase 1). Also other genetic alterations such as RB and the PI3K/AKT pathways (22)(23)(24) play an important role in GBM. In addition to genetic abnormalities, gliomas also exhibit various types of epigenetic alterations, which are expressed as inherited changes in gene expression, which can be further investigated, such as post-translational changes in histones, and methylation in the CpG islands of DNA sequences. Methylation changes have been observed in the DNA sequences of this disease and many other tumors, including hypermethylation in the CpG islands of DNA, or hypomethylation in speci c genes (25,26). DNMTs are a family of enzymes including DNMT1, DNMT3A and -3B in mammals that play a signi cant role in the maintenance of DNA methylation patterns. Somatic mutations and altered expression of DNMT genes have been identi ed in various types of cancers (25)(26)(27)(28). However, their signi cant effect on gene methylation and gene expression pattern remains unclear (29).
In our study, an integrative analysis of methylation and expression data on the same patients using GSE22891 was performed, and genes that were hypomethylated-upregulate and hypermethylateddownregulate were selected for network analysis. By WGCNA, signi cant modules including green, blue, and turquoise using expression values and yellow module using methylation value were detected. Using STRING and Cytoscape, PPI networks of mentioned modules were constructed and core genes were selected. For validation, expression values of core genes were evaluated using separate datasets including GSE77043, GSE42656, GSE30563, GSE22891, GSE15824, and GSE122498. Due to explore the effect of methylation on the expression of selected genes, their methylation was evaluated by GSE50923, GSE22891, and GSE36245, and 12 hub genes were selected for further consideration. Using GEPIA database, box plot expression and survival data of 12 hub genes based on TCGA database were obtained. By cBioportal, Scatterplot of methylation versus expression status of all 12 hub genes was obtained. ROC curve was used to explore the diagnostic performance of all mentioned genes.
The main genes with expression changes in our signature containing a set of 12 genes are summarized according to the literature in Table 1. We found discordant changes in the expression of a few genes including ARHGEF7, OLFM1, PPP1R16B, and IFI16, while other were shown to be altered in GBM samples compared to normal controls in concordant to the past ndings in other studies ( Table 1).All of the genes in the signature except for ARHGEF7 with less changes but also signi cant showed highly signi cant alterations in their expression levels compared to control samples. Among them, ARHGEF7 (also called beta Pix or Cool1) encodes a guanine-nucleotide exchange factor (GEF) has been found to be upregulated in glioblastoma cell lines playing an essential role in the tumorigenesis of invasive cells and its knockdown was revealed to reverse the promoting impacts on tumor cell proliferation and invasion (30). Methylation status of ARHGEF7 has been studied in several cell types of GBM and in combination with others is suggested as a signature in classi cation, prognosis and treatment of GBM disease (31). BCAT1 encoded protein plays a role in the metabolism of branched-chain amino acids (BCAAs), known to be up-regulated in GBM samples in the past studies (32) and also as a result enhances tumor progression in GBM cells (33).
Other genes like C1QB and C1QC which are involved in complement cascade, KIAA0101 (or PCALF) in cell proliferation, and PARP9 performing in DNA damage repair have been known in literature to be up-regulated in GBM and promote tumorigenesis in consistent with our results (34)(35)(36)(37) while no witness was found for NUP37 as an encoding gene for Nucleoporin 37; a member of nuclear pore complex (NPC) which also was shown to be upregulated in the signature. On the other side, CLDN10 encoding claudin 10 in Tight junction interactions and RAB11FIP4 located on chromosome 17q11.2 and encodes for Rab11 family-interacting protein 4 which plays role in the regulation of endocytic tra c, both were shown to be down-regulated in the signature in consistent with the past ndings (38)(39)(40), although a controversial result also has been reported for the latter (41).
In conclusion, by integrated analysis of methylation and expression pro les, we identi ed 12 genes that can be regarded as prognostic biomarkers of GBM. Further experimental analysis is needed to prove the biological importance and functions of the reported genes.