Recurrent Glioblastoma is Associated with a Closed DNA Conformation

Introduction: The pathological changes in epigenetics and gene regulation that accompany the progression of low-grade to high-grade gliomas are under-studied. The authors use a large set of paired atac-seq and RNA-seq data from surgically resected glioma specimens to infer gene regulatory relationships in glioma. Methods: Thirty-eight glioma patient samples underwent atac-seq sequencing and sixteen samples underwent additional RNA-seq analysis. Using an atac-seq/RNA-seq correlation matrix, atac-seq peaks were paired with genes based on high correlation values (|r 2 |>0.6). Results: Samples clustered by IDH1 status but not by grade. Surprisingly there was a trend for IDH1 mutant samples to have more peaks. The majority of peaks are positively correlated with survival and positively correlated with gene expression. Constructing a model of the top six atac-seq peaks created a highly accurate survival prediction model (r 2 =0.68). Four of these peaks were still signicant after controlling for age, grade, pathology, IDH1 status and gender. Grade II, III, and IV (primary) samples have similar transcription factors and gene modules. However, grade IV (recurrent) samples have strikingly few peaks. Patient-derived glioma cultures showed decreased peak counts following radiation indicating that this may be radiation-induced. Conclusions: This study supports the notion that IDH1 mutant and IDH1 wildtype gliomas have different epigenetic landscapes and that accessible chromatin sites mapped by atac-seq peaks tend to be positively correlated with expression. The data in this study leads to a new model of treatment response wherein glioma cells respond to radiation therapy by closing open regions of DNA.


Introduction
Gliomas are the most common primary brain tumor and result in over 15,000 deaths a year in the United States alone 1 . These tumors are often initially discovered as slow-growing low-grade gliomas (Grade II gliomas) 2,3 . However, inevitably these tumors undergo a transformation into faster-growing more aggressive higher-grade tumors (Grade III and Grade IV gliomas). Following diagnosis, standard clinical management includes maximal safe resection and adjuvant treatment (chemotherapy and radiation) 1,[3][4][5] . This often leads to a period of clinical stability where there is no visible tumor on serial MRI scans. However, the tumors inevitably recur. While primary grade IV gliomas (i.e. glioblastomas) can have periods of stability following resection and adjuvant therapy, once they recur, recurrent glioblastomas are typically treatment resistant and clinical stability is rare 6,7 . Whether this is due to a natural progression of the tumor or as a response to typically used DNA-damaging therapies (radiation and chemotherapy) is being investigated 8 .
The transformation from a normal cell to a low-grade glioma and eventually a high-grade glioma involves the dysregulation of many gene pathways and cellular processes [9][10][11] . Much attention has been paid to deletions (e.g. PTEN 12,13 , p53 14 , 1p/19q 15,16 ) and ampli cations (EGFR 17 , MYC 18 ) of speci c gene regions that are frequently encountered in these tumors. However, there is a similar and equally important pathology in the epigenetic dysregulation of these gene pathways that has received less attention. A thorough exploration of the epigenetic landscape of these tumors may lead to the discovery of additional important biomarkers.
The discovery of the IDH1 mutation (frequently found in low-grade gliomas and secondary high-grade gliomas) has brought additional focus and attention to the problem of epigenetic dysregulation in gliomas. The presumed mechanism of the IDH1 mutation is to methylate cytosine nucleotides in DNA and induce a more closed inaccessible DNA chromatin with subsequent gene down-regulation however this has not been demonstrated.
Atac-seq is a relatively new sequencing technology that identi es open accessible DNA regions (e.g. "peaks") with very little cellular input enabling the analysis of surgically resected samples. Its low-cost and relative ease has led to wide-spread use across multiple cell types. However, when atac-seq data is used in isolation without additional lines of supporting evidence, it has been di cult to determine how best to interpret the biological signi cance of these open regions (measured as "peaks" from aligned sequencing reads). Data sets involving multiple modalities (atacseq and RNA-seq) on the same samples would be useful in supporting or refuting these assumptions. To address these short-comings this study combines a large set of atac-seq and RNA-seq data from surgically removed glioma specimens to create an atac-seq/RNA-seq matrix to identify highly correlated peaks and genes. By correlating these data sets with demographic and survival data we have identi ed open DNA regions that have prognostic and thereby likely biological signi cance.

Generation of clinical database
The University of Cincinnati maintains an IRB-approved bio-repository of surgically resected specimens with associated clinical and demographic information including medical record number, diagnosis, name, gender, age, date of collection and survival. A collection of 38 specimens were chosen from the bio-repository with a goal to obtain a heterogenous mixture of low-grade and high-grade specimens as well as those with long and short survival.
Specimens were between two and nine years old prior to selection to allow for adequate follow-up for survival analysis. All specimens denoted as "Grade II," "Grade III," or "Grade IV primary" are taken from the patient's initial surgery and have not been exposed to any chemotherapy or radiation treatment. The presence of the IDH1 mutation was determined via either immunostaining or by direct sequencing.
Protocol for generating Atac-seq libraries Nuclei were isolated from patient tumor tissue as described by Habib et al. 19 , with slight modi cations. Brie y, frozen tissue samples (<0.5 cm) were incubated on ice in 1mL of Nuclei EZ Lysis Buffer (Sigma, #Nuc-101). The tissue was then homogenized using a glass tissue grinder. 1mL of additional EZ Lysis Buffer was added, the tissue was incubated on ice for 5 minutes, and then ltered through a 70mm cell strainer. Nuclei were collected by centrifugation and resuspended in 1mL ATAC-resuspension buffer (10mM Tris-HCl (pH 7.4), 10mM NaCl, 3mM MgCl 2 , and 0.1% Tween-20) and ltered through a 40mm cell strainer. 1mL of ATAC-resuspension buffer was then added and the nuclei were ltered through a 5mm cell strainer. ATAC-seq library preparation was performed as previously described 20 .
Processing of ATAC-seq data ATAC-seq reads in FASTQ format were rst subjected to quality control to assess the need for trimming of adapter sequences or bad quality segments. The programs used in these steps were FastQC v0.11.7, Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program HISAT2 v2.0.5. Aligned reads were stripped of duplicate reads with the program sambamba v0.6.8. Peaks were called using the program MACS v2.1.2 using the broad peaks mode.
To obtain the consensus set of unique peaks, called peaks from all samples are merged at 50% overlap using BEDtools v2.27.0. The consensus peaks, originally in BED format were converted to a Gene Transfer Format (GTF) to enable fast counting of reads under the peaks with the program featureCounts v1.6.2. Each feature in the GTF le has the value "peak" on the second column. Peaks located on chromosomes X, Y and mitochondrial DNA are excluded from further analysis. Raw read counts are normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.

Random forest regression model
Log-transformed data were ltered to remove low variance peaks using the VarianceThreshold function from scikitlearn v0.24.1 using a threshold of 0.45. The resulting 8439 features were used to build a random forest regression model using RandomForestRegressor from the ensemble module of scikit-learn. The options used was n_estimators: 1000, max_samples=0.9, oob_score=True. Permutation importance was used to rank the peaks. The out-of-bag R 2 value for the regression was 0.31.

Elastic-net regularized generalized linear models
We selected the top 20 peaks with highest mean feature importance value from 8439 peaks. To assess the survival predictive power of clinical metadata and top 20 ATAC-seq peaks, we trained elastic-net regularized generalized linear models using 'glmnet' package in R . We tested the performance of three different models that are built using following features -1) top 20 ATAC-seq peaks 2) clinical metadata only 3) clinical metadata + top 20 ATAC-seq peaks. Leave one out cross validation imputation was applied and alpha parameter was explored between 0 to 1 with incrementing step size of 0.1. Out of 36 samples, IDH mutation information of 4 samples was not available. These 4 samples were excluded from models which include clinical metadata as features. Prediction accuracy of models was assessed by computing R 2 value.

Processing of RNA-seq data
The quality control check on RNA-seq reads was performed with FastQC v0.11.7. Adapter sequences and bad quality segments were trimmed using Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program STAR v2.6.1e. Duplicate aligned reads were removed using the program sambamba v0.6.8. Gene-level expression was assessed by counting features for each gene, as de ned in the NCBI's RefSeq database. Read counting was done with the program featureCounts v1.6.2 from the Rsubread package. Raw counts were normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.

ATAC-seq and RNA-seq correlation analysis
Sixteen samples have atac-seq chromatin accessibility as well as RNA-seq gene expression data which is used for correlation analysis. To compare the chromatin accessibility and gene expression, we identi ed the peak-gene pairs which are located within same topologically associating domain (TAD). Peaks having at least 90% overlap with TAD are selected for further analysis. Out of 8439 peaks, 8419 peaks show > 90% overlap with TAD region and 8416 peaks have at least 1 gene within the overlapping TAD. Pearson correlation coe cient was calculated for each peakgene pair.

Analysis of differential chromatin accessibility
The signi cant changes of chromatin accessibility in each subgroup (Grade II, Grade III, Grade IV Primary and Grade IV Recurrent) were assessed with the R package DESeq2 v1.26.0. The 8439 most important peaks from the random forest regression model are selected for the differential analysis. Samples from a speci c subgroup (e.g. Grade II) are compared against remaining samples from the dataset. Differential peaks with log 2 FoldChange > 0.3 and FDR < 0.05 are selected for motif enrichment analysis using HOMER v4.10. Remaining peaks from initial set of 8439 peaks are used as background data set in HOMER analysis. Only known human motifs from CIS-BP 2.0 are considered.

Radiation Treatment
Three previously characterized patient-derived cultures (HK296 21 , HK357 21 , TS603 22 ) underwent a radiation dose of 9 Gy in three fractions in a XenX (X-Strahl, Suwanee, GA) pre-clinical cabinet irradiator, with output calibrated using NIST-traceable instruments. The instrument parameters were: 220 kVp, 0.67 mm Cu HVL, dose rate ~6cGy/sec, delivered using a 100mm x 100mm collimator with 2 cm backscatter below the cell plates.

Results
Atac-seq/RNA-seq correlation matrix creates chromatin maps with candidate target genes This study includes thirty-eight glioma specimens spanning all grades (Grade II-7 specimens, Grade III-9 specimens, Grade IV primary-9 specimens, Grade IV recurrent-13 specimens). Seventeen specimens were determined to have an IDH1 mutation, seventeen specimens were determined to be IDH1 wildtype. In four specimens, the IDH1 mutant status could not be determined. The collection includes three paired specimens in which a single patient underwent two surgeries (MG 18/MG19, MG20/MG21, MG27/28). Sixteen samples were of su cient quality to allow RNAsequencing analysis. The demographic and clinical information is shown in Table 1. Restricting analysis to those peaks with a univariate correlation of |r 2 |>0.6 revealed that the majority of peaks were positively correlated with survival ( Figure 1A) and positively correlated with gene expression ( Figure 1B). Gene ontology enrichment analysis of the identi ed correlated genes (peak-survival correlation |r 2 |>0.6 and peak-gene correlation |r 2 |>0.6) identi ed "Nervous System Development" as the most enriched gene module ( Figure 1C). Clustering using Uniform Manifold Approximation and Projection (UMAP) identi ed two clusters that segregated by IDH1 mutant status ( Figure 1D-E). While it is widely presumed that the IDH1 mutation induces DNA methylation and a closed conformation, there was a non-signi cant trend (p=0.1) towards IDH1 mutant samples having more peaks ( Figure 1F).
Prognostic model predicts identi es atac-seq peaks associated with grade/survival.
Restricting the focus to only those genes with the most prognostic value, a random forest model identi ed 8439 peaks that were predictive of survival. The majority of these peaks mapped to intergenic regions. Using the atacseq/RNA-seq correlation matrix, 29,679 genes were correlated with the identi ed 8439 peaks (gene-peak correlation |r 2 |>0.6). Contrary to the hypothesis that peaks target the "nearest gene," only 1863 (2.5%) of these correlations were between a peak and its "nearest gene." Linear regression models identi ed six peaks that predicted survival with a high degree of accuracy (R 2 =0.6845). The six most closely correlated target gene regions are shown in Figure 2A.
Four of the annotated genes have reported roles in cancer. GUCY1B3 23,24 and IMPAD1 25,26 are reported oncogenes while GRIA2 27 and LZTS1 28 are reported tumor suppressors. This peak linear regression model was compared against a model using established predictive clinical variables (age, gender, grade, pathology and IDH1 status) which yielded a higher degree of accuracy (R 2 =0.7271 Figure 2B). Combining the atac-seq peaks and clinical variables created the most accurate model (R 2 =0.8164 Figure 2C) indicating that four of the atac-seq peaks have predictive value independent of traditionally used clinical variables.

Recurrent grade IV gliomas are associated with fewer open chromatin sites
Each grade was compared to all others to identity any differentially represented atac-seq peaks (Figure 3). These grade-speci c peaks then underwent HOMER motif analysis to determine which likely transcription factors drove gene expression in each grade. Finally, peak-associated genes were identi ed and underwent gene ontology analysis to determine candidate target gene pathways for each grade. Grade II (2326 peaks, 3204 genes), grade III (1708 peaks, 2508 genes), and grade IV (primary) (3058 peaks, 3819 genes) samples had similar and heavily overlapping sets of transcription factor motifs and enriched gene modules (Figure 3). In contrast, grade IV recurrent (14 peaks, 37 genes) had relatively few peaks and no signi cantly enriched gene modules.
To determine if this nding was speci c to the identi ed 8439 peaks or part of a larger trend, the total peak count was calculated for each of the grades. Similar to the previous nding, grade IV (recurrent) samples had signi cantly fewer total peaks than grade II samples (ANOVA) (Figure 4A-B). This collection contains three "paired" samples each of which were grade IV that underwent resection followed by standard adjuvant treatment (temozolomide and radiation) and later a second resection. In each case, the later sample ("recurrent" sample) had fewer peaks ( Figure  4C). This invites two possibilities. The rst possibility is the decrease in peak count is a selection process where only cells with fewer peaks grow back. The second possibility is that the decrease in peak count is a consequence of the clinical treatment (e.g. surgery, temozolomide and radiation). In support of the latter induction hypothesis, MG 30/31 was a patient who had two distinct tumors thar were present on initial diagnosis. The rst (MG 30) was resected and the second tumor (MG31) underwent chemotherapy and radiation followed by resection. To provide further evidence for the latter hypothesis, three previously characterized patient-derived glioma cultures (HK296 21 , HK357 21 , TS603 22 ) underwent atac-seq analysis before and after 9 Gray of radiation in three daily fractions. All three cell lines showed a decrease in peak count following radiation ( Figure 4D).

Discussion
Glioblastoma is a currently incurable disease and despite decades of research and hundreds of clinical trials the prognosis remains grim. One frustration encountered by several trials 6 including those investigating the roles of radiation 4 is that improvements in progression-free survival do not always translate into improvements in overall survival. One possible explanation for this phenomenon is that while certain therapies may initially impede tumor growth those cells that survive may take on additional attributes that make them more destructive and refractory to treatment. The data from this study provides an intriguing new model for this transition. The majority of atac-seq peaks identi ed in this study were correlated with better survival and tended to positively correlate with genes of nervous system differentiation. There was a non-signi cant trend for total peak count to decrease as the grade increased with the most signi cant decrease seen in the recurrent grade IV samples. It may be the case, that the closing of DNA regions responsible for driving neural differentiation induces more malignant behavior and that this process can be accelerated by radiation. If this were true, it may be bene cial to pair radiation therapy with other compounds that could potentially minimize this unwanted side effect.
With advancements in the availability and cost of sequencing, traditional histological classi cations have been replaced with more sophisticated genetic markers. Despite the wealth of glioma expression data, attempts to identify patterns of gene expression that predict survival have been underwhelming 29,30 . In contrast, epigenetic data such as DNA methylation (e.g. G-CIMP) seems to have a high degree of prognostic value 293132 . Supporting this observation, this study created an accurate prognostic survival model with only six atac-seq peaks, four of which remained signi cant when controlling for grade, age and IDH1 status.
The discovery of the IDH1 33 and H3K27M 34,35 mutations has revealed the importance of epigenetic dysfunction in the early stages of cancer development. The presumed mechanism of the IDH1 mutation is to favor DNA methylation by inhibiting DNA demethylase enzymes (e.g. Tet family) 32,36 . This is hypothesized to induce decreased DNA accessibility with subsequent gene down-regulation 37 . In the current study, UMAP clustering divided samples by IDH1 status although several IDH1 mutant samples clustered with the IDH1 wildtype group. Running counter to the hypothesis that the IDH1 mutation leads to hyper-methylation and decreased DNA accessibility, there was a trend for IDH1 mutant samples to have more total peaks. This implies that the relationship between DNA methylation and DNA accessibility may be more complicated than previously assumed.
The relative ease and low input requirement of atac-seq technology has led to wide spread use on a variety of tissue types [38][39][40] . However, while the data is easy to obtain it is challenging to interpret without additional sources of supporting evidence. In the absence of this supporting evidence several assumptions have become common in the literature namely that a given peak positively regulates the nearest gene. In the current study, the authors assembled a su ciently large dataset to create a correlation matrix to make some rational assumptions about the relationships of peaks and target genes. Using relatively strict cut-offs, the data showed that most peaks were correlated with one or at most a few genes. The majority of peaks correlated positively with target gene expression. However, there was no correlation between peaks and the nearest genes. On the contrary, peaks and correlated genes were frequently separated by long distances and including many intervening genes. It should be noted that while this study provides some evidence of speci c intergenic regions regulating speci c target genes, the data is correlational and indirect and needs to be strengthened by alternative methods such as genetic manipulation (CRISPR) or chromosome conformation capture-based methods.
In conclusion, this study uses an innovative peak:gene correlation matrix to make create an epigenetic gene regulatory map of gliomas and then uses this matrix to both identify speci c gene regulatory regions of interest as well as introduce a new model of epigenetic malignant progression.

Declarations
Funding: This study was supported by research funds supplied by the Dean's O ce of the University of Cincinnati School of Medicine as part of a recruitment package for the corresponding author MCG.
Con icts of interest: None of the authors have any con icts of interest to report.
Availability of data and material: All sequencing data will be made publicly available. Sequencing Database: The human database is publicly available from "basespace" login.illumina.com. All experimental protocols were approved by the University of Cincinnati IRB committee. This study was used deidenti ed patient specimens and was designated "Not human research" 19-02-25-02 (5/3/2019). Consent is not applicable.
Human Database: The creation of the human database followed all methods in accordance with relevant guidelines and regulations. This study was used de-identi ed patient specimens and was designated "Not human research" 19-02-25-02 (5/3/2019). Consent is not applicable.
Consent to Participate: This study does not involve human subjects  Each of the 8349 peaks was assigned to one of the four grades (Grade II, Grade III, Grade IV primary, Grade IV recurrent) based on representation in that group. These peaks underwent HOMER analysis to determine enriched transcription factor motifs and gene ontology based on the associated genes. Grade IV recurrent samples had very few peaks and no signi cantly enriched gene modules.

Figure 4
A. The total peaks (transcripts per million>10) were calculated for each sample and tabulated by grade. Grade II samples had signi cantly more peaks than Grade IV recurrent samples (ANOVA). B. The total peak counts for three paired samples are shown. C. Samples were divided by IDH1 status and by primary vs. recurrent status. D. Three patient derived glioblastoma cell lines were subjected to atac-seq analysis before and after 9 grey radiation in three fractions.