Generation of clinical database
The University of Cincinnati maintains an IRB-approved bio-repository of surgically resected specimens with associated clinical and demographic information including medical record number, diagnosis, name, gender, age, date of collection and survival. A collection of 38 specimens were chosen from the bio-repository with a goal to obtain a heterogenous mixture of low-grade and high-grade specimens as well as those with long and short survival. Specimens were between two and nine years old prior to selection to allow for adequate follow-up for survival analysis. All specimens denoted as “Grade II,” “Grade III,” or “Grade IV primary” are taken from the patient’s initial surgery and have not been exposed to any chemotherapy or radiation treatment. The presence of the IDH1 mutation was determined via either immunostaining or by direct sequencing.
Protocol for generating Atac-seq libraries
Nuclei were isolated from patient tumor tissue as described by Habib et al., with slight modifications. Briefly, frozen tissue samples (<0.5 cm) were incubated on ice in 1mL of Nuclei EZ Lysis Buffer (Sigma, #Nuc-101). The tissue was then homogenized using a glass tissue grinder. 1mL of additional EZ Lysis Buffer was added, the tissue was incubated on ice for 5 minutes, and then filtered through a 70mm cell strainer. Nuclei were collected by centrifugation and resuspended in 1mL ATAC-resuspension buffer (10mM Tris-HCl (pH 7.4), 10mM NaCl, 3mM MgCl2, and 0.1% Tween-20) and filtered through a 40mm cell strainer. 1mL of ATAC-resuspension buffer was then added and the nuclei were filtered through a 5mm cell strainer. ATAC-seq library preparation was performed as previously described.
RNA-seq library creation
Frozen patient tumor tissue (<0.5 cm) was placed into pre-chilled RNA-ice later (Invitrogen, #AM7030) for at least 24 hours at -20°C. The tissue was then homogenized with Lysis/Binding Buffer (Invitrogen, #00671513) in lysing matrix D tubes (MP, #116913050-CF) for 40 seconds at 6m/s, using the MP FastPrep-24 5G Instrument (MP, #116005500). RNA (>200 nucleotides) was then purified using the Quick-RNA MiniPrep Plus Kit (Zymo Research, #R1057).
Processing of ATAC-seq data
ATAC-seq reads in FASTQ format were first subjected to quality control to assess the need for trimming of adapter sequences or bad quality segments. The programs used in these steps were FastQC v0.11.7, Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program HISAT2 v2.0.5. Aligned reads were stripped of duplicate reads with the program sambamba v0.6.8. Peaks were called using the program MACS v2.1.2 using the broad peaks mode.
To obtain the consensus set of unique peaks, called peaks from all samples are merged at 50% overlap using BEDtools v2.27.0. The consensus peaks, originally in BED format were converted to a Gene Transfer Format (GTF) to enable fast counting of reads under the peaks with the program featureCounts v1.6.2. Each feature in the GTF file has the value "peak" on the second column. Peaks located on chromosomes X, Y and mitochondrial DNA are excluded from further analysis. Raw read counts are normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.
Random forest regression model
Log-transformed data were filtered to remove low variance peaks using the VarianceThreshold function from scikit-learn v0.24.1 using a threshold of 0.45. The resulting 8439 features were used to build a random forest regression model using RandomForestRegressor from the ensemble module of scikit-learn. The options used was n_estimators: 1000, max_samples=0.9, oob_score=True. Permutation importance was used to rank the peaks. The out-of-bag R2 value for the regression was 0.31.
Elastic-net regularized generalized linear models
We selected the top 20 peaks with highest mean feature importance value from 8439 peaks. To assess the survival predictive power of clinical metadata and top 20 ATAC-seq peaks, we trained elastic-net regularized generalized linear models using ‘glmnet’ package in R . We tested the performance of three different models that are built using following features –1) top 20 ATAC-seq peaks 2) clinical metadata only 3) clinical metadata + top 20 ATAC-seq peaks. Leave one out cross validation imputation was applied and alpha parameter was explored between 0 to 1 with incrementing step size of 0.1. Out of 36 samples, IDH mutation information of 4 samples was not available. These 4 samples were excluded from models which include clinical metadata as features. Prediction accuracy of models was assessed by computing R2 value.
Processing of RNA-seq data
The quality control check on RNA-seq reads was performed with FastQC v0.11.7. Adapter sequences and bad quality segments were trimmed using Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program STAR v2.6.1e. Duplicate aligned reads were removed using the program sambamba v0.6.8. Gene-level expression was assessed by counting features for each gene, as defined in the NCBI's RefSeq database. Read counting was done with the program featureCounts v1.6.2 from the Rsubread package. Raw counts were normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.
ATAC-seq and RNA-seq correlation analysis
Sixteen samples have atac-seq chromatin accessibility as well as RNA-seq gene expression data which is used for correlation analysis. To compare the chromatin accessibility and gene expression, we identified the peak-gene pairs which are located within same topologically associating domain (TAD). Peaks having at least 90% overlap with TAD are selected for further analysis. Out of 8439 peaks, 8419 peaks show > 90% overlap with TAD region and 8416 peaks have at least 1 gene within the overlapping TAD. Pearson correlation coefficient was calculated for each peak-gene pair.
Analysis of differential chromatin accessibility
The significant changes of chromatin accessibility in each subgroup (Grade II, Grade III, Grade IV Primary and Grade IV Recurrent) were assessed with the R package DESeq2 v1.26.0. The 8439 most important peaks from the random forest regression model are selected for the differential analysis. Samples from a specific subgroup (e.g. Grade II) are compared against remaining samples from the dataset. Differential peaks with log2FoldChange > 0.3 and FDR < 0.05 are selected for motif enrichment analysis using HOMER v4.10. Remaining peaks from initial set of 8439 peaks are used as background data set in HOMER analysis. Only known human motifs from CIS-BP 2.0 are considered.
Three previously characterized patient-derived cultures (HK296, HK357, TS603) underwent a radiation dose of 9 Gy in three fractions in a XenX (X-Strahl, Suwanee, GA) pre-clinical cabinet irradiator, with output calibrated using NIST-traceable instruments. The instrument parameters were: 220 kVp, 0.67 mm Cu HVL, dose rate ~6cGy/sec, delivered using a 100mm x 100mm collimator with 2 cm backscatter below the cell plates.