Patients and Clinical Information
Patients initially diagnosed with lung cancer at Peking University Shenzhen Hospital from July 2018 to May 2019 were included in the study. Patients’ information, including pathology, tumor type and smoking history was collected for all 6 patients (Table S1). This study was carried out in accordance with Declaration of Helsinki and approved by the Ethics Committee of Peking University Shenzhen Hospital (Approval number: 2018033) and all patients provided written informed consent.
ATAC-seq Library Preparation and Sequencing
Isolation of nuclei from frozen tissues and transposition reaction for ATAC-seq were performed as previously described (Corces et al. 2017). Briefly, frozen tissue sections of each sample were prepared on dry ice to avoiding thawing. Then the tissue was transferred into a pre-chilled 2-ml Dounce homogenizer containing 2 ml of 1× cold homogenization buffer. Tissues were then homogenized for several minutes with different sizes of the pestles and filtered by nylon mesh filter. The nuclei obtained by sucrose gradient centrifugation were transferred to a tube containing resuspension buffer and then pelleted by centrifugation at 500 rcf for 10 min. The supernatant was discarded and nuclei were resuspended in the Omni-ATAC-seq mix (Corces et al. 2017) containing Tn5 transposase. The two technical replicates were prepared with the two separated transposase reactions from the same set of nuclei of one donor. The reaction was incubated at 37 °C for 30 min in a thermomixer with shaking at 500 rpm and the tagmentated DNA was isolated with MinElute PCR Purification kit (Qiagen). PCR was performed using NEBNext High-Fidelity 2X PCR master mix (72°C 5min, 98°C 30sec,13cycles of: [98°C 10sec, 63°C 30sec, 72°C 1min], 72°C 5min). Sizes of PCR products were selected using Ampure XP by 0.5X volume followed by a 1.0X final volume. The libraries were then prepared based on a BGISEQ-500 sequencing platform(Huang et al. 2017) and sequenced with paired-end 50-bp read lengths.
ATAC-seq Data Processing and Peak Calling
ATAC-seq data was processed using the ENCODE ATAC-seq pipeline. Briefly, the fastq files were processed for adaptor trimming by cutadapt software (version 1.9.1)(Martin 2011) with “cutadapt -a TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG -A GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -m 5 -O 5 -e 0.1” options, coming with the clean fastq files. Clean fastq files were aligned to the human hg38(GRCh38) genome using bowtie2(version 2.2.6)(Langmead, Salzberg 2012) with “bowtie2 -k 5 -X 2000 --mm --local --threads 4 ” options. Then, Samtools (Li et al. 2009) was used to generate the bam files as well as sort and retain properly paired reads with “samtools view -F 1804 -f 2 ” and “samtools sort” options. After this, Picard was used to remove PCR duplicates with default arguments. Mitochondrial reads were also filtered. Finally, final aligned, de-duplicated bam files were generated for the subsequent downstream analysis.
According to data quality control standards of ENCODE , only samples with transcription start site (TSS) enrichment value >5 were classified as qualified and used for downstream analysis. Peak calling was carried out to generate high quality fixed-width peaks from the final bam files of QC-passed samples, each coupled with technical replicates to improve sensitivity and the accuracy. Peak calling was performed using MACS2 with “macs2 callpeak -t bamfile --cap-num-peak 300000 --shift -75 --extsize 150 --nomodel --call-summits --nolambda --keep-dup all -p 0.01” options in each replicate. Only reproducible peaks between two technical replicates identified by Irreproducible Discovery Rate (IDR) analysis (threshold 0.05) were retained for downstream analysis. Fixed-width peak set was then generated by extending the peak summits by 250 bp on either side to a final width of 501 bp. We filtered the ENCODE hg38(GRCh38) blacklist manually to avoid the interference of the high signal regions such as centromeres, telomeres, and satellite repeats. Peaks that beyond the ends of chromosomes after extending were also filtered (Corces et al. 2018). Coverage of each defined peak was evaluated using BEDTools (v2.26.0) (Quinlan, Hall 2010) with “multicov” argument, while raw count matrix was normalized using “DESeq2” package in R by Reads Per Million aligned reads (RPM). The ATAC-seq data from each replicate of our primary tumor samples generated 48~141M (median = 101M) reads, in which two samples did not reach TSS value of 5, leaving 4 samples from LUAD patients for further analysis.
ATAC-seq Data Analysis
We used previously described analysis method on peaks selecting and exploring(Liang et al. 2020)4. Briefly, we selected the available data from TCGA according to peak’s quality. A peak would be considered low-quality if it has a same value in more than 5% patients from single type of cancer, as the repeat values were likely produced by nonsense 0s before normalization. Eventually, we obtained 64316 peaks across 386 samples from TCGA dataset. To further reduce the scale of data, we then applied the previously developed algorithm on the correlation network construction with retained peaks from TCGA, in which two peaks would be connected if their direct or indirect correlation is significant(Liang et al. 2020) (Fig. 1). We considered the direct correlation between two peaks is significant, if the absolute value of correlation coefficient is not less than 0.4, considering the noisy level is around 0.2. Furthermore, we considered the indirect correlation between the two peaks is significant, if their direct correlations with the third peak are both significant. Here we allowed the indirect correlation to amplify peaks' aggregation effect in the network. To assess the correlation between peaks more reliably, outlier values were removed before calculating correlation (Hearst 1999). In practice, we used the function “cor” from R package “stats” V3.6.2 with default arguments. We selected the 10% most frequently-connected peaks for the further analysis, as those peaks were more likely to be the hub peaks in the network with high connectivity.
The unsupervised classification method principal component analysis (PCA) was used to analyze the selected ATAC-seq peaks from 22 LUAD patients in TCGA dataset. We used the function "PCA" from R package “FactoMineR” V1.34 with default arguments which produced 5 most important components(Lê et al. 2008). The association between components and smoking status were checked by the distribution of samples classified by components. To statistically assess the difference between distribution distance of samples from LUAD patients with different smoking histories (based on pack-years of smoking, status of smoking exposure was classified as light smoker (<20 pack-years) and heavy smoker (≥20 pack-years)), two-tailed unpaired t-tests was performed. P < 0.05 were considered significant. We used the function “survdiff” from R package “survival”(Harrington, Fleming 1982), in which Chi-Squared Test was used to assess the PFS and OS analysis. Then the function “pchisq” from R package “stats” was used to calculate P value with arguments df=1, lower tail=F.
For the pathway analysis, all linc-RNA genes (such as RP11- genes) have been removed as they are excluded in KEGG database. We searched the potential pathways using DAVID 6.8. with the non-linc-RNA genes related to top 100 highest contributing peaks of targeted component. Pathways were found with p value less than 0.1, which is the recommended value of DAVID (Huang da et al. 2009).
 https://www.encodeproject.org/atac-seq/. Retrieved 24/12, 2019