Peaksat pipeline facilitated profiling of Histone H4 acetylation patterns in a breast cancer progression cell model.
We applied Peaksat to study the histone H4 Lys5 and Lys8 acetylation (H4K5ac and H4K8ac) ChIP-seq patterns in the MCF10A human breast cancer progression model consisting of four distinct cell lines: the normal-like epithelial MCF10A cells [14], premalignant MCF10AT1 cells [15]that express constitutively activated HRAS [15], and the subsequent xenograft derivatives, MCF10DCIS (ductal carcinoma in situ) cells [17] and MCF10CA1a metastatic cells [16]. We generated replicate ChIP-seq datasets for H4K5ac and H4K8ac in each cell lines conducted a pilot round sequencing at a relatively shallow sequencing depth and obtained an average of 11 million reads with no more than 5k peaks per individual dataset (Figure S1A&B, Figure S2A&B). The FRIP (Fraction of Reads in Peak) values indicated these libraries have low background signal (Figure S1C and S2C), and replicate analysis showed overall concordance for replicates for each target in different cells (Figure S1D-E and S2D-E). Initial quality control analysis showed this was insufficient depth for a robust peak call but indicated that the ChIP libraries had been successful for both marks (Supplemental figures S1 and S2). In addition, we found each H4Kac mark shared highly similar patterns of peak enrichment across different cell lines, despite low frequency of peak overlaps between biological replicates (S1D-G and S2D-G).
To determine how much more sequencing was required to have enough read depth to support a rigorous analysis, we combined all replicates per histone modification into meta-pools aiming to approximate the saturated peak count for each target. The results estimated 51.5 million reads for H4K5ac and 49.7 million reads for H4K8ac were sufficient to reach saturation, hitting maximum peak counts of 28,400 and 29,200, respectively (Fig. 2A). These maximum number of peaks for each meta-pool are representative of the peak saturation point beyond which additional sequencing is not necessary. Consistent with the previous observations for other targets, we found the majority of each peak saturation curve appeared to be linear where increases in peak count were directly proportional to increases in read count. Additionally, we investigated the suitability of a linear regression model to describe the linear portion of the peak saturation curve. The resulting linear models fit the saturation profiles with an adjusted R2 value around 90% (Table S1). In contrast to the meta-pools, none of the individual libraries appeared to approach peak saturation, with peak counts ranging from 420 to 13,690 (Fig. 2B&2C).
To determine how many additional reads were needed to reach peak saturation, we used the corresponding meta-pool saturation peak count and extrapolated a linear regression to obtain a read depth estimate per library (Fig. 2B&2C). Based on these estimates, we performed a second round of sequencing aimed at obtaining enough read depth for each sample. As expected, the combined datasets exhibited marked improvement in the number of total peaks and an increased concordance of peaks between replicates and across cell lines (Figure S3 and Figure S4). Interestingly, the peak saturation profiles were inconsistent across cell lines for a single mark, and this variability was more pronounced in the H4K5ac peak enrichment profiles (Fig. 2B). The MCF10A H4K5ac replicates demonstrated sufficient read depth and reached peak saturation earlier than that estimate of the meta-pool, however, they did not show high enrichment compared to the meta-pool peak enrichment profile. In contrast, the MCF10AT1 and MCF10CA1a replicates for H4K5ac were expected to reach saturation around the meta-pool peak saturation point, but neither replicate was saturated, suggesting a need for additional sequencing to improve the dataset and increase the number of meaningful peaks. The MCF10DCIS cell line also showed a sharp increase in H4K5ac peak number with additional read depth. For H4K8ac, the saturation curves showed similar cell line specific patterns as that of H4K5ac. The discrepancy in read requirements for the same targets across different cell lines suggest that peak saturation and read requirements could be due to underlying heterogeneity across these cell lines.
Additional factors influencing peak saturation.
We used Peaksat to investigate the influence of fold-enrichment and input control read depth on peak saturation. Increasing the minimum fold-enrichment values for peak analysis (see methods) showed an expected decrease in the saturation peak number (Fig. 3A). Notably, when the minimum FE threshold is set at a value greater than 5, the peak enrichment never reaches saturation. MACS2 calls peaks against input utilizing dynamic Poisson distribution to effectively capture the local enriched peaks against those in input [22]. This peak calling algorithm and the read depth of input control in ChIP-Seq (as suggested by ENCODE) are both important factors involved in peak saturation [10, 22]. Through comparison of the peak saturation patterns under different down-sampling of input reads, we investigated whether increasing the number of control input reads enable saturation. For both targets, H4K5ac and H4K8ac, peak numbers saturate at 50–60 million input reads (Fig. 3B). Given sufficient input reads, the incremental target read depth drives peak enrichment to saturate and before plateauing (Fig. 3C). Taken together, both fold-enrichment threshold and sufficient read depth of input controls will help in estimating saturated read depth for targets and achieving high quality data for downstream analysis.
Overall, H4K5ac and H4K8ac showed similar enrichment profiles across the breast cancer progression cell lines (Fig. 4A&B). To compare binding patterns for these targets, we performed pairwise differential enrichment analysis of each mark across cell lines and identified significantly differentially enriched peaks that could be grouped into two clusters based on enrichment signal. Cluster 1 were H4K5ac and H4K8ac peaks more enriched in MCF10A cells, whereas Cluster 2 were peaks more enriched in the tumorigenic derivative cell lines (MCF10A-T1, -DCIS and -CA1a) compared to MCF10A. For example, the enrichment of H4K5ac and H4K8ac near the SYBU transcription start site (TSS) show a similar pattern of enrichment profiles that are stronger in normal MCF10A cell than the tumorigenic derivative cell lines (Fig. 4C). Conversely, H4K5ac and H4K8ac was more strongly enriched near the HRAS TSS in tumorigenic cells MCF10AT1, MCF10DCIS, MCF10CA1a compared to MCF10A cells. These results indicate overall similarity between H4K5ac and H4K8ac profiles that may change with increased proliferative capacity of the cells. Thus, Peaksat provided an appropriate sequencing depth estimate required to identify differential histone H4 acetylation modifications across different cell lines.
Peaksat pipeline is generally applicable for peak enrichment analysis with other assays.
We further applied Peaksat to evaluate the peak saturation curves of both Cut&Run and ATAC-seq datasets in order to evaluate its general applicability. We evaluated a previously published data set that compared different cell preparation methods for Cut&Run analysis of H3K4me3 and Ikaros, a zinc finger transcription factor (TF) [11]. Regardless of sample preparation (frozen or fresh), the PTM H3K4me3 required only 10M reads to reach saturation, and a linear regression relationship exists between read depth and peak count before saturation (Figure S5A). In contrast, the Ikaros TF saturation profiles were variable across the different preparations (Figure S5A). ATAC-seq peaks were evaluated from different datasets generated from different leukemia subtypes [16] showing that 100M reads fit well with linear model based (Figure S5B). These results together suggest that the Peaksat linear regression model is useful to estimate the saturation and target read depth estimates for other sequence- and peak-based assays.