Strandedness and duplication of reads influence the peak calling of ChIP-exo data
ChIPexoQual [12] is extremely useful to determine sample quality before proceeding with the analysis. It gives the user an idea about how well the experiment was performed, based on library complexity, enrichment, and strandedness. IMR90 and K562 datasets are highly duplicated with approximately 86% and 93% redundancy rates respectively, as compared to the U2OS dataset which has a low 24% redundancy rate. Although reads are expected to accumulate over a few genomic locations in ChIP-exo samples, it is hard to distinguish whether this accumulation of reads is due to signal or PCR artifacts. The suitability of reads to be analyzed by ChIP-exo peak-callers can be estimated by FSR (Forward Strand Ratio) plots reported by ChIPexoQual, which gives a fair idea of strandedness of reads, which in turn is an important measure in many algorithms specific to ChIP-exo data (GEM, MACE, CexoR, Genetrack, ExoProfiler).
In the ChIP-exo datasets from the IMR90 and U2OS cell types, Unique Read Coefficient (URC) is very high and it decreases with increase in Average Read Coefficient (ARC), implying high ChIP enrichment and library complexity [12] (Fig. 1a). The same trend is observed in K562 cell line indicating high ChIP enrichment; however, the URC value is low in comparison to other cell types which suggests that the K562 library is less complex than the other two samples (Fig. 1a).
Ideally performed ChIP-exo experiments show exponential decay in the proportion of single-stranded regions in Region composite plots. More than half of the reads in IMR90 and K562 datasets fall only on a single strand while in the U2OS dataset, maximum reads fall on both the strands. The U2OS sample demonstrates a decrease in the proportion of reads on a single strand whereas, in IMR90 and K562 cell samples, the proportion of reads on a single strand are more spread around the median value (Fig. 1b, left panel).
In a well-performed ChIP-exo experiment, the enriched regions are expected to have an equal number of reads on forward and reverse strand. FSR plots (Fig. 1b) depict the rate at which the FSR value reaches 0.5 (indicates that there is an approximately equal number of reads on both strands). The FSR value quickly reaches 0.5 in both the quantiles for the U2OS sample as compared to both IMR90 and K562 cells. The constant value of FSR for a low minimum number of reads in IMR90 and K562 datasets indicates that these samples have very few unique positions to which reads are aligned. In other words, both these samples are highly duplicated. K562 is more duplicated than IMR90. On the other hand, the FSR plot (Fig. 1b) for U2OS cells has an approximately equal number of reads on forward and reverse strands, which implies an equal distribution of reads, as desired in an ideal ChIP-exo experiment.
The final step of the ChIPexoQual pipeline includes fitting the data to a linear model and estimate β1 and β2 which are parameters to estimate the library complexity. Samples with β1 10 and β2 0 are considered as deeply-sequenced high-quality samples. The median values of β1 in both IMR90 and K562 datasets are higher than 10 whereas, in U2OS dataset, it is lower than 10. The median values of β2 are higher than 0 for all three datasets. High values of β1 and β2 in IMR90 and K562 datasets (Fig. 1c) imply low library complexity and poor quality of ChIP-exo samples which are not deeply sequenced [12].
Overall, the U2OS sample has high-quality ChIP-exo data, as compared to the other two datasets, since it has low redundancy, high ChIP enrichment, and library complexity and an approximately equal number of reads on opposite strands. Due to the huge amount of strand imbalance in IMR90 and K562 cells, it becomes difficult to identify precise border pairs of the protein-DNA crosslink pattern. In such cases where the strand imbalance is high and the libraries are less complex, it becomes difficult to separate binding signal from noise.
Deduplication of reads affects the performance of ChIP-exo peak callers
The Genetrack peak-caller (represented by the number of peak-pairs and not peaks) reports the highest number of peaks in IMR90 and K562 cell types (which are highly duplicated at ~86% and ~93% respectively) followed by GEM, MACE, MACS, and Peakzilla (Fig. 2a). On the contrary, GEM and MACS report maximum peaks for U2OS samples followed by Genetrack, MACE, and Peakzilla (Fig. 2a). Also, GEM reports the highest number of binding events in U2OS dataset (Suppl. Table S1).
Genetrack identifies peaks based on a local maxima in accumulated reads; no other peak is reported within a fixed distance of the highest peak. It has no threshold of peak height beyond which a peak should be considered as a true peak [23] due to which it is capable of reporting the maximum number of peaks (Suppl. Table S1). It should also be noted that the total number of reads in the U2OS sample, as reported in the original study [10], was far less in comparison to the other cell types, due to which the number of peaks reported by each tool is less in comparison to the other two cell types.
Carroll et al 2014 [24] reported that the removal of duplicated reads (i.e., removal of PCR duplicates) in ChIP-exo may lead to a loss of signal. To check whether removing duplicates makes any difference to the predictions about the K562 and IMR90 datasets, the reads with identical coordinates on 5’ and 3’ ends were filtered using Picard [25] and peak calling was repeated for the deduplicated reads. (Fig. 2b). GEM, MACE, and MACS report an approximately similar number of peaks after deduplication of the datasets. But contrary to an expected decrease, the number of peaks called by Genetrack and Peakzilla increased by 2-fold and 1.5-fold respectively for U2OS cell type (Suppl.Table S2). However, the number of binding events discovered by Genetrack and Peakzilla drop drastically in case of IMR90 and K562 cell types (Fig. 2b), indicating that the high number of peaks reported was because of duplicate reads.
ChExMix has an inbuilt read filter option to remove the PCR artefacts. It forces a per-base limit on read counts to reduce the number of duplicated reads[11]. ChExMix discovered 58672, 39454 and 18228 binding events for IMR90, K562 and U2OS cell types respectively which is extremely high in number in comparison to the binding events reported by ExoProfiler (4496, 313, 6236 GR binding events in IMR90, K562 and U2OS respectively)[10]. The binding events reported by ChExMix are a total of direct binding (where the protein is bound to a canonical sequence) and indirect binding (where the binding location is degenerate).
Peak-pairing tools are more prone to identify false peaks
To assess the number of unique regions found by each peak caller when compared to the rest, the data reported by all the peak callers were merged into a single set, except the one whose uniqueness was to be measured. The peaks of the tool, whose uniqueness was to be found, were then intersected with the set, to report the number of unique regions discovered by each software. Genetrack reports maximum number of unique regions in IMR90 and K562 datasets whereas GEM finds highest number of unique genomic locations in U2OS dataset. Peakzilla does not report any unique peak (Fig. 3a).
The unique peaks were scanned for the reported direct binding motif MA0113.2 (JASPAR [26]) for GR using FIMO[27]. GEM, and MACS reported the maximum motif occupancy in unique peaks for all three datasets followed by MACE. Genetrack performed poorly in comparison to other tools and motif occupancy could not be found using Peakzilla since it did not report any unique regions (Fig. 3b). It is to be noted that although Genetrack reported the maximum number of unique regions (Fig. 3a), the motif occupancy in these regions is the lowest. The maximum occupancy of GEM peaks can be explained by the fact that GEM links peak finding to motif discovery and reciprocally improves the binding event prediction around the motifs with high tag accumulation. MACE pairs the peaks on opposite strand without any constraint of a fixed sequence between the border peak-pairs. Motif occupancy in MACS peaks is also higher than MACE in IMR90 and U2OS cells but not in K562 cell type (Fig. 3b).
Thus, the binding event predictions using tools where peak-pairs are formed by nearest peaks either by the software (MACE) or manually (Genetrack) do not perform at par with tools like MACS, Peakzilla, and GEM where most of the parameters are estimated from the data itself.
Total motif occupancy in peaks
GR is known to bind multiple DNA sequences via direct as well as tethered binding to DNA by binding to already recruited proteins. The peaks reported by all direct binding peak callers were scanned for the GBS motif (JASPAR MA0113.2) using FIMO[27]. FIMO reported the highest number of motifs, with a p-value of less than 1e-4, in the peaks reported by the GEM (18105 hits in IMR90, 2803 hits in K562 and 44148 hits in U2OS datasets (Table S3)) followed by MACS, Peakzilla, MACE while the least number of motifs were obtained in the Genetrack predicted binding sites (261 hits in IMR90, 158 hits in K562 and 318 in U2OS datasets (Table S3)) (Fig. 5a). The K562 dataset, which had the highest level of duplicated reads, reported the least number of motifs for all the peak callers (Fig. 5a).
Peaks discovered before and after deduplication were scanned for the presence of GBS (JASPAR MA0113.2), FoxA1 (JASPAR MA0148.3), GATA (JASPAR MA0140.2) and STAT3 (JASPAR MA0144.2) motifs, which have been previously reported to bind GR in multiple studies [10]. When the peaks were scanned for GBS (JASPAR MA0113.2) in IMR90 and U2OS datasets and GATA (JASPAR MA0140.2) in K562 datasets (GATA sequences are known to be highly enriched in K562 cells [10]), GEM and MACS found an approximately equal number of motifs in peaks, irrespective of deduplication of reads (Fig.5b). Peakzilla identified a higher number of motifs after deduplication in the datasets thereby, implying that Peakzilla performance improves after removing PCR duplicates. A similar trend was observed when peaks were scanned for secondary motifs (FOXA1 and STAT3) in all three datasets (Suppl. Fig S1). MACE and Genetrack output had the least number of motif hits in all the datasets for all the scanned motifs including FOXA1 (JASPAR MA0148.3) and STAT3 (JASPAR MA0144.2) (Suppl. Fig S1).
Therefore, the motif occupancy reported in tools where peak-pairs are formed by nearest peaks either by the software (MACE) or manually (Genetrack) does not perform at par with tools like MACS, Peakzilla, and GEM, where most of the parameters are estimated from the data itself. GEM outperforms all the tools in the number of motifs identified in discovered peaks while minimum motif occupancy was reported in Genetrack binding events. ExoProfiler peaks were not analyzed for motif occupancy. This is because a motif file is needed as a prerequisite to run this tool and therefore all the reported peaks would have a motif instance. Thus, Exoprofiler is more useful to study the binding footprint of a known motif.
Validation of peak caller output based on the significance score of peaks
Peaks are ranked by a significance score in the output of all peak callers. The parameters deciding the significance of a peak are different for each peak caller. The peaks with highest significance score have highest motif occupancy; this decreases with the significance of the peaks. To investigate which peak caller ranked the peaks with the highest accuracy, we sorted the peaks in descending order of their rank and top n peaks (in multiples of 50) were plotted against the fraction of annotated motifs (present within 50 bp of a peak) reported by FIMO. GEM, MACS, and Peakzilla perform consistently better in all the datasets as compared to MACE and Genetrack (Fig.4).
De novo identification of TF binding site
To assess the peak-callers’ performance in terms of finding a motif that is similar to the previously reported JASPAR motif [26], the binding output from all the peak callers was submitted to MEME [20] for motif discovery. Since ChExMix and GEM have inbuilt options for motif discovery, MEME was not used separately for motif identification for these tools.
When used with default parameters, GEM reported only the half-site of GBS for both IMR90 and U2OS cell types and successfully identified GATA motif in the K562 dataset. ChExMix reported the full GBS motif, which was an exact match to the JASPAR motif, for IMR90 and U2OS datasets and the GATA sequence in the K562 data (Table 2).
For MACE peaks of IMR90 and K562 datasets, MEME reported very long motifs, none of which matched the JASPAR motif when inspected visually and also when the motifs were submitted to TOMTOM [28] to search for similar binding sequences. Instead, FOSL1, which is known to be repressed by GR binding [29], was reported by TOMTOM [28] to be the best match of the motif found in MACE binding events in IMR90 dataset. In the K562 dataset, the best match for MEME motif was SP1-like sequence. GR is known to bind indirectly to SP1 sequences via SP1 TF [30]. However, MEME was able to identify the GBS motif in the U2OS dataset (Table 2).
MACS identified a 44bp long repetitive motif, which resembles HSF1 binding sequence, in the IMR90 cell type. HSF proteins are known to regulate the function of GR [31]. It reported GATA and GBS motifs for K562 and U2OS datasets respectively (Table 2).
In Peakzilla peaks, MEME identified the GBS motif in the IMR90 and U2OS datasets and the GATA motif. In case of Genetrack, where the peaks were paired manually, MEME failed to generate a significant output for the highly duplicated IMR90 and K562 datasets but it identified the GBS motif in the U2OS data with least number of binding sites reported amongst all the tools (Table 2).
For the high-quality U2OS dataset, ChExMix and GEM successfully reported the previously annotated GR binding motifs. This may be because both these tools use ChIP-exo read distributions to map the DNA-protein interacting locations and thus they will only assign motifs on identified peaks. On the other hand, MEME, which has no peak information, reports ungapped sequences as output motif based on pattern search and recurrence of this pattern in a fasta input of multiple DNA sequences. This explains why the MEME output for MACS and MACE peaks in IMR90 and K562 datasets is different from the annotated motif results (Table 2).
Peak Resolution
GEM appears to be the best peak finding program amongst the direct binding tools followed by Peakzilla and MACS. Both GEM and Peakzilla can deconvolute closely spaced peaks and give better resolution whereas MACS generates peaks with large widths, which often spans up to 200 bp (Table S4). This reduces the resolution of peaks by MACS because the resulting peaks might be a merge of multiple small peaks, which in turn makes MACS suitable for large proteins like histones. This could be the reason why MACS reported HSF1 instead of GBS motif in the IMR90 dataset. It should not be ignored that MACS, in spite of being a ChIP-seq peak calling tool, outperforms MACE and Genetrack. Peakzilla, on the other hand, identifies lesser peaks than GEM and MACS but performs reasonably well for de novo identification of binding motifs. It successfully reported the previously annotated motifs for all three datasets using MEME.
ChExMix identifies motif with higher accuracy and mode of binding using read distribution
It is well-known that protein-DNA interactions do not depend strictly on the availability of a canonical binding motif. Proteins can interact with DNA indirectly, i.e., via protein-protein interactions or they may have a broad spectrum of recognition sequences to which they bind with a different affinity [10,11]. Starick et al. [10] used ChIP-seq of GR for de-novo motif discovery and then used these motifs along with ChIP-exo read distribution in ExoProfiler to find the binding footprint of GR and its interaction with other proteins. ChExMix encompasses both the steps in a single tool, without the requirement of a known motif.
ChExMix reported one direct binding subtype (12021 binding events with a canonical GBS site) and four indirect binding subtypes in the IMR90 dataset. Subtype 0 binds directly to GBS whereas no motif sequence is reported for the rest of the subtypes (Fig.6a). The subtype-specific read distribution profiles of subtypes 1 and 3 and subtypes 2 and 4 are very similar but there is a difference in the number of binding events (Suppl. Table S2). The tag density profiles of subtypes 1 and 3 suggest dimeric binding while that of subtypes 2 and 4 might represent composite binding. GR is known to bind a ‘combi motif’ [10]. When the patterns of tag distribution profile were examined, it was found that ChExMix found the read distributions corresponding to all modes of binding reported previously for GR including monomeric, dimeric and composite binding (Fig.6a).
For highly duplicated datasets like K562, ChExMix filters out duplicates based on a global per-base limit from a Poisson distribution by using a function of the number of reads and their mappability. It then estimates a permissible number of reads using the probability based on the Poisson distribution. ChExMix discovered 2 direct binding and 3 indirect binding events for the K562 dataset. Besides GATA motif, it reports SP1 binding motif (STAMP[32] E-value 6.2839e-14), a known cofactor of GR activity[33]. The tag distribution profile of subtypes 2 and 3 appears to be a composite TF binding profile and that of subtype 4 appears to be tethered binding. The number of binding events of subtype 4 outnumbers all other subtypes.
ChExMix reported 7 binding subtypes, with slight differences in tag distribution profiles, for the U2OS dataset, with a total of 18228 binding events. The read distribution profile of all the subtypes is similar and each subtype has the same core GBS motif with few extra bases to the left and right-hand side of the sequence. When the intermediate output file of ChExMix was examined, it was observed that all the motifs reported by the tool were a match to the known interaction partners of GR.