ChIP-Seq data processing, normalization and visualization

This article explains the step-by-step processing of NGS data of ChIPSeq samples for H3K4me3, H3K9me3 and H3K36me3 immunoprecipitations performed in oocytes/embryos. This includes sections on visualization of data, and a comment on normalization. the associated raw data have been deposited at GSE129735 on Gene Expression Omnibus.


Introduction
This procedure has been applied to process ChIPSeq data from the associated publication.Some aspect of this procedure can be used as inspiration for processing low input ChIPSeq datasets Equipment 1. Windows PC with atleast 12 GB RAM powerful enough to run EaSeq sessions listed in this article.3. ChIP-seq reads were aligned as paired ends to the mm10 genome assembly using bowtie2 2 and the settings "--local -X 2000", sorted and exported as bam-files using samtools 3 , and imported into EaSeq 4 (http://easeq.net)for subsequent analyses and visualization.4. In EaSeq, default settings for ChIP-seq data are used including filtering of reads with identical coordinates.Where nothing else is mentioned, values were normalized to fragments per million reads per kbp (FPKM).

General ChIP-seq data analysis and visualization
1. Unless stated, then all analysis and visualization was performed using EaSeq v.1.05-1.111(http://easeq.net) and its integrated tools using the default settings.
2. All these analyses, visualizations and used data are available as a session file at http://easeq.net/Sessions/Sankar.eas.Please note that the session requires 12 GB of available RAM.
3. Plots were exported for layout in Adobe Illustrator CS6 as pdfs using the beta tool 'Export snapshot as pdf'.
4. Gene bodies were extracted from the imported RefSeq 'Geneset' using the 'Extract' tool and 'Feature' set to 'Gene', 'Start position' set to 'Start', and 'End position' set to 'End'. 5. To obtain non-redundant genes without multiple identical termini, gene symbols were exported and duplicates were removed.The list of symbols was imported as a 'Regionset' using the 'Geneset' to lookup coordinates.For gene symbols with multiple annotated transcripts, this resulted in the most extreme coordinates being used for subsequent analysis.Possible gene symbols not matched by coordinates were removed using the gate tool.
6. Broad H3K4me3 domains were acquired as described above and imported into EaSeq as 'Regionsets'.The set of broad H3K4me3 regions were subselected for those at or above 20 kbp of size for further analyses.7.For broad domains and gene bodies, the H3K9me3 levels in replicate 1 were quantified using the 'Quantify' tool within windows corresponding to the entire length of the regions, and the order of the regions was changed according to these values using the 'Sort' tool.8. Heatmaps were generated using the 'HeatMap' plot type and ratiometric heatmaps showing pseudocolored differences between conditions were generated using the ratiomap plot type.'Y-axis Maximum' was adjusted as shown in the figures and multiple plots were overlaid using the 'Overlay' tool.
12. Genome browser tracks were generated using the 'FillTrack' tool with 'Y-axis Maximum' adjusted as shown in the figures and for some tracks loci coinciding with broad H3K4me3 domains were coloured using the 'Highlight' option in the plot settings and the broad H3K4me3 domains as the feature to highlight.
13. Tracks showing log2 fold differences were adjusted in the plot settings by setting 'Ratio' to the 'Dataset' in the denominator and for the Y-axis checking 'Log scale', setting the 'Base' to 2, and and for the previous study we therefore used a normalization strategy based on H3K4me3 at the top-5000 ranked transcription start sites 6 .
5. In this study, the widespread gain of H3K9me3 signal at the H3K4me3 enriched regions where it is normally depleted occurs in more than one fifth of the mouse genome 6 .This would not only elevate the signal in the normally depleted regions, but also lead to a considerably reduced signal in the areas of canonical enrichment (Extended Figure 4c).
6. Therefore, we would be highly cautious about interpreting the loss of gene body H3K9me3 in absolute terms, and prefer conservative interpretations.Accordingly, we find it likely that the apparent loss of H3K9me3 is due to the susceptibility of ChIP-seq and FPKM normalization to pronounced increases in the fraction of the genome, which is covered by H3K9me3.
7. As assessments of the extent of this in a quantitatively manner would require a spike-in setup, which have not yet been developed for the very small-scale ChIP-seq used here, we chose the commonly used FPKM-normalization for plots and analyses, and to make conservative statements that are unaffected by the choice of normalization.8.This conservative approach could lead to underestimations of the extent of H3K9me3 gain in broad domains, and to illustrate the consequences, we have done pairwise comparisons of FPKM normalized plots and similar plots normalized in a manner parallel to what was previously used 6 (Extended Figure 4 d,e).

Anticipated Results
Trimmed, mapped data uploading to genome browser for visualization or direct use for further analysis in EaSeq

9.
For plots of gene bodies and broad H3K4me3 domains, settings were adjusted to a relative window size of 200% of the size of the visualized feature by checking 'rel.' and changing 'Window size' to 200 in the plot settings.10.For all heatmaps 'Maximum density' was adjusted as indicated by the FPKM values in the figures.11.Plots showing the average values at a set of regions were generated using the 'Average' tool, where gene bodies and broad H3K4me3 domains adjusted to a relative window size of 200% of the size of the visualized feature by checking 'rel.' and changing 'Window size' to 200 in the plot settings.