Background: Strand cross-correlation profiles are used for both peak calling pre-analysis and quality control in chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis. Despite its potential for robust and accurate assessments of signal-to-noise ratio (S/N) of ChIP-seq samples, it remains unclear what aspects of quality such strand cross-correlation profiles actually measure.
Results: We introduced a simple model to simulate the mapped read-density of ChIP-seq and then derived the theoretical maximum and minimum of cross-correlation coefficients between strands. The results suggest that the maximum coefficient of typical ChIP-seq samples is directly proportional to the number of total mapped reads and the square of the ratio of signal reads, and inversely proportional to the number of peaks and the length of read-enriched regions. We also developed PyMaSC to efficiently generate strand cross-correlation profiles. Simulation analysis supported our results and evaluation using 790 ChIP-seq data obtained from the public database demonstrated high consistency between calculated cross-correlation coefficients and estimated coefficients based on the theoretical relations and peak calling results. In addition, we found that the mappability-bias-correction improved sensitivity, enabling differentiation of maximum coefficients from the noise level.
Conclusions: We present the first theoretical insights into the strand cross-correlation and the results reveal the potential and the limitations of strand cross-correlation analysis. Our work will help in the establishment of better QC metrics using strand cross-correlation.