BAMscale: quantification of next-generation sequencing peaks and generation of scaled coverage tracks

doi:10.21203/rs.2.20856/v1

Download PDF

Methodology

BAMscale: quantification of next-generation sequencing peaks and generation of scaled coverage tracks

https://doi.org/10.21203/rs.2.20856/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 22 Apr, 2020

Read the published version in Epigenetics & Chromatin →

You are reading this older preprint version

Read the latest preprint version →

Background: Next-generation sequencing allows genome-wide analysis of changes in chromatin states and gene expression. Data analysis of these increasingly used methods either requires multiple analysis steps, or extensive computational time. We sought to develop a tool for rapid quantification of sequencing peaks, and an efficient method to produce coverage tracks for accurate visualization that can be intuitively interpreted by experimentalists with minimal bioinformatics background. We demonstrate its strength by integrating data from several types of sequencing approaches.

Results: We have developed BAMscale, a one-step tool that processes a wide set of sequencing datasets. To demonstrate the usefulness of BAMscale, we analyzed multiple sequencing datasets from chromatin immunoprecipitation sequencing (ChIP-seq) data, chromatin state change data (Assay for Transposase-Accessible Chromatin using sequencing: ATAC-seq, DNA double strand break mapping sequencing: END-seq), DNA replication data (Okazaki fragments sequencing: OK-seq, Nascent-strand sequencing: NS-seq, single-cell replication timing sequencing: scRepli-seq) and RNA-seq data. The outputs consist of raw and normalized peak scores (multiple normalizations) in text format and scaled bigWig coverage tracks that are directly accessible to data visualization programs. Our tool can effectively analyze large sequencing datasets (~100Gb size) in minutes, outperforming currently available tools.

Conclusions: BAMscale is a tool that can be used to accurately quantify and normalize identified peaks directly from BAM files, as well as create coverage tracks for visualization in genome browsers. BAMScale can be implemented for a wide set of methods for calculating coverage tracks such as ChIP-seq / ATAC-seq, as well as splice aware RNA-seq, END-seq and OK-seq for which no dedicated software is available. BAMscale is freely available on github (https://github.com/ncbi/BAMscale).

Epigenetics & Genomics

Histone modifications

expression

ATAC-seq

ChIP-seq

NS-seq

replication timing

replication origins

RNA-seq

Improved technologies and decreasing sequencing costs enable in-depth analysis of chromatin and gene expression changes for genome-wide comparisons. These integrative multi-omics studies help to elucidate the functionalities of coding and non-coding parts of the genome, their influence on development of complex disease such as cancers [1–4], and their translational implications [5–7].

Currently many studies focus on identifying protein-DNA interactions through sequencing (ChIP-seq) [8, 9]. By mapping protein-bound DNA, we can determine transcription factor binding sites or histone modification distributions across the genome. Other analyses focus on identifying open-chromatin and DNA-accessible regions [10–13], useful to classify enhancer regions, and transcription factor footprints [14–16]. Integrating these analyses with gene expression data such as RNA-seq [17–19], it is possible to gain better understanding of the architecture and regulation of the genome.

Recently a new method has been introduced for genome-wide mapping of double-strand DNA breaks (END-seq) [20]. By enabling detection of DNA breaks that occur in a small fraction of a cell population, END-seq can be used to understand how breaks occur and are repaired.

To understand DNA replication patterns across the genome, next-generation sequencing methods are increasingly used. They are either based on sequencing newly synthesized and RNA-primed DNA, such as Okazaki fragment sequencing (OK-seq) [21] for the lagging strand, or nascent-strand sequencing (NS-seq) [22] for the leading strand. These approaches are useful to pinpoint where DNA-replication is initiated in the genome. The order of genome replication can also be measured with replication-timing sequencing, which involves identifying copy-number state differences between diploid G1-phase and replicating S-phase (or asynchronous - AS) cells [23–26].

Although sequencing methods are routinely used, data analysis needs constant improvement to reduce the number of steps prone to error. In many cases, results are difficult to accurately reproduce because they are obtained with “in-house” scripts. One such example is the quantification of ChIP-seq / ATAC-seq peaks followed by normalization. Another example is generating sequencing coverage tracks [27–29], which either requires more computation time for scaling, and/or multiple steps to get accurate results. Additionally, many sequencing types do not have dedicated solutions for creating coverage tracks for accurate visualization. One example is OK-seq, where replication fork directionality (RFD) is used to identify replication origins in the genome. RFD is calculated from the ratio of reads aligning on the forward and reverse strand, which is usually accomplished by calculations involving multiple steps. Another example is RNA-seq, where coverage tracks can be calculated using multiple tools, but in many cases the exon-intron boundaries are disregarded, yielding inaccurate representations of splicing events.

Here we introduce BAMscale (summarized in Fig. 1), a new genomic software tool for generating normalized peak coverages and scaled sequencing coverage tracks in bigWig format. BAMscale is a one-step tool that processes DNA sequencing datasets to create scaled and normalized quantifications and coverage tracks including chromatin binding (ChIP-seq), chromatin accessibility (ATAC-seq), stranded and unstranded RNA-seq, DNA replication data (OK-seq, NS-seq and replication timing) and DNA-strand breaks sequencing (END-seq). Our tool is developed in the C programming language using the samtools library [30] and libBigWig [27] achieving superior performance compared to existing tools. BAMscale can process 100 GB of aligned data (in BAM format) in under 20 minutes using a regular computer with 4 processing threads. The tool, with installation and extensive usage examples, is available at https://github.com/ncbi/BAMscale.

Peak quantification and scaling coverage track from ATAC-seq data

To further demonstrate the potential of BAMscale on DNA-capture based methods, we compared chromatin accessibility from ATAC-seq data recently published by our group [31]. While performance of BAMscale for peak quantification is comparable to the most commonly used BEDTools [2] program with a single processing thread (Fig.S1A), BAMscale reduces execution time by ~50% when using four threads (Fig.2A). In addition, BEDTools will only calculate raw read counts, while BAMscale performs normalization of raw read counts while outputting FPKM, TPM and library size normalized peak scores. This enables a direct comparison of peaks between different conditions. Correlation of raw read counts from the two methods is above 0.99 (Fig.2B and Fig.S1B). Out of the 32,819 quantified peaks, there was one ATAC-seq peak with low read counts from BAMscale compared to BEDTools. The peak was covered predominantly by reads where the read-pair mapped to a different chromosome (Fig.S1C), that are removed by default by BAMscale. The mean execution time to create sequencing coverage tracks with BAMscale was 4.6-fold faster than deepTools bamCoverage and 1.8-fold faster than IGVtools, which does not scale for library size (Fig.2C)

We next compared the effect of the topoisomerase I (TOP1) inhibitor camptothecin (CPT) on ATAC-Seq patterns in human leukemia CCRF-CEM (SLFN11-positive) cells and their isogenic SFLN11-knockout [31]. After CPT treatment, chromatin accessibility remained unchanged in the SLFN11-KO cells, while accessibility of pre-existing sites strongly increased in the SLFN11-positive cells (Fig.2D). Using the GIGGLE tool [32] on the Cistrome [33] website, we found that ATAC-seq peaks strongly overlapped with H3K27ac, H3K4me3 and H3K9ac sites, which are histone marks associated with active genes (Fig.2E). Colocalization analysis of sites with >3-fold increase during CPT treatment in SLFN11-positive cells showed ~20% increase in overlap with H3K4me3 and H3K9ac sites, identified using Coloweb [34] (Fig.S2, Table S2). DNA accessibility sites were strongly enriched in gene promoter regions, such as in the TOP1 and CTCF gene promoters (Fig.2F).

Quantifying ChIP-seq peaks

BAMscale is designed to quantify ChIP-seq/ATAC-seq peaks from BAM and BED files, producing raw read counts, as well as TPM, FPKM and library size normalized peak scores (Fig.1A). By providing accurate peak quantification in parallel with generated scaled coverage tracks, BAMscale simplifies the comparison and visualization of genome-wide and local changes. To illustrate this point, we reanalyzed published histone ChIP-seq data from MV4-11 cell line and their isogenic counterpart (MV4-11R) resistant to PKC412, a multi-target protein kinase inhibitor [35]. Using the BAMscale “cov” and “scale” functions, we accurately quantified peak strengths, and created scaled coverage tracks ready for visualization. In agreement with published results, we observed a global increase of H3K27ac, a decrease in H3K27me3 and a largely unchanged H3K4me3 signal in the drug-resistant cells (Fig.1A, Fig.S3A-C). Drug resistant cells displayed elevated protein expression of HOXB7 [35], which has increased histone H3K27ac signal, a known marker for active genes.

RNA-seq data coverage track generation

RNA-seq involves sequencing of mature RNA, where introns are spliced-out of the molecules. For this reason, genome alignment of RNA-seq is performed with splice-aware aligners such as STAR [36] or HISAT2 [37]. These tools are able to split sequencing reads between two (or more) exons with or without prior gene annotations. Currently most tools that generate coverage tracks (in bigWig or tdf format) are capable of identifying splicing events in the alignments, but their binning process. This causes strong coverage drops in bins that overlap exon-intron boundaries. For this reason, we implemented an RNA-seq compatible function for BAMscale (Fig.1B). Compared to the standard run, in RNA-seq mode BAMscale searches for sudden changes in coverages between adjacent bases in a bin (usually >=5 reads), where resolution changes from bin to a single base pair (illustrated in Fig.3A). A major advantage of this method is that no gene/transcript annotation is needed for accurate representation of recurrent splicing events. Output coverages can be set to be unstranded (“rna” operation) or stranded (operation set to “stranded” or “stranded”), where two separate bigWig files are created for the two strands (Fig.S4).

To test the potentials of BAMscale RNA-seq mode, we reprocessed previously published RNA-seq data from Top1mt wild type and knock-out mice [38]. BAMscale is capable of producing more accurate, single base resolution tracks at exon-intron boundaries, compared to IGVTools or deepToolsbamCoverage (Fig.3B). Additionally, the RNA-seq compatible BAMscale (using one processing thread) is 2.5-fold faster than IGVTools, 7.2-fold and 3.9-fold faster than deepTools bamCoverage running on one or four threads respectively (Fig.3C). After differential expression analysis, we identified several genes that are upregulated in the KO samples (Table S3), such as Insig2, which was the statistically most significantly, but with a moderate fold-change increase around 2x (Fig.3D). This subtle change in expression is somewhat visible in the unscaled tracks, but the variation in the signal is very strong across replicates (Fig.3E upper tracks). This can be overcome by either extracting scaling factors for samples from the differential expression analysis program such as DESeq2 (Fig.3E lower tracks), or by using the genome-size scaling, which scales to the number of sequenced bases. These methods ensure more comparable results for visualization by reducing the variations due to sequencing library size differences.

Comparison of DNA-breaks and replication timing

Next, we processed replication timing, OK-seq and END-seq data derived from activated mouse B cells [39], where the coverage tracks for the three datasets were created with BAMscale.

Replication timing sequencing calculates the order of genome replication. This usually involves the comparison of sequencing depths between G1-phase and S-phase cells. Replication timing log2 coverages of two BAM files can be calculated with BAMscale by setting the “reptime” flag as the operation. In this process, BAMscale first calculates the bin level coverage of the genome for both BAM files, followed by separate signal smoothening. By default, the bin size is set to 100bp, while the smoothening is set to 500 bins. After smoothening the coverage of the two input files, the log2 coverage is calculated and exported to a bigWig file for visualization.

In case of OK-seq, the replication fork directionality (RFD) can be calculated with BAMscale by setting the operation to “rfd”. In this case, BAMscale calculates the bin-level coverage of the genome for reads aligning to the forward strand, and reverse strand separately, followed by RFD calculation of each bin [21]. In case of mapping of DNA breaks with END-seq, stranded bin-level coverages can be calculated by setting the operation flag to “endseq” (both strands have positive values) or “endseqr” (negative strand coverage will be negative) which allows to overlay the two strand coverages in one figure.

Visual comparison showed high similarly with the deposited tracks (Fig.S5). As previously reported, END-seq DNA-break signals were predominantly observed in early replicating regions of the genome (Fig.4A). Genome regions with stronger END-seq signal displayed a higher replication timing average calculated from the log₂ tracks compared to randomly selected regions (Fig.4B). Comparison of negative to positive strand switching in the replication initiation zones identified by OK-seq showed strong overlaps among regions with increased END-seq signal (Fig.4C).

Analyses of DNA-replication derived data

Finally, we compared replication timing data to OK-seq and NS-seq (Nascent strand sequencing) data from the human leukemia K562 cell line. Replication timing results (Fig.5A i) and the generated segments (Fig.5A ii) showed that early-replicating regions strongly correlate with active chromatin regions (Fig.5A iii) identified with ChromHMM [40, 41]. Furthermore, BAMscale showed a strong overlap of OK-seq [42] RFD strand switches (associated with synchronized replication initiation zones) with active (eu)chromatin (Fig.5A iv,v). Fewer than 0.5% of identified OK-seq strand switches were identified in heterochromatin, where no overlap with active chromatin regions was found. Similarly, we observed higher NS-seq signal (and replication origin peaks) in euchromatin (Fig.5A vi). Early-replicating regions tend to be associated with more replication initiation sites, which gradually decrease in later phases of replication timing (Fig.5B). These results correlate strongly with the NS-seq results showing that early replicating regions have higher peak densities compared to later-replicating regions [43] (Fig.5C). We also tested BAMscale on 80 single-cell replication timing sequencing (scRepli-seq) samples [44]. We were able to accurately reproduce the single-cell log2 replication timing profiles from G1 phase and mid-S phase cells (Fig.S6), requiring on average 11 seconds of processing time for each sample pair using 4 processing threads. Furthermore, we compared the performance of BAMscale and deepTools bamCompare on the replication timing data derived from the human leukemia K562 cell line using eight processing threads. The sequencing data consists of >103 Gb of sequencing data in BAM format, which we re-analyzed six times. The mean run time of BAMscale was 23.2 minutes, which is a 5.3-fold decrease in analysis time, compared to 123.1 minutes required for deepTools.

Widespread usage of DNA and RNA capture-based methods helps us understand and categorize changes in chromatin state and their regulatory effects on DNA-replication and gene expression. Visualization of genome-wide data is a crucial step that helps researchers find complex genomic patterns and relationships. Because of the increased usage of next-generation sequencing both in research and clinical settings, it is important to analyze data reproducibly by removing as many analysis steps as possible, which may be prone to error. These help accurate visualization of large-scale data useful to address complex problems.

Since the scope of next-generation sequencing is usually on a genome wide level, the signal distribution of these techniques are generally visualized with different genomic viewers [29, 45]. Although there are available tools for sequencing track generation [27–29], they either require multiple steps, and/or need more computation time to produce results ready for visualization. Additionally, in case of ChIP-seq and ATAC-seq, peak strengths are quantified and normalized by performing multiple analysis steps. This is carried out with “in-house” scripts i.e. time-consuming case-by-case programming.

Although there are multiple tools to analyze genome level coverage of sequencing data, such as IGVTools [29], deepTools [27], and MACS2 [46] coverage mode, these are mainly used to directly calculate coverages. Many sequencing techniques require specific analysis methods for accurate representation. A simple example is RNA-seq, where the binning process has to be splice-aware for accurate representation of exon-intron boundaries. Another example is OK-seq, which can be used to identify replication origins based on the calculation of replication fork directionality (RFD) from reads aligning to the forward and reverse strand of the genome.

We developed BAMscale to analyze data in a quick, simple and reproducible manner. The program is developed in basic C-programming, resulting in very fast execution times compared to previous methods. To facilitate data analysis, we implemented multiple pre-defined settings for a wide set of sequencing types accompanied with extensive tutorials (https://github.com/ncbi/BAMscale wiki page), which many tools lack.

We demonstrated the utility of our tool by reanalyzing new and previously published datasets, such as ChIP-seq / ATAC-seq, RNA-seq, replication data (OK-seq, replication timing and single-cell Repli-seq) and mapping of DNA-breaks with END-sEq. Using BAMscale as a peak quantification method and a scaled coverage-track generation tool, users can identify single focal changes on the genome as well as understand how certain conditions alter global chromatin profiles. We have also implemented an RNA-seq compatible version enabling accurate visualization of exon-intron boundaries from both stranded and un-stranded data. Together these analyses enable the stratification and identification of genomic regions of interest displaying alterations in one or multiple sequencing data types.

BAMscale is a tool that can be used to accurately quantify and normalize identified peaks directly from BAM files, as well as create coverage tracks for visualization in genome browsers. Due to the multithreaded implementation, our tool outperforms currently used methods. We implemented sequencing specific coverage track calculation modes including: 1) replication timing, 2) replication fork directionality analysis from OK-seq data, 3) strand-specific coverage of DNA breaks from END-seq and 4) splice aware RNA-seq coverage modes, many of which lack any dedicated software. BAMscale is freely available on github (https://github.com/ncbi/BAMscale).

BAMscale algorithm

Peak quantification

Peaks can be quantified with BAMscale’s cov function, which takes as input a BED file with peak coordinates, and one or multiple BAM files, outputting raw read counts, FPKM, TPM and library size normalized peak scores. Paired-end reads can be quantified in two main ways: 1) using each read as a single entity, or 2) counting read pairs as one fragment. Additionally, it is possible to count reads that follow either the strand direction of each peak in the BED file, or simply calculate forward or reverse reads only.

During peak quantification BAMscale by default first reads the entire BAM file(s) to count the number of aligned reads using the selected alignment filters to get the effective library size. This approach gives more accurate alignment statistics than using the BAM index file, which has information on number of aligned reads only, containing duplicate reads and low quality reads as well. After calculating the effective library size, BAMscale counts the number of overlapping reads with each coordinate in the BAM file, followed by FPKM [47], TPM [48] and library size (scaled to the smallest library) normalization.

Creating coverage tracks from sequencing data

To generate normalized coverage tracks, the BAMscale “scale” function first imports the coverage of every bin (changeable) of the genome, followed by either genome size scaling (based on the length of the genome), or read count scaling. During genome size scaling, the scaling factor is calculated by dividing the total number of aligned bases with the genome size, which is obtained from the header of the BAM file. In cases where the number of bases exceed the genome size, scaling will reduce the per-bin coverage, while increasing the coverage when the sequenced bases are less than the genome size. The advantage of this approach is that each sample can be scaled separately. Alternatively, it is possible to scale multiple samples based on the library size. In these cases, the number of aligned reads is calculated for each sample, and scaling is done by scaling each sample to the smallest in the set. A drawback of this approach is that all samples have to be processed in parallel, which increases memory requirements (~500Mb for each sample when the bin size is set to 5bp). Additionally, it is possible to supply a scaling factor for each sample, that will be used to adjust the coverages.

We implemented an RNA-seq compatible option for creating coverage tracks with a difference in binning strategy. In RNA-seq mode, at cases where two adjacent bases in one bin have a coverage difference above 4 reads, the resolution is automatically changed to single base resolution. This enables the accurate representation of exon-intron boundaries.

Additionally, we implemented a signal smoothing option for coverage tracks. When smoothening a signal, the number of adjacent bins can be specified, which will be used to calculate the final signal of each bin.

In cases where two files are specified, different operations can be performed, such as calculating the log2 ratio of bins, or subtracting the values of bins.

For ease of use, we implemented predetermined settings to analyze replication timing data and END-seq data. In case of the “reptime” operation, the log2 coverage of two bam files are calculated for 100bp bin sizes, with signal smoothening set to 500 bins. In case of END-seq, we can set the operation to “endseq”, which creates stranded coverage tracks, or “endseqr”, which negates the coverage track of the negative strand for ease of visualization.

OK-seq data and replication fork directionality

In case of OK-seq data first the Watson (forward) and Crick (reverse) strand coverages are calculated consecutively. After importing the strand specific coverages, the replication fork directionality (RFD) is calculated as:

See Formula 1 in the Supplemental Files.

Where denotes the Crick read counts, denotes the Watson read counts for the i-th bin (based on [21]).

Sequenced data

NS-seq and replication timing sequencing of K562 cell lines

To produce high coverage sequencing of replication origins for the K562 human bone marrow derived cell line, we performed nascent-strand sequencing and replication timing sequencing available at GEO (GSE131417).

K562 NS-seq sample preparation

Replication origins were mapped using the nascent-strand sequencing and abundance assay [22]. Briefly, DNA fractionation was performed using a 5-30% sucrose gradient to collect DNA fractions ranging from 0.5 to 2 kb. Five prime single-strand DNA ends were phosphorylated by T4 polynucleotide kinase (T4PK) (NEB, M0201S). After phenol/chloroform treatment to remove T4 PK, DNA was precipitated, resuspended and then treated with lambda-exonuclease (NEB, M0262S) to remove genomic DNA fragments that lacked the phosphorylated RNA primer. After RNase treatment and DNA purification (Qiagen PCR purification kit, 28004), single stranded nascent strands were random-primed using the Klenow and DNA Prime Labeling System (Invitrogen, 18187013). Double-stranded nascent DNA (1 μg) was sequenced using the Genome Analyzer II (Illumina).

K562 replication timing sample preparation

To get the pure G1 phase cells, 1 × 108 K562 cells were washed twice with cold PBS and fractionated by elutriation at each of the following flow speeds: 15 × 2, 16 × 2, 17 × 2, 18 × 2 19 × 2, 20 × 2 ml/min. Each fraction was stained by DAPI in PBS and confirmed by FACS. Genomic DNA from G1 phase and asynchronized K562 cells was extracted simultaneously according to manufacturer’s instructions (Qiagen, 69506).

Data analysis

We demonstrate the capabilities of BAMscale on a wide set of sequencing datasets, such as ATAC-seq [31], ChIP-seq [35], replication timing data and END-seq [39], OK-seq [39, 42], single-cell Repli-seq and BrdU-IP [44] and stranded RNA-seq [38]. The complete list of analyzed samples, genome version and tissues of origins used in this study are shown in Table S1.

Alignment of sequencing data

The next-generation sequencing data GEO/ENA and SRA ids, along with sample type, aligner (and version) and genome build summarizing 27 processed sequencing experiments can be found in Table S1. ATAC-seq, ChIP-seq, OK-seq, END-seq and NS-seq were aligned with the bwa mem aligner [49] (version 0.7.17) or dragen pipeline [50] in case of the K562 replication timing. RNA-seq data was aligned using the STAR aligner [36] two-pass mode. Alignment settings were based on “best recall at base and read level” as shown in supplementary table 37 of [51] to obtain the best alignments. Aligned (unsorted) reads were sorted using samtools [30] (version 1.8), followed by duplicate marking using picard-tools (version 2.9.2).

Peak calling and coverage track generation for ATAC-seq, ChIP-seq and NS-seq

Peaks were identified using MACS [46] peak caller (version 2.1.1.20160309), using the –nomodel setting, and FDR set to 0.01 to filter low quality peaks. ATAC-seq peaks were called using the narrow setting, while histone peaks were called using the broad peak setting of MACS. In case of NS-seq data, both broad and narrow peaks were called, the top 10% of narrow peaks were intersected with the broad peaks to retrieve the highest scoring regions. Called peaks for each cell line and condition were sorted and merged using BEDTools [28] (version 2.27.1). Peak quantification was performed with the “cov” function of BAMscale separately for each sequencing type.

Gene expression quantification and differential expression analysis from RNA-seq data

Raw read counts for each gene were calculated using the TPMcalculator program [52]. Differential expression analysis between wild-type and KO samples were calculated using DESeq2 [53], where, as suggested in the manual, genes with less than ten reads on average were removed from the analysis. Scaling factors for each sample were obtained using DESeq2 sizeFactors() function. Scaled coverages were created with BAMscale “scale” function with operation set to “strandrnaR” and bin size set to 15 bases, scaling set to “-k custom” and scaling factor set to the reciprocal estimated factors from DESeq2 for each sample.

OK-seq data

BigWig signal of aligned OK-seq reads were created with the BAMscale “scale” function, with the operation set to “rfd” (replication fork directionality), all other parameters were set to default.

END-seq data

The strand specific coverage tracks for END-seq data were created the BAMscale “scale” function, with the operation set to “endseqr”. Two coverage tracks are created, one with the forward strand, and a separate track for negative strand reads, where the score is negated.

Replication timing data

Replication timing log2 ratio coverage tracks were created with the BAMscale “scale” function, with the operation set to “reptime”. The first specified BAM file is the G1-phase specific sequencing data, the second BAM file was the asynchronous cell-cycle BAM file. Replication-timing segments were identified the “Replication_timing_segmenter.R” script developed in R, and deposited on github along the BAMscale code.

Single-cell (sc) Repli-seq

Replication timing log2 ratio coverage tracks for single-cell replication timing data [44] were created with the BAMscale “scale” function. The operation parameter was set to log2, and to reproduce the original analysis results, we set the bin size was set to 50kb, and signal smoothening to 4 (resulting in 400kb smoothening). In case of the standard CBA/MsM samples the “CBMS1_ESC_single_G1_01” (GSM2904978) sample was used as the G1 phase reference, and sample “CBMS1_Day7Diff_ESC_single_G1_01” (GSM2905031) was used as the G1 phase reference for the 7-day differentiated CBA/MsM samples.

BrdU-IP replication timing data

The log2 coverage of early and late S phase BrdU-IP sequencing were calculated using the BAMscale “scale” function. Similarly to the scRepli-seq data, the operation parameter was set to log2, and to reproduce the original analysis results, we set the bin size was set to 50kb, and signal smoothening to 4 (resulting in 400kb smoothening).

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

BAMscale is freely available on github (https://github.com/ncbi/BAMscale). All re-analyzed samples (including the GEO and SRA identifiers) are summarized in Table S1. The K562 cell line NS-seq and replication timing data are available at GEO (GSE131417).

Conflict of Interest

The authors declare that they have no competing interests

Funding

This research was supported by the Intramural Research Programs at the National Cancer Institute [Z01-BC 006150 and 1ZIABC010411] and the National Library of Medicine at the NIH.

Authors' contributions

LSP, JMG, RVA developed the core program. JM, SJ, HZ and SY contributed with the ATAC-seq and ChIP-seq data processing method development. CR, HF, BT and AB contributed to the development of NS-seq, replication timing and OK-seq data analysis methods. CR, HF, SY performed the K562 NS-seq and replication timing sample preparation. LSP, JMG, RVA, LM, DL, MIA and YP contributed to writing and editing of the manuscript. All authors contributed to the data interpretation and read and approved the final manuscript.

Acknowledgements

We thank the NCI Sequencing Facility headed by Bao Tran and Jyotti Shetty for expert technical assistance. The study utilized the high-performance computer capabilities of the Biowulf HPC cluster at the NIH.

ATAC-seq: Assay for Transposase-Accessible Chromatin using sequencing

BAM: Binary Simple Alignment format

BED: Browser Extensible Data

ChIP-seq: Chromatin immunoprecipitation followed by sequencing

CPT: Camptothecin

END-seq: DNA double strand break mapping sequencing

FPKM: Fragments Per Kilobase of transcript per Million mapped reads

NS-seq: Nascent strand sequencing

OK-seq: Okazaki fragments sequencing

RFD: Replication fork directionality

TPM: Transcripts Per Million

Huang, Y.H., et al., POU2F3 is a master regulator of a tuft cell-like variant of small cell lung cancer. Genes Dev, 2018. 32(13-14): p. 915-928.
Borromeo, M.D., et al., ASCL1 and NEUROD1 Reveal Heterogeneity in Pulmonary Neuroendocrine Tumors and Regulate Distinct Genetic Programs. Cell Rep, 2016. 16(5): p. 1259-1272.
Bernt, K.M., et al., MLL-rearranged leukemia is dependent on aberrant H3K79 methylation by DOT1L. Cancer Cell, 2011. 20(1): p. 66-78.
Jang, S.M., et al., The replication initiation determinant protein (RepID) modulates replication by recruiting CUL4 to chromatin. Nat Commun, 2018. 9(1): p. 2782.
Patten, D.K., et al., Enhancer mapping uncovers phenotypic heterogeneity and evolution in patients with luminal breast cancer. Nat Med, 2018. 24(9): p. 1469-1480.
Raisner, R., et al., Enhancer Activity Requires CBP/P300 Bromodomain-Dependent Histone H3K27 Acetylation. Cell Rep, 2018. 24(7): p. 1722-1729.
Ross-Innes, C.S., et al., Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature, 2012. 481(7381): p. 389-93.
Johnson, D.S., et al., Genome-wide mapping of in vivo protein-DNA interactions. Science, 2007. 316(5830): p. 1497-502.
Barski, A., et al., High-resolution profiling of histone methylations in the human genome. Cell, 2007. 129(4): p. 823-37.
Buenrostro, J.D., et al., ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr Protoc Mol Biol, 2015. 109: p. 21 29 1-9.
Song, L. and G.E. Crawford, DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc, 2010. 2010(2): p. pdb prot5384.
Giresi, P.G., et al., FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res, 2007. 17(6): p. 877-85.
Boyle, A.P., et al., High-resolution mapping and characterization of open chromatin across the genome. Cell, 2008. 132(2): p. 311-22.
Davie, K., et al., Discovery of transcription factors and regulatory regions driving in vivo tumor development by ATAC-seq and FAIRE-seq open chromatin profiling. PLoS Genet, 2015. 11(2): p. e1004994.
Lu, Z., et al., Combining ATAC-seq with nuclei sorting for discovery of cis-regulatory regions in plant genomes. Nucleic Acids Res, 2017. 45(6): p. e41.
Baek, S., I. Goldstein, and G.L. Hager, Bivariate Genomic Footprinting Detects Changes in Transcription Factor Activity. Cell Rep, 2017. 19(8): p. 1710-1722.
Lister, R., et al., Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 2008. 133(3): p. 523-36.
Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 2008. 5(7): p. 621-8.
Nagalakshmi, U., et al., The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008. 320(5881): p. 1344-9.
Canela, A., et al., DNA Breaks and End Resection Measured Genome-wide by End Sequencing. Mol Cell, 2016. 63(5): p. 898-911.
Petryk, N., et al., Replication landscape of the human genome. Nat Commun, 2016. 7: p. 10208.
Martin, M.M., et al., Genome-wide depletion of replication initiation events in highly transcribed regions. Genome Res, 2011. 21(11): p. 1822-32.
Marchal, C., et al., Genome-wide analysis of replication timing by next-generation sequencing with E/L Repli-seq. Nat Protoc, 2018. 13(5): p. 819-839.
Mukhopadhyay, R., et al., Allele-specific genome-wide profiling in human primary erythroblasts reveal replication program organization. PLoS Genet, 2014. 10(5): p. e1004319.
Hansen, R.S., et al., Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci U S A, 2010. 107(1): p. 139-44.
Koren, A., et al., Genetic variation in human DNA replication timing. Cell, 2014. 159(5): p. 1015-1026.
Ramirez, F., et al., deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res, 2014. 42(Web Server issue): p. W187-91.
Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-2.
Robinson, J.T., et al., Integrative genomics viewer. Nat Biotechnol, 2011. 29(1): p. 24-6.
Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-9.
Murai, J., et al., SLFN11 Blocks Stressed Replication Forks Independently of ATR. Mol Cell, 2018. 69(3): p. 371-384 e6.
Layer, R.M., et al., GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods, 2018. 15(2): p. 123-126.
Liu, T., et al., Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol, 2011. 12(8): p. R83.
Kim, R., et al., ColoWeb: a resource for analysis of colocalization of genomic features. BMC Genomics, 2015. 16: p. 142.
Gollner, S., et al., Loss of the histone methyltransferase EZH2 induces resistance to multiple drugs in acute myeloid leukemia. Nat Med, 2017. 23(1): p. 69-78.
Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
Kim, D., B. Langmead, and S.L. Salzberg, HISAT: a fast spliced aligner with low memory requirements. Nat Methods, 2015. 12(4): p. 357-60.
Baechler, S.A., et al., The mitochondrial type IB topoisomerase drives mitochondrial translation and carcinogenesis. Nat Commun, 2019. 10(1): p. 83.
Tubbs, A., et al., Dual Roles of Poly(dA:dT) Tracts in Replication Initiation and Fork Collapse. Cell, 2018. 174(5): p. 1127-1142 e19.
Ernst, J. and M. Kellis, ChromHMM: automating chromatin-state discovery and characterization. Nat Methods, 2012. 9(3): p. 215-6.
Hoffman, M.M., et al., Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res, 2013. 41(2): p. 827-41.
Wu, X., et al., Developmental and cancer-associated plasticity of DNA replication preferentially targets GC-poor, lowly expressed and late-replicating regions. Nucleic Acids Res, 2018. 46(19): p. 10532.
Smith, O.K., et al., Distinct epigenetic features of differentiation-regulated replication origins. Epigenetics Chromatin, 2016. 9: p. 18.
Takahashi, S., et al., Genome-wide stability of the DNA replication program in single mammalian cells. Nat Genet, 2019. 51(3): p. 529-540.
Haeussler, M., et al., The UCSC Genome Browser database: 2019 update. Nucleic Acids Res, 2019. 47(D1): p. D853-D858.
Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol, 2008. 9(9): p. R137.
Trapnell, C., et al., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 2010. 28(5): p. 511-5.
Wagner, G.P., K. Kin, and V.J. Lynch, Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci, 2012. 131(4): p. 281-5.
Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.
Miller, N.A., et al., A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med, 2015. 7: p. 100.
Baruzzo, G., et al., Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods, 2017. 14(2): p. 135-139.
Vera Alvarez, R., et al., TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics, 2019. 35(11): p. 1960-1962.
Love, M.I., W. Huber, and S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol, 2014. 15(12): p. 550.

Download PDF

Journal Publication

published 22 Apr, 2020

Read the published version in Epigenetics & Chromatin →

Editorial decision: Major revision
03 Mar, 2020
Review #1 received at journal
25 Jan, 2020
Reviewer #1 agreed at journal
23 Jan, 2020
Reviewers invited by journal
21 Jan, 2020
Editor assigned by journal
16 Jan, 2020
Editor invited by journal
15 Jan, 2020
Submission checks completed at journal
12 Jan, 2020
First submitted to journal
10 Jan, 2020

You are reading this older preprint version

Read the latest preprint version →

BAMscale: quantification of next-generation sequencing peaks and generation of scaled coverage tracks

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Results

Peak quantification and scaling coverage track from ATAC-seq data

Quantifying ChIP-seq peaks

RNA-seq data coverage track generation

Comparison of DNA-breaks and replication timing

Analyses of DNA-replication derived data

Discussion

Conclusions

Methods

BAMscale algorithm

Peak quantification

Creating coverage tracks from sequencing data

OK-seq data and replication fork directionality

See Formula 1 in the Supplemental Files.

Sequenced data

K562 NS-seq sample preparation

K562 replication timing sample preparation

Data analysis

Alignment of sequencing data

Peak calling and coverage track generation for ATAC-seq, ChIP-seq and NS-seq

Gene expression quantification and differential expression analysis from RNA-seq data

OK-seq data

END-seq data

Replication timing data

Single-cell (sc) Repli-seq

BrdU-IP replication timing data

Declarations

Abbreviations

References

Supplementary Files

Status:

Journal Publication

Version 1