DNA methylation at CpG dinucleotides is one of the most commonly studied parameters within the epigenome because it is directly assessable and is often reflective of the overall structure of chromatin, which, in turn, contributes to regulation of gene expression at the transcriptional level . While there is a myriad of techniques for analysis of DNA methylation, a number of those used in the past (e.g. reduced-representation bisulfite sequencing , methylated DNA immunoprecipitation sequencing ) have employed enrichment of regions with higher frequencies of CpG dinucleotides to limit the portion of the genome to be sequenced as a means to limit the cost and computational resources required to process and analyze the resulting data. However, these techniques provide only a partial view of the epigenome, typically focused primarily on the impact of DNA methylation on chromatin structure in promoters and exons where CpG dinucleotides are often most abundant . This limits the potential of these techniques to profile DNA methylation in other regions of the genome which also contribute to regulation of gene expression, such as enhancers or regions associated with the boundaries of topologically associated domains . Whole-genome approaches, such as whole-genome bisulfite sequencing (WGBS), yield informative results for the entire genome, and, as such, have become the gold standard for global analysis of DNA methylation with single-CpG resolution . Thus, as sequencing costs have decreased , an increasing number of investigators are opting to utilize this more comprehensive, genome-wide assessment of DNA methylation which yields large, robust datasets [8–10]. However, this more comprehensive assessment of the epigenome mandates a corresponding increase in the extent of computational analysis needed to interpret the resulting larger datasets.
Recently, a novel snakemake  workflow termed wg-blimp was described as an “end-to-end” pipeline for processing WGBS data by integrating established algorithms for alignment, quality control (QC), methylation calling, detection of differentially methylated regions (DMRs), and methylation segmentation for profiling of DNA methylation states at regulatory elements . The wg-blimp pipeline is simple to install on either a personal computer or in a research high computing cluster, often requiring only an input reference, gene annotation, and FASTQ read files to fully process WGBS data. However, due to the nature and large file sizes of WGBS sequencing data, implementing the wg-blimp pipeline in its current form often requires extended computing time emanating from the conversion of unmethylated cytosines to uracils in the original DNA strand following bisulfite treatment. During PCR amplification these uracils are replaced with thymine, ultimately resulting in the conversion of C-G base pairs into T-A base pairs. Because most cytosines in the genome exist in non-CpG contexts and are thus normally unmethylated, the bisulfite treatment causes a substantial increase in the proportion of T-A base pairs and a concomitant decrease in the proportion of G-C base pairs in the amplified copies of the initially treated DNA strands. This renders mapping of bisulfite-converted reads using a conventional read mapper inadequate, because a large percentage of the converted bases will be called as mismatches relative to the untreated reference sequence. To overcome this limitation, improved ‘3-letter’ aligners such as bwa-meth  and gemBS , designed specifically for mapping bisulfite-converted reads, perform a two-stage mapping process. Cytosines on read 1 are fully converted to thymines while guanines on read 2 are fully converted to adenines. The reads are then aligned to either of two reference genomes where either all of the cytosines have been converted to thymines or all guanines have been converted to adenines. After mapping to the converted reference genomes, the read sequences are then restored to the original sequence, revealing methylated Cs which can be identified in further downstream processing. Due to this extensive processing step required for conversion and alignment of all reads to multiple indexed genomes, followed by conversion back to the starting read sequence, the alignment step imposes a very time-consuming computational step.
While both bwa-meth and gemBS follow the same “3-letter” alignment mapping concept, there are significant differences in their implementation which translate to large differences in their overall speed due to differences in the underlying alignment software packages from which these specialized methylation aligners were generated. The bwa-meth methylated DNA aligner has a foundation built on the improved BWA-MEM alignment software which follows the seed-and-extend paradigm to find initial seed alignment with super-maximal exact matches (SMEMs) using an improvement of the Burrows-Wheeler transform algorithm [13, 15]. BWA-MEM additionally re-seeds SMEMs greater than the default of 28 bp to find the longest exact match in the middle of the seed that occurs at least once in the bisulfite-converted reference genome, to reduce potential miss-mapping due to missing seed alignments. BWA-MEM also filters out unneeded seeds by grouping closely located seeds which it terms “chains,” thereby filtering out, by default, notably shorter chains contained within longer chains (which are at least 50% and 38 bp shorter than the longer chain) . The seeds remaining in these longer chains are then ranked by the length of the chain to which the seed belongs, and then by the length of the seed itself. Seeds that are already contained in a previously identified alignment are dropped while seeds that potentially lead to a new alignment are extended with a banded affine-gap-penalty dynamic program . While these strategies have increased the potential size of the read that can be aligned using the BWA-MEM software up to 100 bp, the aligner that the gemBS software is built on, GEM3, allows for mapping lengths of up to 1 kb which can scale more quickly to large sequencing analyses while maintaining equal if not superior read mapping accuracy when compared to BWA-MEM . This superiority largely comes from gemBS performing the conversion of read steps before and after mapping “on the fly”  for each read pair, thereby avoiding the generation of intermediate files and greatly increasing the efficiency of the mapping process. In addition, GEM3 filters and sorts mapped seeds into groups referred to as “strata” which facilitate complete searches of indexed references to find all possible matches to the reference genome, improving both speed and accuracy over BWA-MEM and other heuristic mapping algorithms . Searching through such a large index file does expose one limitation of gemBS, which is that it requires 48 GB of RAM compared to only 8–16 GB required by bwa-meth. However, this limitation is normally insignificant given that most midrange or higher computers are equipped with more than sufficient RAM to meet this need . We sought to leverage these differences to improve the speed of read alignment in the wg-blimp pipeline. We were able to modify the wg-blimp pipeline by replacing the bwa-meth alignment software with gemBS. This single modification allowed us to increase the overall speed of the wg-blimp pipeline by > 7x and open up the pipeline to the alignment of longer reads, all without sacrificing alignment accuracy.