A. The Genome Analysis ToolKit (GATK)
GATK (Genome Analysis Toolkit) is a collection of tools developed by the Broad Institute to identify SNPs and indels in genomic data [16]. SNPs are the most common sort of genetic variation among human beings [17]. SNPs and indels play an important role in the study of human cancer genes [18,19,20]. The GATK best practices pipeline is one of the most widely used for variant calling, due to its high accuracy [21]. This pipeline is composed of three essential parts. The first one is data preprocessing, the second is variant discovery, and the third is variant evaluation. GATK is an opensource software available on GitHub [22].
B. HaplotypeCaller algorithm
The second part of the GATK pipeline, variant discovery, is implemented by the HaplotypeCaller algorithm. This algorithm gets as inputs a processed BAM file (a set of aligned DNA reads) and a reference sequence file, and outputs a VCF file (variant calling file). The HaplotypeCaller \(1\varvec{\epsilon }\)algorithm can take up to 35% of the complete genome analysis pipeline time as reported in [23]. The HaplotypeCaller algorithm has four main steps:
Active region determination based on the presence of a sufficient number of variations that exceed a predetermined threshold [16]. Active regions, i.e., with sufficient variations, are the target of the next three steps.
• DeBruijn like graph assembly is used to assemble the reads of the region and determine a list of haplotypes [24]. A haplotype is a sequence that represents a certain combination of alleles (alternative gene forms) present in a region. Then the SmithWaterman algorithm is used to align each haplotype to the reference haplotype to identify potential genetic variations.
• PairHMM forward algorithm is used to perform a pairwise alignment of each read against each haplotype [25]. The algorithm calculates the likelihood of each haplotype given each read by summing the likelihoods of all possible alignments of the read to the haplotype. This likelihood reflects the probability that the read and the haplotype are related, based on their sequence similarity.
• Genotype sampling estimates the genotypes (combination of alleles) at each variant site in the region using Bayes’ theorem in combination with the previously obtained likelihoods.
C. Algorithm analysis
The GATK HaplotypeCaller software includes a feature that tracks the time required to complete the SmithWaterman and the PairHMM steps of the workflow. Additionally, we used profiling tools [55,56] and present the time repartition of the program in Fig. 1 (running on an Intel Core I75820K x86_64, 6 cores (12 threads), 3.30GHz, 64GB). The figure represents the average relative execution time for chromosomes 1–3 of sample NA12878 (commonly used sample from a female individual of European descent). As observed, the PairHMM component of the algorithm accounts for a significant portion of 49% of the total time, prompting our decision to accelerate this section.
Moreover, we check the memory boundedness of the HaplotypeCaller algorithm by using hardware performance counters [52]. As reported in [52], the memory boundedness of the program, which is a component of the nonexecution of the CPU (cycle_activity.cycles_no_execute), can be computed by the following formula:
$$memory bound= \text{m}\text{a}\text{x}(RESOURCE\_STALLS:SB , CYCLE\_ACTIVITY:STALLS\_L1D\_PENDING) \left(1\right)$$
We present in Table 1 the results of this analysis for a single chromosome (chromosome 1 randomly chosen) and for a WGS running. As can be observed, the memory boundedness of the HaplotypeCaller algorithm is low, which means that memory access is unlikely to become a bottleneck of a HaplotypeCaller. The intuitive conclusion is that PiM would not be the most efficient solution for HaplotypeCaller. Nevertheless, there are three main advantages to PiM architecture even this nonmemorybound program can benefit from:
 PiM architectures have a large number of execution units that provide large amounts of parallelism.
 PiM compute throughput scales with memory capacity.
 PiM systems are not limited in terms of memory capacity (or not as much) as other accelerators (e.g., GPUs or FPGAs with HBM memory).
Table 1
Memory boundedness of HaplotypeCaller algorithm

Chromosome 1

WGS

resource_stalls.sb

146,149,728,421

7,680,287,475,816

cycle_activity.stalls_l1d_pending

791,177,991,761

65,333,865,516,882

cycle_activity.cycles_no_execute

7,590,369,180,122

510,501,141,773,479

Memory boundedness

10%

12.8%

D. PairHMM forward algorithm
In this work we focus on the PairHMM step which is the most resourceintensive part of the HaplotypeCaller algorithm. In this part we compute the likelihood of each read against each haplotype.
The likelihood calculation is based on a pairwise alignment using a Hidden Markov Model (represented by a state machine) and the qualities of the read DNA bases (qualities produced by a sequencer). Given a DNA read R of length n and a haplotype H of length m we define the transitions of the state machine as the probabilities of the possible changes in state that occur during the process of aligning the read to the haplotype. In other words, the state machine represents the different states that the alignment can go through, such as a match (state M), an insertion (state I) or a deletion (state D) in the read. The transitions between the states are defined by the following probabilities:
$${\delta } ={10}^{\frac{{\text{Q}}_{\text{d}\text{e}\text{l}}}{10}}, {\iota } ={10}^{\frac{{\text{Q}}_{\text{i}\text{n}\text{s}}}{10}}, {\epsilon } ={10}^{\frac{{\text{Q}}_{\text{g}\text{c}\text{p}}}{10}}$$
$$prio{r}_{i,j}= \left\{\begin{array}{c}1{10}^{\frac{{Q}_{b}}{10}} if rea{d}_{i}=haplotyp{e}_{j}\\ \frac{{10}^{\frac{{Q}_{b}}{10}}}{3} if rea{d}_{i}\ne haplotyp{e}_{j}\end{array}\right.$$
2
Currently, \({\text{Q}}_{\text{d}\text{e}\text{l}}\), \({\text{Q}}_{\text{i}\text{n}\text{s}}\) and\({\text{Q}}_{\text{g}\text{c}\text{p}}\) (respectively the deletion, insertion, and gap continuation qualities) are included in the PHMM model but are not produced by sequencers and thus default values determined by GATK [22] are used for each base. However, \({\text{Q}}_{\text{b}}\) is referred to as the quality of each base in the read and is provided by the sequencer machine.
The finite state machine of Fig. 2 models the transitions between the three different states (M, the match state, I, the insertion state and D, the deletion state).
Using this state machine with dynamic programming, the algorithm fills up three matrices, M, I and D as shown in Equations (3;1) and (3;2). Each element \(i,j\) of the matrix represents the total likelihood of all paths from the beginning of the read to position \(i\) such that the path ends with the corresponding state (match state for matrix M, deletion state for matrix D, and insertion state for matrix I).
Initialization:\(\forall 0\le i<n, 0\le j<m,\)
$$\left\{\begin{array}{c}{M}_{i,0}=0 , {I}_{i,0}=0, {D}_{i,0}=\frac{1}{m}\\ {M}_{0,j}=0 , {I}_{0,j}=0, {D}_{0,j}=0\\ {M}_{\text{0,0}}=0 , {I}_{\text{0,0}}=0, {D}_{\text{0,0}}=0\end{array}\right. (3;1)$$
Recursion step:\(\forall 1\le i\le n, 1\le j\le m,\)
\(\left\{\begin{array}{c}{M}_{i,j}=prio{r}_{i,j}*[\left(1\left({\iota }+{\delta }\right)\right){ M}_{i1,j1}+\\ +\left(1{\epsilon }\right) {(\text{I}}_{\text{i}1,\text{j}1} + {\text{D}}_{\text{i}1,\text{j}1}\left)\right]\\ {I}_{i,j}= {{\epsilon }\text{*}\text{I}}_{\text{i}1,\text{j}} + {{\iota }\text{*}\text{M}}_{\text{i}1,\text{j}}\\ {D}_{i,j}={{\epsilon }\text{*}\text{D}}_{\text{i},\text{j}1} + {{\delta }\text{*}\text{M}}_{\text{i},\text{j}1}\end{array}\right.(3;2\) )
Termination step:
$$L=\sum _{j=1}^{m}{M}_{n,j}+{I}_{n,j} (3;3)$$
The terminations step computes the final likelihood for the current readhaplotype.