This study provides a comprehensive evaluation of the performance of the GPU-accelerated BaseNumber pipeline for germline variant calling. BaseNumber achieved an average F1 value of 99.59% when applied to the gold standards from the GSCG and GIAB projects. Notably, this level of accuracy is comparable to that of the top-performing variant callers in the PrecisionFDA Consistency Challenge (https://precision.fda.gov/challenges/consistency/results). Furthermore, we compared the results obtained from BaseNumber and GATK on GSRD WGS samples with an average coverage of 48X, and the consistency between the two was remarkable, with an average F1 value of 99.68% when using the GATK results as the reference. Importantly, the analysis time for a 48X WGS sample using the GPU server averaged 23 minutes, representing a speed improvement of more than 200 times compared to that of the BWA + GATK pipeline. These findings demonstrate that the BaseNumber pipeline offers an appealing alternative to the commonly used BWA + GATK pipeline.
With the rise of million-sample WGS projects, the need for ultrafast and highly accurate variant calling has become crucial for in-depth genome analysis. One notable example is the Genome Aggregation Database (gnomAD) (v3.1), which encompasses 76,156 WGS samples sourced globally, providing invaluable population genetic information for various research and clinical applications 31 . According to the results of this study, assuming an average depth of 30X for gnomAD samples, the BaseNumber pipeline can complete variant calling from raw sequencing data within a mere 90 days when deployed on ten GPU servers, alongside the necessary infrastructures. Leveraging the capabilities of accelerated variant calling pipelines offers substantial savings in terms of investments in server hardware, physical space, support infrastructure, and the workforce. Additionally, the reduced analysis time significantly mitigates energy consumption and computing costs. Based on the costs of GPU servers and per-sample power consumption, our estimates indicate that processing 100,000 30X WGS samples over a span of four years could result in an analysis cost of less than $0.40 per sample. Furthermore, the implementation of GPU-based software is agile and straightforward, ensuring operational efficiency and allowing timely adjustments to the pipeline. As a result, the adjustment of the genome reference and algorithms for read mapping, variant calling, and quality control can be accomplished with relatively low sunk costs as large cohort studies progress. An additional advantage of GPU acceleration, similar to CPU acceleration, is that these servers can be employed for other high-performance parallel computing tasks, such as training deep learning models, image recognition, and natural language processing, thereby maximizing the utilization of server resources.
The germline variant calling algorithm used in BaseNumber resembles that used in GATK HaplotypeCaller, as both methods involve reassembling aligned reads within the active regions to construct plausible haplotypes for accurate detection of SNVs and indels. However, the alignment and variant calling processes were relatively coarse-grained. Merely transplanting HaplotypeCaller to the GPU proved ineffective, resulting in a modest 2.3-fold increase in speed 16,32. To address this limitation, the alignment and variant calling processes require fine-grained parallelization and algorithmic adjustments to enable concurrent operation across thousands of GPU cores. In addition, given the exceptional bandwidth of graphics memory, a dedicated management system was necessary to regulate computational flow while optimizing I/O operations to alleviate system bottlenecks. With these improvements, BaseNumber achieves high-throughput parallel acceleration and analysis efficiency. Similar performance claims were made for another GPU-based acceleration tool, NVIDIA Clara Parabricks, in which the GPU caller purportedly completes the germline variant analysis pipeline for 30X human WGS in 28 minutes using eight NVIDIA A100 GPUs in a p4d.24xlarge instance on AWS 33.
This study focused on the germline variant calling scenario using WGS data, while BaseNumber is also suitable for analyzing ultrahigh-coverage sequencing data. Leveraging its robust computational capabilities, BaseNumber demonstrated significantly shorter running times than the GATK pipeline, eliminating the need for downsampling processes employed by GATK to handle regions with excessive coverage. Consequently, BaseNumber consistently delivered reproducible results, ensuring identical outputs for the same dataset. Furthermore, by harnessing the high processing speed of GPUs, BaseNumber allows for the application of other computationally intensive methods, such as graph alignment 34 and genotype imputation 35, to increase the accuracy of read mapping and variant calling. To further enhance performance, BaseNumber could benefit from integrating advanced storage systems, such as parallel file systems and flash solid-state drive (SSD) network-attached storage (NAS), to optimize I/O capabilities, as this evaluation revealed I/O as a relative bottleneck in the workflow.
The results of this study should be interpreted with several limitations. First, we were unable to directly compare GATK and BaseNumber on the same GPU server due to the extensive size of the samples utilized for the evaluation and the slower processing speed of the GATK pipeline. Notably, the GPU_BaseNumber server had superior I/O components, including increased RAM capacity and a larger SSD, compared to the CPU_Generic servers. Although there may be slight performance improvements for GATK on the GPU_BaseNumber server, these improvements are unlikely to alter the principal findings. Second, due to the nature of short-read sequencing 25,36,37, BaseNumber, similar to other variant callers that rely on NGS data, faces challenges associated with regions with low mappability. These regions encompass segmental duplications 38, tandem repeats 39, variable, diverse and joining (VDJ) recombination regions 40, and other genomic regions enriched with repetitive sequences. The accuracy of variant identification and interpretation in these regions depend on advancements in long-read sequencing techniques.