Benchmarking germline variant calling performance of a GPU-accelerated tool on whole-genome sequencing datasets

doi:10.21203/rs.3.rs-4318731/v1

Download PDF

Research Article

Benchmarking germline variant calling performance of a GPU-accelerated tool on whole-genome sequencing datasets

https://doi.org/10.21203/rs.3.rs-4318731/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Rapid advances in next-generation sequencing (NGS) have enabled ultralarge population and cohort studies to identify DNA variants that may impact gene function. Efficient bioinformatics tools, such as read alignment and variant calling, are essential for processing massive amounts of sequencing data. To increase the analysis speed, multiple software and hardware acceleration strategies have been developed. This study comprehensively evaluated germline variant calling via the GPU-based acceleration tool BaseNumber using WGS datasets from various sources. These included standard whole-genome sequencing (WGS) data from the Genome in a Bottle (GIAB) and the Golden Standard of China Genome (GSCG) projects, resequenced GSCG samples, and 100 in-house samples from the Genome Sequencing of Rare Diseases (GSRD) project. The variant calling outputs were compared to the reference and the results generated by the Burrows-Wheeler Aligner (BWA) and Genome Analysis Toolkit (GATK) pipeline.

Results

BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The output comparison between the BaseNumber and GATK pipelines yielded nearly identical results, with a mean F1 score of 99.69%. Additionally, BaseNumber took 23 minutes on average to analyze a 48X WGS sample, which was 215.33 times faster than the GATK workflow.

Conclusions

The GPU-based BaseNumber provides a highly accurate and ultrafast variant calling capability, significantly improving WGS analysis efficiency and facilitating time-sensitive tests, such as clinical WGS genetic diagnosis. This study also sheds light on the GPU-based acceleration of other omics data analyses.

bioinformatics tool

WGS

variant calling

acceleration tool

GPU-based acceleration

GATK pipeline

Genomic DNA variation constitutes a fundamental genetic reservoir that modulates gene expression and function, thereby influencing phenotypic traits, contributing to disease susceptibility, and offering markers associated with diverse population characteristics. Next-generation sequencing (NGS) is widely employed for high-throughput identification of DNA variants. Compared to enrichment-based NGS approaches, whole-genome sequencing (WGS) encompasses nearly 98% of the human genome, enabling comprehensive and unbiased variant detection in individuals ¹. Over the past few decades, the development of NGS has exceeded the pace predicted by Moore's Law, resulting in a rapid decrease in sequencing costs, which has significantly facilitated the extensive adoption of WGS in clinical and research genetic testing ^2,3. Noteworthy ultralarge WGS initiatives, such as Genomics England ⁴, All of Us ⁵, and the China Metabolic Analytics Project ⁶, have been launched to acquire a profound understanding of the underlying molecular mechanisms of human diseases and traits, where precise identification and interpretation of DNA variants serve as the cornerstone. However, the substantial volume of sequencing data available poses a significant bottleneck in variant calling efficiency, with persistent technical complexities.

Conventional variant calling pipelines often rely on CPU servers and open-source software, such as the Genome Analysis Toolkit (GATK) ⁷ and VarScan ⁸. The GATK HaplotypeCaller employs local de novo assembly of haplotypes in a region exhibiting potential variation to call germline variants. The pipeline based on GATK Best Practice reportedly completed variant calling on a 30X WGS sample in approximately 24 hours ⁹. Significant improvements are necessary to meet the demands of large-scale WGS studies. One widely adopted solution is high-performance cluster (HPC)-based speedup, which reduces the total computation time by utilizing multiple CPU servers to simultaneously analyze a dataset. However, the utilization of large clusters entails increased costs for initial equipment procurement, maintenance, and energy consumption. Therefore, maximizing computing resource utilization on a single server is essential for achieving satisfactory efficiency. Various algorithms and tools, such as Sentieon DNAseq, which is a reimplementation of the GATK Best Practice workflow ¹⁰, have been developed to accelerate variant callers ^11–13. Through optimization and recompilation of the variant calling algorithms, DNAseq achieves a tenfold increase in processing speed while providing nearly identical results to the GATK pipeline ^10,14. Nevertheless, CPU computing-based solutions have two primary drawbacks: limited parallelism in a CPU environment and the growing disparity between CPU computing power and sequencing data throughput. To address these challenges, heterogeneous computing-based approaches have emerged that utilize different computing unit architectures, such as graphics processing units (GPUs) or field-programmable gate arrays (FPGAs); these architectures offer enhanced parallelism compared to CPUs and exhibit greater power under specific conditions. For instance, Illumina Dragen integrates an FPGA card to augment the variant caller and incorporates its own bioinformatics algorithms to leverage the integrated circuits (ICs) onboard. Dragen employs a hash algorithm instead of the Burrows-Wheeler transform (BWT) algorithm, which uses the Burrows-Wheeler Aligner (BWA) as its read mapper. Additionally, the variant calling algorithm relies on a hidden Markov model, capitalizing on its inherent parallel nature to significantly improve processing speed ¹⁵. However, dedicated FPGA servers are specifically designed and optimized for particular tasks, resulting in limited versatility and potential challenges in their adaptability for other analytical purposes. Conversely, GPU server-based variant calling solutions are receiving increasing attention because of their robust parallel capabilities and versatile applicability ^16–18. The NVIDIA Clara Parabricks, a GPU-accelerated computational genomics application framework, can reduce the running time of 30X WGS germline analysis on an 8*A100 GPU server by 60-fold compared to that of the GATK pipeline ¹⁹. Moreover, with numerous processing units and high memory bandwidth, GPUs can also significantly enhance the training and inference speed of deep learning-based variant callers ^20,21.

The present study aimed to conduct a comprehensive evaluation of BaseNumber, a GPU-based variant caller. Through comparisons with the GATK-based pipeline, using gold standard samples and an in-house WGS dataset, we assessed the efficiency, accuracy, reproducibility, scalability, and energy consumption of the BaseNumber in germline variant calling on human genome data. This evaluation provides an overall assessment of the use of the BaseNumber tool as a high-throughput variant identification tool.

WGS Data Preparation

The WGS data used in this study were obtained from four different sources: 1) seven standards (HG001 ~ HG007) sourced from the Genome in a Bottle project (GIAB, v3.3.2), which is hosted by the National Institute of Standards and Technology ²²; 2) a Chinese quartet (D5, D6, F7, and M8) from The Quartet Project for Quality Control and Data Integration of Multi-omics Profiling, also known as the Golden Standard of China Genome (GSCG, v1.0) ²³; 3) 24 additional WGS samples resequenced from the GSCG cell lines at different DNA sequencing facilities using distinct sequencing platforms (referred to as the Retested_GSCG samples); and 4) 100 in-house samples collected from the Genome Sequencing of Rare Diseases (GSRD) project. For the GIAB and GSCG standards, Fastq files, high-confidence VCF files, and high-confidence BED files were acquired from project FTP servers. The data URLs are summarized in the Data and Tool Availability section. With respect to the in-house data, 100 samples were randomly selected from the GSRD project, ensuring that the experimental conditions and outputs represented real-world scenarios. As shown in Table 1 and Table S1, the four GSCG standards exhibited high coverage ranging from 144.2X to 146.5X, while all seven GIAB standards showed high coverage, with an average of 247.1X. The GSRD and Retested_GSCG samples had an average coverage of 48.2X, ranging from 31.0X to 118.0X.

Table 1

Summary of WGS samples for evaluation
Category	Sample size	Coverage	Evaluation content
GSRD	100	48.89X	Efficiency, consistency, reproducibility, energy consumption
GIAB	7	247.14X	Accuracy, scalability
GSCG	4	145.05X	Accuracy, scalability
Retested_GSCG	24	45.56X	Accuracy

Variant Calling Pipelines and Testing Environment

The GATK pipeline employs the widely adopted best practice protocol, incorporating Fastp (v0.20.1) for FASTQ preprocessing ²⁴, BWA (v0.7.17) for read alignment ²⁵, SAM tools (v1.9) for BAM sorting ²⁶, Picard (v2.23.5) for duplication removal ²⁷, and GATK (v4.1.7.0) for base quality score recalibration (BQSR) and variant calling ²⁸. The BaseNumber (SaileGene, Inc., Beijing) pipeline utilized the Saile-Aligner (SLA, v1.0.3) for FASTQ and BAM processing and the Saile-Caller (SLC, v1.0.3) for BQSR and variant calling. Docker (v19.03) was used for packaging and implementation of the pipelines on each server. The Genome Reference Consortium Human Build 38 (GRCh38) served as the reference genome, and the relevant reference files were obtained from the GATK Resource Bundle (ftp.broadinstitute.org/bundle/).

Two GPU servers and ten high-performance CPU servers were employed for this study. The detailed configurations of these servers are shown in Table 2. Specifically, the GPU_Generic server was equipped with eight NVIDIA Tesla V100 cards and featured additional CPU cores, making it suitable for various GPU-related computations, such as deep learning model training.

The GPU_BaseNumber server was optimized for the BaseNumber pipeline, which enhances the RAM size and cooling system. The high-performance CPU (CPU_Generic) servers were designed for generic CPU-related analytic tasks; these servers feature a cost-effective, well-balanced hardware configuration and lack GPU acceleration. To monitor both instantaneous power consumption and total power consumption, UNI-T UT230A-II power sockets were utilized. Additionally, all the testing data were stored on the network-attached storage system connected to the GPU and CPU servers via the InfiniBand network.

Table 2

Hardware configuration of the testing servers
Server name	CPU	Memory	GPU	Hard disk
GPU_BaseNumber	2X Intel 6226R processor (16 cores, 32 threads)	DDR4 1T	4* NVIDIA Geforce RTX 2080 Ti 11GB	2X 3.2TB NVMe; 2X 8T 7200 RPM HDDs
CPU_Generic	2X Intel 6248 processor (20 cores, 40 threads)	DDR4 125G	N/A	250T NAS
GPU_Generic	2X Intel 6240R processor (48 cores, 96 threads)	DDR4 256G	8 * NVIDIA Tesla V100 32GB	2X 3.2TB NVMe 3T;2X 8T 7200 RPM HDDs

Assessment of WGS Variant Calling

The evaluation encompassed seven comprehensive assessments, as depicted in Fig. 1. To evaluate the accuracy and scalability, gold standard data from GIAB and GSCG were utilized, while locally generated GSRD data were employed to evaluate the efficiency, consistency, reproducibility, and energy consumption. The accuracy of the BaseNumber data was evaluated by comparing its variant calling output with the reference VCF files using hap.py from Haplotype Comparison Tools, which allowed us to calculate precision, recall, and F1 scores ²⁹. The evaluation was performed in high-confidence genomic regions (as retrieved from the GIAB and GSCG high-confidence BED files) to ensure result reliability. Seqtk was used to randomly select reads from the original FASTQ files to create new FASTQ files with targeted depth, enabling the evaluation of the correlation between accuracy and sequencing depth ³⁰. The downsampled WGS data of the GSCG and GIAB standards were utilized to examine the correlation between analysis time and sequencing depth. An efficiency comparison between BaseNumber and GATK was conducted by analyzing WGS data from 100 GSRD samples on the GPU_BaseNumber and CPU_Generic servers. The wall clock time for each sample was recorded to determine the acceleration ratio. To assess reproducibility, BAM and VCF outputs of BaseNumber were compared. The consistency between BaseNumber and GATK was evaluated using hap.py. Hardware impact was examined by comparing the analysis times of raw and downsampled HG001 on the GPU_BaseNumber and GPU_Generic servers. Additionally, the configuration of the GPU_Generic server was varied to investigate the influence of the GPU card number on the analysis time.

Accurate Variant Identification using the BaseNumber from the GSCG and GIAB Standards

The accuracy of germline variant detection using BaseNumber was evaluated on 24 Retested_GSCG samples and seven GIAB standards (Fig. 2A and Table S2). To address the excessive raw coverage, the GIAB standards were downsampled to a 30X depth. The BaseNumber variant calling results were compared to the reference VCF files of the GSCG and GIAB standards. Among the 31 tested samples, the mean precision rate for all variants was 99.32% (± 0.21% standard deviation), the mean recall rate was 99.86% (± 0.08%), and the mean F1 score was 99.59% (± 0.10%). In terms of specific variant types, single nucleotide variant (SNV) calling achieved a slightly higher mean F1 score of 99.63% (± 0.09%) compared to indel calling, which had a mean F1 score of 99.05% (± 0.09%). Additionally, we compared the Retested_GSCG samples generated by DNBSEQ-T7 to those generated by NovaSeq 6000, with F1 values of 99.65% (± 0.03%) and 99.56% (± 0.13%), respectively, indicating comparable performance. The downsampled GIAB samples also exhibited satisfactory precision (99.26% ±0.13%) and recall (99.80% ±0.09%) rates.

The correlation between the WGS depth and the accuracy of variant calling by BaseNumber was assessed using GSCG standards and GIAB HG001. Reads were randomly extracted from the FASTQ files, creating a series of simulated WGS samples with gradually downsampled depths ranging from 10X to 300X (Table S3). The mean F1 score for the 10X simulated GSCG samples was 97.10%, while for the 10X HG001 sample, it was 94.71%. Notably, as the depth increased, the performance exhibited significant improvement (Fig. 2B). Once the coverage exceeded 30X, the analysis accuracy stabilized, even in scenarios with exceptionally high coverage of 300X.

Assessment of variant calling efficiency

The efficiency of variant calling was evaluated by recording the wall clock time for each step of the process on the data of 100 GSRD samples. This assessment compared the BaseNumber pipeline on the GPU_BaseNumber server to the GATK pipeline on the CPU_Generic servers. Figure 3A shows that the mean total analysis time for each GSRD sample determined using BaseNumber was 23.35 ± 4.75 minutes, with a range of 19.92 to 50.4 minutes. Specifically, the read alignment and PCR duplication removal steps performed by SLA took 10.64 ± 2.68 minutes, the generation of recalibrated BAM files by SLC required 11.59 ± 2.12 minutes, and the variant calling step that produced gVCF and VCF files, which was also performed by SLC, took only 1.11 ± 0.28 minutes. In contrast, the GATK pipeline required a mean time of 5018.35 ± 1330.97 minutes (ranging from 3598 to 10179 minutes) to generate final variant calling results from the raw sequencing read data of the same sample set. BaseNumber significantly improved the computational efficiency of germline variant calling by a factor of 215.33 ± 37.45 times. Furthermore, the analysis time demonstrated a linear correlation with the coverage of the WGS data, ranging from 31X to 118X, with R² values of 0.863 for BaseNumber and 0.719 for GATK.

To further investigate the influence of data volume on analysis time, we conducted experiments using the GSCG and GIAB standards. The FASTQ files of the four GSCG standards were randomly downsampled across a range of depths from 10X to 120X, while the GIAB HG001 FASTQ was sampled from 10X to 300X. Figure 3B shows that the depth of coverage linearly correlated with the analysis time, even at extremely high coverage levels (300X). Interestingly, the R² for SLA exceeded that for SLC, suggesting a stronger linear correlation between the amount of input data and the SLA analysis time.

Reproducibility and Comparison of BaseNumber Outputs

To assess the reproducibility of the BaseNumber outputs, we conducted repeated analyses of the GSRD samples. The MD5 values of the output files corresponding to the same samples remained identical across all three analysis rounds (Table S4). This consistency demonstrates the complete reproducibility and robustness of the outcomes achieved with BaseNumber. Furthermore, we compared the variant calling results obtained from the BaseNumber and the GATK pipelines for the GSRD samples. Precision, recall, and F1 scores were calculated using the corresponding GATK results as the reference for each sample (Fig. 3C). The overall F1 score for all the variants was 99.69 ± 0.19% on average, ranging from 98.62–99.80%. Specifically, the mean F1 scores were 99.71 ± 0.19% for SNVs and 99.55 ± 0.19% for indels, indicating highly similar variant calling outcomes between BaseNumber and GATK using the same sequencing data. Additionally, we compared the F1 scores across different sequencing facilities and platforms, and no significant differences were observed.

Impact of the GPU Configuration on the BaseNumber Performance

The performance of BaseNumber algorithms is directly influenced by the configuration of the GPU cards. We evaluated the runtime of the BaseNumber variant caller using varying numbers of GPU cards. The analysis was conducted on the GPU_Generic server, which is equipped with eight NVIDIA Tesla V100 cards, using downsampled HG001 data ranging from 10X to 180X. As expected, the total analysis time increased with increasing data volume or decreasing number of GPU cards (Fig. 4, Table S5). We further examined the individual durations of the SLA and SLC processes. The analysis time for SLA, which is responsible for alignment, consistently decreased as the number of GPU cards increased. However, the analysis time for SLC, which included the BQSR process and variant calling, showed insignificant decreases from four to eight GPU cards and even increased for HG001_10X. This observation could be attributed to IO bottlenecks encountered by SLC during the output of recalibrated BAM files.

Energy consumption

We conducted a comparative analysis of the energy consumption between the BaseNumber and GATK pipelines. The electricity usage during the variant calling processes of 100 GSRD samples was recorded, and three repetitions were performed. In total, the analysis of these samples using BaseNumber on the GPU_BaseNumber server consumed 224 kilowatt-hours. Consequently, the estimated energy consumption for each GSRD sample analyzed with BaseNumber was 0.746 kilowatt-hours. In contrast, the GATK pipeline on CPU_Generic servers had an average power consumption of 33.5 kilowatt-hours per GSRD sample. Remarkably, the "GPU server + BaseNumber" approach exhibited 44.9 times lower power usage than the "CPU server + GATK" approach.

This study provides a comprehensive evaluation of the performance of the GPU-accelerated BaseNumber pipeline for germline variant calling. BaseNumber achieved an average F1 value of 99.59% when applied to the gold standards from the GSCG and GIAB projects. Notably, this level of accuracy is comparable to that of the top-performing variant callers in the PrecisionFDA Consistency Challenge (https://precision.fda.gov/challenges/consistency/results). Furthermore, we compared the results obtained from BaseNumber and GATK on GSRD WGS samples with an average coverage of 48X, and the consistency between the two was remarkable, with an average F1 value of 99.68% when using the GATK results as the reference. Importantly, the analysis time for a 48X WGS sample using the GPU server averaged 23 minutes, representing a speed improvement of more than 200 times compared to that of the BWA + GATK pipeline. These findings demonstrate that the BaseNumber pipeline offers an appealing alternative to the commonly used BWA + GATK pipeline.

With the rise of million-sample WGS projects, the need for ultrafast and highly accurate variant calling has become crucial for in-depth genome analysis. One notable example is the Genome Aggregation Database (gnomAD) (v3.1), which encompasses 76,156 WGS samples sourced globally, providing invaluable population genetic information for various research and clinical applications ³¹. According to the results of this study, assuming an average depth of 30X for gnomAD samples, the BaseNumber pipeline can complete variant calling from raw sequencing data within a mere 90 days when deployed on ten GPU servers, alongside the necessary infrastructures. Leveraging the capabilities of accelerated variant calling pipelines offers substantial savings in terms of investments in server hardware, physical space, support infrastructure, and the workforce. Additionally, the reduced analysis time significantly mitigates energy consumption and computing costs. Based on the costs of GPU servers and per-sample power consumption, our estimates indicate that processing 100,000 30X WGS samples over a span of four years could result in an analysis cost of less than $0.40 per sample. Furthermore, the implementation of GPU-based software is agile and straightforward, ensuring operational efficiency and allowing timely adjustments to the pipeline. As a result, the adjustment of the genome reference and algorithms for read mapping, variant calling, and quality control can be accomplished with relatively low sunk costs as large cohort studies progress. An additional advantage of GPU acceleration, similar to CPU acceleration, is that these servers can be employed for other high-performance parallel computing tasks, such as training deep learning models, image recognition, and natural language processing, thereby maximizing the utilization of server resources.

The germline variant calling algorithm used in BaseNumber resembles that used in GATK HaplotypeCaller, as both methods involve reassembling aligned reads within the active regions to construct plausible haplotypes for accurate detection of SNVs and indels. However, the alignment and variant calling processes were relatively coarse-grained. Merely transplanting HaplotypeCaller to the GPU proved ineffective, resulting in a modest 2.3-fold increase in speed ^16,32. To address this limitation, the alignment and variant calling processes require fine-grained parallelization and algorithmic adjustments to enable concurrent operation across thousands of GPU cores. In addition, given the exceptional bandwidth of graphics memory, a dedicated management system was necessary to regulate computational flow while optimizing I/O operations to alleviate system bottlenecks. With these improvements, BaseNumber achieves high-throughput parallel acceleration and analysis efficiency. Similar performance claims were made for another GPU-based acceleration tool, NVIDIA Clara Parabricks, in which the GPU caller purportedly completes the germline variant analysis pipeline for 30X human WGS in 28 minutes using eight NVIDIA A100 GPUs in a p4d.24xlarge instance on AWS ³³.

This study focused on the germline variant calling scenario using WGS data, while BaseNumber is also suitable for analyzing ultrahigh-coverage sequencing data. Leveraging its robust computational capabilities, BaseNumber demonstrated significantly shorter running times than the GATK pipeline, eliminating the need for downsampling processes employed by GATK to handle regions with excessive coverage. Consequently, BaseNumber consistently delivered reproducible results, ensuring identical outputs for the same dataset. Furthermore, by harnessing the high processing speed of GPUs, BaseNumber allows for the application of other computationally intensive methods, such as graph alignment ³⁴ and genotype imputation ³⁵, to increase the accuracy of read mapping and variant calling. To further enhance performance, BaseNumber could benefit from integrating advanced storage systems, such as parallel file systems and flash solid-state drive (SSD) network-attached storage (NAS), to optimize I/O capabilities, as this evaluation revealed I/O as a relative bottleneck in the workflow.

The results of this study should be interpreted with several limitations. First, we were unable to directly compare GATK and BaseNumber on the same GPU server due to the extensive size of the samples utilized for the evaluation and the slower processing speed of the GATK pipeline. Notably, the GPU_BaseNumber server had superior I/O components, including increased RAM capacity and a larger SSD, compared to the CPU_Generic servers. Although there may be slight performance improvements for GATK on the GPU_BaseNumber server, these improvements are unlikely to alter the principal findings. Second, due to the nature of short-read sequencing ^25,36,37, BaseNumber, similar to other variant callers that rely on NGS data, faces challenges associated with regions with low mappability. These regions encompass segmental duplications ³⁸, tandem repeats ³⁹, variable, diverse and joining (VDJ) recombination regions ⁴⁰, and other genomic regions enriched with repetitive sequences. The accuracy of variant identification and interpretation in these regions depend on advancements in long-read sequencing techniques.

Our evaluation highlights the accuracy and exceptional speed achieved by BaseNumber, a GPU-based variant calling tool. This approach shows the potential to optimize sequencing analysis in large-scale population studies and effectively meet the time-sensitive demands of clinical WGS genetic diagnosis. Additionally, the successful implementation of BaseNumber underscores the broader application of GPU acceleration in other bioinformatic pipelines that share similar conceptual and design principles. This approach holds significant promise in addressing the growing need for multiomics data analysis, as it offers robust support for future endeavors in individualized precision medicine.

Ethics approval and consent to participate

This study was approved by the Institutional Review Board of West China Hospital, Southwest Hospital and PLA General Hospital. All participants or their guardians provided written informed consent to participate in the study.

Consent for publication

We have received and archived written consent for participation/publication from every individual whose data is included in our manuscript.

Availability of data and materials

The Demo of BaseNumber can be accessed on the AWS US East (N. Virginia) at https://us-east-1.console.aws.amazon.com/. Instructions for usage are provided at https://github.com/WCH-IRD/BaseNumber.

The GIAB and GSCG reference data utilized in this study are available through the following URLs:

GIAB standard fastq files: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/

GIAB standard VCF and BED files (v3.3.2): https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv3.3.2/GRCh38/

GSCG standard Fastq files: http://chinese-quartet.org/#/data/download/quartet-genomics

GSCG standard VCF and BED files (v1.0): http://chinese-quartet.org/#/data/download/quartet-genomics

The VCF files of the GIAB and GSCG samples used in this study: https://vcf-for-benchmark-paper.tos-cn-beijing.volces.com/BaseNumber_VCFs.zip

Competing interests

The authors declare that they have no competing interests.

Funding

This research was financially supported by the National Key Research and Development Program of China (82171836) and the 1·3·5 project for disciplines of excellence, West China Hospital, Sichuan University (ZYJC20002).

Authors’ contributions

QZ, FB and HY conceived the study. QZ, FA and FB designed the study. QZ performed the evaluation. HL, FA and FB wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to Prof. Leming Shi for providing the GSCG standards, and Information Center for providing hardware and network facilities.

Authors' information

¹Department of Neurosurgery, West China Hospital, Sichuan University, Chengdu 610000, China

²Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu 610000, China

HL and QZ contributed equally to this work.

Correspondence to FB and HY.

Waterston, R. H., Lander, E. S. & Sulston, J. E. On the sequencing of the human genome. Proceedings of the National Academy of Sciences 99, 3712–3716 (2002).
Check Hayden, E. Technology: the $1,000 genome. Nature News 507, 294 (2014).
Lappalainen, T., Scott, A. J., Brandt, M. & Hall, I. M. Genomic Analysis in the Age of Human Genome Sequencing. Cell 177, 70–84 (2019).
Siva, N. UK gears up to decode 100,000 genomes from NHS patients. Lancet 385, 103–104 (2015).
All of Us Research Program Investigators et al. The ‘All of Us’ Research Program. N Engl J Med 381, 668–676 (2019).
Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell research 30, 717–731 (2020).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491–498 (2011).
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research 22, 568–576 (2012).
Goyal, A. et al. Ultra-fast next generation human genome sequencing data processing using DRAGENTM bio-IT processor for precision medicine. Open Journal of Genetics 7, 9–19 (2017).
Freed, D., Aldana, R., Weber, J. A. & Edwards, J. S. The Sentieon Genomics Tools-A fast and accurate solution to variant calling from next-generation sequence data. BioRxiv 115717 (2017).
Herzeel, C., Costanza, P., Decap, D., Fostier, J. & Verachtert, W. elPrep 4: A multithreaded framework for sequence analysis. PLoS One 14, e0209523 (2019).
Kelly, B. J. et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome biology 16, 1–14 (2015).
Kathiresan, N. et al. Accelerating next generation sequencing data analysis with system level optimizations. Scientific reports 7, 1–11 (2017).
Kendig, K. I. et al. Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy. Front Genet 10, 736 (2019).
Illumina DRAGEN Bio-IT Platform 3.7 User Guide. (Illumina, 2020).
Ren, S., Ahmed, N., Bertels, K. & Al-Ars, Z. GPU accelerated sequence alignment with traceback for GATK HaplotypeCaller. BMC genomics 20, 103–116 (2019).
Ren, S., Bertels, K. & Al-Ars, Z. Efficient acceleration of the pair-hmms forward algorithm for gatk haplotypecaller on graphics processing units. Evolutionary Bioinformatics 14, 1176934318760543 (2018).
Wang, J., Xie, X. & Cong, J. Communication optimization on GPU: A case study of sequence alignment algorithms. in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 72–81 (IEEE, 2017).
Braunstein, V. & Burnett, G. GPU-Accelerated Tools Added to NVIDIA Clara Parabricks v3.6 for Cancer and Germline Analyses. GPU-Accelerated Tools Added to NVIDIA Clara Parabricks v3.6 for Cancer and Germline Analyses https://developer.nvidia.com/blog/gpu-accelerated-tools-added-to-nvidia-clara-parabricks-v3-6-for-cancer-and-germline-analyses/ (2021).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature biotechnology 36, 983–987 (2018).
Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nature communications 10, 1–11 (2019).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nature biotechnology 37, 561–566 (2019).
Pan, B. et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol 23, 2 (2022).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Picard toolkit. Broad Institute, GitHub repository Preprint at https://broadinstitute.github.io/picard/ (2019).
Van der Auwera, G. A. et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in bioinformatics 43, 11–10 (2013).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nature biotechnology 37, 555–560 (2019).
Li, H. & others. Seqtk: a fast and lightweight tool for processing FASTA or FASTQ sequences. Preprint at (2013).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Franke, K. R. & Crowgey, E. L. Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms. Genomics & informatics 18, (2020).
AWS Editorial Team, Deshpande, A., Choudhury, O. & Srinivasan, S. Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS. https://aws.amazon.com/blogs/hpc/benchmarking-the-nvidia-clara-parabricks-germline-pipeline-on-aws/ (2021).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology 36, 875–879 (2018).
Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. The American Journal of Human Genetics 103, 338–348 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature methods 9, 357–359 (2012).
Marco-Sola, S., Sammeth, M., Guigó, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nature methods 9, 1185–1188 (2012).
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Gemayel, R., Vinces, M. D., Legendre, M. & Verstrepen, K. J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annual review of genetics 44, 445–477 (2010).
Schatz, D. G. & Swanson, P. C. V (D) J recombination: mechanisms of initiation. Annual review of genetics 45, 167–202 (2011).

No competing interests reported.

TableS1.xlsx
Table S1. Detailed information of the testing samples
TableS2.xlsx
Table S2. Recall, precision, and F1 scores of BaseNumber for the Retested_GSCG and GIAB samples
TableS3.xlsx
Table S3. Correlation between the accuracy of the BaseNumber and sequencing depth
TableS4.xlsx
Table S4. MD5 values of BaseNumber output files from three rounds of analysis
TableS5.xlsx
Table S5. Effects of GPU card count and sequencing depth on the analysis time

Download PDF

Editorial decision: Revision requested
20 May, 2024
Submission checks completed at journal
26 Apr, 2024
Editor assigned by journal
26 Apr, 2024
First submitted to journal
24 Apr, 2024

You are reading this latest preprint version

Benchmarking germline variant calling performance of a GPU-accelerated tool on whole-genome sequencing datasets

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Materials and methods

WGS Data Preparation

Variant Calling Pipelines and Testing Environment

Assessment of WGS Variant Calling

Results

Accurate Variant Identification using the BaseNumber from the GSCG and GIAB Standards

Assessment of variant calling efficiency

Reproducibility and Comparison of BaseNumber Outputs

Impact of the GPU Configuration on the BaseNumber Performance

Energy consumption

Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1