Preprint: Please note that this article has not completed peer review.
Research article

Benchmarking Variant Identification Tools for Plant Diversity Discovery

Xing Wu, Christopher Heffelfinger, Hongyu Zhao, Stephen L. Dellaporta
DOI: 10.21203/rs.2.9666/v2

Abstract

Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Keywords
read alignment, variant calling, machine learning, variant filtering, imputation

Figures

Background

Results

Discussion

Conclusions

Materials and Methods

Abbreviations

Declarations

References

Tables

Supplementary Files

STATUS: Accepted

Comments: 0
PDF Downloads: 0
HTML Views: 7

Integrity Check:

Peer Review Timeline

Version 2

Posted 14 Aug, 2019

  • No community comments so far
  • Editorial decision: Accept

    On 22 Aug, 2019

  • Review #2 received

    Received 21 Aug, 2019

  • Reviewer #2 agreed

    On 12 Aug, 2019

  • Reviewer #1 agreed

    On 09 Aug, 2019

  • Review #1 received

    Received 09 Aug, 2019

  • 3 reviewer(s) invited

    Invitations sent on 06 Aug, 2019

  • Editor assigned

    On 02 Aug, 2019

  • Submission checks complete

    On 01 Aug, 2019

  • Editor invited

    On 01 Aug, 2019

Version 1

Posted 18 May, 2019

View this version

More from BMC Genomics

Comments (0)

Comments can take the form of short reviews, notes or questions to the author. Comments will be posted immediately, but removed and moderated if flagged.

Learn more about our company.

Preprint: Please note that this article has not completed peer review.
Research article

Benchmarking Variant Identification Tools for Plant Diversity Discovery

Xing Wu, Christopher Heffelfinger, Hongyu Zhao, Stephen L. Dellaporta

STATUS: Accepted

Comments: 0
PDF Downloads: 0
HTML Views: 7

Integrity Check:

  • Article

  • Peer Review Timeline

  • Related Articles

  • Comments

Abstract

Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Figures

Background

Results

Discussion

Conclusions

Materials and Methods

Abbreviations

Declarations

References

Tables

Learn more about our company.