StratoMod: Predicting sequencing and variant calling errors with interpretable machine learning

doi:10.21203/rs.3.rs-2613736/v1

Download PDF

Article

StratoMod: Predicting sequencing and variant calling errors with interpretable machine learning

https://doi.org/10.21203/rs.3.rs-2613736/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Repetitive structures in the genome often lead to difficulty in accurately characterizing variation across a variety of sequencing technologies and variant detection methods. To address this, the Genome in a Bottle consortium maintains “stratification BED files” for error analysis in “difficult” regions such as homopolymers and segmental duplications. However, this strategy represents genomic context in discrete bins, which sacrifices precision when quantifying difficulty; this could be improved using a data-driven model. To this end, we developed StratoMod, which uses an interpretable machine learning classifier to predict variant calling errors using features derived from genomic data. StratoMod identified distinct associations with errors for A/T vs. G/C homopolymer lengths, and quantified sources of error for a new sequencing technology. We also demonstrated that the model could predict clinically-relevant variants that may be missed by certain methods, using DeepVariant calls from Illumina as an example. From this we also produced a resource of difficult-to-map genes with challenging variants and large challenging INDELs. In each use-case, the interpretability of StratoMod enables one to understand how each feature contributed to a prediction. We anticipate this will be useful for method developers and clinicians who desire a quantitative understanding of sources of variant-calling errors.

Biological sciences/Genetics/Sequencing

Biological sciences/Biotechnology/Genomics

Biological sciences/Computational biology and bioinformatics/Computational models

Yes there is potential Competing Interest. FJS has received support from Oxford Nanopore Technologies, Pacific Biosciences, Illumina, and Genentech

suppfigs.pdf
heatmapindel.gz
supplemental_file_2

Download PDF

Version 1

posted

You are reading this latest preprint version

StratoMod: Predicting sequencing and variant calling errors with interpretable machine learning

Status:

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Version 1