SNP+ to predict dropout rates in SNP arrays

doi:10.21203/rs.3.rs-2272496/v1

Download PDF

Method Article

SNP+ to predict dropout rates in SNP arrays

https://doi.org/10.21203/rs.3.rs-2272496/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 08 Jul, 2023

Read the published version in Conservation Genetics Resources →

You are reading this latest preprint version

Genotyping individuals using forensic or non-invasive samples such as hair or fecal samples increases the risk of allelic amplification failure (dropout) due to the low quality and quantity of DNA. One way to decrease genotyping errors is to increase the number of replicates per sample. Here, we have developed the software SNP+ to estimate the dropout probability and the subsequent required number of replicates to obtain the reliable genotype with probability 95%. Moreover, the software predicts the minor allele frequency and compares two competing models assuming equal or allele-specific dropout probabilities by a Bayes factor. The software handles data from one SNP to high density arrays (e.g., 100,000 SNPs).

SNP

software

dropout

forensic samples

Single nucleotide polymorphisms (SNPs) are biallelic markers largely abundant in the genome and with low mutation rate (10− 9 per generation) (Brumfield et al, 2003; Morin, Luikart and Wayne, 2004). SNPs can be associated to diseases, susceptibility to environmental factors or quantitative trait locus (Erichsen and Chanock 2004; Amos et al, 2008; Casellas et al. 2008; Nickels et al. 2013). In forensic medicine, SNPs can be useful to identify individuals from non-invasive samples by using short amplicons (Sobrino et al. 2005). However, individual identification using degraded samples with less than 100 copies of gDNA may cause genotyping errors (Giardina et al. 2009; von Thaden et al. 2020). The dropout or allelic amplification failure is the most common error caused by stochastic effects of the PCR reaction (Taberlet and Luikart 1999). To reduce dropout ratio, a multiplex pre-amplification and replicates per sample could be performed (Bellemain and Taberlet 2004; Sastre et al. 2009). However, both solutions increase time and cost for genotyping individuals. Our objective was to develop a software (SNP+) to predict the dropout probability of each SNP from a sample of replicated genotypes. Moreover, two alternative parametrizations were compared by a Bayes factor to check for within-SNP homogeneous dropout probability against different dropout probabilities for each allele.

The SNP + software analyzes each SNP individually, taking as a starting point a vector y of n genotypes ordered by individual (m) and replicates within individual (y’ = [y’1y’2 ... y’m]), where n1 is the number of replicates for the first individual, and n = n1 + n2 + ... + nm. Assuming two alleles, A and B, the Bayesian joint posterior distribution generalizes to

p(fA,εA,εB|y) ~ p(y|fA,εA,εB) p(fA) p(εA) p(εB),

and focuses on estimating the allele frequency (fA), as well as the dropout probability for allele A (εA) or B (εB). Taking a particular genotype yi with possible outcomes AA, AB, BB and missing genotype (miss.), its Bayesian likelihood is computed as

p(y_i = AA|f_A,ε_A,ε_B) = p(AA|AA)p(AA) + p(AA|AB)p(AB)

p(y_i = AB|f_A,ε_A,ε_B) = p(AB|AB)p(AB)

p(y_i = BB|f_A,ε_A,ε_B) = p(BB|AB)p(AB) + p(BB|BB)p(BB)

p(y_i = miss.|f_A,ε_A,ε_B) = p(miss.|AA)p(AA) + p(miss.|AB)p(AB) + p(miss.|BB)p(BB),

where

p(AA|AA) = (1 – ε_A)²

p(AA|AB) = ε_B

p(AB|AB) = (1 – ε_A)(1 – ε_B)

p(BB|AB) = ε_A

p(BB|BB) = (1 – ε_B)²

p(miss.|AA) = ε_A²

p(miss.|AB) = ε_Aε_B

p(miss.|BB) = ε_B²

and p(AA) = f_A², p(AB) = 2f_A(1 - f_A), and p(BB) = f_B². Note that the model assumes that a BB individual cannot be genotyped as AA (the probability of false alleles is zero). A priori distributions for f_A, ε_A and ε_B were assumed flat between 0 and 1.

For each SNP, the model was solved by a Metropolis-Hastings sampling process (Metropolis et al. 1953) with 500,000 iterations after a burn-in period of 10,000 iterations. Two alternative parameterizations (ε_A = ε_B vs. ε_A ≠ ε_B) were compared by a Bayes factor (Kass and Raftery 1995). The minimum number of within-individual replicates required to predict the reliable genotype with probability 95% was calculated as log(0.05)/log(ε_A). All these procedures have been implemented in the SNP+ software, available at http://www.casellas.info/software.html. The program generates the following text delimited output files:

1) SNP-by-SNP report of dropout probabilities with their confidence intervals, minimum number of replicates, Bayer factor comparing a single dropout probability against two independent dropout probabilities.

2) Summary table of the probability of error, confidence interval, replications, and Bayes factor of all the SNPs (Figure 1).

3) Predicted genotypes for each individual and SNP, and probability of error, if any.

4) Pairwise comparison among all individuals of the probability of identical genotype.

5) SNP-by-SNP report of the MAF and probability of identity (PI).

As an example, we have used SNP+ to evaluate two panels using Open Array® technology (Thermo Fisher Scientific Inc). We analyzed 22 fecal samples and 114 hair samples from Iberian brown bears (Ursus arctos) using first a 120 SNP panel. Then, we selected the best 60 SNPs, and we ran again SNP+ using 164 fecal samples and 173 hair samples. All samples were replicated four times, and low-quality DNA samples in both analysis (call rate <25%) were not included in the analysis.

Figure 2 shows the relative frequencies of the dropout probability for the four studies. The dropout probability was clearly low after selecting the panel of 60 SNPs using two types of non-invasive samples, in average 0.2 versus 0.05. In terms of variability and distribution mode, the study that obtains lower dropout probabilities is the study C, after SNP selection. The study with the highest probability of dropout is the study B probably because hair samples were hair-trapping collected and therefore, not all samples contained roots or enough hair quantity to obtain high DNA quality.

To summarize, SNP+ calculates the dropout likelihood, the Bayes Factor, PI and MAF, and can be used to select the best arrays from low density arrays up to high density arrays, avoiding those SNPs that require many replicates because they lead to error, or, whether the panel has been designed, to carry out x number of replicates per sample to reach a 95% of genotyping reliability per SNP.

Acknowledgments

This work was funded by the LoupO project (EFA354/19) of the European Interreg Program V-A Spain-France-Andorra (POCTEFA 2014-2020).

Ethics Statement

For the present study no animal was captured or euthanized.

Conflict of interest

The authors declare that they have no conflict of interest.

Amos CI, Wu X, Broderick P et al (2008) Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nature Genetics, 40(5):616–622. https://doi.org/10.1038/ng.109.
Bellemain E, Taberlet P (2004) Improved noninvasive genotyping method: application to brown bear (Ursus arctos) faeces. Molecular Ecology Notes 4(3):519–522. https://doi.org/10.1111/j.1471-8286.2004.00711.x.
Brumfield R, Beerli PA, Nickerson D et al (2003) The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol Evol 18:249–256. https://doi.org/10.1016/S0169-5347(03)00018-1.
Casellas J, Varona L, Muñoz G et al (2008) Empirical Bayes factor analyses of quantitative trait loci for gestation length in Iberian × Meishan F2 sows. Animal: An International Journal of Animal Bioscience, 2(2):177–183. https://doi.org/10.1017/S1751731107001085.
Erichsen HC, Chanock SJ (2004) SNPs in cancer research and treatment. British Journal of Cancer, 90(4):747–751. https://doi.org/10.1038/sj.bjc.6601574.
Giardina E, Pietrangeli I, Martone C et al (2009) Whole genome amplification and real-time PCR in forensic casework. BMC Genomics 10, 159. https://doi.org/10.1186/1471-2164-10-159.
Kass RE, Raftery AE (1995) Bayes Factors. Journal of the American Statistical Association, 90(430):773–795. https://doi.org/10.1080/01621459.1995.10476572.
Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6):1087–1092. https://doi.org/10.1063/1.1699114.
Morin P, Luikart G, Wayne RK (2004) SNPs in ecology, evolution and conservation. Trends Ecol Evol, 19:208–216. https://doi.org/10.1016/j.tree.2004.01.009.
Nickels S, Truong T, Hein R et al (2013) Evidence of Gene–Environment Interactions between Common Breast Cancer Susceptibility Loci and Established Environmental Risk Factors, PLOS Genetics, 9(3). https://doi.org/10.1371/journal.pgen.1003284.
Sastre N, Francino O, Lampreave G et al (2009) Sex identification of wolf (Canis lupus) using non-invasive samples. Conservation genetics, 10(3):555–558. https://doi.org/10.1007/s10592-008-9565-6
Sobrino B, Brión M, Carracedo A (2005) SNPs in forensic genetics: a review on SNP typing methodologies. Forensic Science International, 154(2–3):181–194. https://doi.org/10.1016/j.forsciint.2004.10.020.
Taberlet P, Luikart G (1999) Non-invasive genetic sampling and individual identification. Biological Journal of the Linnean Society, 68(1–2):41–55. https://doi.org/10.1111/j.1095-8312.1999.tb01157.x.
von Thaden A, Nowak C, Tiesmeyer A et al (2020) Applying genomic data in wildlife monitoring: development guidelines for genotyping degraded samples with reduced single nucleotide polymorphism (SNP) panels. Molecular ecology resources, 20(3). https://doi.org/10.1111/1755-0998.13136.

No competing interests reported.

Download PDF

Journal Publication

published 08 Jul, 2023

Read the published version in Conservation Genetics Resources →

Editorial decision: Major revision
23 Mar, 2023
Reviews received at journal
20 Mar, 2023
Reviews received at journal
24 Feb, 2023
Reviewers agreed at journal
02 Feb, 2023
Reviewers invited by journal
23 Dec, 2022
Editor assigned by journal
15 Nov, 2022
Submission checks completed at journal
15 Nov, 2022
First submitted to journal
14 Nov, 2022

You are reading this latest preprint version

SNP+ to predict dropout rates in SNP arrays

Status:

Journal Publication

Version 1

Abstract

Figures

Full Text

Declarations

Acknowledgments

Ethics Statement

Conflict of interest

References

Additional Declarations

Status:

Journal Publication

Version 1