and focuses on estimating the allele frequency (fA), as well as the dropout probability for allele A (εA) or B (εB). Taking a particular genotype yi with possible outcomes AA, AB, BB and missing genotype (miss.), its Bayesian likelihood is computed as
p(yi = AA|fA,εA,εB) = p(AA|AA)p(AA) + p(AA|AB)p(AB)
p(yi = AB|fA,εA,εB) = p(AB|AB)p(AB)
p(yi = BB|fA,εA,εB) = p(BB|AB)p(AB) + p(BB|BB)p(BB)
p(yi = miss.|fA,εA,εB) = p(miss.|AA)p(AA) + p(miss.|AB)p(AB) + p(miss.|BB)p(BB),
where
p(AA|AA) = (1 – εA)2
p(AA|AB) = εB
p(AB|AB) = (1 – εA)(1 – εB)
p(BB|AB) = εA
p(BB|BB) = (1 – εB)2
p(miss.|AA) = εA2
p(miss.|AB) = εAεB
p(miss.|BB) = εB2
and p(AA) = fA2, p(AB) = 2fA(1 - fA), and p(BB) = fB2. Note that the model assumes that a BB individual cannot be genotyped as AA (the probability of false alleles is zero). A priori distributions for fA, εA and εB were assumed flat between 0 and 1.
For each SNP, the model was solved by a Metropolis-Hastings sampling process (Metropolis et al. 1953) with 500,000 iterations after a burn-in period of 10,000 iterations. Two alternative parameterizations (εA = εB vs. εA ≠ εB) were compared by a Bayes factor (Kass and Raftery 1995). The minimum number of within-individual replicates required to predict the reliable genotype with probability 95% was calculated as log(0.05)/log(εA). All these procedures have been implemented in the SNP+ software, available at http://www.casellas.info/software.html. The program generates the following text delimited output files:
1) SNP-by-SNP report of dropout probabilities with their confidence intervals, minimum number of replicates, Bayer factor comparing a single dropout probability against two independent dropout probabilities.
2) Summary table of the probability of error, confidence interval, replications, and Bayes factor of all the SNPs (Figure 1).
3) Predicted genotypes for each individual and SNP, and probability of error, if any.
4) Pairwise comparison among all individuals of the probability of identical genotype.
5) SNP-by-SNP report of the MAF and probability of identity (PI).
As an example, we have used SNP+ to evaluate two panels using Open Array® technology (Thermo Fisher Scientific Inc). We analyzed 22 fecal samples and 114 hair samples from Iberian brown bears (Ursus arctos) using first a 120 SNP panel. Then, we selected the best 60 SNPs, and we ran again SNP+ using 164 fecal samples and 173 hair samples. All samples were replicated four times, and low-quality DNA samples in both analysis (call rate <25%) were not included in the analysis.
Figure 2 shows the relative frequencies of the dropout probability for the four studies. The dropout probability was clearly low after selecting the panel of 60 SNPs using two types of non-invasive samples, in average 0.2 versus 0.05. In terms of variability and distribution mode, the study that obtains lower dropout probabilities is the study C, after SNP selection. The study with the highest probability of dropout is the study B probably because hair samples were hair-trapping collected and therefore, not all samples contained roots or enough hair quantity to obtain high DNA quality.
To summarize, SNP+ calculates the dropout likelihood, the Bayes Factor, PI and MAF, and can be used to select the best arrays from low density arrays up to high density arrays, avoiding those SNPs that require many replicates because they lead to error, or, whether the panel has been designed, to carry out x number of replicates per sample to reach a 95% of genotyping reliability per SNP.