The gradual decline in genotyping costs has led to an increase in the number of samples being genotyped irrespective of species. In the livestock industry, large scale genotyping is a major tool used during genomic selection. Studies have shown that it is cheaper to genotype samples with a lower density chip and impute them to higher densities using large reference panels. A popular, fast and versatile imputation software used in animal production is Fimpute (Sargolzaei et al. 2014). Fimpute has been shown to be robust when imputing many individuals genotyped with different panels. Apart from the obvious fast algorithm powering Fimpute, the speeds gained using the software can be attributed to its lack internal genotype recoding and allelic correlation and concordance estimation. This is in contrast to software such as Beagle (Browning et al. 2018), Minimac4 (Das et al. 2016), and Impute2 (Howie et.al, 2009) which can do internal genotype recoding, estimate allelic R-square values and concordance. We have developed SnpRecode to bridge this gap by recoding genotypes from Variant Call Format (VCF) and/or Plink (Purcell et al. 2007) format to Fimpute format and recoding back to VCF or Plink format after the imputation procedure. Though this may seem trivial, format conversion bottleneck is a major concern in most next generation sequence analysis (Wang et al. 2014) and SnpRecode allows for addressing this bottleneck. Written in Python programming language v3.6, the software is highly customizable and the source code is freely available for modification as needed, however, it is recommended to use the compiled Linux executable.
SnpRecode can process data from up to 10 chips depending on the power of the machine and though there are various checks within the software to detect data preparation oversights, due diligence on quality control and general data preparation should be performed prior to using SnpRecode. As Python process data one row at a time, the function uses a linear (O(n)) and quasilinear (O(nLogn)) time complexity. This implies that using Plink formatted files will use less time as opposed to VCF files that require inversion. The sample size therefore remains to be the main limiting time factor. Furthermore, to allow the estimation of imputation accuracy, we have implemented a function that allows the user to correlate two VCF files. Most of the available software do not offer a direct allele-to-allele or genotype-to-genotype correlation, rather they can estimate variant overlap by comparing genomic coordinates (e.g. bedtools, Quinlan et al. 2010) , duplicate sites or intersections (e.g. vcf-compare, bcftools isec), or estimate concordance (e.g. snpsift, Cingolani et. al (2012)). These tools were mainly created for instance to check genomic differences arising from variants being called by different tools. In contrast, SnpRecode allows for a direct genotype correlation estimate and provide the R-Square value between two genotypes. Furthermore, it bins this result dependent on the minor allele frequency and produces an informative plot (Figure 1) that can allow one to discern the accuracy of the imputation run. A sample script of SnpRecode function is provided as supplementary file 1 and script of how to implement in a pipeline provided in supplementary file 2.