SnpRecode: A Versatile and Fast Genotype Recoding and Correlation Function

Genotype imputation is an essential tool used in genomic selection in plants and animals. A popular imputation tool used in animal genomics is FImpute. FImpute, however, accepts a specic genotype format and produces dosages whose conversion to VCF or Plink format requires multiple software packages in a pipeline with a large amount of processing time. We have developed SnpRecode as a helper tool that bridges the gap between regular genotype les and the FImpute imputation software by allowing for fast and seamless conversion of genotypes to-and-from FImpute format. SnpRecode also implements a fast genotype correlation function to estimate and plot the imputation accuracy. We run tests on 6,000 samples with a step of 1,000 to determine the performance of SnpRecode on various sample sizes and runtime and memory usage used as performance measures. The performance of SnpRecode was modest at 10sec/1,000 samples. Written in Python programming language, SnpRecode provides users with great exibility in implementation with other software packages in a pipeline.

Genotype imputation is an essential tool used in genomic selection in plants and animals. A popular imputation tool used in animal genomics is FImpute. FImpute, however, accepts a speci c genotype format and produces dosages whose conversion to VCF or Plink format requires multiple software packages in a pipeline with a large amount of processing time. We have developed SnpRecode as a helper tool that bridges the gap between regular genotype les and the FImpute imputation software by allowing for fast and seamless conversion of genotypes to-and-from FImpute format. SnpRecode also implements a fast genotype correlation function to estimate and plot the imputation accuracy. We run tests on 6,000 samples with a step of 1,000 to determine the performance of SnpRecode on various sample sizes and runtime and memory usage used as performance measures. The performance of SnpRecode was modest at 10sec/1,000 samples. Written in Python programming language, SnpRecode provides users with great exibility in implementation with other software packages in a pipeline.

Main Text
The gradual genotyping costs decline has led to an increase in genotyped samples irrespective of species. In the livestock industry, large scale genotyping is an essential tool used during genomic selection. Studies have shown that it is cheaper to genotype samples with a lower density chip and impute them to higher densities using large reference panels. A popular, fast, and versatile imputation software used in animal production is Fimpute (Sargolzaei et al. 2014). In comparison to imputation software that imputes between a reference and a validation le (Browning et al. 2018;Das et al. 2016;Howie et al. 2009), FImpute is robust because it can impute many individuals genotyped with different SNP chip panels ( Figure 1A). Compared to most genotype imputation software, FImpute's algorithm is faster at a magnitude of greater than one. However, one can hypothesize that this fast imputation speed is due to FImpute's lack of internal genotype recoding, allelic correlation, and concordance estimation.
The absence of an internal genotype conversion method is of signi cant concern (Yi Wang 2014) and a bottleneck for most imputation pipelines involving Fimpute. To address this, we have developed SnpRecode to bridge this gap by recoding genotypes from Variant Call Format (VCF) and Plink format to FImpute format and recoding back to VCF or Plink format after the imputation procedure (GDT 2020; Purcell et al. 2007). Written in Python programming language v3.6, the SnpRecode is highly customizable because the source code is freely available for modi cation as needed; however, we recommend to use the compiled Linux executable.
SnpRecode can process data from up to 10 chips depending on the power of the machine. There are various checks within the software to detect data preparation oversights; however, it's better to perform quality control and general data preparation before using SnpRecode. As Python process data one row at a time, SnpRecode uses a linear (O(n)) and quasilinear (O(nLogn)) time complexity. This dynamic implies that using Plink formatted les will use less time than VCF les that require inversion and the sample size, therefore, remains the main time-limiting factor. Furthermore, to allow the estimation of imputation accuracy, SnpRecode allows the user to correlate two VCF les for a direct genotype correlation estimate and provide the R-Square value between two genotypes. It then bins this result using the minor allele frequency and produces an informative plot that can allow one to discern the imputation run's accuracy ( Figure 1B). In contrast, most genotype comparison software can estimate variant overlap by comparing genomic coordinates (e.g., bedtools), check duplicate sites or intersections (e.g., vcf-compare, bcftools isec), or estimate concordance (e.g., snpsift) (Cingolani et al. 2012;Li 2011;Quinlan and Hall 2010). Supplementary le 1 illustrates the functions of SnpRecode, and supplementary le 2 illustrates how to incorporate SnpRecode in a FImpute pipeline.

Conclusions
We have implemented a highly optimizable tool to allow seamless implementation in a FImpute pipeline.
The tool allows one to convert genotypes to-and-fro FImpute formats and detect data consistency errors, thus streamlining such a pipeline. Additionally, the tool allows one to quickly establish the imputation goodness of t by estimating genotype correlations between a real and masked genotype.

Declarations Author Disclosure Statement
No competing nancial interests exist.