Accuracy of Imputation of Microsatellite Markers from an SNP50K Chip in Spanish Assaf Sheep

Background: Traditional and new genotyping technologies must be combined by applying bridge methodologies that avoid double genotyping costs. This study aims to identify and evaluate a reliable approach to precisely impute microsatellite markers from SNP-chip panels to perform parental verications in sheep. Moreover, we will assess the optimum number of SNPs necessary to accurately impute microsatellite markers to develop a low-density SNP chip for parentage verication in the Assaf sheep breed. Results: A total of 4,423 animals belonging to the Spanish Assaf sheep breed were genotyped for 19 microsatellites and an ovine custom 49,897 SNP array. The accuracy of microsatellite marker imputation, performed with BEAGLE v5.1 software, was assessed with three metrics, namely, genotype concordance (C), genotype dosage (length r 2 ), and allelic dosage (allelic r 2 ), for all imputation scenarios tested (0.5-10 Mb microsatellite anking SNP windows). The accuracy of our imputation results for the three metrics analyzed for all haplotype lengths tested was higher than 0.90 (C), 0.80 (length r 2 ), and 0.75 (allelic r 2 ). Considering that the objective of the study was to assess a SNP window length that provides the best accuracy for the microsatellite imputation procedure to design an affordable low-density SNP chip for parentage testing, we considered 2 Mb to be the best SNP haplotype length for further analyses (SNPs/window =74.05, C= 0.970; length r 2 = 0.952, allelic r 2 =0.899). We additionally evaluated imputation performance under two null models, naive and random, which showed weak genotype concordance averages in comparison with imputed microsatellites (0.41 and 0.15, respectively). Conclusions: We presented for the rst time a precise methodology in dairy sheep to impute multiallelic microsatellite genotypes from biallelic SNP markers. The use of a 2 Mb SNP anking window for each microsatellite has been shown to achieve high accuracy in The genetic proles of 4,423 animals included in the breeding program of Spanish Assaf dairy sheep were obtained from the Association of Spanish Assaf Sheep Breeders (ASSAFE). These animals were genotyped for the 19 microsatellites recommended by ISAG for paternity control (14), and by an SNP chip with 49,897 markers used in the genomic selection program implemented in the Spanish Assaf sheep breed. Prior to the imputation process, quality control was applied to both sets of markers. We ltered out microsatellites with call rates below 80% and expected heterozygosity under Hardy-Weinberg equilibrium below 0.095. This value corresponds to a minor allele frequency (MAF) of 5% for biallelic markers. After quality control, microsatellite alleles were recoded to t the variant call format (VCF) following the VCFtools software specications (18). The most common allele for each microsatellite was considered the reference allele in the population and recoded as “0”. For the rest of the alleles, a consecutive number was assigned (1,2,3…n) based on the microsatellite allele length. SNP-chip quality control was performed using PLINK (19), and SNPs with call rates under 95% were excluded from the dataset. To maintain haplotype diversity in the population, MAF ltering is not included in the SNP quality control. This study presents an effective methodology to the problem presented in the transition from multiallelic (microsatellites) to biallelic markers (SNPs) for pedigree verication analyses in sheep. The use of a anking 2 Mb SNP window for each microsatellite has been shown to achieve high accuracy in the imputation procedure while providing a cost-effective, low-density SNP chip for breeders. The microsatellite imputed information could be used for individual identication and parentage verication in sheep, postulating a useful approach in the sheep industry to avoid double genotyping.


Background
Parentage misassignments directly affect genetic gain in traditional breeding programs by biasing heritability estimates of productive traits, genetic parameters, breeding values, and the identi cation of superior animals for selection (1)(2)(3). Therefore, accurate pedigree records are essential for the success of genetic improvement in livestock.
The use of molecular markers, speci cally genetic markers, facilitates parentage veri cation and individual identi cation by indicating, through different approaches (simple exclusion, genotype reconstruction, or categorical and fractional allocation), the putative relatedness between individuals (4,5). In this sense, microsatellite variants have become one of the principal molecular markers used in livestock in recent decades for parentage testing. Microsatellites, also known as short tandem repeats (STRs) or simple sequence repeats (SSRs), consist of motifs of 1-6 bp repeated in tandem and represent choice markers for parentage testing in livestock due to their high polymorphic information content with codominant inherited alleles and easy but not fully automatized allele scoring (6).
At present, microsatellite information for parentage veri cation tests is being gradually replaced by single nucleotide polymorphisms (SNPs). Although SNPs are less informative due to their biallelic nature, which determines the range of markers required for parentage testing (200-700 SNPs compared to 14-20 microsatellites) (7), there is increasing interest in using SNP panels in livestock. The advantages of SNP panels include the more straightforward automation of technology, the lack of need for interlaboratory calibration, lower error rates, the uniform distribution of SNP markers across the genome, and the recently reduced costs in genotyping technology (8)(9)(10). Moreover, SNP panels are increasingly used in livestock due to the implementation of genomic selection in breeding schemes (11)(12)(13).
In the case of sheep, there are two strategies for parentage testing proposed by the International Society of Animal Sciences (ISAG): a panel of 19 microsatellites (14) and a panel of 163 SNPs with veri ed qualities to use in diverse sheep breeds (7). Notably, in the Spanish Assaf sheep, most of the animals in the selection scheme are genotyped with microsatellite markers. Therefore, the need for a consistent and reliable pedigree database across generations has made the use of microsatellite information an essential issue. However, since the implementation of genomic selection, with the rst genomic evaluation results obtained in 2020, there has been an annual increase in the number of animals genotyped with a 50K SNP panel. Some of these new animals are genotyped with both platforms: SNPs and microsatellites. Parentage veri cation should be performed using the same technology applied in previous generations, implying additional costs for farmers and breeders' associations. One possible strategy in the migration process from microsatellites to SNP panels to avoid double costs in genotyping is the imputation of microsatellite alleles from SNP haplotypes (15). Therefore, this study aims to identify and evaluate a reliable approach to be used during the transition period to accurately impute microsatellite markers from SNP-chip panels to perform parental veri cations in sheep. Moreover, we will evaluate the optimum number of SNPs necessary to accurately impute microsatellite markers to develop a low-density SNP panel for parentage veri cation.

Methods
Animal genotypes and quality control The genetic pro les of 4,423 animals included in the breeding program of Spanish Assaf dairy sheep were obtained from the Association of Spanish Assaf Sheep Breeders (ASSAFE). These animals were genotyped for the 19 microsatellites recommended by ISAG for paternity control (14), and by an SNP chip with 49,897 markers used in the genomic selection program implemented in the Spanish Assaf sheep breed.
Prior to the imputation process, quality control was applied to both sets of markers. We ltered out microsatellites with call rates below 80% and expected heterozygosity under Hardy-Weinberg equilibrium below 0.095. This value corresponds to a minor allele frequency (MAF) of 5% for biallelic markers. After quality control, microsatellite alleles were recoded to t the variant call format (VCF) following the VCFtools software speci cations (18). The most common allele for each microsatellite was considered the reference allele in the population and recoded as "0". For the rest of the alleles, a consecutive number was assigned (1,2,3…n) based on the microsatellite allele length. SNP-chip quality control was performed using PLINK (19), and SNPs with call rates under 95% were excluded from the dataset. To maintain haplotype diversity in the population, MAF ltering is not included in the SNP quality control.

Imputation procedure
The positions of the microsatellite markers in the ovine genome (Oar_v3.1) were obtained from the sheep database from Ensembl v.95 (https://www.ensembl.org/Ovis_aries/ Info/Index), and they were veri ed through the alignment of the primer sequences to the sheep reference genome using BLAST (16). The genotypes were phased and imputed using the phasing method implemented in the BEAGLE 5.1 software (20) [50 rounds of burn-in and 100 iterations] and the genotype imputation method (21), of the same program. To establish the minimum number of anking SNPs per microsatellite and the optimal window length to achieve accurate imputation, several SNP window distances on each side of the marker were considered [0. 5,1,2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45, 50 Mega bases (Mb)]. In the genotype imputation process, pedigree information and the effective population size (Ne) were also considered.
Given that the animals were genotyped for microsatellite and SNP markers, the imputation performance was estimated by a 10-fold crossvalidation approach. For this purpose, we divided the total population into two groups: the training group, which comprises 90% of the total population, and the validation group, which comprises the remaining 10%. The microsatellite information was masked in the validation population, and the genotypes for these markers were imputed by Beagle software using all information (microsatellite and SNP genotypes and pedigree relationships from the reference dataset and the genotypes of SNPs in the validation dataset). The process was repeated for ten rounds, using different animals in the validation dataset in each round following a nonparametric bootstrap of 10% of the total samples using a custom Fortran source code.

Imputation performance metrics
To assess the accuracy of the microsatellite imputation, we used the metrics genotype concordance, genotype dosage, and allelic dosage, which were previously de ned by Saini et al. (17). The genotype concordance (c i ) was de ned as 0 if neither of the imputed alleles matched a true allele, 0.5 if one of them matched, and 1 if both alleles matched the true alleles. Thus, the genotype concordance for a microsatellite (C) was calculated as the average over all the samples of c i for each microsatellite The microsatellite genotype dosage [aka length r 2 ] was de ned as the Pearson correlation between the sum of the lengths of both alleles at a speci c locus in imputed genotypes and the true genotypes. Last, the microsatellite allelic dosage [aka allelic r 2 ] was calculated as the Pearson correlation for each microsatellite allele length α, de ned as In addition, Pearson's correlations between the frequencies of the reference microsatellite alleles and the imputed microsatellite alleles were calculated. Furthermore, for each microsatellite, the imputation performance was evaluated by computing the expected value for each metric under two null models: (1) a naive model in which the imputed genotype was selected as the most common allele per microsatellite and (2) a random model in which the imputed genotype was randomly selected from the available genotypes at each marker, depending indirectly on the allelic frequencies (17).
Population structure, effective population size, and parental relationships Finally, we used SNP chip information to evaluate different factors affecting imputation accuracies, such as population structure, effective population size, and parental-progeny pedigree con icts. To assess the population structure, we estimated the genomic relationship matrix (GRM), and the genomic relationships between the individuals were plotted following the Pedigromics pipeline (22), calculating centrality metrics such as betweenness and closeness coe cients. The effective population size and the parental-progeny con icts in the pedigree were computed using the BLUPF90 family programs (23).

Genotype quality control
All microsatellite markers passed the quality control settings xed in the analysis. Regarding SNP markers, a total of 3,537 markers showed a call rate lower than 95% and were ltered out. Therefore, a total of 19 microsatellites and 42,665 SNPs were considered for the imputation procedure. The microsatellites were located along the sheep autosomes, and on average, each microsatellite included in this study had 12.73 alleles ranging from 81 to 297 base pairs. The information of the microsatellites considered in this work is summarized in Table 1, and their allele frequencies are presented in Table S1. Imputation procedure results To impute the whole population considered by this study, we performed a 10-fold cross-validation approach, as explained above. Therefore, ten imputation procedures were necessary to estimate the microsatellite information in the whole population. Moreover, we assessed the accuracy (concordance, genotype dosage, and allelic dosage) of the imputed microsatellite markers in the proposed imputation scenarios (window lengths from 0.5 Mb to 50 Mb) to determine the best haplotype length for the imputation procedure (Fig. 1).
There was a noticeable increase in the imputation accuracy of the microsatellite markers for the average accuracy metrics when 0.5 Mb (SNPs/window = 19.11, C = 0.922; length r 2 = 0.890, allelic r 2 = 0.788), 1 Mb (SNPs/window = 38.05, C = 0.962; length r 2 = 0.941, allelic r 2 = 0.878) and 2 Mb (SNPs/window = 74.05, C = 0.970; length r 2 = 0.952, allelic r 2 = 0.899) window lengths were compared ( Fig. 1; Table S2). Considering a 3 Mb window length, the addition of new information provided by the SNPs localized in the surrounding windows (> 100 SNPs) slightly improved the imputation accuracy metrics (C = 0.972; length r 2 = 0.957, allelic r 2 = 0.901), while a stabilization in the imputation accuracy was observed for wider windows (Fig. 1). The number of anking SNPs used in the imputation process for the tested window distances is summarized in Table S3. Considering that the objective of our work was to assess a SNP window length that provides optimum accuracy for the microsatellite imputation procedure to design an affordable low-density SNP chip to be used for parentage testing by breeders, we considered 2 Mb to be the best haplotype for further analyses. The imputation metrics (concordance, genotype dosage, and allelic dosage) for the 2 Mb scenario are summarized in Table 2. The distribution of the allelic r 2 values is represented in Figure S1. Using a 2 Mb SNP haplotype, the Pearson correlation between the real microsatellite allele frequency in the population and the frequency of the imputed alleles in these markers was 1.00. These frequencies are represented in Figure S2.
To validate our imputation results, we analyzed the imputation performance under two null imputation models: naive imputation, in which imputed genotypes showed an average concordance of 0.41 (ranging from 0.26 to 0.67) with observed genotypes, and random imputation, which had an average concordance of 0.15 (ranging from 0.06 to 0.60). Both validation procedures revealed considerably fewer concordance values using the two null imputation models than the imputation method proposed in this study, which validates our approach. Population structure and effective population size Figure 2 presents the population structure of the 4,423 animals included in the study using the GRM created with the 42,665 SNPs remaining after quality control ltering. Individuals are represented as nodes in the network, and two animals are connected by an edge when a prede ned genomic kinship exists, e.g., parent-offspring. Those animals not related to the main population were ltered in the representation. Genomic relationship higher than 0.2 and 0.5 were represented in Fig. 2 and Figure S3, respectively. The Pedigromics approach of the Assaf breed showed low average values of the betweenness centrality coe cient (0.003) and closeness coe cient (0.237), both ranging between 0 and 1. Centrality coe cients re ect the in uence of each vertex over the graph structure. In this case, closeness centrality is based on the average length of the shortest paths from a given node to other reachable nodes in the network (24), given how genomic information is spread in the population (25). The betweenness centrality coe cient re ects the amount of control that a node exerts over the interactions with other nodes in the network. Animals with high betweenness centrality in a pedigree graph could have a role in connecting disconnected groups (25). The low average values of the betweenness centrality and closeness coe cient suggest a low relationship among the samples included in the population studied. However, 21% of the animals had a closeness coe cient higher than the third quartile of the value distribution (0.24), which is represented on a green to a red color scale in Fig. 2. These samples are distributed in eight related family groups (as shown in Fig. 2). The low degree of relationship between these groups and the rest of the animals suggests that the population is neither highly related nor structured. Moreover, we estimated the effective population size of the studied Assaf population, which was 214 animals.

Parental relationships
The pedigree records available for the Assaf population under study integrated 1,450 parental-progeny relationships that could be con rmed with the SNP information to detect parental-progeny con icts. A total of 24 misassignments were found in the pedigree, representing a total of 1.66% of all the parental relationships analyzed in the pedigree.

Discussion
This research presented for the rst time a precise methodology in sheep to impute multiallelic genotypes from biallelic information. Traditional and new genotyping technologies must be joined by applying bridge methodologies, which allow breeders to avoid additional costs of regenotyping historical data. Our study combines microsatellite and SNP markers in an e cient approach to impute microsatellite markers through SNP haplotypes, achieving high concordance rates. Therefore, the imputation procedure developed represents a useful and inexpensive approach to perform parentage veri cation when different genotyping platforms have been used across generations. The results from this study will undoubtedly have a great impact on Assaf sheep breeders, allowing them to perform a transition from microsatellite maker kinship veri cation to the use of SNP panels (26). In addition to constituting a clear advantage for sheep producers, the imputation methodologies developed can provide advantages in genomic studies by combining both types of data, such as in genome-wide association analyses (GWAS). In this approach, microsatellite information could improve the detection of new associations, provide complementary information, and explain part of the missing heritability for the trait under study (17).
In general, as shown in Fig. 1, the accuracy of our imputation results for the three metrics analyzed (C, length r 2 , and allelic r 2 ) in the different scenarios tested (SNP windows ranged between 0.5 and 10 Mb) was higher than 0.90 (C), 0.80 (length r 2 ), and 0.75 (allelic r 2 ) for all haplotype lengths. The accuracy results presented in this study were higher than those found in a previous study performed in cattle by The number of haplotypes per microsatellite and the frequency of these haplotypes did not signi cantly impact the allele dosage, with correlations of 0.33 and 0.18, respectively. Therefore, as the number of alleles and their frequency increases, the concordance tends to rise. However, the naive and random models' concordance rate decreased as the number of alleles increased because they depended on the number of haplotypes of each microsatellite (correlations were − 0.45 and − 0.70).
The imputation accuracies obtained might be overestimated due to (i) a highly structured and related population (27) or due to (ii) a low effective population size (28). On the one hand, the population included in the present work, represented using the Pedigromics pipeline (Fig. 2), achieved low rates of centrality coe cients (betweenness coe cient = 0.003 and closeness coe cient = 0.237), which suggests that the population is nonstructured or highly related. In addition, the selection of the reference and test populations during cross-validation by a nonparametric bootstrap approach avoids the overestimation of the imputation metrics by avoiding the selection of immediate relative samples in the different groups. On the other hand, the effective population size was 214, higher than in highly selected cattle breeds (26), but in the wide range of effective population sizes described in sheep breeds from values of 78 in Romney, 100 in Wiltshire breed, 128 in Churra breed, to 1,317 in Qezel (29)(30)(31). Lower values can lead to an overestimation of imputation accuracy metrics; however, if we compared our concordance rates with the concordance rates obtained in the microsatellite imputation in cattle carried out by Sharma et al. (26), we achieved more than double the concordance (0.90 vs. 0.40). Small population sizes reduce the genetic diversity in the population (32) and would in uence the naive and random models' concordance rates, increasing their accuracy parameters. Nevertheless, the average of the naive and random concordance rates for these two models (0.41 and 0.15, respectively) was far lower than those obtained in humans by Saini et al. (17), [0.72 and 0.61, respectively]. This difference between the imputation accuracy and the accuracy of the null models could be because the effective population size and the genetic diversity of the Assaf population analyzed are large enough to perform an accurate imputation of the microsatellite information. In particular, high genetic diversity in the reference population would help achieve high squared correlations in the imputation process (10,27,33) and reduce the probability of accurate imputations in the naive and random models. Therefore, the large number of samples included in this study, and as a consequence, the large number of individuals genotyped in the reference population, could in uence the high accuracy rates achieved because it is necessary to impute the odd haplotypes (28) accurately and could also reduce the concordance rates obtained in the null models. Therefore, this nding explained the higher concordance rates obtained than those in previous studies on microsatellite imputation from SNP data conducted with lower sample sizes in humans (17 (Table S4) would also help to reduce the number of kinship errors in the pedigree due to its lower error rates compared with microsatellite markers and the lack of need for interlaboratory calibration and easier automation (8)(9)(10).

Conclusion
This study presents an effective methodology to overcome the problem presented in the transition from multiallelic (microsatellites) to biallelic markers (SNPs) for pedigree veri cation analyses in sheep. The use of a anking 2 Mb SNP window for each microsatellite has been shown to achieve high accuracy in the imputation procedure while providing a cost-effective, low-density SNP chip for breeders. The

Declarations
Ethics approval and consent to participate As the data were obtained from the Spanish Assaf breeders association (ASSAFE) database, no direct experimentation on animals was performed in this work. According to the Research Ethics Committee of the University of León, formal ethical approval was not necessary for this case.

Consent for publication
Not applicable.

Availability of data and materials
Supplementary material related to this article can be found in the online version. The genotype datasets that support the results of this study belong to the Spanish Assaf breeders association (ASSAFE) and should be requested from that association.  Population structure using the Pedigromics approach. Genomic relationships (>0.2) among the individuals are displayed. Each node represents one animal from the population. The color and the size of the nodes are based on the closeness coe cient, on a green to a red color scale, with the higher values represented by a large size and red color.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.