This research presented for the first time a precise methodology in sheep to impute multiallelic genotypes from biallelic information. Traditional and new genotyping technologies must be joined by applying bridge methodologies, which allow breeders to avoid additional costs of regenotyping historical data. Our study combines microsatellite and SNP markers in an efficient approach to impute microsatellite markers through SNP haplotypes, achieving high concordance rates. Therefore, the imputation procedure developed represents a useful and inexpensive approach to perform parentage verification when different genotyping platforms have been used across generations. The results from this study will undoubtedly have a great impact on Assaf sheep breeders, allowing them to perform a transition from microsatellite maker kinship verification to the use of SNP panels (26). In addition to constituting a clear advantage for sheep producers, the imputation methodologies developed can provide advantages in genomic studies by combining both types of data, such as in genome-wide association analyses (GWAS). In this approach, microsatellite information could improve the detection of new associations, provide complementary information, and explain part of the missing heritability for the trait under study (17).
In general, as shown in Fig. 1, the accuracy of our imputation results for the three metrics analyzed (C, length r2, and allelic r2) in the different scenarios tested (SNP windows ranged between 0.5 and 10 Mb) was higher than 0.90 (C), 0.80 (length r2), and 0.75 (allelic r2) for all haplotype lengths. The accuracy results presented in this study were higher than those found in a previous study performed in cattle by Sharma et al. (26), which reached a concordance of 0.40 and a correlation between the real and imputed microsatellites of 0.31. In addition, we have explored not only the viability of performing microsatellite imputation but also the optimum number of SNPs necessary to perform accurate imputation of microsatellite information. According to Strucken et al. (7) 700 SNP markers are required to reduce false-positive results in parentage testing, which in our approach correspond to an SNP haplotype length of 1 Mb, covering 38.05 SNPs per microsatellite with adequate imputation accuracy rates (C = 0.962; length r2=,0.941, allelic r2 = 0.878). However, the imputation performance reached high accuracy values at a SNP haplotype length of 2 Mb: 0.97 (C) 0.95 (length r2) and 0.90 (allelic r2), with all accuracy metrics higher than 0.90. The SNPs located in the 2 Mb window distance used in the imputation procedure have been summarized in Table S4. These results were slightly higher than those obtained by Saini et al. (17), who achieved a genotype concordance of 0.97, a genotypic dosage of 0.91, and an allelic dosage of 0.86. In our study, accuracy metrics were obtained using a 50 k SNP chip in sheep compared to the SNP data from whole-genome sequencing (27,185,239 SNPs) with a SNP window of 100 Kb used by Saini et al. (17) in humans. Moreover, concordance rates of the null models obtained by Saini et al. [naive (0.72), and random (0.61)] are higher than those obtained in the present study [naive (0.41) and random (0.15)]. This highlights the genetic diversity of the microsatellite markers in sheep and the high efficiency of the imputation procedure presented in this work.
The number of haplotypes per microsatellite and the frequency of these haplotypes did not significantly impact the allele dosage, with correlations of 0.33 and 0.18, respectively. Therefore, as the number of alleles and their frequency increases, the concordance tends to rise. However, the naive and random models' concordance rate decreased as the number of alleles increased because they depended on the number of haplotypes of each microsatellite (correlations were − 0.45 and − 0.70).
The imputation accuracies obtained might be overestimated due to (i) a highly structured and related population (27) or due to (ii) a low effective population size (28). On the one hand, the population included in the present work, represented using the Pedigromics pipeline (Fig. 2), achieved low rates of centrality coefficients (betweenness coefficient = 0.003 and closeness coefficient = 0.237), which suggests that the population is nonstructured or highly related. In addition, the selection of the reference and test populations during cross-validation by a nonparametric bootstrap approach avoids the overestimation of the imputation metrics by avoiding the selection of immediate relative samples in the different groups. On the other hand, the effective population size was 214, higher than in highly selected cattle breeds (26), but in the wide range of effective population sizes described in sheep breeds from values of 78 in Romney, 100 in Wiltshire breed, 128 in Churra breed, to 1,317 in Qezel (29–31). Lower values can lead to an overestimation of imputation accuracy metrics; however, if we compared our concordance rates with the concordance rates obtained in the microsatellite imputation in cattle carried out by Sharma et al. (26), we achieved more than double the concordance (0.90 vs. 0.40). Small population sizes reduce the genetic diversity in the population (32) and would influence the naive and random models' concordance rates, increasing their accuracy parameters. Nevertheless, the average of the naive and random concordance rates for these two models (0.41 and 0.15, respectively) was far lower than those obtained in humans by Saini et al. (17), [0.72 and 0.61, respectively]. This difference between the imputation accuracy and the accuracy of the null models could be because the effective population size and the genetic diversity of the Assaf population analyzed are large enough to perform an accurate imputation of the microsatellite information. In particular, high genetic diversity in the reference population would help achieve high squared correlations in the imputation process (10, 27, 33) and reduce the probability of accurate imputations in the naive and random models. Therefore, the large number of samples included in this study, and as a consequence, the large number of individuals genotyped in the reference population, could influence the high accuracy rates achieved because it is necessary to impute the odd haplotypes (28) accurately and could also reduce the concordance rates obtained in the null models. Therefore, this finding explained the higher concordance rates obtained than those in previous studies on microsatellite imputation from SNP data conducted with lower sample sizes in humans (17) [1,916 samples] and cattle (26) [1,482 samples].
Last, the development of a low-density SNP panel with the 1,407 SNPs (2 Mb SNP window) proposed in this approach (Table S4) would also help to reduce the number of kinship errors in the pedigree due to its lower error rates compared with microsatellite markers and the lack of need for interlaboratory calibration and easier automation (8–10).