Systematic comparison of genotype imputation strategies in aquaculture: a case study in Nile tilapia (Oreochromis niloticus) populations

Genotype imputation is an attractive approach to obtain whole genome sequencing (WGS) data at low cost. However, the availability of imputed WGS data was mainly depending on imputation accuracy. How to balance in�uencing factors to improve the imputation accuracy is highly necessary, especially in aquaculture. In the present study, we downloaded 361 whole genome re-sequencing data of Nile tilapia to construct different reference panel for genotype imputation and systematically determined the impact of several key factors on imputation accuracy, including the reference panel type, the haplotype phasing and imputation software, the reference panel size, the key individual selection strategies, and the composition of the combined reference panel. Results showed that the imputation accuracy has no signi�cant difference (P = 0.3) using pre-phasing data obtained from Beagle5, Eagle2, and Shapeit4, but Beagle5 has the highest computational e�ciency. But for imputation software, both Beagle5 and Impute5 were more suitable for combined and external reference panel with large reference size, and Minimac4 was suitable for internal reference panel, especially for small reference size. Furthermore, it would always improve the imputation accuracy increasing reference panel size, but larger reference size does not necessarily result in a higher imputation accuracy. When the number of external individuals increased from 5 to 250, the average imputation accuracy of combined reference panel was descending from 0.942 to 0.899 for Minimac4 but always higher than internal reference panel (0.866). Compared with minimizing the average distance to the closest leaf (ADCL) and randomly selecting individuals (RAN), it always had slightly higher accuracy using maximizing the expected genetic relationship (REL) method to select key individuals to construct internal reference panel for imputation. However, it has zero or negative growth on imputation accuracy when using selection strategies to select internal or external individuals to construct a combined reference panel for imputation. In conclusion, using a combined reference panel provided greater imputation accuracy, but the optimal genotype imputation strategy needs to balance the actual situation carefully and comprehensively. This work sheds light on how to design and execute genotype imputation in aquaculture.


Introduction
Genomic selection (GS) and genome-wide association studies (GWAS) has been widely implemented to accelerate the genetic improvement of important economic traits in aquaculture using single nucleotide polymorphism (SNP) array or restriction-site-associated DNA sequencing (RAD-seq) data (Yáñez et al., 2022).As we know, both theoretical principles of GS and GWAS are assuming that quantitative trait loci (QTL) existed strong linkage disequilibrium (LD) with one SNP marker at least.But most of SNP genotyping arrays and RAD-seq only capture a small portion of whole-genomic variations, and the estimated LD between SNPs always rapidly decays with the increase of marker distance (Khatkar et al., 2008).Therefore, it is essential to increase marker density even to whole genome sequencing (WGS) level to improve the accuracy of genomic breeding value estimation of GS and the statistical power of GWAS.
With the rapid development of high-throughput sequencing technology, the cost of WGS has fallen dramatically, but it is still too costly to sequence a great number of individuals in aquaculture breeding programs.Genotype imputation is a cost-effective approach to solve this issue by inferring missing or ungenotyped markers based on haplotypes information from reference panel (Li et al., 2009).Nowadays, genotype imputation has become an indispensable part of GWAS to facilitate identi cation of causal variations and genes associated with complex traits and diseases, especially in human (Marchini and Howie, 2010), cattle (Daetwyler et al., 2014;Xiang et al., 2021), and pig (Ding et al., 2023).Because the large and extremely diverse human haplotypes reference panels were available in these species, such as the 1000 Genomes Project (Genomes Project et al., 2015), the International HapMap Project (Altshuler et al., 2010), UK10K Project (Walter et al., 2015), the 1000 Bull Genomes consortia (Hayes et al., 2012) and the swine imputation (SWIM) haplotype reference panel.Although no high-quality public imputation reference panels in most aquaculture species, genotype imputation also has been successfully implemented in several aquaculture species with a self-constructing reference panel, such as Atlantic salmon (Tsairidou et al., 2020;Yoshida et al., 2018), large yellow croaker (Zhang et al., 2021), Rainbow Trout (Sanchez-Roncancio et al., 2022; Yoshida and Yanez, 2022), European sea bass (Delpuech et al., 2023), and Nile tilapia (Yoshida and Yáñez, 2021).For example, using the imputed WGS data, a major QTL for resistance to Viral nervous necrosis was identi ed and includes two important candidate genes (ZDHHC14 and IFI6/IFI27-like) in European sea bass (Delpuech et al., 2023).Whether the imputed WGS data is possible to successfully applied in GWAS to identify causal variations and genes was mainly depending on the imputation accuracy.
Previous studies have shown that many factors effect the accuracy of genotype imputation, including the pre-phasing and imputation algorithms (De Marino et al., 2022), the reference panel size (Garcia et al., 2022), the genetic relationship between reference and validation populations (Garcia et al., 2022), the minor allele frequency (MAF) (Ye et al., 2019), and so on.Currently, many strategies have been proposed to improve the imputation accuracy in livestock or human, including the optimal combination of prephasing and imputation software (De Marino et al., 2022;Ding et al., 2023), selecting key individual to construct reference panel (Ye et al., 2018), and adding these publicly available data to expand the reference panel size (Ye et al., 2019).Most of these factors on the accuracy of genotype imputation also has been proved in aquaculture, but how to balance these factors to improve the imputation accuracy are still rarely reported.
In this study, we download 361 whole genome re-sequencing data of Nile tilapia from ve BioProjects in ENA or NCBI to construct different reference panel for genotype imputation.We systematically investigated the reference panel type, the haplotype phasing and imputation software, the reference panel size, the key individual selection strategies, and the composition of the combined reference panel on the accuracy of genotype imputation when imputing SNP chip data to WGS data.Our results provide insight into the designing and executing of genotype imputation in aquaculture.

Material and methods
Whole genome re-sequencing data collection A total of 361 whole genome re-sequencing data of Nile tilapia was download from ve BioProjects in ENA or NCBI, including 280 samples from PRJNA634901, 31 samples from PRJDB1657, 22 samples from PRJEB48570, 8 samples from PRJNA609616, and 20 samples from PRJNA802819.The 280 samples from PRJNA634901 are composed of three different Nile tilapia breeding populations (popA, popB and popC), their number of samples are 56, 100, and 124, respectively (Cádiz et al., 2020).The 31 samples from PRJDB1657 are composed of 23 genetically improved farmed tilapia (GIFT) strains and 8 samples collected from Nile-c population, a selected line of Nile tilapia in China since originally introduced from Egypt in the 1980s (Xia et al., 2015).The 22 samples from PRJEB48570 are collected from the pure populations of Nile tilapia in Tanzania (Ciezarek et al., 2022).The 8 samples from PRJNA609616 are collected from a selected line of Nile tilapia in the National Institute for Basic Biology of Japan (Tao et al., 2021).The 20 samples from PRJNA802819 are collected from Lake Hora, belong to the subspecies O. niloticus cancellatus (Triay et al., 2022).Detailed information of theses samples including accession numbers, origins, and read counts, are showed in Supplementary Table S1.

Mapping, Variant calling, and ltering
After downloaded 361 WGS data, the pipeline of data processing was performed to obtain high quality genotype data for further analysis, including quality control, mapping, variant calling, and ltering.In brief, the raw data per sample was trimmed by trimmomatic software (Bolger et al., 2014) with the following parameters ILLUMINACLIP:TruSeq3-SE:2:30:10, LEADING:5, TRAILING:5, SLIDINGWINDOW:5:20, and MINLEN:50.After Trimmed, the clean reads per sample were mapped to the Nile tilapia reference genome (GCF_001858045.2) using BWA v.0.7.17-r1188 (Li and Durbin, 2009).The SAM les generated from BWA were sorted and converted to BAM les by the SortSam module of GATK version 4.2.0 (McKenna et al., 2010).After that, potential PCR duplicates were marked by the MarkDuplicates module in GATK.Genotypes per sample were called by the HaplotypeCaller utility of GATK.All gVCF-les were combined into a single VCF les by the GenomicsDBImport and GenotypeGVCFs utility of GATK following the suggested pipelines.Finally, SNP variants were selected by the SelectVariants utility of GATK, and to lter out false positive variants with the following parameters: variant con dence score (QUAL) ≥ 30, QualByDepth (QD) ≥ 2.0, RMS mapping quality (MQ) ≥40, Fisher-Strand (FS) < 60, strand Odds Ratio (SOR) 3, mapping quality rank sum test (MQRankSum) ≥ 12.5, and ReadPosRankSum test ≥ 8.After ltering, a total of 361 samples with 40,272,309 SNPs on chromosomes was remained for further analysis.

Population genetic characteristics analysis
To understand the population genetic characteristics among difference populations, SNP characteristics, linkage disequilibrium (LD) degree, principal component analysis (PCA), and phylogeny trees analysis were performed using WGS data.SNP characteristics analysis were performed to identify common and speci c SNP markers among difference Nile tilapia populations.Linkage disequilibrium was computed ≤ by PopLDdecay software (Zhang et al., 2019) on difference populations using WGS data after removing low genotyping call rate (< 90%) and low-frequency (MAF < 0.01) variants.In PCA analysis, genotyped SNPs with more than two alleles, genotyped SNPs with minor allele frequency (MAF) smaller than 1%, and genotyping call rate smaller than 90% were removed by PLINK 2.0 software rstly [26].Furthermore, to reduce the in uence of linkage disequilibrium on population genetic structure inference, SNPs with LD (r2 > 0.3) were pruned out from WGS data using PLINK 2.0 software with this parameter (-indep-pairwise 50 10 0.3).Finally, PCA was performed using PLINK 2.0 with remaining markers for all individuals and visualized using the R package (scatterplot3d).In genetic distance analysis, pairwise genetic distances matrix between all individuals were estimated as one minus the identity-by-state (IBS).And the IBS value was estimated using PLINK 2.0 software with the ltered WGS data.Finally, the phylogeny trees were constructed using the neighbor-joining method in MEGA version 11.0.11software (Kumar et al., 2018) and visualized using the ggtree package (Xu et al., 2022).

Reference and target panels
For exploring genotype imputation strategies from SNP chip data to WGS data, the 100 samples of popB and the remaining 261 samples were used as internal and external population separately.To obtain the target panel of popB for genotype imputation, these corresponding genotype data were extracted from the 100 WGS data of popB based on the map information of 65 K SNP chip data using PLINK (Penaloza et al., 2020).The reference panels were divided into internal, external, and combined reference panel, which were made up by WGS data of different subset samples form popB, WGS data of external population, and their combinations, respectively.To obtain high quality genotype data for genotype imputation, genotyped SNPs with minor allele frequency (MAF) smaller than 1% and genotyping call rate smaller than 90% were removed in reference and target panel using PLINK 2.0 software.

Key individual selection strategies
To investigate the in uences of key individual selection strategies for reference population on the accuracy of genotype imputation from SNP chip data to WGS data, three methods were considered, including minimizing average distance to the closest leaf (ADCL) (Kang et al., 2015), maximizing the expected genetic relationship between target population and reference population (REL) (Druet et al., 2014), and randomly selecting individuals (RAN).In minimizing ADCL method, the rppr binary in pplacer were executed to select key individuals for reference panel by the adapted partitioning-around-medoids algorithm using phylogenetic tree.Genotype data was used to calculate the pairwise genetic distance matrix and to construct a phylogenetic tree using the neighbor-joining method in MEGA version 11.0.11software (Kumar et al., 2018).In REL method, the IBS matrix was used to maximize the expected genetic relationship between reference population and target population, while accounting for relationships among selection individuals, as described in detail by Druet et al. (2014).This IBS matrix was constructed by PLINK 2.0 software with the standardized genotypes of 65 K chip data (Zhou and Stephens, 2012).In RAN method, individuals were selected randomly for reference panel.

Genotype imputation
To explore the impact of the reference panel type, the haplotype phasing and imputation software, the reference panel size, the key individual selection strategies, and the composition of the combined reference panel on the imputation accuracy, three scenarios were investigated in this study.To improve computational e ciency, four chromosomes (NC_031965.2,NC_031967.2,NC_031972.2,and NC_031983.2) were selected for genotype imputation on all scenarios.
Scenario one was designed to study the accuracy of imputation from 65 K chip data to WGS data with different combinations of pre-phasing software, imputation software, and reference panels.So, a total of 27 combinations were designed using three phasing software ( Beagle5 version 5.

Assessment of imputation quality
The concordance rate and non-reference concordance rate were used to evaluate imputation accuracy.Concordance rate is de ned as the proportion of imputed genotypes in concordance with observed genotypes.In this study, the genotype calling from whole genome re-sequencing data of popB was de ned as observed genotypes.Non-reference concordance rate is similar to the concordance rate but is restricted to only individuals that are not homozygous for the reference allele.After nished genotype imputation, both the overlapping SNP sites between target panel and imputed WGS data and SNPs with MAF smaller than 0.5% were removed using PLINK 2.0 software rstly.Finally, the overlapping SNPs between remaining genotypes and observed genotypes was used to measure concordance rates and non-reference concordance rate per individual or SNP.In addition, the process would be replicated ve times and reported the average imputation accuracy to reduce random sampling errors if individuals were selected to construct reference panel randomly.

Population genetic characteristics among difference Nile tilapia populations
The genetic relationship between reference and validation populations is an important factor in uence on genotype imputation accuracy, understanding population genetic characteristics among difference populations is of great practical importance in genotype imputation.In this study, we identi ed a total of 11,553,540, 14,915,736, 8,890,633, 19,046,171 SNPs in popA, popB, popC, and other populations, respectively.The total SNPs among these four Nile tilapia populations were 29,235,156, and 4,280,748 SNPs were share with each other, accounting for 14.64% of the total SNPs.The number of private SNPs was 2,404,922, 4,750,220, 1,027,047, and 8,613,416 for popA, popB, popC, and other populations, accounting for 8.23%, 16.25%, 3.51%, and 29.46% of the total SNPs, respectively (Fig. 1A ).The extent of LD declined with physical distance between SNPs, but the rate of decline varied between Nile tilapia populations (Figure . 1B).As the distance between variants increased, LD between SNPs declined more rapidly in popC, followed by popB, others population, and popA (Figure . 1B).Furthermore, the cluster results of PCA and phylogeny trees analysis showed that these 361 Nile tilapias were separated into four distinct clusters clearly using the SNP data, including popA, popB, popC and a fraction of individuals in other population, and the remaining other population (Fig. 1C-D).The contributions of the rst three principal components (pc) were 25.97%, 15.66%, and 12.84% respectively, accumulated 54.47% contributions.And clearly showed that a small number of individuals are mixed between popA and popC (Fig. 1C-D).In general, the diverse and rich genetic resources of Nile tilapia would provide a high-quality haplotype reference panel for genotype imputation.
Comparison of different pre-phasing and imputation software combinations for genotype imputation To investigate the impact of pre-phasing and imputation software on imputation accuracy, a total of 9 combinations of commonly used phasing and imputation software were considered using internal, external, and combined reference panels, including Beagle5/Beagle5, Beagle5/Impute5, Beagle5/Minimac4, Eagle2/Beagle5, Eagle2/Impute5, Eagle2/Minimac4, shapeit4/Beagle5, shapeit4/Impute5, and Shapeit4/Minimac4.In all combinations, the highest imputation accuracy of internal, external, and combined reference panel was 0.794 (the accuracy of Beagle5/Minimac4 on NC_031972.2),0.784 (the accuracy of Beagle5/Impute5 on NC_031965.2),and 0.806 (the accuracy of Beagle5/Impute5 on NC_031965.2),respectively (Fig. 2A-C).For pre-phasing software, Beagle5 was slightly better than Eagle2 and Shapeit4 in most time, but there is no signi cant difference (P = 0.3) (Fig. 2A-I).For imputation software, Impute5 has the highest imputation accuracy, following by Beagle5 and Minimac4 when combined or external reference panel was used.When performed genotype imputation with internal reference panel, Minimac4 has the highest imputation accuracy, following by Impute5 and Beagle5.Furthermore, when performed genotype imputation with Beagle5 or Impute5, imputation accuracy was greatest with combined reference panel, following by external reference panel, and internal reference panel, which positively correlated with the reference size, not matter which prephasing software is used (Fig. 2A, 2B, 2D, 2E, 2G, 2H).But for Minimac4, imputation accuracies of internal reference panel slightly higher than combined reference panel in most time, and obviously higher than external reference panel, not matter which pre-phasing software is used (Fig. 2C, 2F, 2I).In general, the optimal pre-phasing and imputation software combinations of internal, external, and combined reference panels were Beagle5/Minimac4, Beagle5/Impute5, and Beagle5/Impute5, respectively.

Impact of reference panel size on imputation accuracy
To investigate the impact of reference panel size on imputation accuracy, different sizes of internal, external, and combined reference panels were designed to perform genotype imputation with Beagle5, Impute5,and Minimac4, respectively, and the results were shown in Fig. 3.When performed genotype imputation with internal reference panel, the average imputation accuracy increases as the size of reference panel increased, regardless of imputation software (Fig. 3A).Beyond the reference panel size of 25, the increase in accuracy slowed down for both Impute5 and Minimac4 (Fig. 3A).For example, when the reference panel size was increased from 25 to 50, the imputation accuracy only increased by 6.85% and 7.64% for Impute5 and Minimac4, which was far below the growth rate of reference panel size increased from 5 to 25 (27.58% for Impute5 and 29.69% for Minimac4).But for Beagle5.2, the imputation accuracy almost increases linearly with the size of reference panel (Fig. 3A).When performed genotype imputation with external reference panel, the average imputation accuracy also increases as the size of reference panel increased for Beagle5.2 and Impute5 (Fig. 3B).But the growth rate of imputation accuracy declined fast when the size of reference panel was higher than 50, even zero or negative growth in Minimac4 (Fig. 3B).When performed genotype imputation using combined reference panel with xed external individuals (N = 261), the average imputation accuracies increased constantly as the internal population individuals increased, especially for Minimac4 whose growth rate was obviously higher than Beagle5.2 and Impute5 (Fig. 3C).When performed genotype imputation using combined reference panel with xed internal individuals (N = 50), the average imputation accuracy of Minimac4 decreased obviously (from 0.942 to 0.899) as the combined reference panel including more external population individuals, but trend of average imputation accuracy almost unchanged or increased slowly was showed in Beagle5.2 and Impute5 (Fig. 3D).But compared with only used internal or external reference panel, adding external or internal population individuals to construct a combined reference panel to increase reference panel size would always improve the imputation accuracy (Fig. 3).For example, the average imputation accuracy of Minimac4 using internal reference panel (N = 50) was 0.866 that was lower than the imputation accuracy of different combined reference panel with xed fty internal individuals (from 0.942 to 0.899).In addition, the detail comparison of imputation accuracy of Minimac4 between combined reference panel (N = 55) and internal reference panel (N = 50) in different MAF was showed that adding ve external population individuals greatly improved the imputation accuracy, particularly on rare variants (Figure S1).In a word, increasing reference panel size would always improve the imputation accuracy, but the larger combined reference panel do not necessarily result in a higher imputation accuracy.

Impact of key individual selection strategies on imputation accuracy
How to select individuals to construct a high-quality reference panel is a prerequisite to implement genotype imputation if no high-quality public imputation reference panels are available.To answer this question, we systematically investigated three different methods to select individuals to construct internal, external, and combined reference panel for imputation, and the results were shown in Fig. 4. For external reference panel, there are not obviously different among ADCL, RAN, and REL when the number of external individuals is higher than 25 (Fig. 4A-C).For combined reference panel with xed internal individuals, using RNA method to select external individuals always have a higher imputation accuracy compared with ADCL and REL, particularly on Minimac4, and not obviously different between ADCL and REL (Fig. 4A-C).For example, when we select 100 external individuals to construct combined reference panel for imputation, the average accuracy of Minimac4 was 0.920, 0.848, and 0.849 for RNA, ADCL, and REL (Fig. 4C).For internal reference panel, when the number of key internal individuals is less than 30, using REL method to select internal individuals always have slightly higher imputation accuracy compared with ADCL and RAN, particularly on Impute5 (P = 0.04) (Fig. 4D-F).For combined reference panel with xed external individuals, there are also not obviously different among ADCL, RAN, and REL.Taken together, key individual selection strategies would affect the accuracy of genotype imputation, but the optimal strategy different in different imputation scenarios.

The in uence of MAF on imputation accuracy
The accurate imputation of rare variants still is a challenge in genotype imputation.Using internal, external, and combined reference panel, a total of 9 different pre-phasing and imputation software combinations were considered to assess the impact of MAF on imputation accuracy, and the results are shown in Fig. 5. Results showed that the concordance rate decreased substantially as MAF of the variants increased, regardless of the circumstances (Fig. 5).For non-reference concordance rate, with the MAF of variants increasing, it had the trend of rst rising and then falling for external and combined reference panel, not matter what pre-phasing and imputation software were used (Fig. 5).But for internal reference panel, with the MAF of variants declining, non-reference concordance rate had the trend of rst rising and then falling for Minimac4 (Fig. 5CFI), the trend of rst rising, then falling, and stabilizing for Impute5 (Fig. 5BEH), and the trend of rst stabilizing, then falling, and rising for Beagle5 (Fig. 5ADG), respectively.More importantly, internal reference panel would exhibit better performance compared to external or combined reference panel if the MAF of variants was less than 0.05.

Discussion
Pre-phasing and imputation software Pre-phasing and imputation software not only affected the imputation accuracy, but also the computational e ciency.Currently, most popular pre-phasing and imputation software were mainly depending on the Li and Stephens Hidden Markov Model (HMM), including Beagle5.4,Eagle2, and Shapeit4 for phasing, and Beagle 5.4, Impute5, and Minimac4 for imputation.In this study, we found that imputation accuracy is no signi cant difference (P = 0.3) when pre-phasing using Beagle5, Eagle2 or Shapeit4 (Fig. 2).But the computational e ciency exists obviously different among Beagle5.4,Eagle2, and Shapeit4.Total time-consuming for pre-phasing was the least for Beagle5.4,follow by Eagle2, the most for Shapeit4 (Table S1 ).For example, the total time-consuming for pre-phasing on NC_031965.2were 816s, 967s, and 5940s for Beagle5.4,Eagle2, and Shapeit4, respectively (Table S2 ).Similar results also were found in human (De Marino et al., 2022).Because Beagle 5.2 or higher version uses marker windowing, composite reference haplotypes, and a progressive phasing algorithm to reduce memory usage and computation time (Browning et al., 2021).But for the comparison of imputation software, we found the advantage of imputation software on imputation accuracy was different in different reference panels (Fig. 1).When performed genotype imputation with combined or external reference panel, the imputation accuracy of Beagle5.4 and Impute5 were always higher than Minimac4 (Fig. 1 and Fig. 3 ).In contrast, when performed genotype imputation with internal reference panel, the Minimac4 has the highest imputation accuracy, especially for small reference size (Fig. 1 and Fig. 3).This phenomenon caused mainly by the reference panel size and composition (Chassier et al., 2018) as well as the algorithm of imputation software (Das et al., 2016).The Minimac4 mainly used local similarities between sequenced haplotypes in small genomic segments for genotype imputation, so the imputation accuracy would decrease when less unrelated subpopulations were included (Fig. 3D).In addition, the number of polymorphic SNPs obtained by genotype imputation exhibited obviously difference among different imputation software (Figure S2).For example, after performed genotype imputation with internal reference panel, the Beagle5.4/Minimac4 has the most polymorphic SNPs (1,999,885), with the 2.25 and 5.93 folds of Beagle5.4/Impute5(888,624) and Beagle5.4/Impute5(337,395) (Figure S2).In general, the Minimac4 may be the optimal choice for internal reference panel or combined reference panel with only a few external individuals, but Impute5 and Beagle5 would be well suited external and combined reference panel with large reference size.

Reference panel size and composition
The reference panel size and composition also were the mainly factor affecting the imputation accuracy.In most of the time, increasing reference panel size would result in higher imputation accuracy, and similar results were revealed many species, such as cattle (van Binsbergen et al., 2014), chicken (Ye et al., 2019), Nile tilapia (Garcia et al., 2022) and so on.In this study, similar trends also were observed when increasing internal population animals in internal, external, and combined reference panel for genotype imputation (Fig. 3A-C).However, adding external population individuals to increase the size of combined reference panel, the imputation accuracy didn't increase linearly with the reference panel size even decrease in Minimac4 (Fig. 3D).It maybe due to more distantly related animals do not contribute substantially to increase imputation accuracy, even introduce many haplotypes noises decreasing imputation accuracy (Pook et al., 2019).Therefore, adding external population individuals to increase the size of combined reference panel would improve the accuracy of genotype imputation, but to obtain the highest accuracy need to balance the reference panel sizes and composition, particularly on performed genotype imputation with Minimac4.

Key selection strategies
The prerequisite for genotype imputation was owning a high-quality reference panel.However, the large and extremely diverse haplotypes reference panels were still lacking in most aquaculture species.How to crease a self-constructing reference panel paly an important role in genotype imputation, with budget constraints for WGS.In this study, we compared three methods (ADCL, REL and RAN) to select internal or external individuals to construct different type of reference panel for imputation (Fig. 4).However, the advantage of REL and ADCL method were only showed in internal reference panel.When the reference panel size is small (less than 30), using REL method to select individuals to construct internal reference panel would have a slightly higher imputation accuracy (Fig. 4D-F), similar with previous studies (Druet et al., 2014;Ye et al., 2019).But there were no signi cant differences in imputation accuracy using different methods to select internal individuals to construct combined reference panel (Fig. 4D-F).Because REL method selected individuals based on the genetic similarity among the target populations, which would well represent the mainstream haplotypes in population (Druet et al., 2014).On the other hands, it maybe due to more distantly related animals (N = 261) were included in combined reference panel that would dilute the haplotypes from small number of key internal individuals, thus not contribute to increase imputation accuracy (Dekeyser et al., 2023).Furthermore, using RAN method to select external individuals was better than ADCL and REL methods for constructing combined reference panel (Fig. 4A-C), which con icted with previous studies (Ye et al., 2019).It maybe caused by selecting individuals from genetically distant population using ADCL and REL methods would introduce more unnecessary noise.

Minor allele frequency
Rare variants play an important role in complex traits, which might contribute substantially to the missing heritability of complex traits (Bomba et al., 2017;Manolio et al., 2009).However, accurate rare variant imputation is still a challenge in genotype imputation.In this study, using a combined or external reference panel, we found that the concordance rate decreased substantially as MAF of the variants increased, regardless of the circumstances (Fig. 5).Because of the concordance rate does not account for the frequency of imputed alleles and considerably overestimates the accuracy of imputation for rare variants (Fernandes Junior et al., 2021).For better assessing the imputation of rare variant, we introduced the non-reference concordance rate to calculate the imputation accuracy.And results found that internal reference panel would exhibit better performance compared to external or combined reference panel in any scenario if the MAF of variants was less than 0.05 (Fig. 5), which seems like con icted with previous studies that using a large combined-reference panel provides greater imputation accuracy, especially for low minor allele frequency (MAF) variants (Jiang et al., 2022;Ye et al., 2019).In fact, our results also supported previous studies that adding ve external population individuals into internal reference panel (N = 50) to construct combined reference panel (N = 55) greatly improved the imputation accuracy, particularly on rare variants (Figure S1).In addition, using internal reference panel, the imputation accuracy would increase as the MAF decreased, if the MAF of variants was less than 0.05.Because most rare variants are tend to be recent that associated with longer haplotypes, population-speci c (Altshuler and Lander, 2012) and potentially re ect the characteristics of population structure directly (Mathieson and McVean, 2012).

Conclusions
There is no signi cant difference for genotype imputation when performed pre-phasing with Beagle5, Eagle2, and Shapeit4, but Beagle5 has the highest computational e ciency.Both Beagle5 and Impute5 were more suitable for combined and external reference panel with large reference size, and Minimac4 was suitable for internal reference panel, especially for small reference size.Furthermore, increasing reference panel size would always improve the imputation accuracy, but larger reference size does not necessarily result in a higher imputation accuracy.Using REL method to select key individuals for internal reference panel would improve the imputation accuracy compared.In conclusion, using a combined reference panel provided greater imputation accuracy, but the optimal genotype imputation strategy needs to balance the actual situation carefully and comprehensively.This work sheds light on how to design and execute genotype imputation in aquaculture.

Declarations
Ethics approval and consent to participate.Not applicable.

2 (
Browning et al., 2021), eagle version 2.4.1 (Loh et al., 2016), and shapeit4 version 4.2.2 (Delaneau et al., 2019)), three imputation software (Beagle5 version 5.2, Impute5 version 1.1.5(Rubinacci et al., 2020), and Minimac4 version 4.1.2(Das et al., 2016; Fernandes Junior et al., 2021; Howie et al., 2012)), and three reference panels (internal, external, and combined).In this scenario, the internal reference panel was made up by 20 individuals with WGS data selected from popB randomly.The external reference panel was made up by all 261 individuals with WGS data in external population.The combined reference panel was made up by 20 WGS data from internal population and all 261 WGS data from external population.All of phasing and imputation software were launched with default parameters on same computing resources (120 threads and 512G memory).In addition, the optimal combination of phasing and imputation software in this scenario was used for other scenario in genotype imputation.Scenario two was designed to study the effects of reference panel size on the accuracy of genotype imputation from 65 K chip data to WGS data.In this scenario, four different reference panel type were considered, including internal reference panel, external reference panel, combined reference panel with xed internal population individuals (N = 50), and combined reference panel with xed external population individuals (N = 261).And the corresponding numbers of internal or external individuals were selected randomly to construct corresponding reference panels with different sizes for genotype imputation, respectively.Finally, different sizes of internal reference panel (N = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50), external reference panel(N = 5, 50, 100,150, 200, 250), combined reference panel with xed 50 internal population individuals (N = 55, 100, 150, 200, 250, 300), and combined reference panel with xed 261 external population individuals (N = 266, 271, 276, 281, 286, 291, 296, 301, 306, 311) were used for genotype imputation, respectively.Scenario three was designed to study the impact of key individual selection strategy on the accuracy of genotype imputation from 65 K chip data to WGS data.According to the three key individual selection strategies (ADCL, REL, and RAN), the corresponding numbers of internal or external individuals were selected to construct different sizes of internal reference panel (N = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50), external reference panel(N = 5, 50, 100,150, 200, 250), combined reference panel with xed 50 internal population individuals (N = 55, 100, 150, 200, 250, 300), and combined reference panel with xed 261 external population individuals (N = 266, 271, 276, 281, 286, 291, 296, 301, 306, 311) for genotype imputation, respectively.In combined reference panel with the xed 50 internal population individuals, these internal population individuals were selected from popB by corresponding key individual selection strategies.

Figures
Figures

Figure 1 Population
Figure 1

Figure 3 Effects
Figure 3

Figure 4 Effects
Figure 4

Figure 5 Effects
Figure 5