How Imputation Can Mitigate Ascertainment Bias

Background 17 Population genetic studies based on genotyped single nucleotide polymorphisms (SNPs) are 18 influenced by a non-random selection of the SNPs included in the used genotyping arrays. The 19 resulting bias relative to whole genome sequencing (WGS) data is known as SNP ascertain- 20 ment bias. Correction for this bias requires detailed knowledge of the array design process 21 which is often not available in practice. This study intends to investigate an alternative ap- 22 proach to mitigate ascertainment bias of a large set of genotyped individuals by using infor- 23 mation of a small set of sequenced individuals via imputation without the need for prior 24 knowledge on the array design. Results and to 1.44 when using the larger but unbalanced reference panel. This generally supported the results from simulation but was less favorable, advocating for a larger reference panel when imputing to WGS. The results highlight the potential of using imputation for mitigation of SNP ascertainment 43 bias but also underline the need for unbiased reference sets.

proach to mitigate ascertainment bias of a large set of genotyped individuals by using infor-23 mation of a small set of sequenced individuals via imputation without the need for prior 24 knowledge on the array design. 25

Results 26
The strategy was first tested by simulating additional ascertainment bias with a set of 1,566 27 chickens from 74 populations that were genotyped for the positions of the Affymetrix Axiom™ 28 580k Genome-Wide Chicken Array. Imputation accuracy was shown to be consistently higher 29 for populations used for SNP discovery during the simulated array design process. Reference 30 sets of at least one individual per population in the study set led to a strong correction of 31 ascertainment bias for estimates of expected and observed heterozygosity, Wrights Fixation 32 Index and Nei's Standard Genetic Distance. In contrast, unbalanced reference sets introduced 33 a new bias towards the reference populations. Finally, the array genotypes were imputed to 34 WGS by utilization of reference sets of 74 individuals (one per population) to 98 individuals 35 (additional commercial chickens) and compared with a mixture of individually and pooled se-36 quenced populations. The imputation reduced the slope between heterozygosity estimates of 37 Background 47 To realize cost-and computational efficiency, many of the population genetic studies of the 48 last 10 years for humans [1,2], as well as for model-[3, 4] and agricultural species [5][6][7][8] were 49 based on single nucleotide polymorphisms (SNP), which were genotyped by commercially 50 available SNP arrays. Those arrays are based on a non-random selection (ascertainment) of 51 SNPs, mostly performed in more intensively researched populations (e.g. commercially used 52 livestock breeds). This results in a shift of the allele frequency spectrum towards mean allele 53 frequencies and thereby an overestimation of heterozygosity compared to whole genome re-54 sequencing (WGS) data. As this overestimation is stronger for populations involved in the ar-55 ray design (discovery populations) than for those not involved [9], follow-up analyses can be 56 biased. This effect is widely known as SNP Ascertainment Bias [10][11][12]. 57 Different population genetic estimators are affected by SNP ascertainment bias to a different 58 extent [11] and implementing bias-reduced estimators requires strong assumptions on the 59

Analyses based on simulation of ascertainment bias within the genotype set 153
A first comparison was based solely on the 15,868 SNPs of chromosome 10 of the genotype 154 set which allowed for a high number of repetitions while still being based on a sufficiently 155 sized chromosome. To simulate an ascertainment bias of known strength, a stronger biased 156 array was designed in silico from the genotype set for each of the 74 populations (further 157 called discovery population) by using only SNPs with MAF > 0.05 within this population. Then, 158 reference samples for imputation were chosen in five different ways with ten different num-159 bers of reference samples and three repetitions per sampling: 160 1) allPop_74_740: Equally distributed across all populations by sampling one to ten chick-161 ens per population (74 -740 reference samples). 162 2) randSamp_5_50: 5, 10, …, 50 randomly sampled chickens (5-50 reference samples). 163 3) randPop_5_50: Five chickens from each of one to ten randomly sampled populations 164 Parameter settings in Beagle were further tweaked by increasing the window parameter to 203 200 cM to ensure enough overlap between reference and study SNPs. This was needed as we 204 observed low assembly quality and insufficient coverage of the array on the small chromo-205 somes. Analyses were then based on comparisons between the genotype set, the pruned set 206 or the imputed sets and the gold standard, the WGS data. 207

Assessment of imputation accuracy 208
Assessment of imputation accuracy was done by using Pearson correlation (r) between true 209 and imputed genotypes [42,56] for the in silico designed arrays. Pearson correlation puts a 210 higher relative weight on imputation errors in rare alleles than plain comparison of allele-or 211 genotype concordance rates [56]. In case of the imputation to sequence level, no 'true' set 212 was available for the evaluation of imputation accuracy. Instead, we used the internal Beagle 213 quality measure, the dosage r-squared (DR2) [57]. This, however, has the drawback that it only 214 shows the theoretical imputation accuracy, cannot capture biases due to biased reference sets 215 and also does not allow for a per population evaluation of imputation accuracy. 216  (4). 253

Comparison of population genetic estimators
The definition of a group describes for within-population estimators (e.g. H E ) whether a pop-255 ulation was used for SNP discovery (discovery population), samples from that population were 256 used as reference set (reference population) or none of both (application population). Note 257 that in scenarios where reference individuals were present for every population, we only di-258 vided them into discovery and application populations. For between population estimators 259 (F ST , D), a group describes the according combination of the two involved population groups. 260 Differences of the estimated slopes from one and the correlation between heterozygosity and 261 13 distance estimates from biased and true set within groups were used as indicators for the 262 magnitude of bias and random estimation error. 263 To get a measure for a fixed estimation error, we also calculated the mean overestimation 264 across populations (j = 1 ... J) as in equation (5). The effects were observed in a comparable manner for the other imputation strategies ( Figure  331 S 3). Due to smaller reference panels, the correction effect of the imputation was generally 332 worse than for strategy allPop_74_740. Interestingly, when limiting the reference samples to 333 a small number of populations (strategies randPop_5_50, minPop_5_50, maxPop_5_50), we 334 observed a newly introduced bias towards the reference populations (Figure S 3). This effect 335 was strongest for strategy maxPop_5_50, where we chose the populations due to a maximum 336 distance from the discovery population. However, increasing the number of reference sam-337 ples minimized the bias of reference and discovery populations with all strategies. 338

Population distances 339
The effects of ascertainment bias were also present but less pronounced for the distance 340 populations. The results were less beneficial for the real WGS data, but also showed a strong 404 19 decrease of the slope towards one. From the imputed in silico arrays, we could additionally 405 realize a fast closing of the gap of the stronger overestimation of heterozygosity within dis-406 covery populations and the less severe overestimation in non-discovery populations. This also 407 seemed to be the case when imputing to WGS level where we observed that the slope within 408 the commercial populations (closely related to discovery populations of the real array) de-409 creased more than the slope within all populations due to imputation. However, this observa-410 tion in the WGS data has to be regarded with caution, as we additionally identified a non-411 negligible bias due to pooled sequencing which interfered with the assessment of ascertain-412 ment bias and which was, in our study, confounded with the difference between commercial 413 populations (sequenced individually) and non-commercial populations (sequenced as pools). 414 The use of WGS information via imputation also consistently showed better results in regard 415 of reduction of ascertainment bias than using linkage pruned array SNPs which was reported 416 to be an effective filtering strategy for ascertainment bias mitigation by Malomane et al. [16]. 417 Generally, the effect of imputation on the investigated estimators was shown to be compara-418 ble across estimators, regardless of their initial reaction to ascertainment bias. An interesting 419 side observation was that F ST did not show any ascertainment bias on the real array data ( We also investigated the effect of differently sized and constructed reference sets for impu-427 tation. Generally, larger reference sets increased the accuracy of imputation and therefore 428 decreased the effects of ascertainment bias. The best results were achieved when the refer-429 ence set was as evenly distributed across the study set as possible. When reference popula-430 tions were closely related to the discovery population, reduction in imputation quality and 431 increase in ascertainment bias were less severe in case of unbalanced reference sets than if 432 distantly related reference populations were used. This suggests that variation within study-433 and reference set needs to show enough overlap to achieve sufficient imputation accuracy 434 and therefore reduction of ascertainment bias. 435 Results from literature suggest that multi-breed reference panels generally increase imputa-436 tion accuracy especially for rare variants and within admixed populations [39][40][41]. Addition-437 ally, Rowan et al. [41] argue that they do not seem to introduce variation at a relevant scale 438 for markers for which the breeds are actually fixed. However, some studies also showed that 439 strongly unbalanced reference sets can reduce imputation accuracy [42,43]. In this study, in-440 cluding additional reference samples in a biased way when going from reference set 74_1per-441 Line to 158_all increased the effects of ascertainment bias. However, theoretical imputation 442 accuracies rather increased than decreased (Figure S 6; Table S  Wright's fixation index (F ST ). The biased F ST was either estimated directly from the array gen-572 otypes (FST.arr, pooled bias + ascertainment bias) or from the array positions of the sequenc-573 ing data (FST.arr.seq, pure ascertainment bias), while the estimates from the complete se-574 quence were assumed to be the true estimates. The black solid line represents the line of 575 identity, solid colored regression lines and dense points represent estimates between individ-576 ually sequenced populations and dashed lines and triangles represent estimates between two 577 populations of which at least one was pooled sequenced. 578