Background

Population genetic studies based on genotyped single nucleotide polymorphisms (SNPs) are influenced by a non-random selection of the SNPs included in the used genotyping arrays. The resulting bias relative to whole genome sequencing (WGS) data is known as SNP ascertainment bias. Correction for this bias requires detailed knowledge of the array design process which is often not available in practice. This study intends to investigate an alternative approach to mitigate ascertainment bias of a large set of genotyped individuals by using information of a small set of sequenced individuals via imputation without the need for prior knowledge on the array design.

Results

The strategy was first tested by simulating additional ascertainment bias with a set of 1,566 chickens from 74 populations that were genotyped for the positions of the Affymetrix Axiom™ 580k Genome-Wide Chicken Array. Imputation accuracy was shown to be consistently higher for populations used for SNP discovery during the simulated array design process. Reference sets of at least one individual per population in the study set led to a strong correction of ascertainment bias for estimates of expected and observed heterozygosity, Wrights Fixation Index and Nei’s Standard Genetic Distance. In contrast, unbalanced reference sets introduced a new bias towards the reference populations. Finally, the array genotypes were imputed to WGS by utilization of reference sets of 74 individuals (one per population) to 98 individuals (additional commercial chickens) and compared with a mixture of individually and pooled sequenced populations. The imputation reduced the slope between heterozygosity estimates of array data and WGS data from 1.94 to 1.26 when using the smaller balanced reference panel and to 1.44 when using the larger but unbalanced reference panel. This generally supported the results from simulation but was less favorable, advocating for a larger reference panel when imputing to WGS.

Conclusions

The results highlight the potential of using imputation for mitigation of SNP ascertainment bias but also underline the need for unbiased reference sets.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

This is a list of supplementary files associated with this preprint. Click to download.

- FigureS1.tiff
Figure S 1: Recall rates for samples which were genotyped as well as sequenced per SNP (A; B) and per animal (C; D); before (A; C; red) and after (B; D; blue) correction of potential reference allele switches in the genotype data.

- FigureS2.tiff
Development of the per-animal imputation accuracy with an increasing number of reference animals per population. A – scenario randSamp_5_50; B – scenario randPop_5_50; C – scenario minPop_5_50; D – scenario maxPop_5_50. Individuals are grouped on whether they belong to the population which contains reference individuals, was used as for SNP discovery or none of them (application). the lines show the trend of the median.

- FigureS3.tiff
Development of correlations within population group (r), slope and mean overestimation of the regression lines for HE and HO estimates and different reference panel strategies. The intended value for unbiasedness and minimum variance is marked as dense black horizontal line. Note that the case without imputation is consistent with zero reference samples.

- FigureS4.tiff
Development of correlation within population group (A), slope (B) and intercept (C) of the regression lines for D and FST when distributing the reference samples equally over all populations (allPop_74_740). The intended value for unbiasedness and minimum variance is marked as dense black horizontal line. Note that the case without imputation is consistent with zero reference samples.

- FigureS5.tiff
Development of correlation within population group (r), slope and mean overestimation of the regression lines for Nei’s Distance (D) and FST estimates and different reference panel strategies. The intended value for unbiasedness and minimum variance is marked as dense black horizontal line. Note that the case without imputation is consistent with zero reference samples.

- FigureS6.tiff
Distribution of DR2 values by chromosome and reference set. Note that outliers are not shown due to a large number of underlying values.

- FigureS7.tiff
Two-dimensional distributions of DR2 values vs. MAF by chromosome when imputed with the reference set 74_1perLine. The red line represents the median within 0.05 MAF bins.

- FigureS8.tiff
Effect of pooled sequencing and the correction factor of Futschik and Schlötterer [48] on expected heterozygosity (HE) and ascertainment bias. A – HE estimated from array positions of the sequencing data vs. HE directly estimated from array data. The color indicates the state before and after correcting the pooled sequence estimates and the accordingly colored solid lines the group specific regression lines while the black solid line indicates the line of identity in all three plots. The plot therefore shows the magnitude of the bias introduced by pooled sequencing and the according effect of the correction factor. B – HE estimated from the array data vs. HE estimated from the complete sequence data. The color again shows the values before and after implementing the correction of the pooled sequence estimates. While the solid regression line and dense circles indicate the individually sequenced samples, the dashed regression lines and triangles indicate pooled sequenced samples. The plot therefore shows the combined effect of ascertainment bias and pooled sequencing bias. C – HE estimated from array positions of the sequencing data vs. HE estimated from all positions of the sequencing data. The plot therefore shows the pure ascertainment bias.

- FigureS9.tiff
Effect of pooled sequencing on the expression of the ascertainment bias in Nei’s standard genetic distance (D). The biased D was either estimated directly from the array genotypes (D.arr, pooled bias + ascertainment bias) or from the array positions of the sequencing data (D.arr.seq, pure ascertainment bias), while the estimates from the complete sequence were assumed to be the true estimates. The black solid line represents the line of identity, solid colored regression lines and dense points represent estimates between individually sequenced populations and dashed lines and triangles represent estimates between two populations of which at least one was pooled sequenced.

- FigureS10.tiff
Effect of pooled sequencing on the expression of the ascertainment bias in Wright’s fixation index (FST). The biased FST was either estimated directly from the array genotypes (FST.arr, pooled bias + ascertainment bias) or from the array positions of the sequencing data (FST.arr.seq, pure ascertainment bias), while the estimates from the complete sequence were assumed to be the true estimates. The black solid line represents the line of identity, solid colored regression lines and dense points represent estimates between individually sequenced populations and dashed lines and triangles represent estimates between two populations of which at least one was pooled sequenced.

- SupplementaryFile1.csv
Supplementary File 1: Accession Information of raw data per sample

- SupplementaryFile2.pdf
Supplementary File 2: Supplementary Methods

- TableS1.docx
Table S 1: Quantiles of theoretical imputation accuracies (DR2) by reference set

Loading...

Posted 27 Jan, 2021

###### No community comments so far

Posted 27 Jan, 2021

###### No community comments so far

Background

Population genetic studies based on genotyped single nucleotide polymorphisms (SNPs) are influenced by a non-random selection of the SNPs included in the used genotyping arrays. The resulting bias relative to whole genome sequencing (WGS) data is known as SNP ascertainment bias. Correction for this bias requires detailed knowledge of the array design process which is often not available in practice. This study intends to investigate an alternative approach to mitigate ascertainment bias of a large set of genotyped individuals by using information of a small set of sequenced individuals via imputation without the need for prior knowledge on the array design.

Results

The strategy was first tested by simulating additional ascertainment bias with a set of 1,566 chickens from 74 populations that were genotyped for the positions of the Affymetrix Axiom™ 580k Genome-Wide Chicken Array. Imputation accuracy was shown to be consistently higher for populations used for SNP discovery during the simulated array design process. Reference sets of at least one individual per population in the study set led to a strong correction of ascertainment bias for estimates of expected and observed heterozygosity, Wrights Fixation Index and Nei’s Standard Genetic Distance. In contrast, unbalanced reference sets introduced a new bias towards the reference populations. Finally, the array genotypes were imputed to WGS by utilization of reference sets of 74 individuals (one per population) to 98 individuals (additional commercial chickens) and compared with a mixture of individually and pooled sequenced populations. The imputation reduced the slope between heterozygosity estimates of array data and WGS data from 1.94 to 1.26 when using the smaller balanced reference panel and to 1.44 when using the larger but unbalanced reference panel. This generally supported the results from simulation but was less favorable, advocating for a larger reference panel when imputing to WGS.

Conclusions

The results highlight the potential of using imputation for mitigation of SNP ascertainment bias but also underline the need for unbiased reference sets.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Loading...