Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

23 Background: Low coverage whole genome sequencing is a low-cost genotyping 24 technology. Combining with genotype imputation approaches, it is likely to become a 25 critical component of cost-efficient genomic selection programs in agricultural 26 livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to 27 investigate the performance of genotype imputation for low coverage whole genome 28 sequence data and genomic selection based on the imputed genotype data. The specific 29 aims were: (i) to measure the accuracy of genotype imputation under different 30 sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the 31 accuracy of genomic selection under different marker densities derived from the 32 imputed sequence data, different strategies for constructing the genomic relationship 33 matrixes, and single- vs multi-trait models. 34 Results: We found that a high imputation accuracy (> 0.95) can be achieved for 35 sequence data with sequencing depth as low as 1x and the number of sequenced 36 individuals equal to 400. For genomic selection, the best performance was obtained by 37 using a marker density of 410K and a G matrix constructed using marker dosage 38 information. Multi-trait GBLUP performed better than single-trait GBLUP. 39 Conclusions: Our study demonstrates that low coverage whole genome sequencing 40 would be a cost-effective method for genomic selection in Dezhou Donkey. 41

Results: We found that a high imputation accuracy (> 0.95) can be achieved for 35 sequence data with sequencing depth as low as 1x and the number of sequenced 36 individuals equal to 400. For genomic selection, the best performance was obtained by 37 using a marker density of 410K and a G matrix constructed using marker dosage 38 information. Multi-trait GBLUP performed better than single-trait GBLUP.

39
Background 45 Dezhou Donkey, originated from Dezhou area, Shandong Province, China, is one 46 of major donkey breeds in China. It is famous for its large body size (thus good meat 47 production ability) and excellent skin quality (for producing donkey-hide gelatin). It 48 has been introduced as breeding stock into many provinces and cities, and has also 49 brought considerable economic benefits to farmers. Therefore, Dezhou Donkey plays   This approach has been used in human and some animal species for genome-wide 74 association study and genomic selection/prediction and approved to be a feasible 75 alternative to normal sequencing [7-10]. Since the cost of lcWGS can even be lower 76 than that of SNP array, it is considered as a cost-effective genotyping approach for GS 77 and GS based this approach was referred as GS 2.0 by Hickey (2013) [11].

78
A critical issue of lcWGS-based GS is the accuracy of imputation of missing 79 genotypes, which is affected by several factors, such as sequencing depth, population 80 size, minor allele frequency (MAF), and imputation method. A number of imputation 81 methods for lcWGS data have been proposed [12][13][14]. However, most of these methods 82 require a high-density haplotype reference panel, which are not available for most 83 animal species. Davies et al. (2016)[12] proposed a method called STITCH for 84 imputation based only on sequencing read data, without requiring a haplotype reference 85 panel, which provides an opportunity of using lcWGS technology for species that lack 86 a haplotype reference panel.

87
In this study, we evaluated the imputation accuracy of lcWGS data with respect to different sequencing depths, population sizes, MAFs, and imputation pipelines using 89 617 Dezhou Donkey animals which were sequenced with an average depth of 3.5x. We    Technologies, Santa Clara, CA). The quality-controlled genomic library for each 113 sample was PE150 sequenced using the Illumina NovaSeq 6000 sequencing system.

114
The average sequencing coverage of the sample data was 3.5x. p-value > 1e-6.

131
Evaluation of imputation accuracy 133 We evaluated the imputation accuracy using the sequencing data of 18 Dezhou 134 Donkey animals, which were sequenced with an average sequencing coverage of 13.5x. with distribution of N (0, G 2 ), where 2 is the additive genetic variance and G is the 157 genomic relationship matrix, X and Z are the incidence matrices for b and a, 158 respectively, and e is the vector of random residuals with distribution of N (0, I 2 ).
159 Two-trait model: where the meanings of the vectors and matrices are the same as those in the single-trait The dosage-based G matrix was constructed as follows.
Here, is the centralized marker dosage matrix whose elements are zero-centered 180 genotype dosages. is the sum of variances for every column of . 181 We used both of the two strategies to construct G. Different marker densities were 182 also considered when constructing G to evaluate the effect of marker density on the 183 performance of genomic prediction. The effects of sample size and sequencing depth 214 We compared the genotype accuracy and genotype concordance for imputation 215 with different sample sizes (200, 400 and 600) and sequencing depths (1x, 1.5x and 216 3.5x) (Figure 2). In general, as expected, the genotype accuracy and genotype  is only slightly better than that using the genotype-based G matrix. Since the 305 improvement was rather small, we infer this may be due to the high accuracy of 306 imputation.
Noticeable increases in genomic prediction accuracy were observed when using a 308 two-trait model compared with using a single-trait model. It has been long widely In this study we demonstrated that the pipeline BaseVar + STITCH is a good choice for       Figure 1 Comparison of imputation accuracy using two imputation pipelines BaseVar + STITCH: Calling SNPs using BaseVar and imputing missing genotypes using STITCH Bcftools + Beagle: Calling SNPs using Bcftools and imputing missing genotypes using Beagle a, b and c: genotype accuracy for chromosomes 1, 19, and 30, respectively; d, e and f: genotype concordance for chromosomes 1, 19 and 30, respectively The effects of sample size and sequencing depth on imputation accuracy using the pipeline BaseVar + STITCH a, b and c: genotype accuracy for chromosomes 1, 19, and 30, respectively; d, e and f: genotype concordance for chromosomes 1, 19, and 30, respectively Figure 3 The effect of minor allele frequency on imputation accuracy using the pipeline BaseVar + STITCH a, b and c: genotype accuracy for chromosomes 1, 19, and 30, respectively; d, e and f: genotype concordance for chromosomes 1, 19, and 30, respectively