Estimation of Breeding Values Using Different Densities of Snp to Inform Kinship in Broiler Chickens


 Background: Traditionally, breeding values are estimated based on phenotypic and pedigree information using the numerator relationship (A) matrix. With the availability of genomic information, genome-wide markers can be included in the estimation of breeding values through genomic kinship. However, the density of genomic information used can impact the cost of implementation. The aim of this study was to compare the rank, accuracy, and bias of estimated breeding values (EBV) for organs [heart (HRT), liver (LIV), gizzard (GIZ), lungs (LUN)] and carcass [breast (BRST), drumstick (DRM) and thigh (THG)] weight traits in a broiler population using pedigree-based BLUP (PBLUP) and single-step genomic BLUP (ssGBLUP) methods using various densities of SNP and variants imputed from whole-genome sequence (WGS). Results: For both PBLUP and ssGBLUP, heritability estimates varied from low (LUN) to high (HRT, LIV, GIZ, BRST, DRM and THG). Regression coefficients values of EBV on genomic estimated breeding values (GEBV) were similar for both the high density (HD) and WGS sets of SNPs, ranging from 0.87 to 0.99 across scenarios. Conclusion: Results show no benefit of using WGS data compared to HD array data using an unweighted ssGBLUP. Our results suggest that 10% of the content of the HD array can yield unbiased and accurate EBV.


Background
Traditionally, breeding values are estimated based on phenotypic and pedigree information by pedigree-based BLUP (PBLUP) using the numerator relationship (A) matrix [1]. With the advent of genomic selection and the availability of dense SNP arrays, genomic information has been included in the estimation of breeding values. Currently, many genetic evaluation systems have implemented a single-step genomic BLUP (ssGBLUP) [2] approach that makes use of genomic, phenotypic, and pedigree data simultaneously. This approach combines the A matrix with the genomic relationship matrix (G) into a single kinship matrix (H) [3]. The bene t of this approach is in the ability to account for Mendelian inheritance information and thus a more accurate prediction of breeding values can be obtained as compared with PBLUP.
Despite the reduction in the cost of genotyping, it still represents a non-trivial cost. Consequently, the ability to optimize the cost of implementing genomic selection and the rate of genetic gain from having done so is of interest. One potential way to do this is to reduce the proportion of animals genotyped in a strategic manner [4]. Another option is to simply reduce the density of the marker panel used. Theoretically, denser SNP panels lead to an increased probability that any QTL (Quantitative Trait Loci) is in perfect linkage disequilibrium (LD) with a SNP [5]. However, the use of high density (HD) SNP panels in forming a genomic relationship matrix has not been shown to provide signi cant improvements in accuracy [2]. Despite numerous studies, it is unclear what the optimal density of a SNP panel would be to achieve increased estimated breeding value (EBV) accuracies with minimal genotyping costs.
Recently, efforts have been allocated to whole-genome sequencing (WGS) and using this information to estimate EBV both in real data [6] as well as in simulated data [7][8][9]. Thus, it is expected that data obtained by sequencing the whole genome include the causal mutations underlying the QTL, which would enable estimating the trait QTL effect regardless of LD between the SNPs and QTL [10]. Performing WGS at moderate to highdepths for every animal in a population would be cost prohibitive to many if not all livestock breeding programs. A less expensive solution would be to genotype individuals with less expensive SNP panels and impute sequence variants throughout the population by only sequencing targeted individuals. Simulated data has shown an increase in genomic prediction accuracy when the causal mutations were included in the analyses [7,8,11]. Interestingly, this has not always been the case in real data using cattle and chickens [9,10,12].
Although the expectation that genomic selection using HD panels and even WGS data increase prediction accuracy in chickens for traits that are di cult or costly to measure, it is unclear what marker density is su cient. Therefore, the aim of this study was to compare the rank, accuracy, and degree of bias of estimated breeding values (EBV), as well as, genomic estimated breeding values (GEBV) for organs (heart, liver, lungs and gizzard) and carcass (breast, thigh and drumstick) traits in a broiler population using PBLUP and ssGBLUP with various densities of SNP and variants imputed from whole-genome sequence.

Methods
All experimental protocols related to animals in this study were performed in agreement with the resolution number 010/2012 approved by the Embrapa Swine and Poultry Ethics Committee on Animal Utilization to ensure compliance with international guidelines for animal welfare.

Population And Phenotypes
The chicken population used in this study was derived from a TT broiler line belonging to the Poultry Breeding Program from the Embrapa Swine and Poultry National Research Center. Since 1992, multi-trait selection has been applied in this line, mainly focused on traits such as body weight, feed conversion, carcass weights and yield, fertility, hatchability, and to reduce abdominal fat and metabolic syndromes [13,14]. The TT reference population is a broiler population developed for genomic studies in 2007 from the mating between 92 females (one from each female family) with 20 males (one from each male family) in a hierarchical scheme (1 male: 5 females) producing approximately 1,500 chickens of both sexes from ve hatches. Matings between relatives were avoided to improve the genetic variability as described by Marchesi et al. [15].
A total of 1,453 animals (703 males and 750 females) were weighted at 42 days of age (BW42) after six hours of fasting and euthanized by cervical dislocation followed by bleeding, according to the approval of the Embrapa Swine and Poultry Ethical Committee of Animal Use (CEUA), under protocol number 010/2012. Blood samples from each animal were collected for DNA extraction and the eviscerated carcass was cooled. After four hours of cooling (4 °C), the carcass (breast, drumstick, and thigh) and organs (heart, liver, gizzard and lung) were weighed. More details about the rearing condition and phenotypes measurements are available in Fornari et al. [16].
Descriptive statistics for the carcass and organ traits involved in the study (Table 1) were obtained through the PROC MEANS procedure of SAS® (SAS 9.4, SAS Institute).

Genotypes
Blood samples of each animal (1,453) were used to extract DNA using PureLink® Genomic DNA (Invitrogen, Carlsbad, CA, USA) kit and quanti ed with Qubit® 2.0 Fluorometer (Thermo Fisher Scienti c, Waltham, MA, USA). After extraction, the diluted genomic DNA was prepared following Affymetrix protocol to perform the genotyping analysis using 600K Affymetrix Axiom Genotyping Array (HD) (Affymetrix, Inc. Santa Clara, CA, USA). This genotyping array was developed using segregating SNPs identi ed in chicken populations, including four commercial broiler lines, as described by Kranis et al. [17].
Axiom™ Analysis Suite (Affymetrix®) software was used to lter based on DishQC parameter, and then PLINK v.1.9 software [18] was used to perform quality control analysis and genotype calling. Samples that exhibited DishQC of ≥ 0.82 and call rate of ≥ 90% were kept. In order to select markers with high quality, a SNP quality control was applied for removing SNP with call rate lower than 98%, MAF lower than 2% and signi cant deviations from HWE (p-value < 10 − 7 ) leaving 370,608 SNP for further analysis [19]. Imputation Data from WGS were obtained using the Illumina HiSeq2500® System (Illumina, Inc., San Diego, EUA) with coverage of 10X for 84 animals from Brazilian broiler and layer lines; 14 of those were randomly selected from the 20 males used in the matings to obtain the TT reference population.
These data were aligned to Build 5 of the chicken reference genome (Gallus_gallus-5.0) with BWA (version GCA_000002315.3). The read alignment, as well as variant calling and quality control, were performed following the same pipeline adopted by [20] and [19].
After ltering, 12,577,770 SNPs remained in the set of 84 animals sequenced and were used as the reference dataset to impute the HD array to sequence data. Imputation from HD to WGS was performed using BEAGLE 4.1 software [21] with 20 iterations. Imputation accuracy was assessed using the validation subset approach. Sequenced individuals (n = 84) were randomly divided into 14 subsets with 6 animals per group and each group was used as validation set once. The imputation process was carried out again for each validation subset masking the SNPs from HD, and then the imputed values for the validation set were compared to their observed values from sequence. Imputation accuracy was de ned as the average of squared correlation between observed and predicted variants. The accuracy of imputation was 0.84.
After imputation, a quality control was applied to select the sequence variants with a MAF greater than 0.015 and imputation accuracy equal to or greater than 0.95 (e.g., r² ≥ 0.95) which left 1,421,371 SNPs for further analysis. By using this MAF it is expected that the chance of detecting segregating SNPs is greater, thus it would reduce the cost of genotyping of non-segregating selected SNPs. Furthermore, SNPs were classi ed into ve classes by Variant Effect Predictor (VEP) software (version vep-93.4; [22]) using galGal5 as reference genome. The sequence variants selected to be included in further analysis were UTR3', UTR5', downstream, upstream and intergenic regions of the genome. Genetic variants annotated in those regions were considered potentially functional and thus could have a role in the regulation of the phenotype or even be responsible for controlling gene expression [20]. To ensure that the WGS dataset had the same number of variants as the HD array set, the common genetic variants between those data sets were removed from WGS data, leaving only the non-common variants. After the selection of non-common variants, 1,095,053 SNPs remained to compose the WGS dataset, which consisted of 69% of intergenic regions of the genome, 16% of downstream, 14% upstream and 1% of UTR3'and UTR5', respectively. From the non-common variants, 370,608 SNPs were randomly selected (replicated 10 times) to compose the nal WGS dataset and ensure a random representation of the entire genome.

Prediction
Correlations between EBV using PBLUP and ssGBLUP with different SNP subsets were high and ranged from 0.88 to 0.94 for organ traits and from 0.92 to 0.95 for carcass traits. Regression coe cients were similar for both the HD and WGS sets of SNPs, ranging from 0.87 to 0.99 across subsets. For HRT, GEBV tended to be overin ated with regression coe cients ranging from 0.87 to 0.89 when less than 80% of SNPs were used from the HD dataset. The same pattern was observed using imputed sequence data in which the regression coe cients were also less than 1 the use of WGS variants did not result in regression coe cients closer to 1 compared to HD panel, except for LUN.
Regarding the predictive ability, traits with higher heritabilities (e.g. GIZ and DRM) showed higher predictive ability than traits with lower heritabilities (e.g. LUN) ( Table 5). Compared to PBLUP, the predictive ability of ssGBLUP was greater when at least 5% of SNPs were used (HD and WGS sets), excepted for THG and BW42, where the predictive ability of ssGBLUP was greater than PBLUP when at least 10% of SNPs were used in HD panel for the evaluated traits. However, for THG and BW42 the predictive ability of PBLUP was greater than ssGBLUP when WGS dataset was used.    where y 1 and y 2 are the vector of observation for each evaluated trait (y 1 ) and BW42 (y 2 ); X 1 and X 2 are the design matrices for xed effects; b 1 and b 2 are the vectors of xed effects (sex and hatch) for the rst and second trait, respectively; Z 1 and Z 2 are the design matrices for random effects; u 1 and u 2 are the vector of random additive genetic effects; e 1 and e 2 are the vector of random error effects. The vector of genetic effects, u = [u ' 1 , u ' 2 ], was assumed to be multivariate normal distributed with mean 0 and variance\varvecA − 1 ? F and \varvecH − 1 ? F for PBLUP and ssGBLUP, respectively, where ⊗ is the Kronecker product and Φ is the additive genetic (co)variance matrix of trait 1 and 2. The vector of residuals, e = [e ' 1 , e], was assumed to be multivariate normal distributed with mean 0 and variance \varvecI ⊗ \varvecR, where \varvecI was an identity matrix and \varvecR was the residual (co)variance matrix.

[ ] [ ][ ] [ ][ ] [ ]
The H matrix combines information from numerator relationship matrix (A) and genomic matrix (G). The inverse of H was calculated following the approach of Aguilar et al. [24] as: where A − 1 is the inverse of a numerator relationship matrix; G − 1 is the inverse of a blended genomic matrix; and A − 1 22 is the inverse of a pedigree-based relationship matrix for genotyped animals only. The G blended matrix was obtained as follows: \varvecG = 0.95\varvecG \varvecw + 0.05\varvecA 22 where: \varvecA 22 is the pedigree-based relationship matrix for genotyped animals only; G w is the genomic matrix obtained following [12,25]: where: M is the SNP matrix, coded as 0, 1 or 2, and p i is the allelic frequency for the i th SNP.

Assessment Of Accuracy And Bias
Spearman correlations between EBV from PBLUP and GEBV for ssGBLUP using genotypes from eight subsets (0.5%, 1%, 5%, 10%, 20%, 40%, 80% and 100% of SNPs) from both the HD and WGS imputed set were calculated to determine the impact of using reduced subsets of SNP on EBV rank.
Approximately one-third of the animals had their phenotypes masked and were chosen to be in the validation set. These animals were randomly selected, and three subsets were created to ensure that all the animals were in the validation set once. Predictive ability of EBV, from PBLUP and ssGBLUP, was de ned as the correlation (r) between EBV and phenotypes corrected for xed effects (y*) for animals in the validation set for each trait [26]: r = cor(EBV, y * ) Moreover, the regression coe cients of EBV on GEBV in each scenario were calculated to evaluate the degree of in ation/de ation of GEBV.

Descriptive statistics
The descriptive statistics of phenotypic data are presented in Table 1. Imputation accuracy estimated by Beagle and assessed using the validation subset approach was 0.84 and ranged from 0.79 to 0.88. After ltering, the distribution of MAF for the HD array was uniform while the MAF distribution for WGS variants retained for further analyses was not (Fig. 1). The estimates of variance components and heritability for organ and carcass traits from PBLUP and ssGBLUP using both HD and WGS subsets are given in Table 2. Small numerical differences in parameter estimates were observed between PBLUP and ssGBLUP. Heritability estimates and their standard errors were lower when genomic information was used.
[ ]  Table 2 Additive genetic variance (σ 2 a ), residual variance (σ 2 e ), phenotypic variance (σ 2 p ) and heritability estimates (h 2 ), with their respective standard errors (in brackets) for organ and carcass traits of broiler chickens using PBLUP and ssGBLUP from the high-density (HD) panel and whole genome sequence (WGS) Dataset. Regarding the predictive ability, traits with higher heritabilities (e.g. GIZ and DRM) showed higher predictive ability than traits with lower heritabilities (e.g. LUN) ( Table 5). Compared to PBLUP, the predictive ability of ssGBLUP was greater when at least 5% of SNPs were used (HD and WGS sets), excepted for THG and BW42, where the predictive ability of ssGBLUP was greater than PBLUP when at least 10% of SNPs were used in HD panel for the evaluated traits. However, for THG and BW42 the predictive ability of PBLUP was greater than ssGBLUP when WGS dataset was used.

Discussion
In the present study we chose SNPs at random to be part of reduced subsets to investigate the genomic prediction of carcass and organ traits in broiler chickens. Although the markers were not equally spaced during the selection process, they were present in at least one chromosome across all genome.

MAF distribution
The distribution of MAF for the HD array was uniform while the MAF distribution for WGS variants retained for further analyses was not (Figure 1). Unlike other studies [10,12], the variants used in the current study did not show a U-shaped MAF distribution for WGS data. In accordance with Ni et al. [27], which also found a non-U-shaped MAF distribution for sequence data in layer chickens, this distribution in WGS data may have occurred due to two possible reasons. First, some of the rare SNPs in the sequenced animals were removed during the imputation process as a result of poor imputation accuracy of SNPs with low MAF. Second, these same rare SNPs were not available in all animals in the population.
Heritability for pedigree-based and genomic models Estimates of variance components and heritability for carcass and organ traits obtained through PBLUP and ssGBLUP are provided in Table 2. The heritability estimates varied from low (LUN) to moderate (HRT, LIV and THG) and high (GIZ, BRST and DRM) and the standard errors associated with those estimates were low.
Pedigree-based heritability estimates have been reported in the literature for most of the traits used in this study. Using the same population (Embrapa TT), Venturini et al. [14] reported similar pedigree-based heritability estimates for LIV (0.33±0.07), GIZ (0.44±0.08) and BRST (0.37±0.07) to those reported herein. However, the heritability estimate found in this study for DRM (0.43±0.08) and THG (0.39±0.07) was than the estimate in Venturini et al. [14]  The heritability estimates for LUN in broiler chickens are not common in the literature. Using an F 2 experimental population, Ledur et al. [31] reported similar pedigree-based heritability estimates for LUN (0.10) to the result reported herein. Although LUN is not considered an economically important trait, it has been related to pulmonary hypertension (e.g. ascites). Heritability estimates for ascites have been reported by several authors [32][33][34][35][36]. The use of a multi-trait model may be responsible for higher estimates of heritability reported in this study compared to the literature since multi-trait models use additional genetic information from links with other traits [37].
Heritability was also estimated using the H matrix instead of the numerator relationship (A) matrix which resulted in relatively small differences between the estimates ( Table 2). In general, the genomic heritability is smaller than heritability estimated using only the pedigree and phenotypic information [38].

Correlation between EBV and GEBV
Across all traits, EBV estimated with at least 0.5% of SNP were highly correlated with EBV estimated from the complete HD (minimum correlation of 0.94) and the minimum average correlation between 0.5% of SNP and PBLUP was 0.89. Indeed, lower correlations were observed when a smaller number of SNP sets were used, but correlations between predicted breeding values were higher when the genomic matrix was incorporated in the analyses, regardless the SNP set selected. A similar pattern was observed with WGS subsets.
Comparing the correlations between EBV using different SNP subsets showed no differences when 10% or 100% of SNPs (mean correlation 0.99) were used in the analyses which suggests that the use of an evenly-spaced lower-density panel could provide a very similar ranking of EBV at a potentially lower cost, as proposed by Habier et al. [39]. When applied to Japanese black cattle, Ogawa et al. [40] suggested that using at least 4,000 equally spaced SNPs was su cient for genetic prediction for carcass weight and marbling score. Using a 50K chip, Rolf et al. [41] reported that between 2,500 and 10,000 SNPs distributed throughout the genome could be used to form a G matrix to accurately predict EBV for feed e ciency in Angus cattle. The current study supports the use of reduced subsets of SNP, demonstrated in other species, for genomic prediction in broiler chickens.

Regression coe cients
The regression coe cients of EBV on GEBV quanti es the bias in the variance of the estimated breeding value for each SNP subset (Tables 3 and   4). Regression coe cients were similar for both the HD and WGS sets of SNPs, ranging from 0.82 to 0.99 across subsets. The regression coe cients increased as the proportion of SNP increased reaching a plateau between 5 and 10% of SNPs. Overall, all the regression coe cient values were the same when HD or WGS was used. In practice, regression coe cient equal to one indicates no bias. Except for LUN, our results showed regression coe cients lower than one for both HD and WGS sets, which means the variance of breeding values were overestimated. However, according to Tsuruta et al. [42] deviations of ± 15% from unity are acceptable.
Using a pure layer line, Yan et al. [43] used bias to compare PBLUP and ssGBLUP prediction methods. These authors used the regression coe cients of phenotypes corrected for xed effects on predicted (G)EBV and reported that EBVs from ssGBLUP were less biased than those from PBLUP. Also, in layer lines, Heidaritabar et al. [12] reported high regression coe cients (greater than 1) when PBLUP or GBLUP methods were used. According to these authors, regardless the incorporation of genotypes from a 60K SNP panel and sequence data the regression coe cients remained greater than 1, indicating an underestimation of the breeding value variance.
Bias differences among the methods may be explained by directional selection [44]. In the present study, multi-trait selection has been applied in this line, mainly focused on traits such as body weight, feed conversion, carcass weights and yield, fertility, hatchability, and to reduce abdominal fat and metabolic syndromes [13,14].

Predictive ability
Predictive abilities across all traits are reported in Table 5. Compared to PBLUP, the predictive ability assessed by ssGBLUP was higher for most of the traits when at least 5% of SNPs were used. The incorporation of sequencing variants is generally thought to have the potential to improve predictive abilities, since it is expected that a high proportion of genetic variation may be explained when a high-density panel or even sequencing data are used. Although WGS increases the number of markers, most of them are in incomplete LD with causal mutations. Variants in incomplete LD with causal mutations limited the increase of prediction abilities, thus the use of variants in strong LD with causal mutations could be useful in improving the accuracy of genomic prediction [45].
The concept of reducing SNP density as a solution for genotyping costs has been widely reported [39-41, 46, 47]. While accuracy estimates using a genomic relationship matrix appear to be better than those from a pedigree relationship matrix, our results show no difference in genomic prediction accuracy when a reduced number of SNPs were used to t the genomic relationship matrix, indicating that at least 10% of SNP from the HD panel (~37,000 SNPs) can be used in genomic evaluation. A previous study with dairy cattle [48] has shown small genomic prediction gains with the increased marker density from medium density (~54,000K) to HD (~777,000K). Agreeing with our ndings, similar results were also obtained in pigs by Zhang et al. [37] which increasing marker density (80K, 650K and WGS) had a little or no advantage in genomic prediction for feed intake.
Simulated data has shown an increase in genomic prediction accuracy when the causal mutations were included in the analyses [7,8,11]. Contrary to those ndings, our study showed no signi cant increase in prediction accuracy when using WGS variants as opposed to SNP from the HD. Other authors have also observed lower or no signi cant bene ts in predictive ability gain using sequence data comparing with SNP arrays [6,9,10,12,45,49,50].
The in nitesimal model used herein (ssGBLUP), whereby all markers are assumed to have an effect and a common variance, showed no signi cant increase in prediction accuracy using WGS variants as compared to the HD markers. In a simulated study, Clark et al. [51] suggested that the increase of genomic prediction accuracy would be small when the trait is highly polygenic especially in a small reference population.
Furthermore, false positives, including sequencing, alignment and calling errors, which are not included in simulated analysis but are present in real data, can also be responsible for these results [49].
Another possible reason is related to the population structure. When a small effective population size undergoes selection for an extended period of time no signi cant gains in prediction accuracy are obtained regardless of using HD panel or WGS dataset [11]. Thus, in highly selected population, almost all of the genetic variance can be explained by the SNPs genetic variance as result of the relationship between individuals [52].
Although imputation accuracy was not the main objective of the present study, it can help to explain why sequence data did not show a superior predictive ability compared with the HD panel. In our study, the average of imputation accuracy assessed using the validation subset approach was 0.84. Although this value is suitable, possible errors in the genomic map might be responsible for reducing the imputation accuracy since those errors may decrease the accuracy of prediction and interfere in the detection of causal mutations [53]. welfare.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analyzed during this study are public and included in this published article. The datasets used and/or analyzed during the current study (genotypes and phenotypes) are available from the corresponding author on reasonable request.

Competing interests
Not applicable.