Imputation quality
In our study, we obtained similar imputation results with both a within-breed reference panel and a multi-breed sequence reference panel, provided that the breeds included in the reference population are quite similar to the imputed breeds (Figure 2). We assume that using a wider reference panel covers best the genetic variability of the breed than the very little number of sequenced individuals of the breed in the VarGoats project.
Removing pedigree information before imputation strongly deteriorated the correlations (R), which decreased by 5.1% to 5.4% depending on the breed. CRs were less impacted, showing a decrease of 0.9% to 1.5% depending on the breed. In our conditions, a complete pedigree therefore seems useful to improve the accuracy of the imputed genotypes.
Even though the CRs obtained in this study were similar to those involving equivalent reference population sizes in other species, the correlations were significantly lower than in dairy cattle [29] or poultry [30]. Indeed, CRs ranged from 0.75 to 0.85 in Li et al. [29] and the genotype CR was around 0.8, depending on the chromosome, in Ye et al. [30]. However, the squared correlations for cattle breeds with similar population sizes ranged from 0.63 to 0.76 in Li et al. [29]. Binsbergen et al. [31] obtained results similar to ours with a reference population size of nearly 46 Holstein individuals. They performed direct imputation from 50k genotypes to sequence level and obtained a mean correlation of 0.37 between true and imputed sequences.
One reason explaining our low correlations could be a negative effect of the large number of variants with a low MAF in our dataset (Figure 4), for which the correlations decrease rapidly at lower MAF values (Figure 3). Also, a slight drop in R was observed for variants with a MAF of 0.50 in Saanen goats (Figure 3). This could be linked to the small number of variants with a high MAF (Figure 4), and thus implies that an imputation error would have a major impact on the final correlation. Nonetheless, the correlations that we obtained even for high MAF were low in comparison with other studies in livestock [3,30,31] or humans [32] where the correlation between true and imputed genotypes could reach 0.8. As our imputation study involved very few sequenced individuals (33 and 40), a single imputation error would drastically reduce this correlation.
Another possible explanation is the considerable genetic diversity of the Capra hircus species [1,33]. According to the French Varume project, the number of ancestors contributing to 50% of the gene pool is higher in French Alpine and Saanen (16 and 15 respectively) than in French dairy cow breeds: 7 in Montbéliarde and Holstein, 8 in Normande (Danchin-Burge, Institut de l’Elevage, personal communication). Besides, as shown by Carillier et al. [33], the LD is lower in French goats than in dairy cattle. A low LD might make phasing more difficult as the distance to a 50k marker and therefore the number of potential recombinations increase, leading to imputation errors.
Moreover, the reference populations used for imputation were small and some individuals had initial low-depth sequences (coverage < 10X). Parts of their sequence genotypes remain uncertain.
Nonetheless, most of the QTL regions previously identified with GoatSNP50 BeadChip data were detected on the sequence data with refined signals and increased significance, which suggests that the imputed sequence could be suitable for association analysis.
Nevertheless, the significance of the detected regions or the identification of new regions could be further increased by improving imputation quality. As only a limited number of animals in each breed were sequenced, some genotyping errors might be erased by increasing this number. According to Druet et al. [14], at a given sequencing effort, it is preferable to sequence more individuals at a lower coverage to detect rare variants and to call genotypes accurately than to sequence deeply few individuals.
Association analysis
The p-values for sequence data were slightly higher than for 50k data and clear distinct signals were identified when imputed sequence data were used for association analyses (Figure 7). Sanchez et al. [10], conducted similar studies on the Montbéliarde dairy cattle breed and reported a major increase in the significance of the detected signals when using sequence data, rather than HD or 50k genotypes. However, the imputation accuracy was greater than in our study. Two-step imputation has proven to be more efficient and in our case could dramatically increase the imputation quality. Binsbergen et al. [31] obtained a correlation similar to ours (0.37) when imputing directly from 50k to sequence level using 46 sequenced individuals. This correlation increased and reached 0.65 when an intermediate HD step was introduced. A HD chip is not yet available for caprine species, it would be a very powerful tool that would improve imputation to sequence level by overcoming imputation errors.
Our results varied depending on the imputation scenario used to impute the available 50k genotypes to sequence level. In the Alpine breed, using a multi-breed reference panel resulted in the detection of a new signal for milk yield on chromosome 2 (Figure 5). This signal does not appear when using the 50k genotypes (Figure 7). Significant variants in this region are annotated for CRYBA2, CDK5R2 and FEV genes which are not explicitly related to the mammary gland or any metabolic path linked to milk production. According to the RumimiR database, the region close to the signal is rich in miRNA: 12 miRNAs are located in a range of 1 Mb around the signal in the caprine species. However, among them, only 2 are expressed in mammary tissue: chr2_2187 (29.47 Mb) and chr2_2972 (29.80 Mb) and 4 are expressed in the ovaries: _Novel: bta-miR-153 (28.57 Mb), _Novel: bta-miR-26a (29.39 Mb), LO-m0073 (29.80 Mb) and FO-m0047 (29.80 Mb). Their exact role and impact on milk yield is still unknown. Further analysis is therefore required to confirm or disprove the signal detected on chromosome 2 for milk yield.
In the Saanen breed, the multi-breed reference panel led to the detection of a new signal for milk yield on chromosome 5 while confirming the involvement of a large area on chromosome 19. Significant variants on chromosome 5 are annotated for MDM1 gene, however the link between this gene and milk production is not clear. According to the RumimiR database, there are a few miRNAs in goats that are also located on chromosome 5 near our signal: novel_mir299 (44.53 Mb) discovered in blood samples, chr5_4536_mature (42.17 Mb) and chr5_4548_mature (45.82 Mb). The two latter are expressed at high levels in mammary tissue but are located further away from the signal and their exact involvement in milk yield is not known. As the signal only appeared when imputing using a French multi-breed reference panel, PCA was performed using PLINK for the region of chromosome 5 between 44.804 Mb and 44.816 Mb to try to understand where the imputed frequencies in the region came from. No significant breed group was found using PCA which implies that the QTL might actually be present in other French goat breeds.
In Saanen goats, a larger number of significant variants for milk production were detected on chromosome 19 and deeper investigation is required in this area that is also linked to udder health and conformation [6] and semen production. Multi-breed imputation gave the highest number of significant variants for milk yield (Table 4). The variants are annotated for 91 genes including 3 miRNAs (mir195, mir324 and mir497). Top 10 most significant variants (p-values between and ) are annotated for 2 genes (SCIMP and ZNF232). One of the 10 most significant variants (26,099,146) is situated in an intron of an unknown gene (GENE_id401516). Our study does not allow us to isolate functional candidate genes with certainty as it would require functional analysis. Nevertheless, the proximity of our signal to the ALOX genes cluster constitutes an interesting lead as the latter genes are implicated in lipid metabolism.
For semen production and more particularly semen volume, using a multi-breed reference panel considerably increased the noise observed on Manhattan plots (Figure 6) making it difficult to distinguish true signals from what could be false positives. DYDs for this trait are more precise as they are derived from multiple repeated data from on average 100 daughters per buck whereas semen traits are the bucks’ own limited number of repeated performances. The significance of the signal (Figure 6) and the number of significant variants (Table 4) decreased slightly when a multi-breed reference panel was used compared with within-breed imputation. When imputing within-breed, 209 variants reached the chromosome significance level on chromosome 19. These variants are annotated for 61 genes. Four of the identified genes, (PELP1, ELP5, NEURL4 and CNTROB) are broadly expressed in testes. One gene (CHD3) is ubiquitous in the prostate, and another, YBOX2 (Y-box Binding Protein 2) is restrictedly expressed in testes. YBX2 is a member of the Y-box gene family that encodes a transcription factor and is specifically expressed in germ cells. Knock-out mice for this gene are of normal appearance but are sterile [34]. Mutations in this gene in humans are associated with male fertility disorders such as azoospermia and oligospermia [35]. A significant 23-bp deletion at position 26,614,373 was found in the French Saanen breed, close to the mature miRNA chi-miR-497 (26,614,406 – 26,614,427). The same variant is also located near chi-miR-195 (26,614,085 – 26,614,104). Both miRNAs are ubiquitously expressed in testicular cells and might have an impact on semen production traits.
Figure 7 here
A pleiotropic region for milk, type traits and udder health was previously identified on chromosome 19 for the Saanen breed by Martin et al. [6]. Our study confirmed that a 3.5 Mb region was involved in milk production. For milk yield and semen volume, when sequences were imputed within-breed, top 10 variants had MAF comprised between 0.39 and 0.44 in the Saanen breed. The CLIP test rejected the pleiotropy assumption. The observed correlation was estimated at 0.013 and the threshold not to reject pleiotropy was above 0.15. The two traits might therefore be controlled by two different mutations situated close to each other. Moreover, none of the top 10 variants is shared between the two traits. According to the estimated effects, the allele with the highest frequency in the QTL region decreases both SV (-0.09 SD) and milk yield (-0.51 SD). Such an association therefore shows a favorable condition for improving both semen quantity and milk production.