New insights from genome-wide association analysis using imputed whole-genome sequence: the genetic mechanisms underlying residual feed intake in chickens

Background: Poultry feed occupies the largest cost of poultry production, which is estimated up to 70%. Moreover, it is pressure on the agricultural industry to reduce emissions and improve its environmental footprint, simultaneously increasing output to meet the growing demand for protein worldwide. Therefore, improving feed efficiency (FE) play an important role to improve profits and their environmental footprint in broiler production. In this study, using imputed whole genome sequencing (WGS) data, genome-wide association analysis (GWAS) was performed to identify SNPs and genes associated with residual feed intake (RFI) and its component traits. Furthermore, transcriptomic analysis between high- and low-RFI groups was performed to validate candidate genes from GWAS. Results: Results showed that heritability estimates of average daily gain (ADG), average daily feed intake (ADFI) , and RFI were 0.29 (0.004), 0.37 (0.005), and 0.38 (0.004), respectively. Using imputed sequence-based GWAS, we identified seven significant SNPs and five candidate genes ( MTSS I-BAR domain containing 1, MTSS1; folliculin, FLCN; COP9 signalosome subunit 3, COPS3; 5',3'-nucleotidase, mitochondrial, NT5M; and gametocyte specific factor 1, GTSF1) associated with RFI, twenty significant SNPs and one candidate gene ( inositol polyphosphate multikinase, IPMK ) associated with ADG, and one significant SNP and one candidate gene ( coatomer protein complex subunit alpha, COPA ) associated with ADFI. After performing transcriptomic analysis between high- and low-RFI groups, both 38 up-regulated and 26 down-regulated genes were identified in high-RFI group. Furthermore, integrating regional conditional GWAS and transcriptome analysis, ras related dexamethasone induced 1 (RASD1) was the only one overlapped gene associated with RFI, which also suggested that the region (GGA14: 4767015 -4882318) is a new quantitative trait locus (QTL) associated with RFI. Conclusions: In conclusions, using imputed sequence-based GWAS is an efficient to identify significant SNPs and candidate genes in chicken. Our results provide valuable insights into the genetic mechanisms of RFI and component traits, which would further improve the genetic gain of feed efficiency rapidly and cost-effectively in the context of marker-assisted breeding selection.

kb on GGA6. After stepwise conditional analysis, the level of all significant or suggestive loci around the lead SNP decreased below the genome-wide suggestive threshold in the conditional GWAS ( Figure S3). And all significant SNPs were located in intergenic region, the nearest gene was inositol polyphosphate multikinase (IPMK) (Table 3). Additionally, the genomic inflation factor of ADG was 1.024, which indicated that the results of GWAS of ADG was acceptable.

Imputed sequence-based GWAS for ADFI
Using the univariate model, sequence-based GWAS were performed for ADFI, and the Manhattan and Quantile-Quantile (QQ) plots of GWAS results of ADFI were shown in Figure   1B. Both one significant SNPs (Table 3) and 140 suggestive significant SNPs (Table S1) associated with ADFI were identified using the threshold of suggestive and significant Pvalues (4.98 × 10−6, and 2.49× 10−7). These SNPs were located on these chromosome (GGA1, GGA3, GGA4, GGA14, and GGA25). And the most significant SNPs located at 1,147,989bp position of GGA25. And highly linkage disequilibrium between SNPs of significant regions were found on GGA25 ( Figure S4). After stepwise conditional analysis, the level of all significant or suggestive loci around the lead SNP (rs15997392) decreased below the genome-wide suggestive threshold in the conditional GWAS ( Figure S5). A total of 7 genes were found in the region, which extended 50 kb flanking regions in both upstream and downstream of lead SNP (rs15997392) position ( Figure S5). And the independent significant SNP (rs15997392) was an intron variant in coatomer protein complex subunit alpha (COPA) (Table 3). Additionally, the genomic inflation factor of ADFI was 1.024, which indicated that the results of GWAS was acceptable.

Imputed sequence-based GWAS for RFI
Using the univariate model, sequence-based GWAS was performed for RFI, and the Manhattan and Quantile-Quantile (QQ) plots of GWAS results of RFI were shown in Figure  1C. Both 7 genome-wide significant SNPs (Table 3) and 100 suggestively significant SNPs (Table S1) associated with RFI were identified using the threshold of suggestive and significant P-values (4.98 × 10−6, and 2.49× 10−7). These significant SNPs were mainly located on these chromosome (GGA2, GGA14, and GGA27). And the most significant SNP was located at 4,779,635 bp on GGA14, which explained 5.46% of the phenotypic variance of RFI. Using SnpEff software, five candidate genes associated with RFI were identified and annotated (Table 3). There are MTSS I-BAR domain containing 1 (MTSS1), folliculin (FLCN), COP9 signalosome subunit 3 (COPS3), 5',3'-nucleotidase, mitochondrial (NT5M), and gametocyte specific factor 1 (GTSF1). After LD analysis, the highly linkage disequilibrium between significant SNPs was found on GGA14 and GGA27 ( Figure S6, S7). To find the independent SNPs in a chromosome, stepwise conditional analysis was performed by adding the lead SNP to the model as fixed effect. The level of all significant or suggestive loci around the lead SNP decreased below the genome-wide suggestive threshold in the conditional GWAS ( Figure 2A). Extended 50 kb flanking regions in both upstream and downstream of lead SNP position of GWAS results, five genes (ENSGALG00000004816, COP93, NT5M, MED9 (mediator complex subunit 9), ras related dexamethasone induced 1 (RASD1)) were indicated on GGA14 (Figure 2A), and six genes (ENSGALG00000027009, ENSGALG00000045610, ENSGALG00000027214, GTSF1, golgi SNAP receptor complex member 2 (GOSR2), and ENSGALG00000037637) on GGA27 ( Figure 2B). In addition, the significant SNP of GGA2 was an isolated signal, which suggested it maybe a false positive significant site. Therefore, only two independent SNPs (rs314690911 and rs317155749) were suggested significance association with RFI in this chicken population. Both substitutions variants of rs314690911 (A to G) and rs317155749 (G to A) led to a significant decrease of in RFI phenotypic value ( Figure 3). Additionally, the genomic inflation factor of RFI was 1.002, which closes to 1.00 reflects the results of GWAS was acceptable.

Validation of candidate genes from GWAS results based on differentially expressed genes (DEGs)
To validate the list of candidate genes from GWAS results, we performed transcriptomic analysis to identify DEGs between high-and low-RFI groups in chicken. There were 64 genes differentially expressed between high-and low-RFI with gene expression fold change ranging from -5.33 to 5.20. Compared with low-RFI group, both 38 up-regulated genes and 26 down-regulated were identified in high-RFI group. All of significant DEGs between high-and low-RFI groups was shown in Figure.4 and Table S2, and the top 10 differentially expressed genes were monoamine oxidase A (MAOA), heat shock protein 90 beta family member 1 (HSP90B1), cytochrome P450 family 2 subfamily C polypeptide 23a (CYP2C23a), liver enriched antimicrobial peptide 2 (LEAP2), RASD1, ABI family member 3 binding protein (ABI3BP), and ADAMTS like 5 (ADAMTSL5). After functional and pathway enrichment analysis, two molecular function (carbon-carbon lyase activity and unfolded protein binding) and one cellular component (endoplasmic reticulum lumen) categories were significantly enriched (Table S3) Figure S1).

Comparison between significant SNPs from GWAS and reported QTLs
SNPs less than the threshold of genome-wide significant P-values (4.98 × 10−6) were selected to compare with the reported QTLs. These QTLs were collected from the Animal QTL database based on their physical locations. For RFI, only two QTLs located on GGA4 and GGAZ were extracted, and the nearest distance of QTL between significant SNPs was 24,435,812 bp on GGAZ (Table S4). For ADG, a total of 4 QTLs located on autosomes (GGA 1, 3, 6 and 27) were extracted, and the nearest distance of QTL between significant SNPs was 2,588,827 bp on GGA 27 (Table S4). For ADFI, a total of 6 QTLs located on autosomes (GGA 1, 3 and 4) were extracted, and the nearest distance of QTL between significant SNPs was 1,504,353 bp on GGA 4 (Table S4). None significant SNPs located inside of reported QTL were found.

Discussion
Feed efficiency (FE) play an important role to improve profits and their environmental footprint in broiler production. The genetic dissection of RFI and its component traits would further improve the genetic gain of feed efficiency rapidly and cost-effectively. In this study, a chicken population with imputed WGS data was used to perform GWAS for exploring the genetic mechanisms of feed efficiency. To the best of our knowledge, this is the first time to perform GWAS for RFI in chicken using imputed sequence data, and performed transcriptomic analysis to identify DEGs between high-and low-RFI groups in chicken to validate these candidate genes.

GWAS with imputed whole genome sequence data
Recently, GWAS with imputed WGS data have been widely used in livestock species, especially in cattle [16,18,19]. Because of genotype imputation would improve the power of GWAS [20], and reduce the cost of genotyping. However, genotype imputation from SNP array to WGS data no only increased the marker density, but also brought the imputation error. And imputation errors will affect the probability of causal SNPs, which were determined by performing GWAS. Therefore, it is necessary to ensure the highly imputation accuracy before performing GWAS. For obtained higher imputation accuracy, genotype imputation with two-step approach was performed using a combined reference panel in this study. After quality control of the imputed WGS data, the average imputation accuracy (Beagle R2) of all SNPs was 0.871 0.177, and ranged from 0.357 to 0.926 for different chromosomes (Table S5). This high imputation accuracy was full enough to ensure the confidence of GWAS results. This obtained high imputation accuracy may due to performed imputation with two-step approach [21] and using a combined reference panel [22]. Compared with our previous study, which performed GWAS for RFI using 426 chickens with 600 K array data [15], new significant SNPs and genes were identified.
Because of more individuals (626 birds) and higher marker density (imputed WGS data) were utilized to perform GWAS. Although the power of GWAS was improved in this study, it is still difficult to pinpoint the causative variant. Because many significant SNPs existed a high degree of linkage disequilibrium, and had almost equally significantly associated variations, such as the significant SNPs of ADG on GGA6 (Table 3).

Candidate genes for residual feed intake (RFI) and its component traits
After performed sequence-based GWAS, a lot of candidate genes were annotated that associated with RFI and its component traits (Table 3 and Table S2). For RFI, two independent significance SNPs (rs314690911 and SNP rs317155749) were identified via conditional GWAS. One was a variant upstream of COPS3, and the other is an intron variant of GTSF1. The COPS3 gene encodes the third subunit of the eight-subunit COP9 signalosome complex, which was initially identified as a negative regulator of constitutive photomorphogenesis in Arabidopsis thaliana [23]. Also, COPS3 plays an important role in both the tumorigenesis and progression. Previous research shown that the amplification of COPS3 was strongly associated with large tumor size (P =0.0009) [24]. The other is GTSF1, which encodes a 167-amino acid protein as a member of the Uncharacterized Protein Family 0224 (UPF0224), and play an important role in liver cancer. Compared with GTSF1positive group, the sizes and weights of the tumors of liver cancer in the GTSF1-negative group were decreased significantly (P<0.05) [25]. Moreover, RASD1 was the only one overlapped gene between GWAS results and DEGs between high-and low-RFI groups. RASD1 is a highly conserved member of the Ras family of monomeric G proteins that was first identified as a dexamethasone (DEX) inducible gene in AtT-20 mouse pituitary tumor cells [26]. Vaidyanathan et al. indicate that RASD1 often promotes cell growth [27].
Moreover, the current literature demonstrates that Rasd1 expression is induced by a wide range of physiological stimuli and has many biological effects including the regulation of circadian timekeeping, anxiety related behavior, adipocyte differentiation, and hormone release [28]. For ADG, all significant SNPs were located in intergenic region, and IPMK was the nearest gene (Table 3). IPMK is a member of the IPK-superfamily of kinases, which plays an important role at the nexus of signaling, metabolic and regulatory pathways [29].
For example, IPMK involved in hypothalamic control of food intake via AMP-activated protein kinase (AMPK) signaling pathways [30]. For ADFI, the most significant SNPs located at 1147989bp position of GGA25, which was an intron variant in COPA. COPA encodes the α-COP subunit of the coat protein I (COPI) seven subunit complex that is involved with intracellular coated vesicle transport [31]. COPA mutations have recently been revealed to cause autoimmune interstitial lung, joint and kidney disease (COPA syndrome). Moreover, the COP9 signalosome is a subunit of a highly conserved complex of COPS3. Therefore, we guess the interaction of COPS3 and COP9 would impact the feed intake and further on RFI.

Combined GWAS and transcriptomic analysis to identify candidate genes
Nowadays, it is an efficient method that combined GWAS and transcriptomic analysis to identify the causal mutations of complex traits in livestock. In cattle, previous study integrated RNA-Seq data and sequence-based GWAS data to explore the genetic mechanisms of mastitis resistance and milk production [32]. In swine, combining the GWAS and gene expression profile data, two genes (UBA domain containing 1 gene and Epsin 1 gene) were identified significantly associated with Streptococcus Suis serotype 2

Ethics statement
Animal care and experiments were conducted according to the Regulations for the Administration of Affairs Concerning Experimental Animals (Ministry of Science and Technology, China) and approved by the Animal Care and Use Committee of the South China Agricultural University, Guangzhou, China (approval number: SCAU#2014-10).

Population and phenotyping
A chicken population derived from a yellow-feather dwarf broiler breed that maintained for 25 generations by Wens Nanfang Poultry Breeding Co. Ltd. (Xinxing, P.R. China) was used in this study. This population includes 1,600 birds (800 males, 800 females) and was the 3rd batch of the 25th generation of this chicken population. These birds came from a mixture of full sib and half sib families from mating 30 males and 360 females of the 24th generation. After hatching, all birds were maintained in a closed building under controlled environmental conditions and provided with a standard diet until they were 4 wk of age.
The chickens were then randomly assigned to six pens by gender (three pens for males and three pens for females) for growth performance testing from 5 to 13 wk of age. They received food and water ad libitum during all stages. Finally, remaining 1,338 birds were slaughtered at 91 day of age for carcass trait recording. Average daily gain (ADG) and average daily feed intake (ADFI) per individual were calculated for the period from 45 to 84 day. The and residual feed intake (RFI) were calculated by follow formula RFI = ADFI -(b0 + b1 × MMBW+ b2 × ADG), where b0 was the intercept, MMBW is mid-test body weight (MBW raised to the power of 0.75), and the MBW was the predicted body weight on day 21 of the test. b1 and b2 are the partial regression coefficients for MMBW and ADG, respectively. Descriptive statistics of analyzed traits could be found (Table 1)

Genotyping, genotype imputation and quality control
After traits recorded systematically, a total of 644 male birds were randomly selected for genotyping. These birds were 15 male parents and 629 male offspring. Of these 644 birds, 450 birds were genotyped with the 600 K Affymetrix® Axiom® HD genotyping array [35], and remaining 194 birds were genotyped with the Affy 55K array [36]. The 600 K SNP chip contained 580,961 SNPs probes across 28 autosomes, two linkage groups (LGE64 and LGE22C19W28_E50C23), and two sex chromosomes. The 55 K SNP chip contained 52,184 SNPs probes across 28 autosomes and a sex chromosome (chrZ). After converting genome coordinates to a chicken reference genome (galGal5), 28 autosomes and a sex chromosome (chrZ) were extracted for further analyses. In the process of quality control of genotype, SNPs that minor allele frequency (MAF) smaller than 0.5%, genotyping call rate smaller than 97%, and Hardy-Weinberg equilibrium test P-value smaller than 1×10 -6 were removed. Finally, 547,020 and 51,984 SNPs were remaining for 600 K and 55K chip data, respectively. In addition, a total of 23,213 SNPs was shared between 600 K and 55 K SNP chip.
Genotype imputation was performed with two-step approach from 55 to 600 K, and then imputed to WGS. Before Genotype imputation, pre-phasing was executed in Beagle 4.1 with default parameter [37]. Firstly, using 450 birds with 600 K chip data as a reference panel, these 194 birds were imputed from 55 K to 600 K chip data using Beagle 4.0 with pedigree. And then merge 600 K chip data of these 194 birds and 450 birds using VCFtools. Secondly, all of 644 birds with 600 K chip data were imputed to WGS data using a combined reference panel Beagle 4.1 with default parameter. The combined reference panels included 24 key individuals from the yellow-feather dwarf broiler population and 311 birds with WGS data from diverse chicken breeds. These 24 key individuals were selected by maximizing the expected genetic relationships [38]. These 311 birds were downloaded from 10 BioProjects in ENA or NCBI. The combined reference panels contained 36,840,795 SNPs probes across 28 autosomes and a sex chromosome (chrZ). More detail information could be found in our previous study [22].
In addition, individuals would be excluded who existed Mendelian errors. Finally, the remaining 626 individuals and 11,173,020 SNPs were used for further analysis.

Genetic parameter estimations
The genomic heritability was calculated using the average information restricted maximum likelihood (AI-REML) method implemented in the software DMU v6.0 (Madsen and Jensen, 2013). The statistical model was: where y is a vector of phenotypic values of all individuals; b is the vector of fixed effects including batch effect; was the vector of the animal additive genetic effect, and ; e is the residual term, and ; and X and Z are incidence matrices relating the fixed effects and the additive genetic values to the phenotypic records; is the genomic relationship matrix calculated using 600 K chip data as follows [39]: where is a matrix of centered genotypes, m is the number of markers, and is the minor allele frequency of SNP .

Genome-wide association analyses using imputed WGS data
Before performed GWAS, the population structure of this chicken population was calculated by PLINK. And found a slight population stratification ( Figure S1), so that we added top five principal components as covariates into the GWAS model to adjust population structure. The univariate tests of association were performed using a mixed model approach implemented in the GEMMA v0.98.1 software [68]. All sequence variants after quality control were tested for associations. The model was: where y is a vector of phenotypic values of all individuals; X and Z are incidence matrices relating the fixed effects and the additive genetic values to the phenotypic records; b is the vector of fixed effects including batch effect and top five PCs; g is a vector of the genomic breeding values of all individuals; a is the additive effect of the candidate variants to be tested for association; is a vector of the variants' genotype indicator variable coded as 0, 1, or 2; and e is the residual term, . Genomic breeding values were assumed to be distributed as , where is the standardized relatedness matrix calculated by GEMMA using 600 K chip data. The Wald test was applied to test the alternative hypotheses of each SNP in the univariate models. And the variance contribution to additive genomic variance by a SNP was calculated as follows: , where is the additive genomic variance explained by a SNP, p is the allele frequency, and β is the SNP effect estimated from GWAS results.
The Manhattan plot and quantile-quantile plot (QQ plot) were generated by the qqman package (Turner 2014) in R. The threshold of genome-wide significant P-values was adjusted based on the effective number of independent tests for Bonferroni method. The imputed WGS data was pruned to 1,082,126 independent SNPs using PLINK with the command (--indep-pairwise 25 5 0.2). The effective number of independent tests were set to 200,629, estimated by sampleM. Therefore, the genome-wide suggestive and significant P-values were 4.98 × 10 − 6 (1.00/200,629) and 2.49× 10 − 7 (0.05/200,629), respectively.
For evaluating the extent of false positive signals of GWAS results, a genomic inflation factor (λ) was calculated as the median of the resulting chi-squared test statistics divided by the expected median of the chi-squared distribution with one degree of freedom (i.e., 0.454). Haploview software (Barrett et al. 2005) was used to analyze the linkage disequilibrium around the significant SNPs. For identifying the independent signals precisely, the most significant SNP were added as covariates into the univariate models in step-wise conditional analyses. In addition, the gene information file of chicken was downloaded from Ensembl gene build 94, and candidate genes were annotated using the software SnpEff version 4.3t [40].

Transcriptomic analysis identifies differentially expressed genes (DEGs) associated with RFI
Raw reads of four samples (sample45561 and sample46307 with high RFI, and sample45012 and sample46732 with low RFI) were downloaded from our previous study [15]. And raw reads were processed to clean reads by filtering low quality reads and adaptor dimers. Clean reads were mapped to the chicken reference genome (galGal5) using HISAT2 v2.0.5 with the default parameter. Then, the alignments were assembled into full and partial transcripts using StringTie, and quantify the transcripts for each sample using the GAL5. Finally, the differential gene expression analysis was made with Ballgown in R environment [41]. In this study, differentially expressed transcripts or genes were identified based on an adjusted p-value less than 0.05 (false discovery rate, FDR of 5 %) and the absolute value of log2-transformed of fold change (FC) more than or equal to one. Function and pathway enrichment were performed by the R language package (clusterProfiler). Using Benjamini-Hochberg method, the P-values of KEGG pathway and GO terms were adjusted for multiple testing [42]. And an adjusted P-value less than 0.05 was set as significant P-values. In addition, the genome annotation information file was downloaded from Ensembl gene build 94.

Validation of candidate genes based on differentially expressed genes (DEGs)
The identification of candidate genes of lead SNPs from GWAS results was performed basing on their corresponding genomic positions. The candidate gene regions were defined as extended 50 kb flanking regions in both upstream and downstream of lead SNP position. If there are no genes in candidate gene regions, selected the nearest genes in both upstream and downstream of lead SNP position as the candidate genes. Compared the overlap genes or regions between the candidate genes from sequence-based GWAS and DEGs, to identify the causal genes or QTLs.

Significant SNPs compared with reported QTLs
To compare results from sequence-based GWAS with reported QTLs, significant and suggestively significant SNPs were selected to compare with the QTLs. These QTLs, all of which affect ADG, ADFI and RFI, were selected from the Animal QTLdb [43], respectively.
QTLs closest to the significant SNPs were extracted.  Manhattan plots and Q-Q plots of imputed sequence-based genome-wide association study for ADG, ADFI, and RFI. The Manhattan plots indicate −log10 (observed P-values) for markers (y-axis) against their corresponding position on each chromosome (x-axis), while the Q-Q plots show the expected -log10(Pvalues) vs. the observed -log10(P-values). The horizontal blue and red lines represent the genome-wide significant threshold (4.98 × 10−6) and genome-wide suggestive significant threshold (2.49 × 10−7), respectively. Lambda represents genomic inflation factor.

Figure 2
Regional association plot of the lead SNP associated with RFI at GGA14 and GGA27. The regional association plot indicates −log10 (observed P-values) for markers (y-axis) against their corresponding position on each chromosome (xaxis). The horizontal blue and red lines represent the genome-wide significant threshold (4.98 × 10−6) and genome-wide suggestive significant threshold (2.49 × 10−7), respectively. The lead SNPs are denoted by large black circle. SNPs are represented by colored circle according to the degree LD between the lead SNP.
These green lines represent these genes located on this chromosome.

Figure 4
Differentially expressed genes (DEGs) between high-and low-RFI groups in chicken. The volcano plot indicates −log10 (observed P-values) for genes (y-axis) against their corresponding log2(|fold change|) of echo gene (x-axis). The horizontal red dotted line represents the significant threshold (0.05). The red, blue, and gray points represent up-regulated, down-regulated, and non-regulated genes in high-RFI groups, respectively.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.