The profile of the parental genomes and gene expression in different tissues, sexes of progenies
The two inbred chicken strains, CG and WL, which exhibit major differences in growth rate, egg production, and behavior, were used to generate purebreed and reciprocal hybrid F1 progenies (Fig. 1). To identify breed-specific variants, we sequenced the genes of four parents of the two reciprocal crosses, recovering on average 100.73 million pair-end reads per sample after quality control. We identified on average 4.74 million single-nucleotide polymorphisms (SNPs) per parental genome, which were used to generate simulated parental genomes. We picked SNPs that were homozygous in each parental bird but different from each other in the same cross (heterozygous in the hybrid progenies), resulting in two heterozygous SNP lists with 1.4 million heterozygous SNPs on average for the two reciprocal crosses, individually, to identify the allele-specific RNA-Seq reads of the offspring in the following steps.
For each hybrid cross, we collected RNA-Seq data from the brain, liver, and muscle tissue of three male and three female F1 progenies 1 day post-hatching. On average, we recovered 29.17 million mappable reads per sample. To eliminate the effect of the sex chromosomes, we removed all Z and W genes from our analysis and focused entirely on autosomal loci. We observed significant differences in gene expression among different tissues, between sexes, and between parents-of-origin (Fig. 2). Tissue was the most significant factor influencing gene expression, sex played a leading role in the brain, strain influenced gene expression of liver the most, while in the muscle, the parent-of-origin seemed the most powerful because samples were divided into two parts based on mother origin. Consequently, we retained all three variables in our subsequent analyses, resulting in 12 treatment groups, comprised of three tissues, two sexes, and two reciprocal crosses in the present study.
An effective pipeline was applied for the allele-specific expression analysis
To identify the parental origin of the mRNA of the offspring, we explored a novel pipeline using the ‘asSeq’ package in R [31]. Briefly, a set of R scripts was available for genotype phasing based on the 1.4 million heterozygous SNPs identified in the preceding step. Approximately 2% of the SNPs mentioned above were located in the exon region. The high number of SNPs increased the chances that an RNA-seq read could overlap with a heterozygous genetic marker to enable its identification as an allele-specific read.
To validate the accuracy of our ASE pipeline, we generated two artificial hybrid F1 libraries. Specifically, we concatenated two male brain RNA-Seq fastq files from cross 1 and cross 4, which had roughly equal read depths. We also concatenated two female liver samples in the same manner. The two simulated hybrid libraries and four original purebred libraries were handled similar to the other hybrid libraries, using the heterozygous SNP lists of both cross 2 and cross 3. We compared the expression ratio of two simulated alleles (CG/WL) to the real expression ratio of two samples (CG/WL) for each gene. A strong correlation between the two measurements was observed (Fig. S1), indicating that our ASE analysis pipeline was robust. Since our pipeline only counted the local reads containing the heterozygous SNPs, we further assessed the expression fold change (CG/WL) correlation between the local reads method and the method of counting total reads using edgeR [32-34]. The correlation was also strong (Fig. S2). These results demonstrated the feasibility of our pipeline.
Genes were classified into different categories based on the type of regulatory divergence
A total of 24,881 genes from Ensembl v87 annotation were analyzed. Approximately a fifth of the genes contained heterozygous SNPs and were expressed in our progeny samples (Table S1). For the genes containing heterozygous SNPs, we observed significant expression differences (p-value <0.05, binomial test corrected for multiple comparisons by q-value method) between the purebred females (cross 1 vs. cross 4), in 14.71% in the brain, 36.45% in the liver, and 38.38% in muscle (consider the heterozygous SNP list of cross 2, for example). In males, 17.64% of the genes in the brain, 41.87% of the genes in the liver, and 37.84% of the genes in muscle were expressed significantly differentially (Table S1).
Expressed genes were classified into different categories based on the type of gene regulatory divergence [7, 35, 36] (Fig. 3A, B, Table 1, Fig. S3-S5). Most genes exhibited conserved or ambiguous expression, as expected, considering the relatively recent divergence time of the two breeds investigated. More than 70%, 40%, and approximately 50% of the genes in the brain, liver, and muscle, respectively, were classified as conserved. Nonetheless, we observed substantial cis- and trans-variation in the hybrid crosses. There was a higher proportion of trans-regulated gene expression variations than cis-regulated gene expression in most tissues and across both sexes, particularly in muscle (Fig. 3C).
Genes regulated by both cis- and trans-regulatory variations were divided into four categories, including “cis+trans (same)”, “cis+trans (opposite)”, “cis×trans”, and “compensatory”. Genes classified as “cis+trans (same)” show cis and trans-variations acting in a similar direction, while genes classified into the other three categories show cis and trans-variations acting in opposite directions, with different expression trends on the two alleles. We observed the latter pattern more frequently, and most genes were classified as “compensatory” (Fig. 3C).
The gene proportions in each regulatory category were similar among different tissues and between different sexes, except for some variation between the muscle and the other two tissues (Fisher’s exact test, Table S2). Unexpectedly, we observed only few loci with consistent cis- or trans-regulatory divergence across different groups (Fig. S6). The stable cis- or trans-regulatory divergence genes seem to play key roles in phenotypic divergence. For example, IGFBP2, TGFBI, PDGFRL, and IGF2R all showed significant expression bias between the two breeds investigated. The genes are associated with chicken growth, which could explain the difference in growth rate between the two breeds (Table S3).
Genes regulated by trans-acting variation exhibit greater sequence conservation
We counted the number of variants located 1 kb upstream of transcription start sites of each gene using the genome data of the four parents. The results showed greater variations upstream of cis-regulatory divergence genes than upstream of trans-acted genes in all samples (Fig. 4A).
The ratio of the number of non-synonymous SNPs to the number of synonymous SNPs (pN/pS) in the coding sequences of each gene was calculated in the present study. The pN/pS values in genes regulated by trans-variants were lower than the pN/pS values of genes regulated by cis-variants in all samples (Fig. 4B, Fig. S7–S8).