Mapping the open chromatin states of 20 pig tissues
To map genome-wide regulatory elements in pig, we performed ATAC-seq and RNA-seq on 40 samples representing 20 tissues, namely, brain, cerebellum, hypothalamus, medulla spinalis, lung, bronchia, parotid gland, pharynx, stomach, liver, SI, kidney, ovary, cervix, heart, thymus, longissimus muscle (LD), semimembranosus muscle (SM), backfat and leaf fat) from two female DLY pigs at 6 months (Fig. 1a), representing two biological replicates per tissue. After quality control, we obtained a total of 2.43G uniquely mapped reads from the 40 samples (Supplementary Table 1). The reads showed enrichment around transcription start sites (TSS) and a periodical fragment size distribution corresponding to the nucleosome-free regions (NFR) (< 100 bp) and mono-, di-, and tri-nucleosomes (~ 200, 400 and 600 bp, respectively) (Additional file 1: Fig. S1b, S1c). Furthermore, we computed quality related statistics FRIP (fraction of reads in peaks), TSS score, library complexity (NRF, PBC1 and PBC2) and cross correlation (NSC and RSC)) (Supplementary Table 1). These statistics suggested that the data generated hereby are of good quality. We identified an average of 77,096 peaks with average size of 448 bp per tissue (Additional file 1: Fig. S1e). Peaks from all samples were merged into 557,273 non-redundant peaks, among which 363,638 (65.25%) overlapped with peaks identified in Zhou et al [40] (Fig. 1d), while the rest of peaks were newly identified owing to more types of tissues were profiled in this study.
To obtain a set of peaks with higher confidence, we only kept peaks that were evidenced in both biological replicates for each sample, and finally retained 267,836 merged peaks for further analysis. Most of the 267,836 peaks located in intronic (54.23%) and intergenic (36.75%) regions, meanwhile, 4.23%, 2.59% and 2.20% peaks were evidenced in CDS, 5’UTR and 3’UTR regions (Fig. 1f). Similar to the results obtained in the previous study [25], we found that accessible chromatin region varies greatly across tissues (range: 9,729 to 94,946), covering 0.56% to 3.21% sequences of the reference genome (Fig. 1c), with a good overlap with those reported in independent studies by tissues (Supplementary Table 4).
a Schematic diagram showing tissues assayed in this study. The ring colors represent different organ systems. Starting with the location of the brain tissue, clockwise around the circle, corresponding to the nervous system (brain, cerebellum, hypothalamus, medulla spinalis), respiratory system (lung, bronchia), digestive system (parotid gland, pharynx, stomach, liver, small intestine (SI)), urinary system (kidney), reproductive system (ovary, cervix), circulatory system (heart), lymphatic system (thymus), locomotor system (longissimus muscle (LD), semimembranosus (SM)), endocrine system (backfat and leaf fat). b Hierarchical clustering of samples using ATAC-seq signal intensity. c Distribution of the peak number and genome coverage in each tissue. d Venn plot showing the overlap of peaks identified in this study with those reported in Zhou et al. 2021. e Pearson correlation heatmaps of 20 tissues by density of ATAC-seq distal peaks (left) and ATAC-seq proximal peaks (right). f Pie chart represents the proportion of peaks located in introns, intergenic, 5’UTR, 3’UTR and CDS regions.
Hierarchical clustering based on the peak intensity clearly separated different tissue types (Fig. 1b), indicating that the peak intensities largely reflect tissue identity. We grouped the 267,836 peaks into 20,488 proximal peaks (within 1Kb of TSS) and 247,248 distal peaks (away from TSS 1Kb). As expected, the distal peaks displayed stronger tissue specificity comparing to proximal peaks (Fig. 1e and Additional file 1: Fig. S1d), agreeing with the reported characteristics of promoters and enhancers [49].
To investigate the effect of regulatory elements on gene expression, RNA-seq was successfully performed on 38 out of the 40 ATAC-Seq samples. The summary statistics of each RNA library is listed in Supplementary Table 2. A total of 17,624 genes were identified. The reproducibility of the 38 RNA-seq data was verified by hierarchical clustering of gene expression (Additional file 1: Fig. S2a).
Tissue specificity of the open chromatin peaks
To explore the key regulatory elements that determine specialized function of tissues, we defined peaks whose average intensity was at least two folds in the target tissue relative to any other tissues as tissue-specific peaks, and identified 16,704, 7,887, 6,742, 5,960, 5,286, 4,278, 3,089, 884, 649, 407, 326, 277, 261, 177, 118, 84, 21, 16, 6 and 3 tissue-specific peaks in cerebellum, kidney, liver, heart, parotid gland, cervix, thymus, leaf fat, stomach, brain, SI, LD, pharynx, SM, ovary, hypothalamus, medulla spinalis, lung, backfat and bronchia, respectively (Fig. 2a, 2b). We exemplified a liver specific peak located close to ARG1 (Fig. 2c and Additional file 1: S2e), a protein coding gene expressed primarily in the liver and involved in the urea cycle [50]. The genes locate closest to the tissue-specific peaks were enriched in tissue-specific pathways, such as the heart and circulatory system development for heart specific peaks (Additional file 1: Fig. S2f), such as heart development and circulatory system development in heart. Given that the numbers of tissue-specific peaks were largely affected by the inclusion of physiologically similar tissues, we further determined peaks that specific to a certain tissue system using a similar rule applied on the individual tissues, and identified 11,114, 434, 310, 28, 5,053 and 50 specific peaks for locomotor system, reproductive system, endocrine system, respiratory system, nervous system and digestive system, respectively, located close to genes with function matching the biology of corresponding tissues (Fig. 2a). For example, genes expressed specifically in nervous system were significantly involved in synaptic membrane (q-value = 1.39E-32), while those expressed specifically in locomotor system were significantly associated with muscle structure development (q-value = 5.68E-28) (Additional file 1: Fig. S2d).
Next, we investigated enrichment of TF binding motifs in the tissue-specific peaks using HOMER [39]. We defined the TF motifs with enrichment strength (-log (p-value)) in one tissue were at least two folds relative to the other tissues as tissue-specific, and identified 205 tissue-specific TF motifs (Fig. 2d). These included motif of ATOH1 which controls primary cilia formation that facilitates SHH-Triggered granule neuron progenitor proliferation [51] was evidenced to be cerebellum specific, and motif of MEF2C which controls cardiac morphogenesis [52] was enriched in heart.
Ubiquitously accessible chromatin peaks
Meanwhile, we also identified 8,734 peaks that consistently active in at least 90% of investigated tissues (Fig. 2e). The genes locate nearest to the conserved peaks were enriched for pathways like cytoskeleton and mitotic cell cycle that have housekeeping functions (Additional file 1: Fig. S2d). The top 5 enriched TF motifs (SP1, NFY, SP5, ELK4 and KLF3) (Supplementary Table 5) participate in chromatin remodeling and transcriptional activation (Fig. 2f), which are indispensable transcription factors in growth and development [53].
Tissue-specific gene expressions
We applied similar approach to investigate tissue specificity of gene expressions, and identified 135, 528, 439, 711, 228, 181, 117, 467, 40, 75, 469, 486, 94, 581, 124, 117, 411, 34, 336 and 1,590 tissue-specific genes were found to be specific in backfat, brain, bronchia, cerebellum, cervix, heart, hypothalamus, kidney, LD, leaf fat, liver, lung, medulla spinalis, ovary, parotid gland, pharynx, SI, SM, stomach and thymus, respectively (Additional file 1: Fig. S2b). For example, MYH7, a gene described to be responsible for hypertrophic cardiomyopathy, was highly expressed in heart than in other tissues (Additional file 1: Fig. S2c). The GO analysis showed these tissue-specific genes were associated with tissue-specific biological processes (Additional file 1: Fig. S2b). For example, genes significantly expressed in brain were related to the function of synaptic signaling, and genes significantly expressed in thymus were related to the function of T cell activation (Additional file 1: Fig. S2b).
Joint profiling of chromatin accessibility and gene expression
To investigate the regulatory elements that link to gene transcription, we examined correlations of gene expressions with intensity of ATAC peaks within 500Kb from the TSS of corresponding genes, as majority of promoter-anchored chromatin interactions were within 500 kb [54, 55]. The analysis was performed on the integrated dataset of 527,718 peaks and 17,634 genes from 71 samples, including 33 samples from public datasets (Supplementary Table 3, Additional file 1: fig. S3). We identified a total of 827,942 significant peak-gene associations (q-value < 0.01) that involves 255,166 peaks and 17,493 genes (Additional file 1: Fig. S4). Majority of the significant correlations (406,896/827,942, 49.15%) are positive only, while we also identified 10.54% negative correlations only. For example, a peak (chr5:10,663,155-10,664,871) has high intensity in liver showed positive correlation with TMPRSS6 (Spearman r2 = 0.77, q-value = 6.43e-12) (Fig. 3f), which encodes a type II transmembrane serine protease exclusively produced by liver [56]. We also exemplified negative peak – gene correlation where a peak (chr1:128,083,325-128,085,034) show strong negative correlation with expression levels of PDIA3 (Spearman r2 = - 0.60, q-value = 1.42e-6) (Fig. 3g). In addition, there are 40.31% peak – gene correlations showed both positive and negative correlations, indicating that the identity of regulatory elements was mutable.
On average, there were 2.87 genes per peak and 35.86 peaks per gene in positive correlations while 2.13 genes per peak and 14.49 peaks per gene in negative correlations (Fig. 3a, 3c). Interestingly, we observed that the strength of positive and negative correlations showed distinct distribution against peak-gene distances (fig. 3b). The positive correlations were enriched around the TSS of genes, while the negative correlations appeared to be underrepresented in TSS regions (fig. 3d). We identified 1,786 tissue-specific genes were associated with 38,574 tissue-specific peaks within 500 kb from the TSS regions (Fig. 3e, 3h and Fig. S4), suggesting important roles of tissue-specific chromatin accessibility in driving the tissue-specific genes expression thereby encoded tissue-specific biological function.
Open chromatin states of transposable elements in different pig tissues
Over 40% of the sequences in the non-coding region of the mammalian genome are TEs whose function were largely unexplored. To understand the function of TEs in pig tissues, we examined the chromatin accessibility of genome wide TEs. We found 26.91% of the 267,836 open chromatin regions were overlapped with at least one TE.
The spatial correlation analysis between TEs and open chromatin calculated by GenometriCorr [44] suggested underrepresentation of ratio of TEs located in the open chromatin regions than random expectation, indicating that TEs are tend to be epigenetically silenced. Despite this, each tissue contains an average of 17.96% peaks that were overlapped with at least one TE, range from 9.49% to 22.33% (Fig. 4a). The SINE, LINE and LTR accounted for approximately 40%, 30% and 23% of the accessible TEs, respectively, across the 20 tissues (Fig. 4b). We found over 50% of these TEs existed in a single tissue were located close to genes involved in tissue specific functions (Fig. 4c, 4d). For example, we observed enrichment of T cell receptor signaling pathway enriched in thymus while muscle system process enriched in LD (Fig. 4d). These results suggested that accessible TEs existed in one tissue exhibit strong tissue-specific regulatory roles, which was consistent with the previous reports [57, 58]. Meanwhile we observed approximate 0.19% of accessible TEs were commonly accessible in all 20 tissues (Fig. 4c). The conserved accessible TEs were more associated with housekeeping functions (e.g., ribosome assembly) (Fig. 4d).
Mounting evidence suggests that specific TEs in pigs, especially PERV, could potentially lead to immunodeficiency and tumorigenesis, making it difficult for xenotransplantation [59, 60]. Upon further inspection, we found a total of 10 unique PERVs overlapped with 13 peaks (Supplementary Table 6). For example, the PERV (sPERV-JX2like-12-23.0M) was overlapped with the peak (chr12:22,990,219-22,991,950) which showed high intensity in liver (Fig. 4e), significantly correlated multiple genes (NEUROD2, ENSSSCG00000017507, CACNB1, PLXDC1, PIP4K2B). Among them, PIP4K2B, which participates in 1-phosphatidylinositol-4-phosphate 5-kinase activity and IP6 [61], the product of 1-phosphatidylinositol-4-phosphate 5-kinase, can facilitates assembly of the immature HIV-1 Gag lattice [62], shows highly correlation (Fig. 4f), suggesting potential role of the PERVs in facilitating virus infection.
Besides this, we also focused on the function of other genes regulated by PERV. For example, GTPBP8 [63], BCKDK [64], IDH3B [65] and IDH1 [66] were related to the function of mitochondrion; CD2BP2 [67]and FUS [68] involved in mRNA splicing; BOC [69] and FZD5 [70] participated in signal transduction pathway. Taken together, these results suggested that PERV which act as regulatory element involved in a variety of biological functions.
Endogenous retroviruses (ERVs) are LTR retrotransposons (class I of TEs), which originated from exogenous retroviruses that infected the germ line throughout evolution [71, 72], which could manifest their function across multiple organs in an animal, thereby could have fundamental functional impact. In this study, we identified a total of 162 TEs that are accessible in all investigated tissues. To trace the origin or evolutionary history of these ubiquitously accessible TEs, we calculated sequence similarity between 124 viral sequences (Supplementary Table 8) and 162 conserved accessible TEs sequences, and constructed phylogenetic trees to reconstruct evolutionary histories. Interestingly, we observed some conserved accessible TEs sequences were grouped with some viruses whose host was pig (Fig. 5), such as PBoV, TTSuVK2a, TTSuVK2b, PADV-A, PoBuV, PPV7 and Po-Circo-Like Virus 21 (Supplementary Table 9). We performed GO and KEGG enrichment analysis of all genes associated with conserved accessible TEs, and found that the biologic process of mitochondrial depolarization was highly enriched (Supplementary Table 10), which indicated that accessible TEs derived from viral sequences have a potential regulatory function in clearance of excess or impaired mitochondrial. Altogether, these evidences suggested that some special viral sequences have been integrated into the host genome and retained as regulatory elements that could play constitutive regulatory functions.
An annotation resource for regulatory mechanisms in complex traits
To demonstrate the utility of such resource in annotating candidate casual variants for complex traits, we examined overlap of 191 significant variants for fatty acid composition identified in a genome sequence based imputation GWAS [48] with the ATAC-Seq peak regions revealed in this study. A total of 22 variants were found to be coincident with the ATAC-Seq peaks (Supplementary Table 7). Among these, a lead SNP (chr2:9,736,686, P = 2.10 × 10-10) for C20:3n-6/C18:2n-6 was located in an ATAC-Seq peak (chr2:9,736,357-9,737,907) that showed high intensity in brain and moderate intensity in muscle (Fig. 6a). This peak showed significant correlation with three genes (FADS1, FADS2 and RAB3IL1) (Supplementary Table 7). Among them, FADS1 (q = 1.56 × 10-04) which exhibited strongest correlation (Fig. 6b), encodes delta-5 desaturase, one of the rate-limiting enzymes in the endogenous synthesis of polyunsaturated fatty acids (PUFAs) [73]. FADS2 which converts linoleate and alpha-linolenate into PUFAs is one of the key limiting enzymes in the lipid metabolic pathway [74] (Fig. 6c). In addition, several other genes related to fatty acid have been identified. Osteoglycin (OGN) is involved in matrix assembly, cellular growth, and migration. Previous study showed that OGN was associated with fat acids composition traits to some extent [75] and regulated lipid differentiation through the Wnt/β-catenin signaling pathway [76]. TEAD3 which participates in the fatty acid, triacylglycerol, and ketone body metabolism pathway from the Reactome Pathways dataset[77] is a member of the transcriptional enhancer factor (TEF) family of transcription factors. Further experiment is needed to validate effect of the variants on the accessibility of the peak, and in turn the expression of relevant genes and fatty acid composition phenotypes.