Data collection
The high-throughput sequencing datasets contained 5 tissues and 6 breeds and were all collected from public datasets (Table S1). The public datasets were retrieved from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), and the corresponding SRA accession numbers can be found in Table S1. The genome annotation data, eQTL, sQTL, CTCF signals, top 1% predicted effect score of 13 tissues, and gene expression matrices were obtained from PigGTEx and FAANG[9, 22]. The chromatin states of the pig genome can be accessed through the UCSC Genome Browser (http://genome.ucsc.edu/s/zhypan/susScr11_15_state_14_tissues_new)[9]. Additionally, the high-throughput sequencing data generated for four tissues in this study are available in the Gene Expression Omnibus (GEO) under the accession number GSE158430.
Hi-C data processing and TAD calling
The Hi-C reads were processed using the Juicer software (version 1.5)[24]. Briefly, the high-quality Hi-C reads were mapped to the pig reference genome (susScrofa11.1, access date:2021-08-28) using BWA (v0.7.15) [25, 26] with default parameters. Unaligned read pairs and PCR duplicates were filtered out using Juicer, and alignments with low quality (MAPQ < 30) were also discarded.
Subsequently, intra-chromosomal contact matrices at 50 kb resolution were generated independently for each sample using valid read pairs. These matrices were then quantile normalized using the Knight-Ruiz algorithm[27]. Additionally, inter-chromosomal contact matrices at 1 Mb resolution were generated using the KR algorithm and normalized using log-counts per million (CPM), which represents the average abundance across all libraries.
Criteria for filtering samples
To ensure the quality and consistency of the Hi-C datasets used in this study, we implemented several filtering steps. Firstly, we excluded Hi-C samples with sequencing depths below the average sequencing depth of 48.3\(\times\) as higher sequencing depth is known to yield more reliable results for high-resolution Hi-C maps and 3D chromatin structure prediction. The interquartile range (IQR) method was employed to define the upper and lower limits of outliers. Datasets with sequencing depths outside the range of 30 (lower limit) and 208.8 (upper limit) were excluded, based on the criteria of data points above the third quartile (Q3) + 1.5 × IQR or below Q1 − 1.5 × IQR (Fig. S2).
To maintain consistency in comparisons, we removed any unique Hi-C datasets with read lengths different from the majority of the datasets (read length of 150), specifically those with a read length of 100. Trim Galore (v0.6.7) [28] was used to trim sequences with a Phred quality score below 20, setting the threshold at 20. Sequences shorter than 15 nucleotides were discarded, and the first 3 nucleotides at the 5' end of Read 2 were trimmed (--q 20 --paired --max_n 15 --clip_R2 3).
Furthermore, we calculated the percentage of Hi-C contact reads out of all reads. After applying the aforementioned filtering steps, the remaining Hi-C datasets exhibited a balanced number and coverage of TADs.
Down-sampling analysis
To ensure equal representation and comparability within each comparison group, down-sampling was performed on each sample. Initially, All Valid Reads of each sample were using for down-sampling, and then we utilized HiC-Pro (v2.11.4) [29] to convert the matrix files back to the Juicer format for subsequent analysis. Down-sampling was conducted on the Hi-C interaction matrix following the method described by Carty[30].
In summary, the counts of each element in the matrix, representing pairs of genomic loci, were converted into a list of paired-end reads, where the size of the list matched the counts. Through a random subsampling procedure without replacement, reads were selected from this list and reassigned to create a new downsized dataset. This down-sampling process ensured that each sample within the comparison group had an equal number of valid read pairs, facilitating fair and unbiased comparisons.
Functional enrichment analysis of genes
To conduct gene function enrichment analysis, we utilized the biomaRt (v2.52.0)[31] R package. This package facilitated the conversion of swine gene Ensemble IDs to swine Entrez IDs. The obtained Entrez IDs were then inputted into clusterProfiler (v4.2.2)[32], enabling us to retrieve GO[33] and KEGG[34] enrichment annotations.
In order to gain deeper insights into the functional roles of genes within the conserved TADs, we conducted GO and KEGG enrichment analyses. These analyses aimed to identify significant enriched GO terms and KEGG pathways, respectively, providing valuable information about the functional characteristics of genes within the identified TADs.
Identification of conserve and differential TADs.
Our study focused on six comparative groups: muscle tissue and adipose tissue, adipose tissue and ear tissue, liver tissue and adipose tissue, Large White pig and Rongchang pig, muscle tissue and ear tissue, and Bama Xiang pig and Large White pig (Table 4). Differential interactions regions for each group were detected using CHESS (v 0.3.7)[35]. Differential TADs were defined that TADs overlapped with the regions of differential interactions.
Table 3 Definitions and abbreviations of 15 chromatin states
Chromatin states
|
Abbr.
|
Group
|
Strongly active promoters/transcripts TssA comparison
|
TssA
|
promoter
|
Flanking active TSS without ATAC
|
TssAHet
|
promoter
|
Transcribed at gene
|
TxFlnk
|
TSS-proximal
|
Weak transcribed at gene
|
TxFlnkWk
|
TSS-proximal
|
Transcribed region without ATAC
|
TxFlnHet
|
TSS-proximal
|
Strong active enhancer
|
EnhA
|
enhancers
|
Medium enhancer with ATAC
|
EnhAMe
|
enhancers
|
Weak active enhancer
|
EnhAWk
|
enhancers
|
Active enhancer no ATAC (hetero)
|
EnhAHet
|
enhancers
|
Poised enhancer
|
EnhPois
|
enhancers
|
ATAC island
|
ATAC_Is
|
NA
|
Bivalent/poised TSS
|
TssBiv
|
promoter
|
Repressed polycomb
|
Repr
|
repressed
|
Weak repressed polycomb
|
ReprWk
|
repressed
|
Quiescent
|
Qui
|
quiescent
|
Table 4
the details of six comparison groups
|
Group1
|
Group2
|
comparison
|
Large White muscle
|
Large White adipose
|
Large White adipose
|
Large White ear
|
Bamaxiang liver
|
Bamaxiang Adipose
|
Large White muscle
|
Large White ear
|
Large White ear
|
Rongchang ear
|
Bamaxiang liver
|
Large White liver
|
When analyzing differential TADs, we took into account that the TAD regions identified by the Arrowhead algorithm[36] contained overlapping regions between TADs. To accurately calculate the frequency distribution of TADs throughout the genome, we assigned a frequency of 2 to the overlapping regions, while assigning a frequency of 1 to the non-overlapping regions. This approach ensured an accurate representation of TAD occurrences in the genome. TADs with a frequency of 24 (equal to the number of samples) were defined as conserved TADs, while all other TADs within each comparison group, excluding the differential TADs, were categorized as other TADs. We then calculated the proportions of both the differential TADs and conserved TADs between the groups.
Enrichment analysis of TADs and functional annotation data
To evaluate the enrichment fold and odds ratios of functional regulatory elements within TADs and their flanking regions, we conducted a rigorous analysis. Initially, we identified the number of variants located within TAD regions across different pig tissues and breeds using bedtools (v2.30.0)[37]. Subsequently, we employed Fisher's exact test to evaluate the enrichment probability of various functional elements, including eQTL, sQTL, 15 chromatin states, top 1% regulatory variants predicted by a deep learning method in 13 tissues, and LoFs within TAD regions. Here, we will provide a detailed description of the enrichment analysis for eQTLs as an example.
To calculate the enrichment, we implemented the following procedure:
We divided each TAD into 60 equally sized bins. Additionally, we extended each TAD upstream and downstream by the length of the TAD itself to define the TAD flanking region. We further divided the TAD flanking region into 120 equally sized bins. Consequently, both the TAD and its flanking region were uniformly segmented into a total of 180 bins.
Next, we computed the enrichment as the ratio of the number of eVariants (the genes and genes with alternative splicing events had at least one significant variant detected in e/sQTL study) within a specific bin to the number of SNPs within that bin. This value was divided by the ratio of the number of all eVariants in the entire genome to the number of all SNPs in the genome. The calculation was performed as follows: (number of eVariants in bin / number of SNPs in bin) / (number of all eVariants in the genome / number of all SNPs in the genome).
We employed a significance threshold of FDR < 0.01 to identify statistically significant enrichment between eQTL and TADs. This allowed us to determine the presence of meaningful relationships at a high confidence level.
By applying these robust methods, we were able to investigate the enrichment between eQTL and TADs at a fine-grained level within the genome. Similar procedures were followed for assessing the enrichment of other functional regulatory elements. This comprehensive analysis enabled us to uncover the intricate relationships and potential regulatory mechanisms between TADs and these elements.
Identification of differentially expressed mRNAs.
In this study, RNA-seq gene expression matrices were obtained from the PigGTEx[22]. To identify differentially expressed genes (DEGs) in each comparison group, we utilized the EdgeR package (v3.36.0)[38], which is a widely used statistical method based on a negative binomial distribution model. For each comparison group, we applied filtering criteria set as FDR ≤ 0.01 and |log2FC| ≥ 1 to identify differential expression genes, where FC represents the fold change.