Comparative Analysis of Super-enhancers Across Mammals Reveals Their High Function Conservation


 BackgroundSuper-enhancers (SEs) are key positive regulatory elements in defining cells/tissues identity in mammals, yet their similarities and differences of sequence and function across mammals have been poor studied. To allow sequence and function comparison across mammalian SEs, we employ H3K27ac ChIP-seq data to six cell types/tissues across human, pig, and mouse, which represent different lineages of mammals in the evolutionary tree.ResultsOverall, a median of 848 human SEs, 888 pig SEs and 503 mouse SEs are identified across cells/tissues. These SEs are largely distributed in promoter regions for human (91.9% in median) and mouse (63.4% in median), while mostly in distal intergenic regions for pig (66.1% in median). Extremely higher unique orthologous SEs frequency (91.6%~92.1%) has been detected for the same cell/tissue across species. Consistently, their overlapping rates are very low for the same cell/tissue across species (0.1%~0.5%). For the SE-associated orthologous genes, they also show high unique frequency for species (63.3%~83.9%) and low overlapping rates (0.8%~1.3%) at inter-species comparison. However, orthologous SEs function comparisons across species have shown similar biological processes related to cells/tissues identity in the top 15 significant enriched terms for the same cell/tissue. Meanwhile, common core transcription factors that determine cells/tissues identity are determined for the same cell/tissue across mammals.ConclusionsThis study highlights the differences of SEs genomic distribution across mammals. It reveals low orthologous sequence overlapping but high function conservation of SEs across mammals. It would improve our understanding of regulation function cis-regulatory elements in mammals.


Background
Identi cation of transcription factors is central for dissecting cell/tissue-speci c transcriptional expression program. Enhancers are regulatory elements that increase their target genes expression by binding speci c transcription factors and cofactors [1][2][3]. Advances in DNA sequencing technology enable enhancers identi cation from genome-wide scale, for example, by using ChIP-seq [4][5][6]. ChIP-seq studies have frequently focused on factors or histone markers that are associated with enhancer activities, such as the presence of histone H3 lysine 27 acetylation (H3K27ac) [6]. By using ChIP-seq-based methods, scientists have identi ed groups of putative enhancers in close genomic proximity, termed as "superenhancers" [7,8]. Super-enhancers (SEs) differ from typical enhancers in sizes in genome, and signal of transcription factors or histone markers [7,8].
SEs are spatially closely associated with many protein-coding genes that play prominent roles in cell identity [7,8]. For example, SEs in mESCs are associated with pluripotency genes, including Nanog, Sox2, Oct4 [7]; they are also found to be enriched for motifs implicated in ESC biology, including Klf4, Esrrb, and Prdm14 [7]. Super-enhancers can be de ned in any cell types. For example, H3K27ac data from 86 different human tissues have been used for SE identi cation [8]; Databases such as SEA [9] or dbSUPER [10] are used for SE annotation in mouse and human cells or tissues. Thus, SE concept would be very useful for identifying key transcription factors that control cell-speci c expression programs in various cell across various species. However, up to now, except for human and mouse cells/tissues, paucity studies have been conducted on the SEs identi cation for the other mammals.
Despite of the rapid evolution and rarely conserved of enhancers sequences across these mammals [11], they are reported to have overlapping function in phylogenetic distant species [12][13][14]. Those studies have been invaluable for our understanding of their evolution, and dissecting regulation function of mammal regulatory elements. With the complete genome sequencing nished in various species of mammals, many projects, such as Encycopedia of DNA elements, ENCODE, Roadmap Epigenome Project, and Functional Annotation of Animal Genomes project (FAANG) have initiated for the functional annotation of human, mouse and domesticated animal genomes [15][16][17][18]. These annotations provide abundant resources for the comparison of enhancers or super-enhancers across mammals.
In this study, by using H3K27ac ChIP-seq data, we identify enhancers for six cells/tissues, including induced pluripotent stem cells (ips)/embryonic stem cells (ESCs), liver, colon, stomach, small intestine and ileum across three mammals, human, pig and mouse. These species represent different lineages of mammals in the evolutionary tree [11]. We then determine SEs in these cells/tissues and compare them at inter-and intra-speci c level. Comparative analysis of SEs are able to identify common key transcription factors controlling cellular identity across mammals [7,8].

Identi cation of super-enhancers
To analyze the characteristics of mammals SEs, we identify enhancers in six cells/tissues, including ESCs/ips, liver, stomach, colon, ileum and small intestine across human, mouse and pig by using H3K27ac histone maker ( Figure 1). When pooling all cells/tissues in each species together, we identify a median of 848 SEs across ve human cells/tissues (without ileum), a median of 888 SEs across ve pig cells/tissues (without small intestine), and a median of 503 SEs for six mouse cells/tissues ( Figure 2a). Generally, hundreds of SEs and thousands of TEs are identi ed for individual cells/tissues across mammals (Figure 2f, Figure 1S-4S). For example, in liver, we identify 912 SEs and 17852 TEs for human, 1261 SEs and 17220 TEs for pig, 503 SEs and 11472 TEs for mouse, respectively (Figure 2f).
In addition, average H3K27ac signals for SEs are detected higher than that for TEs. When pooling all cells/tissues in each species together, a median of 13013 rpm/bp are detected for human SEs, 8596 rpm/bp for pig SEs and 5492 rpm/bp for mouse SEs, while a median of 1117 rpm/bp are detected for human TEs, 1124 rpm/bp for pig TEs, and 570 rpm/bp mouse TEs (Figure 2b). For H3K27ac density for individual cells/tissues across mammals, saturation curves also show higher signal found at SEs that at TEs (Figure 2d, Figure 1S-4S). In addition, average H3K27ac signal density around SE center is higher than that around TE center (Figure 2f, Figure 1S-4S).
Furthermore, SEs sizes in genome are found larger than TEs sizes. When pooling all cells/tissues in each species together, we detect a median of 33289 bp for human SEs, 27558 bp for pig SEs, and 19523 bp for mouse SEs, while a median of 983 bp for human TEs, 2332 bp for pig TEs, and 1117 bp for mouse TEs (Figure 2c). For individual cells/tissues across mammals, the median lengths of SEs are also found longer than that of TEs. For example, in liver, we detect a median size of 19090 bp for human SEs, 15654 bp for pig SEs and 8532 bp for mouse SEs, while median size of 1285 bp for human TEs, 1490 bp for pig TEs, and 953 bp for mouse TEs (Figure 2f). The SE identi cation of the remaining unexampled individual cells/tissues across species are listed in Figure S1-S4, they show similar SEs and TEs characterizations to liver across species.
Generally, in each cell/tissue of three species, SEs differ TEs from their larger sizes in genome, higher H3K27ac histone maker signal and lower numbers, which is consistent with SE characterization in previous studies [7,8].

Genomic distribution of super-enhancers
To reveal SE genomic features across three mammal cells/tissues, we plot the distribution of genomic annotation of SEs. Across species, a majority of SEs are located in promoters regions for human and mouse, while a majority of SEs are located in distal intergenic region. When pooling all cells/tissue within species together, we detect 91.9% human SEs, 63.4% mouse SEs, and 29.8% pig SEs located in promoter region (± 3kb of TSS) in median, while 1.3% of human SEs, 11.8% of mouse SEs, and 66.1% of pig SEs overlapped with distal intergenic regions in median ( Figure 3a). For TE annotation, we detect a median of 33.6% for human, 25% for mouse, and 3.1% for pig located in promoter region, while 20.1% for human, 31.2% for mouse, and 91.1% for pig overlapped with distal intergenic region ( Figure 3a). Thus, for each species, a higher percentage of SEs are located in promoter than TEs, while a higher percentage of TEs are overlapped with distal intergenic region than SEs ( Figure 3a). Furthermore, we plot the density of SEs along their distance to TSSs for individual cells/tissues across species. Generally, a large part of pig SEs are located far from TSSs TSS at 20~30kb while most human and mouse SEs are located near from TSSs within 10kb (Figure 3b, 5S-8S). In addition, the distribution of genomic annotation for individual cells/tissues show that pig SEs are mostly distributed in promoter region compared to human and mouse SEs. For example, in liver, we detect 91.9% human SEs, 31.6% mouse SEs, 31.2% pig SEs located in promoter, while 1.1% human SEs, 17.5% mouse SEs, and 64.7% pig SEs located in distal intergenic region. SEs in the remaining unexampled individual cells/tissues show similar pattern (Figure 3c and 4S-8S), except for mouse colon SEs.
Super-enhancers speci city at different stages of tissues within species To reveal SEs speci city at different stages of tissues within species, we investigate their intersection in three human stomach stages, three human colon stages, three mouse stomach stages and three mouse small intestine stages. When pooling all these individuals within species together, we detect a median of 50.7% SEs and 35.7% TEs unique for human, while a median of 50.4% SEs and 43.7% TEs unique for mouse ( Figure 4a, Figure 12S). To better compare the overlapping across different tissues across species, we also analyze the common overlapping rates for SEs and TEs for different stages within species. Generally, the overlapping of common SEs is lower than that of TEs. For example, across three human stomach stages, 6.4% SEs and 11.76% TEs are shared (Figure 4b). This pattern has also been found in three mouse stages (Figure 2c), three human colon stages and three mouse small intestine stages ( Figure  S9). To further better understand the overlapping of SEs in different stages for a certain cell/tissue within species, we show several examples of common overlapping SEs in IGV (Figure 4d, Figure S9).
Similar to the overlap analysis of SEs, we also assess the intersection of SE-associated genes. When pooling all individuals of tissues within species together, we detect a median of 19.6% SE-and 8.7% TEassociated genes unique for human, while a median of 26.6% SE-and 19.7% TE-associated genes unique for mouse (Figure 4a, Figure 11). For different stages within individual cells/tissues, lower overlapping rates of SE-associated genes has been found than that of TE-associated genes (Figure 4c, S9). For example, for three human stomach stages, 17% SE-associated genes and 23.7% TE-associated genes are shared (Figure 4c). Similar pattern of SE and TE-associated genes for three human colon stages, three mouse stomach stages, three mouse small intestine stages are detected and given in Figure 4c and Figure S9).

Super-enhancers speci city across different cells/tissues within species
Later, to reveal super-enhancers speci city across different cells/tissues within species, we estimate the overlapping of SEs for three selected tissues, including liver, stomach and colon across three species. When pooling all tissues within species together, we detect a median of 73.5% of SEs and 64% of TEs unique for human, 79.5% of SEs and 45.3% of TEs unique for pig, and 88.4% of SEs and 30.8% of TEs unique for mouse (Figure 5a, Figure S13). In addition, we detect the common overlapping rates for SEs and TEs across different cells/tissues within species. Normally, the overlapping of common SEs is lower than that of TEs. For example, in liver, 1.6% SEs and 3.5% TEs are detected for human, 1.8% SEs and 33.3% TEs for pig, and 0.2% SEs and 1.4% TEs for mouse (Figure 5b,c,d). Examples of overlapping for different cells/tissues of human, pig and mouse are given in Figure 5e.
Similarly, we also investigate the intersection of SE-associated genes across different cells/tissues within species. We detect a median of 43.7% SE-associated genes and 21.4% TE-associated genes unique for human, 32.8% SE-associated genes and 3.7% TE-associated genes unique for pig, and 62.1% SEassociated genes and 30.8% TE-associated genes unique for mouse (Figure 5a, Figure 13). For the overlapping rates, we detect 16.8% SE-associated genes and 16.0% TE-associated genes shared for human, and 10.8% SE-associated genes and TE-associated genes 28.4% shared for pig, while 2.9% SE-associated and 5.2% TE-associated genes shared for mouse (Figure 5b,c,d). The individual information are given in Figure 5e.

Super-enhancers speci city across species
Furthermore, to reveal super-enhancers speci city across species, we investigate the overlapping of orthologous SEs for the same cells/tissues across species. The selected cells/tissues include liver, stomach, colon and ips across three species. When pooling the cells/tissues across the species, we detect a median of 91.6% orthologous SEs and 85.9% orthologous TEs unique for human, 93.2% orthologous SEs and 87.2% orthologous TEs unique for pig, and 92.1% orthologous SEs and 81.3% orthologous TEs unique for mouse (Figure 6b). The overlapping across species for orthologous SEs are low. For example, we detect 0.1% orthologous SEs and 1.3% orthologous TEs are shared for liver across three species. Similar pattern has been found in other individuals cells/tissues ( Figure S10-S13). Examples of overlapping for orthologous SEs are given in Figure 6d, and S10-S13.
The overlapping analysis are also performed for orthologous SE-associated genes across species. When pooling all cells/tissues together, a median of 79.1% SEs and 64.8% TEs orthologous associated genes unique for human, 83.9% SEs and 49.8% TEs associate genes unique for pig, and 63.3% SEs and 34.4% TE-associated genes unique for mouse (Figure 6a). The common overlapping of orthologous SEassociated genes has shown 1.2% SEs and 3.5% TE orthologous associated genes for liver (Figure 6b). Similar patterns have been found in other individuals cells/tissues and given in Figure S10-S13.

Function comparison across species
To reveal function of SEs, we perform GO analysis for six cells/tissues across three species. First, we show the selected ve of top 15 enriched terms across species (Figure 7). Among these terms, terms such as response to hormone/peptide hormone are shared among three species, terms such as response to insulin, actin lament organization are shared between human and mouse, while terms such as response to oxidative stress, oxoacid metabolic process are shared between human and pig. The shared biological terms are mostly consistent with liver function. Previous studies have shown that liver plays a central role in glucose, and lipid metabolism, oxygen-rich blood supply and oxidative metabolism [19,20]. In addition, investigation of scRNA-seq of human liver cells has revealed remarkably conserved features of liver zonation between mouse and human, i.e. the spatial separation of the immense spectrum of different metabolic pathways along the liver sinusoids between mouse and human [21,22]. Furthermore, we use CRC mapper to identify the core transcription factors in each species (Figure 8). For liver, six common core transcription factors are detected across three species, including SREBF1, FOXO3, BCL6, NR5A2, IRF1, and KLF15 ( Figure 8). SREBF1 is a decamer anking the lover density lipoprotein receptor gene and some genes involved in sterol biosynthesis. KLF15, Krüppel-like factor 15 (KLF15) is a transcription factor that is involved in various biological processes, including cellular proliferation, differentiation and death.
Second, we show the ve selected of top 15 enriched terms of ips/ESCs across species (Figure 7). By using the same parameters with mouse and human SEs, no signi cant enriched terms has been found for pig ips SEs. Thus, we compare SE function between human and mouse. Terms such as stem cell population maintenance, maintenance of cell numbers, regulation of cell fate speci cation/cell fate commitment are shared between two species (Figure 7). These biological processes are related to ips/ESCs identity that are pluripotent and capacity to self-renewal [7,8]. Furthermore, one common transcription factor SOX2 are shared across three species for ips ( Figure 8). SOX2 is a key transcription factor that is essential to maintaining the pluripotent embryonic stem cell phenotype [7,8].
Third, we compare SE function of the tissues of digestive system. These digestive tissues include stomach, colon, small intestine, and ileum. For stomach, terms such as muscle cell differentiation are shared between human and mouse, while the remaining terms are unique in each species. For example, terms such as cellular response to peptide hormone stimulus, cellular response to peptide, cellular response to insulin stimulus are detected in human. The unique terms in mouse include striated muscle tissue development, muscle tissue development and epithelial cell proliferation. The unique terms in pig include response to oxygen-containing compound, positive regulation of immune system process, cellular protein metabolic process. These different terms are related to stomach function. Previous study have indicated that the stomach is a muscular sac that provides a conducive environment for breaking down, chemically modifying, and sending to the next stage of digestion the food [23]. For stomach, eight common core transcription factors are detected, such as FOS, SREBF1, FOSL2, TGIF1, IRF1, NR2F2, JUN, and TEAD1. FOS proteins have been implicated as regulators of cell proliferation, differentiation, and transformation. In some cases, expression of the FOS gene has also been associated with apoptotic cell death. TGIF1, he protein encoded by this gene is a member of the three-amino acid loop extension (TALE) superclass of atypical homeodomains.
For the other digestive tissues such as colon, small intestine and ileum, they show similar pattern with stomach. For example, for colon, similar terms such as response to steroid hormone, myeloid cell differentiation are shared between human and mouse. Terms such as actin lament organization, actin lament bundle organization are shared between human and mouse. The other unshared terms such as response to oxygen-containing compound, muscle adaptation, and cell junction assembly. For small intestine, active lament organization, actin lament bundle organization, actin lament bundle assembly, cell junction assembly are shared between human and mouse. For mouse ileum, regulation of epithelial cell differentiation, cellular response to organic cyclic compound, and negative regulation of focal adhesion assembly. For pig ileum, response to oxygen-containing compound, response to oxidative stress, response to hormone, positive regulation of immune system process. The tissues are complex mixed cell components ( Figure S14). For the key transcription factors, we shown colon across three species, For colon, nine transcription factors includes FOS, SREBF1, FOS2, TGIF1, IRF1, JUN, TEAD1, HES2, and NR2F2 ( Figure 8). NR2F2, Ligand-activated transcription factor. Activated by high concentrations of 9-cis-retinoic acid and all-trans-retinoic acid, but not by dexamethasone, cortisol or progesterone (in vitro).

Discussion
Genomic distribution of SEs differ across species Our comparative study has shown that genomic distribution of SEs differ between pig and human/mouse. The majority SEs and TEs for pig are mostly distributed in intergenic regions, while the SEs and TEs for human and mouse are largely overlapped with promoter region. The SE genomic distribution for human and mouse is consistent with previous study, that they are mostly overlapped with their target genes [7,24]. By comparing zebra sh, human and mouse SEs, Perez-Rico et al. (2017) tends to conclude SEs overlapping in intergenic region is a distinct feature in zebra sh. However, SEs largely overlapping with intergenic regions has also been found in pig, indicating not all mammals SEs tend to largely occupy in promoter region. The differences of SE genomic distribution across mammals can not be fully excluded by the reason that the gene annotation for human and mouse genome is more completed than pig genome. One evidence is that we found that human and mouse genome contains more than 20000 genes, while pig contains about 4000 genes across its genome [25].

Low orthologous SE overlapping and high function conservation
In contrast to most studies that usually have conduct on SE identi cation across different stages or tissues within species, we identify SEs for different species and allow SEs comparison at both intra-and inter-species level. Consistent with the nding in other human and cell cells [8,26], these SEs are more cell/tissue-speci c than TEs at intra-speci c level. Similarly, our study also indicates that orthologous SEs are more species-speci c than orthologous TEs at inter-species level. Consistently, our orthologous SEs comparison across species have shown that their overlapping are low for cells/tissues. This nding is consistent with that (1) low SE conservation has been found in mammals [24], and (2) only 1% of liver enhancers are functionally conserved (i.e. orthologous enhancer overlapping) across 20 mammals [11]. Furthermore, the low overlapping of orthologous SEs is also concordant with the rapid evolution of enhancers [11]. Despite the extremely low overlapping of orthologous SEs and SE-associated orthologous genes, high function conservation have been found across species. Indeed, our study detect many shared biological terms and common core transcript factors for cells/tissues across species, which are closely related to cell/tissue identity. This result is supported by that high function conservation have been found by enhancers comparison in liver between pig, human, and cattle [27]. In addition, it also consistent with the investigation that enhancer regions with overlapping functions in phylogenetically distant species [12][13][14]. In turn, previous studies have indicated that sequence conservation alone does not necessarily predict functional conservation, as it has been shown that regions with high sequence conservation can drive different patterns of expression in reporter assays [28] Conclusions Our study highlights the differences of genomic distribution in different lineages of mammals. In addition, the study reveals low orthologous SE overlapping but high function conservation. The inter-species comparison would help to understand the regulatory role of cis-regulatory elements in the transcription regulation.

Materials And Methods
Public data resources ChIP-seq raw data are downloaded from ENCODE (https://www.encodeproject.org/) and EBI (https://www.ebi.ac.uk/). The individual corresponding IDs for six cells/tissues are supported in additional le Table S1. It should be noted that the lack of human ileum and pig small intestine data as we do not nd the resources. For different stages, we have downloaded three human stomach stages (f51, female 51 years old; m34, male 34 years old; f53, female 53 years old), three human colon stages (m54, male 54 years old; m37, male 37 years old; f53, female 53 years old), three mouse stomach stages

Peaks calling and super-enhancer identi cation
H3K27ac ChIP-seq data are used for peak calling by macs2 [31] with parameters: macs2 callpeak -t -c -gp 1e-5. Samtools is used to generate a sorted pileup format of the aligned reads [30]. For replicated samples, peaks overlapping at least two samples are used for downstream analysis. Normalized bigwig les are produced by deepTools 2.5.1 with parameter "bamCoverage --normalizeUsing RPKM" [32]. SEs are identi ed by using the ROSE algorithm [7] based on the H3K27ac ChIP-seq signal with the parameters "-t 2500 -s 12500".

Downstream analysis of TEs and SEs
Annotation of peak location and calculation of distance to the nearest promoter are carried out using the R packages ChIPseeker [33] with the appropriate genome. We use mergePeaks with parameter "-d 2000" to calculate the overlapping of SEs and TEs and their associated genes with homer [34]. For the SEs and TEs comparison across species, we convert them in mm10 or susScr11 to hg38 coordinates by using UCSC LiftOver tools (https://genome.ucsc.edu/cgi-bin/hgLiftOver) with a minimum match of 0.2. For ileum, we convert SEs or TEs in susScr11 to mm10 as no human ileum data provided. At inter-species level, the common overlapping rates are calculated as the shared orthologous SEs/TEs across species divided by the sum of all compared species. At intra-species level, the overlapping rates are calculated as the shared SEs/TEs across stages/tissues within species divided by the sum of all compared cell/tissues. For the unique SEs/TEs frequency, they are calculated as the unique orthologous SEs/TEs divided by the total of SEs in each species at inter-species comparison. At intra-species comparison, unique SEs/TEs frequency are calculated as unique SEs number divided by the total of SEs in each cell/tissues. GO analysis is used by R package clusterPro ler [35]. Core transcription regulatory factors are identi ed by using CRCmapper [26].

Declarations
Competing interests Figure 1 Work ow of this study. Small intestine are collected from human and mouse, while ileum are collect from mouse and pig, both tissues are shown together with colon as their locations are close.