Map of the human heart genome regulatory network
Our atlas contains 109 individual CAGE libraries prepared from 31 healthy and failing hearts (Figure 1A). We aligned 1025M reads to hg38 genome assembly total with an average mapping ratio of 97.7 % (Figure 1B). Most of the mapped CAGE tags were in promoter regions (Figure 1C). CAGE signal clustering resulted in 55,204 decomposition peak identifiers (DPIs) and 10,254 bidirectional enhancers (promoter and exon regions were masked) (Figure 1D-E). Heart CAGE DPI clusters associated with 14,519 genes corresponding to 27,557 Gencode or 20,411 Refseq transcripts. Bidirectional enhancers were connected to DPI clusters in the 500kbp range: 1,142 bidirectional enhancers have significant associations with 1,491 DPI related to 469 genes (correlation thresholds r ≥ 0.5 and p-value < 0.05).
Next, we performed classification of CAGE peaks by training TSSClassifier on control 2kb sequences such as eukaryotic promoter database (EPD), reference transcription start site (refTSS), promoter-like sequences (PLS), proximal enhancer-like sequences (pELS), distal enhancer like sequences (dELS) from Encode, and FANTOM5 enhancers. This step resulted in 30,036, 37,965, 31,008, 12,768, 15,062, and 5,616 matched heart CAGE clusters, respectively (Supplementary Figure 1A). Heart CAGE clusters were checked to have RNA-seq signals using Encode and other available sources10,11. This gives an option to filter out CAGE peaks without an RNA-seq signal to identify new markers. In total, 75.8% of DPI clusters were associated with transcription, but in the case of bidirectional enhancers, only 14% had overlap with RNA-seq signal. Other Encode libraries, including DNase-seq, assay for transposase-accessible chromatin sequencing (ATAC-seq), Rampage, and chromatin immunoprecipitation sequencing (ChIP-seq), were used for overlapping and classification (Supplementary Figure 1B-C, Supplementary Table 2-3). This step allowed to filter out CAGE peaks without known ATAC/Rampage/DNase/Chip-seq signals to identify new targets. For example, ChIP-seq based classification confirmed the total of 10,154 active promoters and 4,533 active enhancers combined between DPIs and bidirectional enhancers. There were 29,332 CAGE peaks overlapping with significant PolR2A signal. Most of the DPI CAGE peaks had an “active promoter” label as expected, while ChIP-seq based classifier for bidirectional enhancers showed just a slightly higher number of “active enhancers” compared to “active promoters.” This can be explained by the insufficient sample diversity of ChIP-seq libraries. Most of the TSS (65%) showed the canonical YR initiator motif, 2.4% showed alternative YC, and ~30% had other dinucleotide frequency with the uncertain functional role6 (Supplementary Figure 1D). This “other” motif had enriched G nucleotide which could be related to CAGE library preparation bias or other elements with the uncertain functional role. In addition, we validated our bidirectional enhancers by overlapping them with regions from other databases (Supplementary Figure 1E).
Based on GENCODE annotation, the DPI clusters were primarily found in promoter regions of genes, while bidirectional enhancers were located in intergenic and intronic regions (Supplementary Figure 1F). Usage of the EPD-like cluster classification assigned the location of 98% of DPIs to promoter regions. Initiator motifs for classified DPIs were well represented in promoter-like clusters and look similar to EPD, refTSS, and PLS with YR motif (Supplementary Figure 2). A similar motif was also present in enhancer-like clusters (pELS), but the motif is less evident in dELS-like regions.
In total, we defined 17,668 promoter and 14,920 enhancer regions active in the human heart, while 150 regions exhibited ambiguous features. Aggregation plots and heatmaps for Encode epigenetic libraries overlapping heart CAGE consensus regions are shown in Supplementary Figure 3.
Heart CAGE atlas revealed novel regulatory clusters specific for cardiac tissue
Compared to other studies, heart CAGE peaks are well supported by other omics data, including histone methylation, acetylation, ATAC-seq, DNase-seq, and Chip-seq for Pol2 (Figure 2). Defined CAGE peaks have relatively high GC content and conservation scores, while the total length of consensus regions is the shortest resulting in higher resolution for motif and SNP analysis.
From the overlap with the studies, where heart regulatory regions are available, we identified 3,204 peaks (Supplementary Figure 4). These unique peaks contained 351 novel promoters, 1,318 novel enhancers, 8 ambiguous, 500 not classified consensus regions, and 43 alternative peaks. Direct comparison to the FANTOM5 database, where the CAGE technique is used as the primary tool for promoter and enhancer calling, showed 12,670 new DPIs (putative TSS) and 7,381 bidirectional enhancers. Still, EPD-like cluster classification reduced the number of new predicted promoters to 3,238 (Supplementary Figure 5A). The largest collection of human enhancers and promoters developed up to date (FANTOM5 project) contains only two heart-derived samples in total: one left atrial and ventricular sample, and through higher number of samples and deeper sequencing in this study allowed us to identify new cardiac peaks. Furthermore, the overlap with FANTOM5 peaks drastically improves the specificity of the epigenetic signal and clearly defines the location of transcribed cis-elements (Supplementary Figure 5B).
Applying de novo motif enrichment, we confirmed the presence of the transcription initiation sequences such as TATA-box and initiator element (INR) in the DPI regions (Supplementary Figure 6A). The most representative peak in the bidirectional enhancer region was the transcription factor binding site for the myocyte-specific enhancer factor 2B (MEF2B) motif, a muscle-specific gene activator (Supplementary Figure 6A). Overall, there was an accumulation of TFBS close to TSS in heart promoters and in the center of the bidirectional enhancers, which is expected from promoter and enhancer regions, thus confirming the accurate classification of promoter and enhancer regions (Supplementary Figure 6B).
Besides looking at the CAGE peak overlap with different databases, we also tested how sequencing depth affects comprehensive cluster detection. By increasing the library size, we have observed the sequencing depth reaching a plateau with no additional active genes and transcripts being detected (Supplementary Figure 7A). Our samples were saturated at this sequencing depth level in terms of new genes and transcripts active in different parts of the heart. Interestingly, we observed that the right and left atria had ~2,000 more active genes as compared to the left and right ventricles. Compared to FANTOM5 atrial and ventricular samples, we detected additional ~2,200 genes expressed in the human heart chambers (Supplementary Figure 7A). These genes participate in important signaling and metabolic pathways such as Wnt, mTOR, and autophagy (Supplementary Figure 7B). Libraries appeared saturated at the level of 1-2 million reads. Despite the saturation, deeper sequencing may still uncover the activity of low expressed genes and pseudogenes, including snRNA related to RNA transport and mRNA splicing12.
Differential expression analysis reveals large differences in transcription by cis-regulatory elements between healthy and failing atria and ventricles
Dimensionality reduction showed a clear difference between atria and ventricles and between healthy and failing states (Figure 3A). However, we noticed a trend that atrial expression from failing hearts becomes similar to that of ventricular samples, with some exceptions. Information about these samples is available on the Zenbu reports platform with an interactive principal component analysis plot. Correlation analysis confirms those findings by showing a strong correlation within atria and ventricles, with most of the failing atrial samples showing ventricular pattern of expression (Supplementary Figure 8).
Differential expression analysis between healthy and failing samples resulted in 1,748 and 915 CAGE clusters significantly different within atria and ventricles, respectively (Figure 3B-C). Of those significant clusters, 645 promoters and 223 enhancers were differentially expressed in healthy versus failing atria, while 291 promoters and 105 enhancers were differentially expressed in healthy versus failing ventricles. Healthy atria showed upregulation in genes related to the upkeep of the immune system surveillance genes, such as several chemoattractants for neutrophils (CXCL1, CXCL3, and CXCL8), cytokines (CSF3), and immunoadhesion activators (SELE). Genes involved in glucose metabolism (PFKM) and lipid utilization (LPL) were differentially activated in failing atria. Healthy ventricles also had activation of the innate immune system (FCN3, LCN6), while failing ventricles had defense response activation (SPP1, ITLN1).
Functional analysis of differentially expressed clusters between healthy and failing samples also resulted in enrichment of the immune system-related GO terms, including cytokines, inflammation, chemokines in addition to muscle system processes, collagen, and contractile fiber tags (Figure 3D). Among KEGG pathways most enriched were TNF signaling, complement, and cytokine-related pathways (Figure 3E). Disease ontology enrichment analysis resulted, as expected, in mostly heart-related diseases such as cardiovascular system disease, cardiomyopathy, and other muscle and connective tissue-related diseases (Figure 3F). Connecting top differentially expressed clusters with heart diseases identified inactivation of cardiac structure-specific genes (MYH6, TNNT2) and upregulation of metabolism (LPL, APOA1) and immune response genes (CXCL10, CD36) in failing hearts (Figure 3G).
Functional annotation from the FANTOM5 database, such as cell type, anatomy, and disease ontology, was transferred on heart CAGE clusters (Supplementary Figure 9A). The cell type annotation revealed that TSS in human heart chambers had features of epithelial cells, myoblasts, fibroblasts, smooth muscle cells, electrically active cells, endothelial cells, and blood vessel cells. Such precise identification of cardiac cell types in clusters obtained from bulk tissue demonstrates the cell level specificity of regulatory elements. Organ (UBERON)13 ontology enrichment resulted in similarity to FANTOM5: heart, vessels, artery, and circulatory system-related samples, but also to embryo and lateral plate mesoderm samples. Disease ontology showed cardiovascular and heart-related diseases, but also cancer and disease of cellular proliferation. Similar patterns occurred in promoters and enhancers taken separately, confirming a functionally coherent behavior between the two types of regulatory elements. Then, to take a closer look into the functional differences between healthy and failing hearts, we annotated statistically significant differentially expressed clusters (Supplementary Figure 9B). The annotation showed a considerable involvement of the immune system and various embryogenesis-related pathways supporting previous observations of dedifferentiation of cells to the embryonic state during heart failure.
Differential expression analysis of ICM versus NICM identified disease-specific clusters, 323 for NICM and 255 for ICM (Figure 4A). Functional annotation with GO and KEGG highlighted several immune pathway activations in NICM (Figure 4B-C) and enrichment of lipid metabolism in ICM (Figure 4D-E). Then, we select several genes strongly associated with heart failure14 to investigate their TSS usage. Variations in TSS selection reflect the diversity of preinitiation complexes and can impact post-transcriptional RNA fates. We noticed no significant shift if ICM is directly compared to NICM. However, we observed significant differential remodeling of some clusters when ICM and NICM are compared separately to healthy samples. For example, the sarcomere gene, TTN, had an upregulated cluster in ICM, while NICM had an upregulated cluster in the junctional membrane gene, JPH2 (Supplementary Figure 1). Cytoskeleton-related gene, FLNC, had a differential usage of the TSS between NICM and ICM, the former having one cluster inactivated while the latter having two clusters significantly inactivated compared to a healthy state. These observations demonstrate that different disease etiology could cause activation of specific regulatory elements of the genome.
Differential expression analysis for sample groups, considering chamber, disease type (ICM or NICM), and sex allowed to determine specific groups of genes and their functional roles using GO and KEGG pathway enrichment (Supplementary Figure 11-12). Failing heart tags included channel inhibitor activity and metabolism-related pathways. Failing ventricles showed pathway enrichment related to structural and metabolic activity while failing atria pathways were associated with neuronal activity. Heart failure, in general, was accompanied by IRF and STAT transcription factors (Supplementary Figure 13). Detailed heatmaps are available on the Zenbu reports platform (https://fantom.gsc.riken.jp/zenbu/reports/#heart%20CAGE%20GO, https://fantom.gsc.riken.jp/zenbu/reports/#heart%20CAGE%20KEGG, https://fantom.gsc.riken.jp/zenbu/reports/#heart%20CAGE%20TFBS, https://fantom.gsc.riken.jp/zenbu/reports/#heart%20CAGE%20heatmaps).
SNPs in transcription factor binding sites can alter gene expression
While the most studied SNPs are located within the coding sequences, we also identified a 10% fraction of heart GWAS SNPs, which reside in regulatory regions (Figure 5A), most frequently in the promoter region (Figure 5B). The SNP density in heart CAGE clusters is higher than the genome-wide density, especially in classified consensus regions. The highest frequency of disease-associated SNPs corresponds to familial hypertrophic cardiomyopathy, congenital heart disease, atrial septal defect, cardiomyopathy, and familial atrial fibrillation (Figure 5C). Among National Human Genome Research Institute - European Bioinformatics Institute (NHGRI-EBI) GWAS SNPs, specific features included resting heart rate and glycated hemoglobin levels (Supplementary Figure 14A). Promoter regions showed accumulation of reticulocyte related SNPs while enhancers accumulated cardiac electrophysiology related SNPs (Supplementary Figure 14B-C).
Differentially expressed CAGE clusters between healthy and failing atria and ventricles accumulated GWAS SNPs related to blood pressure, autoimmune traits, blood protein levels in failing ventricles and electrocardiogram morphology, warfarin maintenance dose in failing atria (Figure 6A, Supplementary Table 6). To better understand the role of the SNPs in the regulatory regions, we identified the cases where these SNPs significantly change TFBS. In total, there were 2,679 GWAS SNPs localized in 479 unique TFBS. As an example, we looked at troponin, a known marker for the diagnosis of myocardial infarction and heart failure. The promoter of one of three troponin subunits, TNNI3, showed a statistically significant increase in its activity in failing hearts and was linked with SNP in its TFBS (Figure 6B-C). Other cases of SNP in TFBS are available on the Zenbu reports platform (https://fantom.gsc.riken.jp/zenbu/reports/#heart%20CAGE%20SNP).