Human heart CAGE library identified novel promoters and enhancers
We collected tissue samples from non-diseased human donor hearts (Supplementary Table 1) from four cardiac chambers: left and right, atria, and ventricles. We used a total of 41 samples from 14 hearts for CAGE library construction, sequencing, mapping to genome, and TSS identification. Based on this CAGE data, we identified 49,179 new heart CAGE robust decomposition peak identification (DPI) clusters, which had a 78.6% overlap with FANTOM5 annotation, and the remaining 10 k were new previously unknown peaks (Fig. 1a). These peaks were classified into TSS-like according to nucleotide distribution against the Eukaryotic Promoter Database (EPD) (29,598 TSS). They resulted in 27,295 TSS-like clusters in heart, where 2,750 had no overlap with FANTOM5. Bidirectionally expressed CAGE TSS (CTSS) were used for cardiac-active enhancer annotation (n = 6,147) and compared to FANTOM5 (Fig. 1a). These heart CAGE based regulatory elements had features of active promoters and enhancers (Fig. 1b-d; Supplementary Fig. 1) and were used in all remaining analyses. Notably, heart CAGE peaks accumulate DNase I hypersensitive sites, and histone modification marks signals at a higher level than FANTOM5 peaks, while heart ATAC-seq signal enrichment was expected and representative in case of enhancers (Supplementary Fig. 2). These observations, together with high conservation scores and CpG overlap proportion, suggested the housekeeping nature of defined heart CAGE clusters. Almost all peaks classified as TSS were associated with transcripts and related genes, while in the case of DPI clusters and enhancers, 82% and 20% could be linked respectively (Fig. 1e).
Since there was a significant overlap of heart CAGE peaks with FANTOM5, we assigned available ontologies (cell line, disease, organ) to robust DPIs. Interestingly, enrichment results showed embryo related features in defined heart CAGE peaks (Supplementary Fig. 3); cardiovascular system-related tags were enriched as well. Disease and Cell ontology enrichment results were consistently reporting “cancer” and “embryonic cell” as the top hits, also including heart-related diseases, myoblasts, fibroblasts, and other expected cell types. These results agree with the heart composition profile since we found mesodermal cells, muscle stem cells, smooth muscle cells, electrically active/responsive cells, fibroblasts, and other cell type-specific promoters in the heart (Supplementary Fig. 3).
The total number of genes active in the heart was 13,849 (Supplementary Table 2), resulting in 3.6 robust DPI to gene ratio and 2.47 TSS per gene. Only 30.0% (4,149) genes have a single CAGE peak (Fig. 2a), but a majority of genes (70%) had multiple DPI and TSS. Moreover, while in general heart CTSS distance to available transcripts models was within 98 bp, numerous outlier peaks could be found (n = 6,003, Fig. 2a). Genes with such “distal” CAGE peaks were associated with numerous heart-related diseases including cardiomyopathy, hypertrophic cardiomyopathy, cardiovascular system disease (Fig. 2b), and related to regulation activity, like transcription factor binding or co-regulator activity (Fig. 2c).
CAGE library identified cardiac-specific alternative promoters
CAGE, as a transcriptomic approach, allowed estimating expression precisely at the promoter level and resulted in a clear separation of ventricular and atrial samples in our study (Fig. 2d-e). In total, we obtained 2,111 atria and 1,589 ventricle specific CAGE peaks, where 830 and 491 are TSS-like, respectively (|logFC|>1, FDR < 0.05; Supplementary Table 2). Functional analysis of specific genes revealed neural system related ontology in atria while catabolic, and oxidation processes were enriched in the ventricles (Fig. 2f-g). FANTOM5 ontologies-based enrichment for specific groups resulted in embryo type tags for atria and heart tags (primary circulatory organ, cardiac valve, endocardium, etc.) in ventricles. In contrast, heart and vascular diseases were predictably overrepresented in both chambers (Supplementary Fig. 4). Fibroblasts, myoblasts, multi-potent muscle stem cells, electrically active/responsive, and muscle precursor cells were enriched in the atria, while epithelial and fat cells in ventricles (Supplementary Fig. 4). Genes annotated by heart CAGE peaks and specifically expressed in chambers had cardiovascular system disease and cardiomyopathy as top tags in atria and ventricles, respectively (Supplementary Fig. 5). These genes had clear functional separation at pathway level through KEGG annotation: atrial genes participate in PI3K − Akt, TGFβ, AGE − RAGE signaling; ventricular genes are involved mostly in carbon metabolic pathways, including PPAR and fatty acid degradation (Supplementary Fig. 5). Motif analysis for specific promoter regions resulted in androgen receptor (AR), NFIX, ZNF263 top transcription factors in the atria, and Esrrb, ESRRA, NR4A2, HNF4A, ESR2 in ventricles (Supplementary Fig. 5).
We also found that hundreds of newly annotated heart CAGE peaks could be classified as “distal” and specifically expressed in ventricles and atria: 320 and 332, respectively (Fig. 2e). There are 13 genes, which have more than one CAGE peak with the opposite specificity (Fig. 2h), and at the same time, some of them are characterized as TSS-like or “distal.” Some representative examples are illustrated in Supplementary Fig. 6–7. The utilization of alternative promoters allows alternative splicing resulting in multiple transcript variants encoding different protein isoforms.
One of the critical regulators of heart development, TBX5, had one CAGE peak classified as TSS-like, which was located close to the existing gene model and expressed specifically in the atria. Still, alternative downstream CAGE peak had ventricle-specific expression (Supplementary Fig. 6). TBX5 acts as a key transcription factor during embryonic development, and its mutations have been linked to several cardiac defects and diseases, including atrial fibrillation and cardiac septal defect 12,13. PITX2 gene encodes a regulator of TBX5 14 and has also been implicated in the pathogenesis of atrial fibrillation and cardiomyopathy 15. Interestingly, PITX2 has multiple CAGE peaks with left atrial specificity (Supplementary Fig. 6)
Another intriguing example, KALRN gene encodes several isoforms by using alternative transcription start and termination sites (Supplementary Fig. 6, 14, Table 6–7).16,17 CAGE sequencing analysis confirmed the heart activity of at least two of these alternative starts of transcription, 552,218 bp apart. The two start sites were located in positions 124,032,456 (Site A) and 124,584,674 (Site V); Based on CAGE, Site A is approximately equally active in both atria and ventricles (15.66:12.15), while Site V is almost 3-fold more active in the ventricles (8.94:25.13). Site V corresponded to the shorter isoform (V), encoding the 1288 aa long protein, while Site A corresponded to the longer isoform (A), encoding the 2988 aa long protein. The additional 2,000 amino acid long sequence contains five copies of the Spectrin repeat, Sec14p-like lipid-binding domain, and an extra set of RhoGEF/PH2_Kalirin_Trio/SH3, already included in the shorter isoform V. Spectrin repeats of the isoform A are used for interaction of KALRN with several proteins, including inducible nitric oxide synthase.18–20 Nitric oxide mediates multiple physiological and pathophysiological processes in the cardiovascular system, especially metabolism21, its crucial role in the development of cardiovascular diseases has been demonstrated by numerous human and animal studies.22–24 The isoform A contains an intronic SNP, rs9289231, associated with early-onset coronary artery disease (CAD).25 It has been proposed that the role of KALRN in CAD susceptibility is through its interaction with the inducible nitric oxide synthase.25
Calpastatin or CAST is a calpain (calcium-dependent cysteine protease) inhibitor. Calpain family has been shown to play an essential role in the pathogenesis of hypertrophy; for example, calpain inhibition has been shown to modulate the process of ventricular hypertrophy. CAGE data indicate alternative splicing domains between atria and ventricles, which might explain why inhibition by calpastatin is more effective in the ventricles than atria 26 (Supplementary Fig. 6).
Gene IRF6 has two transcription initiation regions: 2 and 5 DPI are atria and ventricle specific, respectively (Supplementary Fig. 6). In these two cases, the final protein product remains unchanged, but in the case of ABLIM1, AMOTL1, CAST, RGS3, the structure is likely to be changed (Supplementary Fig. 6). IRF6, ABLIM1, and AMOTL1 are involved in the developmental process of the heart. All these examples could be further investigated through the Zenbu page (https://fantom.gsc.riken.jp/zenbu/reports/#Human_Heart_CAGE_A).
SNP related to cardiac diseases accumulate in the TSS region
The intersection of heart CAGE clusters with heart-related GWAS SNPs resulted in the accumulation of variants in DPI regions. There are 1.7–85.4% (9.86% in median) of SNPs on primary chromosomes located in heart promoter regions defined in this study, 1.4–77.1% in “robust” DPI, as defined by FANTOM5 pipeline (Fig. 3a). After normalization by the total length of the heart, CAGE DPI regions, and Gencode v33 coding sequence (CDS) showed the highest proportion of heart GWAS SNPs. We also found SNPs distribution peak close to TSS defined by heart CAGE (Fig. 3b), as well as for several heart diseases (Fig. 3c-g). In total, 4,302 and 12 unique heart GWAS SNPs were identified in heart CAGE DPI and enhancers, respectively. In DPI regions (± 250 bp), the average MAF is 0.027, compared to 0.117 genome-wide except DPI. Distribution of MAF and SNPs density near the TSS-like areas is shown in Supplementary Fig. 8.
Quality of annotation allowed the identification of novel regulatory elements
The prediction quality of promoters and enhancers was assessed from the analysis of nucleotide consensus, DNA methylation, and characteristic motifs in the vicinity of the TSS (Fig. 4). To compare the two groups (promoters and enhancers), we selected two subsets of data. To represent the promoters, we have chosen those TSS that were annotated as “true,” then we have selected the start site with the highest CAGE expression as the “best” TSS per gene (10,206 TSS). The distribution of frequencies of nucleotides (Fig. 4a) is consistent with the well-defined start of transcription. There are peaks of A and T at TSS-30 and pronounced G-richness of the transcribed region; all of these observations were well-documented in prior studies and across multiple organisms 27–32. Enhancer positions were predicted all over the genome, with 83% of them located inside a transcribed sequence. We have excluded those enhancers that overlapped with exons, 5’ UTR, 3’ UTR, and promoters. Ten thousand eight hundred thirty-two enhancer-TSS sequences satisfy these conditions. Statistical patterns of nucleotides around the “enhancer-TSS” (Fig. 4b) differed from a typical TSS distribution showing narrow and pronounced peaks at the TSS and TSS-30. DNA sequence around the enhancer-TSS showed a GC-rich 500 nucleotide long “bubble.”
TATA-box is usually located at -30 upstream from the TSS and is particularly important. CAGE-annotated promoters showed a peak of TATA-like sequences (“TATA”, “TAAA”, and “ATAT”) at the window [TSS-40, TSS-20] (Fig. 4b). Enhancers had a dip in the frequency of TATA at TSS. Another remarkable feature of the TSS is the CA dinucleotide motif located at the start of transcription (Fig. 4c). CAGE-predicted TSS also had an elevated frequency of this motif, while enhancers lacked this trend. Compositional strand asymmetry is a characteristic feature of important genomic regions, such as starts of trаnscription, tRNA insertion sites, and transposable elements 33. Peaks in CG-skew and AT-skew were previously associated with TSS 34,35. At the human TSS, there was a narrow peak of AT-skew and similarly small dip of CG-skew (Fig. 4d). The trends in enhancers were different and much more subtle: they had small approximately symmetrical peaks of AT-skew and CG-skew ~ 300 − 200 nucleotides upstream/downstream of the enhancer-TSS (Fig. 4D)
Another remarkable feature of promoters - the peak of the density of transcription factor binding sites (TBFS) upstream of the start of transcription (Fig. 4e). This peak corresponded to the core promoter, typically found at [TSS-200, TSS]. TFBS distribution (Fig. 4e) had a sharp spike in the core promoter region [TSS-200, TSS], followed by a drop in the 5’ UTR, and then there was a second smaller peak at TSS + 300 nucleotides. As with all other features, the trends near the enhancer-TSS were much less pronounced: only a slight increase of TBFS density at enhancer-TSS. A natural question if the sequence motifs near TSS and enhancer-TSS are the same? What motifs are associated with elevated levels of transcription? We have conducted an analysis of sequence motifs using Match 36 for known TFBS and cisExpress27,30 for de-novo discovery. For both TSS and enhancer-TSS, the most common and essential for transcription motifs were made of cytosines and guanines: GGGGC, GGGGG, GGCGG (and their reverse complements) ATAAT (and variations).
Trends of the methylation profiles could also be used to validate the quality of TSS prediction 28,37. There was a drop in the DNA methylation that starts immediately upstream of the TSS. Since the gene bodies are generally more methylated,38 the resulting pattern is U-shaped (Fig. 4f)