Conserved motif loci captured actual TF binding sites
We first extracted conserved DNA-binding motifs from the JASPAR UCSC track hub[6] (see details in Materials and Methods). To visualize the number of motif loci in a certain genomic region, we first aligned motif sites from raw data to conserved motif loci in the NANOG promoter region in the human and mouse genomes (Fig. 1). Fig. 1A shows a narrow segment that is 60 nucleotides in length in the NANOG promoter region. Original data from the JASPAR database contained a bewildering amount of motif loci. The number of motif loci significantly decreased after filtration using a motif matching score greater than 400. Because the filtered motif loci included redundant motifs that localized to the same genomic loci, we next merged these motif loci into one locus according to the TF family[16]. For example, the nine motifs named Dmbx1, GSC, Pitx1, PITX3, OTX1, OTX2 and RHOXF1 were located at the same genomic locus in the human NANOG promoter (chr12:7,941,908-7,941,924 in hg19) because these TFs share a similar DBD and belong to the same TF family[16]. Therefore, these nine motif loci were summarized into one region named “Paired-related HD factors” referring to the TFClass[16]. Finally, we aligned sequences from human and mouse and acquired conserved motif loci between the two species. The conserved motif loci included POU domain factors, SOX-related factors, and paired-related HD factors in the NANOG promoter (Fig. 1A). Furthermore, Fig. 1B displays a wider range of NANOG promoters of 1,200 nucleotides in length. The conserved motif loci specifically overlapped with actual POU5F1, SOX2, and OTX2 binding sites (Fig. 1B). These results indicated that the conserved motif loci can reflect actual TF binding sites.
Importantly, the original data on genome-wide binding site predictions contain more than ten billion motif loci, and the file size is very large (hg19: 65 GB in gzip compression), which makes the data difficult to handle and visualize. With the procedures, we ultimately extracted 9,281,940 conserved motif loci (0.09% of the original data) in the human genome. The conserved motif loci enable easy data processing and visualization using the genome browser because of its reduced file size (hg19: 305MB in gzip compression) (Table 1).
Conserved motif loci were localized in TSS sites
We next evaluated the genomic distribution of conserved and nonconserved motif loci. Although both conserved and nonconserved motif loci were frequently located near TSS regions, conserved motif loci showed sharper localization at TSS sites compared with nonconserved motif loci (Supplementary Fig. S1A). Moreover, the distribution of genomic features displayed conserved motif loci that were highly enriched in promoter-TSS and exon sites (conserved: 3.7% and 7.0%, nonconserved: 1.2% and 0.7%, respectively) (Supplementary Fig. S1B). The results indicated that conserved motif loci showed higher enrichment in TSS sites compared to nonconserved motif loci. In addition, the higher motif matching score was correlated with the higher localization of TSS and exon sites, which can reflect the evolutionarily conserved sequence at these loci. (Supplementary Fig. S1B).
Conserved motif loci were enriched in TF binding sites and histone marked regions at both promoter and intergenic regions
To evaluate how conserved motif loci are enriched in regulatory DNA elements compared to nonconserved motif loci, we counted the percentage of motif sites that overlapped with TF binding sites and histone marks. We utilized publicly available ChIP-seq peak data for TFs, H3K27ac (active promoter and enhancer mark) and H3K4me1 (active enhancer mark), which were downloaded from ChIP-Atlas[17].
We found that the conserved motif loci showed significantly higher overlap with TF binding sites in every genomic feature compared to nonconserved motif loci (all regions; conserved: 10.5%, nonconserved: 3.8%) (Fig. 2A). Interestingly, in the intergenic region, conserved motif loci also showed significant enrichment compared with nonconserved motif loci (Intergenic; conserved: 8.2%, nonconserved: 3.1%), as well as in the promoter regions (Fig. 2A). Likewise, the conserved motif loci displayed enrichment at the active histone marks H3K27ac (all regions; conserved: 36.9%, nonconserved: 23.1%) (Fig. 2B) and H3K4me1 (all regions; conserved: 31.6%, nonconserved: 19.0%), which indicated that conserved motif loci were preferentially activated (Fig. 2C).
Since the cooperation of multiple TFs is important for triggering target gene transcription[18], we next hypothesized that genomic loci that accumulated with multiple conserved motif loci are related to actual TF binding. The results showed the significant dependency of the number of motif accumulations and TF bindings in conserved motif loci (number of accumulations = 1: 46.1%; number of accumulations ≥ 6: 67.9%). In contrast, nonconserved motif loci did not clearly show the dependency of motif accumulation (number of accumulations = 1: 34.1%; number of accumulation ≥ 6: 37.0%) (Fig. 2D). Furthermore, the accumulation of conserved motif loci was significantly associated with H3K27ac and H3K4me1 marks. On the other hand, nonconserved motif loci did not show dependency (Fig. 2E and F).
Next, viewing of the SOX2 genomic locus revealed that pluripotent stem cell (PSC)-related TFs, such as NANOG, POU5F1 and SOX2, actually bound the corresponding conserved motif loci (NK-related, POU domain, and SOX-related motif sites, respectively) at the SOX2 promoter and distal regions in PSCs. Moreover, the PSC-related TF binding sites localized at accumulated motif loci (Fig. 2D). The results demonstrated that the conserved motif loci were enriched in functional DNA elements at both the promoter and distal regulatory regions.
Conserved motif loci were enriched in enhancer loci
Enhancers are one of the major distal regulatory DNA regions and are essential for the precise control of gene transcription[19, 20]. Because the conserved motif loci showed higher enrichment at TF binding sites and active histone marked regions in distal regulatory regions as well as promoter regions, we next focused on the association between conserved motif loci and enhancer regions. We utilized the FANTOM5 enhancer atlas, which describes promising active enhancer regions according to bidirectional enhancer RNA expression[21]. The percentage of overlapping conserved motif loci with FANTOM5 enhancer sites demonstrated that the conserved motif loci were highly enriched in the enhancer region compared to nonconserved motif loci (all regions; conserved: 1.8%, nonconserved: 0.7%) (Fig. 3A). Next, we examined the genomic locus, including the enhancer DNA element (VISTA Enhancer element ID: hs1862), that showed reproducible enhancer activity in heart tissue[22]. At the VISTA enhancer locus, we found the conserved motif locus “GATA-type zinc fingers” overlapped with the FANTOM5 enhancer. From this finding, we processed publicly available ChIP-seq data for GATA4[23], a transcription factor that plays essential roles in heart development[24, 25]. Indeed, we observed GATA4 binding at the conserved motif locus and enhancer region in iPSC-derived cardiomyocytes (Fig. 3B). These results indicated that conserved DNA loci can preferentially be utilized as enhancers.
Conserved motif loci were functionally exploited in glioblastoma
To assess whether TF binding at conserved motif loci can trigger the expression of critical genes in a certain cellular system, we focused on the transcription factor Acheate-scute like 1 (ASCL1), which plays a key role in neuronal differentiation in glioblastoma by NOTCH signaling inhibition[26, 27]. We utilized the ChIP-seq data for ASCL1 and found that the conserved ASCL1-related motif loci were highly overlapped in actual ASCL1 binding sites compared to nonconserved motif loci (conserved: 1%, nonconserved: 0.3%) (Fig. 4A). Next, we analyzed the RNA-seq data for WT and ASCL1 KO human primary glioblastoma cells[27] and integrated the ChIP-seq and motif data to predict possible ASCL1 target genes. We found that possible target genes, such as JAG2, DLL1 and DLL3, that included ASCL1 binding on the conserved ASCL1-related motif loci were significantly associated with the NOTCH signaling pathway (Fig. 4B). We confirmed that the ASCL1 binding sites corresponded to the conserved ASCL1-related motif loci near DLL3 and DLL1. In addition, these ASCL1 binding sites localized at accumulated motif loci (Fig. 4C). In summary, these results suggest that conserved motif loci can be functionally exploited to regulate the expression of key genes in glioblastoma cells.