Conserved DNA-binding motif loci reflect functional transcription factor binding sites

Background Transcription factors (TF) regulate gene expression by binding to specific DNA elements called DNA-binding motifs, or motifs. The motifs at a certain genomic locus can offer clues for predicting TFs that control target genes. However, since motifs are short nucleotide sequences and many TFs share similar motifs, they are distributed across the whole genome, making it difficult to investigate candidate functional DNA elements. Because previous studies have reported that TF binding sites are highly conserved among mammals, we focused on the conserved motif loci between humans and mice to identify functional DNA elements. Results Our results showed that conserved motif loci overlapping considerably with TF binding sites at both promoter and intergenic regions compared to nonconserved motif loci. In addition, conserved motif loci were significantly enriched in enhancer regions and enabled us to predict GATA4 binding to the heart-specific enhancer. Moreover, the integration of ChIP-seq and RNA-seq glioblastoma data revealed that genes with ASCL1 binding at the conserved ASCL1-related motifs were associated with NOTCH signaling, which is a critical pathway in glioblastoma. Conclusions These results suggest that conserved motif loci can reflect gene regulatory elements and can be utilized to predict candidate TFs that regulate genes of interest. The conserved motif data are publicly available at


Background
Transcription factors (TFs) are a large family of DNA-binding proteins that play critical roles in regulating gene transcription. TFs typically contain a DNA-binding domain (DBD) and control the activity of regulatory DNA elements such as promoters and enhancers through their binding. One of the important functions of TFs is recognizing specific DNA elements, which are called DNA-binding motifs, or motifs. Because the structure of the DBD is divergent among TFs, each TF has a particular preference of motifs to a greater or lesser degree [1].
Recent technological developments have allowed us to characterize the specificity of motifs in each TF using high-throughput methods. The analysis by protein-binding microarrays (PBMs) [2], high-throughput SELEX (HT-SELEX) [3] and ChIP-seq [4] revealed high-resolution motifs in a large proportion of mammalian TFs [5]. These results have provided a high-quality motif catalog, and several well-maintained databases are available for obtaining motif information [3,[6][7][8][9]. In addition, by applying the motif information, the prediction of motif sites across the whole genome is accessible with databases [6] or can be performed by software [10].
Although these resources for motif sites throughout the whole genome are undoubtedly important for investigating DNA elements that can regulate transcription, it is difficult to infer functional motif loci from the predicted motif sites. Because the motifs are extremely short (typically ten nucleotides in length) [11] and several motifs allow flexibility to match DNA sequences, the predicted motif sites are almost ubiquitously distributed throughout the whole genome. Moreover, many TFs recognize similar DNA sequences because they share similar DBD structures, which makes motif sites too redundant and enormous to handle and visualize the data. Therefore, we aimed to extract candidates for functional motif loci. Previous studies have reported that the gene expression patterns of each tissue are highly conserved in several vertebrates [12,13] and that the primary TF binding sites are also stable among mammalian species [14,15]. These results suggest that a major proportion of regulatory DNA elements is evolutionarily conserved. Thus, we focused on the motifs localized at DNA sequences that are conserved between humans and mice. Here, we demonstrated that the conserved motif loci showed a higher overlap with actual TF binding sites and that they were enriched in enhancer loci compared to nonconserved motif loci. Furthermore, the genes near TF binding sites on the conserved motif loci reflected a critical pathway in human primary glioblastoma. These results suggested that the conserved motif loci can capture functional DNA elements.

Results And Discussion
Conserved motif loci captured actual TF binding sites We first extracted conserved DNA-binding motifs from the JASPAR UCSC track hub [6] (see details in Materials and Methods). To visualize the number of motif loci in a certain genomic region, we first aligned motif sites from raw data to conserved motif loci in the NANOG promoter region in the human and mouse genomes (Fig. 1). Fig. 1A shows a narrow segment that is 60 nucleotides in length in the NANOG promoter region. Original data from the JASPAR database contained a bewildering amount of motif loci. The number of motif loci significantly decreased after filtration using a motif matching score greater than 400. Because the filtered motif loci included redundant motifs that localized to the same genomic loci, we next merged these motif loci into one locus according to the TF family [16]. For example, the nine motifs named Dmbx1, GSC, Pitx1, PITX3, OTX1, OTX2 and RHOXF1 were located at the same genomic locus in the human NANOG promoter (chr12:7,941,908-7,941,924 in hg19) because these TFs share a similar DBD and belong to the same TF family [16]. Therefore, these nine motif loci were summarized into one region named "Paired-related HD factors" referring to the TFClass [16]. Finally, we aligned sequences from human and mouse and acquired conserved motif loci between the two species. The conserved motif loci included POU domain factors, SOX-related factors, and paired-related HD factors in the NANOG promoter (Fig. 1A). Furthermore, Fig. 1B displays a wider range of NANOG promoters of 1,200 nucleotides in length. The conserved motif loci specifically overlapped with actual POU5F1, SOX2, and OTX2 binding sites (Fig. 1B).
These results indicated that the conserved motif loci can reflect actual TF binding sites.
Importantly, the original data on genome-wide binding site predictions contain more than ten billion motif loci, and the file size is very large (hg19: 65 GB in gzip compression), which makes the data difficult to handle and visualize. With the procedures, we ultimately extracted 9,281,940 conserved motif loci (0.09% of the original data) in the human genome. The conserved motif loci enable easy data processing and visualization using the genome browser because of its reduced file size (hg19: 305MB in gzip compression) ( Table   1).

Conserved motif loci were localized in TSS sites
We next evaluated the genomic distribution of conserved and nonconserved motif loci.
Although both conserved and nonconserved motif loci were frequently located near TSS regions, conserved motif loci showed sharper localization at TSS sites compared with nonconserved motif loci ( Supplementary Fig. S1A). Moreover, the distribution of genomic features displayed conserved motif loci that were highly enriched in promoter-TSS and exon sites (conserved: 3.7% and 7.0%, nonconserved: 1.2% and 0.7%, respectively) ( Supplementary Fig. S1B). The results indicated that conserved motif loci showed higher enrichment in TSS sites compared to nonconserved motif loci. In addition, the higher motif matching score was correlated with the higher localization of TSS and exon sites, which can reflect the evolutionarily conserved sequence at these loci. (Supplementary Fig. S1B).

Conserved motif loci were enriched in TF binding sites and histone marked regions at both promoter and intergenic regions
To evaluate how conserved motif loci are enriched in regulatory DNA elements compared to nonconserved motif loci, we counted the percentage of motif sites that overlapped with TF binding sites and histone marks. We utilized publicly available ChIP-seq peak data for TFs, H3K27ac (active promoter and enhancer mark) and H3K4me1 (active enhancer mark), which were downloaded from ChIP-Atlas [17].
Since the cooperation of multiple TFs is important for triggering target gene transcription [18], we next hypothesized that genomic loci that accumulated with multiple conserved motif loci are related to actual TF binding. The results showed the significant dependency of the number of motif accumulations and TF bindings in conserved motif loci (number of accumulations = 1: 46.1%; number of accumulations ≥ 6: 67.9%). In contrast, nonconserved motif loci did not clearly show the dependency of motif accumulation (number of accumulations = 1: 34.1%; number of accumulation ≥ 6: 37.0%) (Fig. 2D).
Furthermore, the accumulation of conserved motif loci was significantly associated with H3K27ac and H3K4me1 marks. On the other hand, nonconserved motif loci did not show dependency ( Fig. 2E and F).
Next, viewing of the SOX2 genomic locus revealed that pluripotent stem cell (PSC)-related TFs, such as NANOG, POU5F1 and SOX2, actually bound the corresponding conserved motif loci (NK-related, POU domain, and SOX-related motif sites, respectively) at the SOX2 promoter and distal regions in PSCs. Moreover, the PSC-related TF binding sites localized at accumulated motif loci (Fig. 2D). The results demonstrated that the conserved motif loci were enriched in functional DNA elements at both the promoter and distal regulatory regions.
Conserved motif loci were enriched in enhancer loci Enhancers are one of the major distal regulatory DNA regions and are essential for the precise control of gene transcription [19,20]. Because the conserved motif loci showed higher enrichment at TF binding sites and active histone marked regions in distal regulatory regions as well as promoter regions, we next focused on the association between conserved motif loci and enhancer regions. We utilized the FANTOM5 enhancer atlas, which describes promising active enhancer regions according to bidirectional enhancer RNA expression [21]. The percentage of overlapping conserved motif loci with FANTOM5 enhancer sites demonstrated that the conserved motif loci were highly enriched in the enhancer region compared to nonconserved motif loci (all regions; conserved: 1.8%, nonconserved: 0.7%) (Fig. 3A). Next, we examined the genomic locus, including the enhancer DNA element (VISTA Enhancer element ID: hs1862), that showed reproducible enhancer activity in heart tissue [22]. At the VISTA enhancer locus, we found the conserved motif locus "GATA-type zinc fingers" overlapped with the FANTOM5 enhancer.
From this finding, we processed publicly available ChIP-seq data for GATA4 [23], a transcription factor that plays essential roles in heart development [24,25]. Indeed, we observed GATA4 binding at the conserved motif locus and enhancer region in iPSC-derived cardiomyocytes (Fig. 3B). These results indicated that conserved DNA loci can preferentially be utilized as enhancers.
Conserved motif loci were functionally exploited in glioblastoma To assess whether TF binding at conserved motif loci can trigger the expression of critical genes in a certain cellular system, we focused on the transcription factor Acheate-scute like 1 (ASCL1), which plays a key role in neuronal differentiation in glioblastoma by NOTCH signaling inhibition [26,27]. We utilized the ChIP-seq data for ASCL1 and found that the conserved ASCL1-related motif loci were highly overlapped in actual ASCL1 binding sites compared to nonconserved motif loci (conserved: 1%, nonconserved: 0.3%) (Fig. 4A).
Next, we analyzed the RNA-seq data for WT and ASCL1 KO human primary glioblastoma cells [27] and integrated the ChIP-seq and motif data to predict possible ASCL1 target genes. We found that possible target genes, such as JAG2, DLL1 and DLL3, that included ASCL1 binding on the conserved ASCL1-related motif loci were significantly associated with the NOTCH signaling pathway (Fig. 4B). We confirmed that the ASCL1 binding sites corresponded to the conserved ASCL1-related motif loci near DLL3 and DLL1. In addition, these ASCL1 binding sites localized at accumulated motif loci (Fig. 4C). In summary, these results suggest that conserved motif loci can be functionally exploited to regulate the expression of key genes in glioblastoma cells.

Conclusions
In this study, we extracted the conserved DNA-binding motif loci between humans and mice and demonstrated that they reflect a proportion of functional TF binding sites compared with nonconserved motif loci. Particularly, the conserved motif loci enabled us to predict GATA4 binding to heart-specific enhancer region and GATA4 ChIP-seq data validated the prediciton (Fig. 3B). Moreover, we demonstrated that the conserved motif loci can be preferencially utilized as functional TF binding sites (Fig. 4). These result indicated that our data of the conserved motif loci provides a useful clue for predicting functional TF binding sites.
Because the data of conserved motif loci contains only 0.1% of the motif sites from raw data, it allows us to easily visualize them in a genome browser and recognize the candidate TFs that can regulate a gene of interest (Table 1 and Fig. 1B). Although we recognized that there are still a considerable number of false positives, we believe that the conserved motif data can allow researchers to predict possible candidate TFs that control the transcription of genes of interest.

Methods
Extracting conserved DNA-binding motif loci We downloaded BED files of hg19 and mm10 genomic data from the JASPAR UCSC track hub (http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/; last modified on January 16, 2018) [6,28]. The BED files contain chromosomal loci of predicted motif sites, JASPAR motif names, and motif matching scores (transformed p-values, which indicate the significance of matches between genomic sequences and known motif profiles) [29]. We first extract motif sites with a motif matching score greater than 400 (p-value < 10-4). To exclude redundant motif loci, we merged motifs that included the same TF family. For example, the motifs of GATA1, GATA2, GATA3 and GATA4 can be located at the same genomic locus because their binding motifs are quite similar to each other. Thus, these motif loci were merged into one locus as "GATA-related motifs". To merge original motifs into the family, we referred to human TFClass (draft version, October 05, 2018) [16] that provides the hierarchical classification of eukaryotic TFs based on their DBDs. We used bedtools (version 2.28.0) [30] to merge genomic regions that included motif loci of the same TF family. The merged genomic loci were excluded when the sequence length was extremely short (less than four nucleotides) or long (more than thirty nucleotides). Then, we culled the conserved motif loci that existed in the corresponding locus between the hg19 and mm10 genomes. To do this, we utilized hg19-mm10 alignment sequence data that can be accessed from the UCSC genome browser (http://hgdownload.cse.ucsc.edu/goldenpath/mm10/vsHg19/axtNet/; last modified on March 08, 2012) [31]. The genomic annotation of motif locations in hg19 was conducted by HOMER annotatePeaks.pl (version 4.10.4) [10]. The motif sites located 1-2 kb and 2-3 kb upstream of the transcription start site (TSS) were annotated as "promoter-1-2 kb" and "promoter-2-3 kb", respectively.

ChIP-Atlas
The data on TF binding sites in pluripotent stem cells (PSCs) near the NANOG promoter region were obtained from the ChIP-Atlas Peak Browser (accessed on July 01, 2019) [17].
The BED files containing the TF, H3K27ac, and H3K4me1 ChIP-seq peaks were also downloaded from the ChIP-Atlas Peak Browser (accessed at July 01, 2019) [17]. We selected the peaks with a MACS2 score greater than 100.

Motif accumulation
Any motif loci that overlap or are within 10 bp of one another were counted and merged by bedtools merge (version 2.28.0) [30]. Because the merged motif loci with a high number of motif accumulations appear to be longer genomic loci, every motif locus was transformed to a uniform 200 bp length to remove the sequence length dependency when counting the overlapping motif sites and ChIP-seq bindings.
Overlapping motif loci and ChIP-seq data Any motif loci that overlapping with ChIP-seq data (TFs, H3K27ac, and H3K4me1 from the ChIP-Atlas21) were counted using bedtools intersect (version 2.28.0) [30]. In terms of TFs, the overlap of motif loci and TF binding sites that belong to the same TF family according to TFClass [16] was counted.
To visualize the expression by the Integrative Genomics Viewer [34], the BAM files were converted to BigWig files by deepTools (version 3.2.0) [35] with CPM normalization, and the bin size was set to 10. Peak calling was conducted by MACS2 (version 2.1.1) [36] with the q-value < 0.01.

RNA-seq
Data on human glioblastoma cells were obtained from NCBI SRA (accession number: GSE87615, GSE87617) [27]. The FASTQ files were mapped to the hg19 reference genome by STAR (version 2.7.1a) [37] with default parameters. The output SAM files were converted and sorted to BAM files using samtools (version 1.9) [33]. To visualize the expression with the Integrative Genomics Viewer [34], the BAM files were converted to BigWig files by deepTools (version 3.2.0) [35] with CPM normalization, and the bin size was set to 10. Differentially expressed genes were identified by DESeq2 (version 1.24.0) [38] with an adjusted p-value < 0.01 and log 2 fold change > 1. Summary of the number of motif loci, the percentage from raw data, and the gzip compressed file size during procedures. Figure 1 Example genomic viewing of DNA-binding motif loci from raw data to conserved motif loci (A) An example of a genomic locus illustrating the NANOG promoter region in the human (hg19) and mouse (mm10) genomes. All JASPAR motif sites (blue) were filtered by a score greater than 400 (green). Then, the motif sites that belonged to the same TF family were merged (yellow). The conserved motif loci (red) were extracted by aligning the sequence between hg19 and mm10.