Towards a comprehensive regulatory map of Mammalian Genomes

doi:10.21203/rs.3.rs-3294408/v1

Download PDF

Research Article

Towards a comprehensive regulatory map of Mammalian Genomes

https://doi.org/10.21203/rs.3.rs-3294408/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Genome mapping studies have generated a nearly complete collection of genes for the human genome, but we still lack an equivalently vetted inventory of human regulatory sequences. Cis-regulatory modules (CRMs) play important roles in controlling when, where, and how much a gene is expressed. We developed a training data-free CRM-prediction algorithm, the Mammalian Regulatory MOdule Detector (MrMOD) for accurate CRM prediction in mammalian genomes. MrMOD provides genome position-fixed CRM models similar to the fixed gene models for the mouse and human genomes using only genomic sequences as the inputs with one adjustable parameter – the significance p-value. Importantly, MrMOD predicts a comprehensive set of high-resolution CRMs in the mouse and human genomes including all types of regulatory modules not limited to any tissue, cell type, developmental stage, or condition. We computationally validated MrMOD predictions used a compendium of 21 orthogonal experimental data sets including thousands of experimentally defined CRMs and millions of putative regulatory elements derived from hundreds of different tissues, cell types, and stimulus conditions obtained from multiple databases. In ovo transgenic reporter assay demonstrates the power of our prediction in guiding experimental design. We analyzed CRMs located in the chromosome 17 using unsupervised machine learning and identified groups of CRMs with multiple lines of evidence supporting their functionality, linking CRMs with upstream binding transcription factors and downstream target genes. Our work provides a comprehensive base pair resolution annotation of the functional regulatory elements and non-functional regions in the mammalian genomes.

transcription factor

positive weight matrix

enhancers

Cis-regulatory modules

epigenomics

machine learning

Transcriptional regulation ensures that every gene is transcribed at the right time, in the right cells, and in the right amount¹. Cis-regulatory modules (CRMs) are defined as DNA sequences with transcription factor binding sites (TFBSs) clustered into modular structures to regulate spatiotemporal gene expression² and include enhancers, promoters, locus control regions, silencers, and other modulators. CRM disruption has been implicated as a disease-driving mechanism for many diseases such as cancer^3,4 and neurological disorders⁵. Despite their clear importance to both basic and disease biology, we still lack a complete understanding about the repertoire of human enhancers, including where they reside, how they work, and what genes they mediate their effects through^6–8.

CRM discovery methods can be broadly classified into two categories: empirical and computational approaches, although in reality the two are often intertwined¹. Classically, CRMs were defined through functional assays—primarily reporter gene assays—that demonstrate the ability of a given sequence fragment to affect transcription. Those experimentally defined CRMs (ExpCRMs) still serve as gold standards of functional CRMs. However, these approaches are low-throughput, expensive, and time-consuming^1,9. VISTA Enhancer Browser is the only database that has a collection of mammalian enhancers experimentally defined in transgenic mice^10,11 with 673 and 998 from the mouse and human genomes, respectively. Recently, high-throughput empirical approaches such as chromatin immunoprecipitation coupled to sequencing (ChIP-seq)^12,13, DNase I hypersensitive sites (DHS) sequencing (DNase-seq), transposase-accessible chromatin using sequencing (ATAC-seq)^14,15, STARR-seq¹⁶, and massively parallel reporter assay (MPRA)¹⁷, have greatly facilitated candidate enhancer identification on a genome-wide scale⁴. Despite the enormous amount of data generated by these methods however, empirical methods only identify subsets of potential active enhancers in the genome because it is currently not possible to sample every cell type, every developmental stage, or every environmental stimulus due to limited availability of reagents, starting materials, and technologies⁷. DNA fragments defined by these methods are “putative enhancers” and should be validated through gold-standard experimental approaches to provide additional evidence of function^8,13. Additionally, because no reference CRM models analogous to gene models exist, peak calling is required in epigenomic data analyses to first define the genomic regions with the epigenomic signal of interest. It is common that different peaks or peaks with different positions and lengths are called in biological replicates and the number of peaks can vary hugely depending on the program used for peak calling¹⁸. This makes signal comparison across biological replicates and conditions much more challenging than comparing gene expression using the RNA-seq method, which has a well-annotated set of gene models as references¹⁹. Computational methods for CRM discovery can be broadly classified into three main categories depending on the types of input data they require. 1) The comparative genomics approach depends on identifying conserved non-coding DNA sequence regions across species under the assumption that functionally important genomic sequences are under more evolutionary constraint than sequences with less-vital functions. However, not all CRMs are conserved and this type of method fails to discover newly evolved regulatory modules^1,7. 2) Motif-based methods (reviewed in^1,7) rely on the identification of clusters of TFBSs, requiring prior knowledge of the constituent motifs, to predict new CRMs with similar features. 3) “Motif-blind” approaches are based on properties of the input sequences such as known CRMs or epigenomic datasets to predict functional CRMs (reviewed in ²⁰). These categories are not mutually exclusive, and methods combining multiple approaches often perform strongly. However, most of them require a set of known CRMs, TF binding information, or epigenomic data as training data^1,7,20. Due to the limited knowledge of known TFs and functional CRMs, and the limitation of currently available epigenomic data, none of the current methods can detect all types of CRMs for all cell types, developmental stages, and conditions.

To overcome the limitations of the current computational CRM discovery methods, we developed the Mammalian Regulatory MOdule Detector (MrMOD) to accurately predict CRMs in mammalian genomes. MrMOD started with the mouse, rat, and human genomes and has several unique features compared to current methods. 1) There is no requirement of training data; 2) It has only one parameter to adjust: p-value < 0.01, one-tailed; 3) It predicts a variety of currently known types of regulatory modules; 4) It provides a comprehensive set of high-resolution CRMs (average size < 220 bp) not limited to any tissue, cell type, developmental stage, or stimulus condition; 5) The predicted CRMs are final, providing a fixed position for every CRM (refCRM) in the genome similar to the fixed positions for every gene in the genome. We used multiple sources of orthogonal data to evaluate predicted CRMs including thousands of ExpCRMs collected from the literature and VISTA Enhancer Browser, 19 epigenomic data sets that represent putative regulatory elements from Enhancer Atlas²¹, ENCODE cCREs²², Cistrome Data Browser²³, as well as single-cell ATAC-seq data (scATAC-seq) from the mouse brain atlas²⁴ and human genome²⁵ derived from hundreds of cell types, tissues, developmental stages, and conditions. Comparing a subset of the predicted functional CRMs with genomic regions that had the exact same length distribution and genomic coverage but predicted to be non-functional achieved high sensitivity and odds ratio demonstrating the accurate prediction of functional regulatory sequences. Experimental dissection of a large enhancer from the VISTA database guided by our predicted CRMs delineated smaller enhancers with more restricted spatiotemporal expression. To annotate the functionality of predicted CRMs, we scanned the mouse genome for TFBSs using the Position weight matrices (PWMs, models the binding specificity of TFs)²⁶ in the Catalog of Inferred Sequence Binding Preferences (CIS-BP) databases²⁷. Unsupervised machine learning defined groups of CRMs with similar motif compositions, the TFs that interact with the CRMs, and the putative target genes regulated by the set of TF-CRMs. We provide one example from the mouse chromosome 17 with multiple lines of evidences supporting their roles in the regulation of gene expression in cancer, neurodevelopmental, and neurodegenerative disorders. We provide access to the predicted CRMs and the experimental supporting evidence across the human and mouse genomes as part of the WashU Epigenome Browser.

Genome-wide TFBS prediction and validation in the mammalian genomes

The workflow of whole-genome-wide CRM prediction includes two steps (Fig. 1A). First, we defined a set of non-redundant conserved TFBS motifs in the mammalian genomes via applying PhyloNet²⁸ on the genomic sequences of mouse, rat, and human using published workflow²⁹. Second, we applied the algorithm implemented in CERMOD²⁹ on the genomic sequences of human and mouse using the set of mammalian TFBS motifs obtained from the first step to predict mammalian CRMs. PhyloNet systematically identifies phylogenetically conserved motifs that also occur multiple times throughout the genome to define a network of regulatory sites for a given organism²⁸. A final set of 5,143 unique PWMs (p-value < 10^− 10) with an average length of 18 bp (ranging from 5 to 30 bases) was obtained after consolidation of redundant motifs as described²⁹. Each predicted PWM is associated with a set of genes that are potentially regulated by this motif.

Next, we performed multiple analyses to assess the functionality of the predicted PWMs. 1) We compared the predicted PWMs to known TF PWMs in the TRANSFAC³⁰ and the CIS-BP databases²⁷ using the average log likelihood ratio (ALLR) statistic³¹ and OLAP score as described²⁹ (Table 1). Out of the 5,143 predicted PWMs, 836 (16.3%) and 2,203 (42.8%) shared similarity with at least one PWM from the TRANSFAC and CIS-BP database, respectively. 2) We tested whether each PWM-associated gene set was enriched for TF targets determined in the ChIP-Atlas database³², which has a collection of 2,540 ChIP-seq datasets including 723 mouse TFs in 369 different cell types and tissues. A total of 637 (12.4%) PWM-associated gene sets were significantly enriched for known TF targets in at least one condition (hypergeometric test, false discovery rate (FDR) corrected p-value < 0.05). 3) We tested whether the set of PWM-associated genes was enriched for specific classes of genes using Gene Ontology (GO) and KEGG pathway enrichment analyses. We found that 2,678 (52%) PWMs had significant enrichment for at least one GO term and 614 (12%) PWMs had significant enrichment for at least one KEGG pathway. We observed multiple lines of evidence from different categories consistent with the known functions of corresponding TFs which provides strong support that the majority of the PhyloNet-discovered PWMs represent the DNA-binding specificity for the corresponding TFs (Fig. 1B, Table 1).

Whole-genome-wide CRM prediction

Next, we developed MrMOD to perform whole-genome-wide CRM predictions (referred to as “CRMfull” herein) for mouse and human genomes using the CERMOD²⁹ algorithm but with the set of 5,143 unique mammalian PWMs obtained from the first step as the input. MrMOD predicted a total of 5,549,442 (44.1% of genome coverage) and 6,070,455 CRMs (42% genome coverage) with an average size of 216 and 213 bp for the mouse and human genomes, respectively (Fig. 1C). The length ranged from 63 to 19,743 bp but the majorities were smaller than 1 kb for both genomes (99.7% and 99.6%, Fig. 1D). The majority of the predicted CRMs are located within distal intergenic regions or intronic regions with about 10% located within 3 kb of a promoter for both human and mouse genomes (Fig. 1F, 1G), similar to that observed for the human putative regulatory elements defined using DNase I hypersensitive sites (DHSs)¹⁴.

Obtaining control regions for the evaluation of CRM predictions

To systematically evaluate the accuracy of predicted CRMs we need to obtain a set of control regions that are not enriched for regulatory functions. However, there is no collection of experimentally defined non-functional regions. Therefore, we devised a scheme to obtain a set of control regions that have the exact same length distribution and genomic coverages as a subset of CRMs from each genome but are predicted to be non-functional for a fair comparison (Fig. 2A, details in the Materials and Methods section). First, we obtained regions not covered by any predicted CRMs, which correspond to regions predicted to be non-functional. Because it is difficult to predict the exact boundaries of CRMs, small DNA fragments representing regions located between two predicted CRMs were removed, as they could be part of the nearby functional CRMs. The predicted non-functional regions did not have sufficient space to generate the exact length distribution of the full-predicted CRM set. Therefore, we gradually eliminated the longest CRMs to obtain a predicted CRM subset (CRMsub) such that we can generate a set of control regions with the same number and length distribution located within the predicted non-functional regions (referred to as CTRLs) for performance evaluation. The CRMsub and the CTRLs had the exact same genomic coverage and distributions relative to genomic features for both genomes (Fig. 2B, 2C), with 21.8% and 20.8% of genomic coverage for mouse and human, respectively. Therefore, the CTRL regions provided good data sets for a fair performance evaluation.

Predicted CRMs include diverse classes of functional regulatory sequences

To evaluate the accuracy of predicted CRMs we performed an extensive literature search to identify any mouse or human genes whose genomic regions have been analyzed using in vitro or in vivo reporter assays to locate any regulatory sequences (referred to as experimentally defined CRMs, or ExpCRMs). Wherever possible, we used regions that were determined by an enhancer assay using minimal sequence sufficient to drive reporter gene expression because it better defines the boundaries of regulatory regions that are sufficient in regulation²⁹. Only ExpCRMs smaller than 2.5 kb were included in the analyses because > 90% of the peaks in all are smaller than 2.5 kb datasets (except VISTA Enhancer Browser). For mouse, we collected 97 ExpCRMs associated with 56 genes including enhancers, silencers, insulators, locus control regions, and RNA post-transcriptional regulatory modules that regulate gene expression in a diverse cell types, tissues, and developmental stages, covering a total of 54,624 bp with an average size of 563 bp (range 11 to 2,000 bp, Supplementary Table 1). Similarly, a total of 60 ExpCRMs associated with 43 human genes were collected with 10 CRMs defined by mouse in vivo reporter assays and 50 by in vitro assays covering 20,610 bp with an average size of 343 bp (range 25 to 1,515 bp, Supplementary Table 2). For our evaluation of accuracy an ExpCRM was deemed as correctly predicted if the predicted CRM and ExpCRM overlap at least 50% of the length of the smaller fragment or if they overlap at least 1 bp (referred to as 50%- or 1bp-overlap cutoff herein).

1) Accurate prediction of enhancers and silencers: We first evaluated MrMOD prediction accuracy using curated ExpCRMs that are enhancers or silencers located in the intergenic regions upstream of ATG start codons, the intronic sequence, or the 3’ UTRs. 1.1) Accurate prediction of upstream CRMs: We defined modules that were located upstream of the ATG start codon and downstream of the next nearest gene as upstream CRMs. We used ExpCRMs that were located within 50 kb upstream of ATG in both genomes (69 mouse and 37 human ExpCRMs) and compared them to the predicted CRMs located within the same genomic ranges. Figure 3A shows an example comparing predicted CRMs with ExpCRMs located upstream of the Nanog homeobox gene (Nanog) and Spi1 proto-oncogene (Spi1) in the mouse genome. The Nanog enhancer drives gene expression in early mesoderm-specified progenitors³³. Spi1 has two CRMs previously reported in the literature with one driving reporter gene expression in transgenic mouse at embryonic sites of the endothelial-to-hematopoietic transition and the other driving non-hematopoietic gene expression³⁴. All three ExpCRMs were correctly predicted without overlapping with the CTRL regions. We have a total of 78 mouse ExpCRMs that are upstream the ATG but not limited to 50kb within the ATG. A total of 71 out of 78 ExpCRMs (91%) were correctly predicted whereas only 19 (24.3%) overlapped with the CTRL regions at 50%-overlap cutoff. Because CRMfull has higher genomic coverage than CTRL regions, this comparison is not suitable for statistical testing. Therefore, we performed simulations by randomly shuffling the positions of predicted CRMs within each sequence to estimate the statistical significance of obtaining the given sensitivity. Using mouse ExpCRMs within 50 kb upstream of ATG the average sensitivity of simulated CRMs is 75.4% with a standard deviation of 4.4 for 10,000 simulations (Fig. 3C), demonstrating high sensitivity (88.4%, p-values < 0.01) in MrMOD predictions for mouse CRMs located within 50kb of ATG (upstream regions). Similarly, all three enhancers upstream of the human GATA binding protein 2 (GATA2)³⁵ and retinoschisin 1 (RS1)³⁶ were correctly predicted without overlapping with the CTRL regions (Fig. 3B). A total of 29 (78.3%) out of the 37 ExpCRMs overlapped with CRMfull whereas only 6 (16.2%) overlapped with the CTRL regions. Simulated CRMs had an average sensitivity of 60.1% with standard deviation of 7.7 (Fig. 3D, p-values < 0.01 for a sensitivity of 78.3%). We cannot calculate the positive predictive value because the majority of the predicted modules are located within DNA sequences that have not been tested, or DNA fragments drive reporter gene expression in additional tissues but location information was not reported³⁷. 1.2) Accurate prediction of intronic CRMs: Fig. 3E demonstrates that both the mouse c-fos³⁸ and the protein phosphatase 1 regulatory inhibitor subunit 1B (Ppp1r1b) intronic enhancers (11 and 23 bp, respectively)³⁹ overlapped with predicted CRMs but not the CTRL regions. In total, 6 out 7 (85.7%) intronic mouse ExpCRMs were correctly predicted and only one overlapped with the CTRL regions. We collected two human intronic ExpCRMs and both overlapped with predicted CRMs but not the CTRL regions at 50%-overlap cutoff (Fig. 3F). 1.3) Accurate prediction of CRMs located in the 3’ UTR regions: We collected two mouse enhancers located in the 3’ UTR of Fgf4 (64 bp)⁴⁰ and Col2a1 (543 bp)⁴¹. Both were correctly predicted without overlapping with the CTRL regions at either cutoff (Fig. 3G). The two curated human ExpCRMs located in the 3’ UTR of COL2A1⁴¹ (537 bp) and DDB2 (49 bp)⁴² were also correctly predicted (Fig. 3H).

2) Accurate prediction of putative post-transcriptional regulatory elements

Post-transcriptional gene regulation controls the amount of proteins produced from an mRNA by altering rates of decay and translation. A 37 bp miRNA recognition element (MRE) of miR-375 in mouse Pax6 3’ UTR was previously reported to be active in αTC1–6 cells⁴³ and a 43 bp miR-24/miR-30a binding site in the mouse Per2 3’ UTR was reported to be functional in NIH 3T3 cells⁴⁴. Both post-transcriptional regulatory elements overlapped with predicted CRMs at 1bp-overlap cutoff but not with CTRL regions (Fig. 4A). We collected 11 human elements validated functionally using a luciferase reporter assay⁴⁵. All (100%) overlapped at least 1 bp with our predicted CRMs and only two of them also overlapped with the adjacent CTRL regions (Fig. 4B). These results demonstrate that our method can identify post-transcriptional regulatory elements with high sensitivity and accuracy. Because miRNA sites are not included in the sequences used for PWM identification, our predicted CRMs may also represent novel transcriptional elements that happen to overlap with miRNA sites as comparative genomic studies suggest that many additional novel regulatory elements may exist in the mammalian 3’ UTRs⁴⁵.

3) Accurate prediction of silencers

Silencer elements are undercharacterized in mammalian genomes and can be hard to predict. We therefore evaluated the sensitivity of out method for detecting silencers. Three silencers (771–979 bp) that repress the activity of the Cebpa proximal promoter in mouse G1ME cells⁴⁶ were all correctly predicted at both cutoffs without overlapping with CTRL regions (Fig. 4C). Of five human silencers (301 bp) that were previously detected in K562 cells⁴⁷, MrMOD correctly predicted four (80%) of them at 1bp-overlap cutoff with 3 of them also overlapping with adjacent CTRL regions (Fig. 4D).

4) Accurate prediction of locus control regions (LCRs): Two mouse LCRs at the β-globin loci⁴⁸ were both correctly predicted (Fig. 4E) although one also overlapped with the nearby CTRL regions. We collected three human LCRs: two (both 72 bp) that demonstrated enhancer-blocking activity⁴⁸ and one upstream of OPN1LW (592 bp) essential for the expression of cone pigment genes⁴⁹. Two were correctly predicted at 1bp-overlap cutoff and none of them overlapped with CTRL regions (Fig. 4F).

In summary, our predicted CRMs detected diverse classes of functional regulatory elements currently known in the human and mouse genomes with high sensitivity. Many of the ExpCRMs that overlapped with predicted non-functional CTRL regions also overlapped with predicted CRMs. For example, of the 22 mouse ExpCRMs that overlapped with CTRL regions at 50%-overlap cutoff, 19 (86.3%) also overlapped with nearby predicted CRMs. This is largely due to the large size of some ExpCRMs, which may include both functional and non-functional sequences. Further analyses are needed to identify the minimal functional sequences, which will improve accuracy of performance evaluation.

Statistical evaluation of MrMOD predictions using multiple orthogonal datasets

To perform statistical evaluation of MrMOD predictions we used a compendium of experimentally defined enhancers and large-scale epigenomic datasets representing putative regulatory elements downloaded from multiple databases including VISTA Enhancer Browser¹⁰, ENCODE cCREs²², the Enhancer Atlas 2.0²¹, Cistrome Data Browser²³, ATAC-seq Atlas⁵⁰, and scATAC-seq data for mouse brain and 45 adult/fetal human tissue types^24,25 (Supplementary Tables 3 and 4). Peak overlap between each dataset and CRMsub or CTRL regions was determined separately and peak-detection sensitivity and odds ratios (OR) were calculated comparing CRMsub to CTRL. Below is a description of all the datasets (11 for mouse and 10 for human) used in the analyses.

A total of 157 literature-curated mouse (97) and human (60) ExpCRMs driving gene expression in a diverse type of tissues, cell types, and developmental stages.

A total of 1,632 in vivo validated functional enhancer elements for mouse (634) and human (998) from VISTA Enhancer Browser

Enhancer Atlas 2.0 contains a total of 530,094 and 201,032 putative enhancers for mouse and human, respectively, annotated in 534 different tissues/cell types²¹.

DHS peaks: DHS regions contain a variety of regulatory elements including enhancers, silencers, insulators, and locus control regions²². Mouse (1,242,162 peaks) and human (2,401,388 peaks) DNase-seq data were downloaded from Cistrome database.

Bulk ATAC-seq peaks: The Cistrome database contained 1,124,888 mouse ATAC-seq peaks derived from 1,937 samples and 46 different tissues and cell types. The human ATAC-seq data contained 1,217,002 peaks from 1,059 samples and 25 different tissues and cell types. Additionally, the Mouse ATAC-seq Atlas database has a total of 296,390 peaks derived from 66 ATAC-seq profiles from 20 primary tissues of adult mice.

ChIP-seq data of the histone marks H3K4me1 and H3K27ac represent putative active enhancers. Mouse H3K4me1 (1,095,988 peaks), H3K27ac (1,400,986), and human H3K4me1 (1,565,782) and H3K27ac (1,697,483) ChIP-seq data were downloaded from the Cistrome database.

ChIP-seq data of the H3K4me3, which represents putative promoters, were downloaded from the Cistrome database with 1,120,149 peaks for mouse and 2,081,430 peaks for human.

scATAC-seq detects cell-type-specific open chromatin regions potentially representing cell-type-specific regulatory elements functioning in hundreds of cell types from different tissues and developmental stages. Mouse scATAC-seq brain atlas identified 491,818 elements from 45 brain regions and 160 cell types in adult mouse cerebrum²⁴. Human scATAC-seq data provided ~ 1.2 million elements from 45 adult/fetal tissues and 222 cell types²⁵.

ENCODE cCREs are candidate cis-Regulatory Elements (cCREs) obtained through the integration of ENCODE epigenomics data, providing 368,121 cCREs in 169 tissues/cell types and 1,063,878 cCREs in 1,518 tissues/cell types for mouse and human genomes, respectively²².

We first used all collected ExpCRMs to evaluate CRMsub and CTRL regions. For the 97 mouse ExpCRMs, CRMsub has a detection sensitivity of 79.3% (77/97) whereas the CTRL region detection has a sensitivity of 28.8% (26/97) resulting an OR of 10.51 (p-value < 2.5e-13) at 1bp-overlap cutoff (Fig. 4G, Table 2). At a more stringent 50%-overlap cutoff, fewer CRMsub (71.1%) and CTRLs (22.6%) overlapped with the ExpCRMs but the OR remained high (8.40, p-value < 1.5e-11, Table 2, Fig. 4H). For the 60 human ExpCRMs, the CRMsub has a detection sensitivity of 56.6% (34/60) whereas the CTRL region detection has a sensitivity of 20% (12/60) giving an OR of 5.23 (p-value < 3.9e-05, Fig. 4G, Table 3) at 1bp-overlap cutoff. Using the more stringent 50%-overlap cutoff resulted in similar results (OR = 4.67, p-value < 2.2e-04, Fig. 4H). If we use 1bp-overlap as the cutoff, the full set of predicted CRMs (CRMfull) have a detection sensitivity of 93.8% (91/97) and 90% (54/60) for mouse and human datasets, respectively (Tables 2 and 3), demonstrating highly sensitive and accurate CRM predictions made by MrMOD.

Next, we used functional enhancers from the VISTA Enhancer Browser database to evaluate our predicted CRMs. Although VISTA enhancers are also experimentally defined modules, they are much bigger than our curated ExpCRMs (average size = 2,461 vs. 541 bp for mouse) (Supplementary Table 3). Many VISTA enhancers drive reporter gene expression in multiple tissues or brain regions without cell-type specificity information. Therefore, DNA fragments in the VISTA enhancer database likely contain multiple enhancers as well as non-functional sequences between those enhancers, which could be delineated through more detailed analysis. With these limitations a high percentage of CTRL regions (68.5% and 61.2% at 50%- overlap cutoff and 71.9% and 62.6% at 1bp-overlap cutoff for mouse and human, respectively) overlapped with VISTA enhancers. However, a much higher percentage of CRMsub (92.8% and 92.2% at 50%-overlap cutoff and 95% and 94.5% at 1bp-overlap cutoff for mouse and human, respectively) overlapped with VISTA enhancers, resulting high ORs for all comparisons (5.94–10.24, p-value < 3.7e-15, Fig. 4G-H, Tables 2 and 3). If we use 1bp-overlap as the cutoff for correct prediction, our full prediction (CRMfull) has a sensitivity of 100% and 99.5% for mouse and human datasets, respectively. Again, these results demonstrate highly sensitive and accurate CRM prediction made by MrMOD.

Lastly, we used multiple epigenomic profiling datasets to evaluate our predicted CRMs. For the mouse genome, there were 9 datasets containing 296,390–1,400,986 total peaks per dataset with an average size ranging from 276 bp to 947 bp (Supplementary Table 3). When comparing the overlaps between the collected peaks and our mouse predicted CRMsub and CTRL regions, CRMsub always had a much higher sensitivity (range 39.9%-82.3%) than did the CTRL regions (range 31%-50.3%) with ORs ranged 1.31–4.58 (p-value < 2.2E-16) at either cutoff (Fig. 4G-H, Table 2), highlighting the significant enrichment of functional elements in our predicted CRMsub compared to the CTRL regions. We obtained similar results using all epigenomic datasets to evaluate human predicted CRMs. There were 8 datasets containing 201,032 − 2,401,388 total peaks per dataset with an average size ranging from 273 bp to 952 bp (Supplementary Table 4). Our human predicted CRMsub had a much higher sensitivity (range 35.8%-74.2%) than did the CTRL regions (range 36.4%-49.8%) at either cutoff. All ORs were above 1.21 (Fig. 4G-H, Table 3) except the ENCODE cCREs dataset, which has an OR of 0.97 at 50%-overlap cutoff but a much higher OR (1.57) at 1bp-overlap cutoff. This is likely due to the high resolution of predicted ENCODE cCREs (average length of 276 bp and 273 bp for mouse and human, respectively) and our predicted CRMs (216 bp for mouse and 213 bp for human).

We also evaluated MrMOD prediction using an alternative scheme of obtaining control regions (Supplementary Fig. 1A-C). In this method, the control regions (predicted non-functional regions) were longer and had slightly higher genomic coverage than the predicted CRM subset to give them an advantage in comparisons. However, we obtained results similar as described above, demonstrating high prediction accuracy and consistent performance (Supplementary Fig. 1D-E, Supplementary Tables 5 and 6). Notably, our prediction had the highest ORs (5.91–16.32) compared to control regions when evaluated using curated ExpCRMs and VISTA enhancers, the “gold standard” functional regulatory sequences no matter what cutoff or control regions were used. In summary, the consistency in prediction sensitivity and OR comparing CRMsub with the CTRL regions for both mouse and human genomes across all datasets, representing diverse regulatory elements obtained from thousands of samples and derived from hundreds of different tissues and cell types at different developmental stages and conditions, suggests that our predicted CRMs represent all types of regulatory elements tested here unlimited by tissue, cell type, developmental stage, or stimulus.

Unsupervised machine learning for CRM functional annotation

Next, we sought to understand the regulatory code encoded in the CRMs by determining the tissue, cell type, developmental stage, or stimulus response-specificity of each CRM. We first scanned the mouse genome for TFBS using PWMs of known TFs from the CIS-BP database. We obtained TFBS abundance in each CRM for every TF in the CIS-BP database and performed unsupervised clustering using single-cell genomics tool Seurat⁵¹. Due to the large dataset and limitations of computational infrastructure to complete analysis on creating the Seurat object for each chromosome, we narrowed our analysis of the smaller mouse chromosome 17 as a test case. We obtained a total of 43 CRM clusters (Fig. 5A). Each CRM cluster represents a set of CRMs with similar TFBS compositions. Cluster marker genes represent TFs whose binding sites are enriched in the CRM cluster. Genes associated with the CRMs in the cluster by proximity (distance to TSS) represent the putative target genes. Therefore, each CRM cluster linked putative TFs with their target genes through component TFBSs located within the CRMs. Most of the clusters included CRMs that were located in the distal intergenic and intronic regions, with the exception of cluster #17 (referred to as C17 herein), which enriched for CRMs located in the promoter regions (Fig. 5B).

To investigate the functionalities of the clusters of CRMs, we performed pathway enrichment analyses on the CRM-associated genes of the 43 clusters. Most clusters were enriched for some cellular component pathways with C17-associated genes enriched for multiple pathways involved synapse and axon components (Fig. 5C, Supplementary Table 7). Many clusters uniquely enriched for different Biological Process pathways (Supplementary Fig. 2A) suggesting that they could regulate genes involved in distinct biological processes. Multiple pathways including regulation of nervous system development, regulation of neurogenesis, learning or memory, neuron projection extension, neuron death, were uniquely enriched in C17-associated genes (Fig. 5D) suggesting their nervous system related functions. Cellular response to DNA damage stimulus and regulation of cellular response to stress were also uniquely enriched in C17-associated genes (Fig. 5D). Six clusters had significant enrichment of KEGG pathways including C17 with multiple pathways in cancer and focal adhesion, which is important for axonal branching and synapse formation⁵² (Supplementary Fig. 2B, Supplementary Table 8). Additionally, using DisGeNET, a database of genes and variants associated to human diseases⁵³ we found that CRM-associated genes from multiple clusters, including C17, were enriched for cancer and autism spectrum disorder (ASD) (Supplementary Fig. 2C, Supplementary Table 9). Notably, C17-associated genes were uniquely enriched for multiple cancer pathways, neurodevelopmental disorders, and severe intellectual disability using DisGeNET database in ClusterProfiler (Fig. 5E).

Next, we investigated the functions of the CRM clusters by examining the cluster marker TFs (Fig. 6A, CRM-binding TFs). Each cluster had a set of TFs that were highly enriched in the cluster although most of the time not uniquely (Fig. 6B-D). Due to the unique association of C17 with cancer and neurodevelopmental disorders we investigated it in further detail and uncovered multiple lines of evidence (Fig. 6C-G) connecting the CRMs with their binding TFs (C17 marker genes, Supplementary Table 10) and target genes (C17-associated genes, Supplementary Table 11) consistent with their function in tumorigenesis and neurodevelopment. First, many of the binding TFs are known cancer drivers such as Ascl1⁵⁴, Bcl6⁵⁵, Atoh1⁵⁶, Arid5⁵⁷, and multiple E2F family TFs⁵⁸ (Fig. 6B). They were significantly enriched for various cancer pathways (Fig. 6E, Supplementary Table 12) as determined by three different pathway enrichment tools using DisGeNET database: disgenet2r, ClusterProfiler, and EnrichR. Similarly, many of the binding TFs are known regulators of neurodevelopment and neurodegeneration diseases and are significantly enriched for neurodevelopmental disorders and neurodegenerative pathways (Fig. 6E, Supplementary Table 12). For example, Foxp2⁵⁹, Bcl11a⁶⁰, Hey1⁶¹, and Tcf4⁶² (Fig. 6B) have been linked to ASD, schizophrenia, and intellectual disability whereas astrocytic Cebpd⁶³ contributes to the progression of Alzheimer’s Disease. Additionally, top C17 CRM-binding TFs known to cooperatively regulate target gene expression (Fig. 6F) such as TCF4 that interacts with ASCL1 and NEUROD1⁶⁴. Furthermore, many C17 CRM-binding TFs that play important roles in neurodevelopment have been linked to glioblastoma and brain tumors such as ASCL1⁵⁴, BCL6⁵⁵, Atoh1⁵⁶, and Arid5⁵⁷, consistent with the known link between neurodevelopmental pathways and brain tumor⁵⁶. CRM-binding TFs were also enriched in embryonic expression and pluripotent stem cells by STRING functional enrichment analysis (Supplementary Fig. 2D), consistent with the knowledge that most ASD risk genes regulate gene expression or to be involved in neuronal communication during early brain development⁶⁵. Many of the enriched pathways were shared by the CRM-binding TFs and the C17 CRM-associated genes (Supplementary Fig. 2D-E) linking the putative TF regulators with their targets.

To test whether C17- associated genes are the biological targets of the C17 CRM-binding TFs we tested the overlap between these genes with the known TF target genes in the ChIP-Atlas database³² defined by the mouse ChIP-seq experiments. C17-associated genes were enriched in 149 different combinations of TFs, cell lines, and tissues for 21 of the C17 CRM-binding TFs such as Ascl1, CTCF, Tfap2c, and Neurod1 (Fig. 6G, Supplementary Table 13). Multiple sets of CTCF targets were enriched for the C17-associated genes including targets determined in the brain and cerebellum. CTCF dosage deficiencies has been linked to developmental delay and intellectual disability⁶⁶. The enrichment of Tfap2c targets in mammary tumor and Maz targets in murine erythroleukemia cell suggesting that the CRM-associated genes represent biological-relevant targets of the CRM-binding TFs. Taken together, these results provide strong evidence that CRMs in C17 linked CRM-binding TFs with their targets in regulating cancer, neurodevelopmental and neurodegenerative diseases.

Using CRM to guide experimental design

Many VISTA enhancers drive reporter gene expression in multiple tissues or brain regions therefore likely contain multiple enhancers as well as non-functional sequences between those enhancers. We used our predicted CRMs to guide experimental design to delineate enhancers that drive more restricted expression patterns through more detailed analysis. To validate the predicted CRM function compared to the original VISTA enhancer, we tested their ability in driving reporter gene expression in chick embryo neural crest cells. During embryonic development, every cell differentiates and becomes specialized to assemble an organ or a physiological system. Therefore, tracing the fates of specific cells provides important understandings for monitoring organogenesis, physiological, and pathological processes. Despite recent success in cataloging the gene expression profiles of distinct cell subpopulations, there is still limited ability to specifically access subpopulations to study their function. Neural crest cells serve as a great model for cell differentiation as they are progenitor cells, located at the embryonic dorsal neural tube, that differentiate into many derivatives⁶⁷. These derivatives range from peripheral nervous system components such as sensory and autonomic neurons, satellite glia and Schwann cells, to endocrine cells and melanocytes⁶⁸. Mouse and human enhancer elements have been shown to be active in the chick neural tube^69,70. The accessibility of the embryonic chick neural tube, together with the evolutionary conservation of many human and mouse enhancer elements suggested the chick embryo as an ideal model system for deciphering neural crest cells differentiation. VISTA Enhancer #52 (hs52, 1013bp) flanking the genes Fto-It1 and Irx3 on human chromosome 16 was identified through a transgenic mouse screens of highly conserved non-coding sequences in the human genome to drive β-galactosidase reporter gene expression at embryonic day (E) 11.5 in the dorsal root ganglion (DRG) and trigeminal ganglions. Three predicted CRMs overlapped the hs52 enhancer: 52.1 (116bp), 52.2 (127bp) and 52.3 (147bp) (Fig. 7A). We chose to test elements hs52.2 and 52.3 because they overlapped also with ENCODE cCREs (Fig. 7A). Elements hs52, hs52.2 and hs52.3 were PCR amplified from human genomic DNA and cloned upstream to a Cre recombinase. To verify the specificity of the expression, the Enhancers::Cre plasmids were electroporated into stage 16 to 17 chick hemi-tube along with a Cre-dependent mCherry plasmid [pCAGG-LoxP-STOP-LoxP-mCherry]. Green fluorescent protein (GFP) driven by the general synthetic promoter RPBSA, was used as a control for efficiency of the electroporation (Fig. 7B). While hs52 element showed wide expression in many cells along the embryo neural tube, the elements hs52.2 and hs52.3 showed a more specialized expression (Fig. 7B). Transverse sections along the embryo further demonstrated that hs52 drives expression in cells inside the neural tube, as well as cells migrating outside of the tube, whereas elements hs52.2 and hs52.3 drive mCherry expression in more specialized population of cells migrating outside of the neural tube (Fig. 7C). These results suggest that our analysis defined sub-elements of hs52 enhancer that contribute to common expression control in an additive way. To further decipher the identity of the specified cells, driven by hs52.2 element, co-staining with the glia marker FABP7 and the neuronal markers SCG10/STMN2 was performed (Fig. 7D). Most cells did not show co-localization with the glia or neuronal marker, therefore their fate might be endocrine or pigment cells, or they still have not completed their differentiation. Nevertheless, this experimental validation is sufficient to confirm the hypothesis that our predicted CRMs are potentially functional and can drive more restricted gene expression.

CRMs are still poorly annotated⁷¹ and their discovery has been challenging¹. We developed MrMOD, an algorithm to predict CRMs in mammalian genomes. Extensive comparison of predicted functional CRMs and predicted non-functional regions in the genome with curated ExpCRMs and large collections of epigenomic datasets demonstrated high sensitivity and accuracy. Dissection of a VISTA enhancer to define smaller enhancers that drive more cell-restricted expression with the guide of our predicted CRMs demonstrates the value of the predicted CRMs. Thus our CRM predictions will have wide applications and broad impact on transcription regulation research in evolution and development.

How many regulatory elements are there in mammalian genomes? Currently there is a wide range of estimates of the number of regulatory elements in mammalian genomes ranging from ~ 1.47 million CRMs covering about 55% of the human genome⁷² to more than 3 million, covering 10–80% of the non-coding human genome⁷³. The ReMap 2022 database reported 2.4 million and 3.4 million non-redundant CRMs in their mouse and human regulatory atlases, respectively⁷⁴. The Enhancer Atlas 2.0 database obtained more than 700,000 predicted enhancers across mouse and human species²¹. The accessible chromatin landscape of human genome identified approximately 2.9 million DHSs through genome-wide profiling in 125 tissues/cell types¹⁴. Mouse scATAC-seq brain atlas identified 491,818 elements from 45 brain regions and 160 cell types in adult mouse cerebrum alone²⁴. Human scATAC-seq data provided ~ 1.2 million elements from 45 adult/fetal issues and 222 cell types²⁵. Given the limitation of how many datasets are currently available, our prediction of ~ 5.5 million (44.1% genome coverage) and 6.1 million (42% genome coverage) CRMs for the mouse and human genomes, respectively, is a reasonable upper estimate of the regulatory universe of these genomes.

It has been pointed out that the majority of putative elements identified by empirical methods are merely predictive, and biological validation is essential before assigning a definitive regulatory function to a genomic region¹. Experimentally defined CRMs remain the gold standards of functional CRMs. However, due to the low-throughput, expensive, and time-consuming nature of this validation approach there are very limited data available. VISTA Enhancer Browser, the only database that has a collection of experimentally validated mammalian enhancers, covers only ~ 0.05% of the mouse and human genomes. Our literature-curated 157 experimentally defined CRMs that have minimal sequence sufficient for function and are not available on VISTA, expanded the current database by 9.6%. Our computational and experimental validation of predicted CRMs demonstrate high sensitivity and accuracy of our prediction. Despite the large number of available experimental datasets collected, we emphasize that we still don’t have all the possible experimental data for every tissue, every cell type, and every developmental stage, or stimulation condition. Therefore, our work here fills in a critical gap in knowledge by providing a comprehensive base-pair resolution annotation of the functional and non-functional elements in mammalian genomes.

Our study has a potential impact on the research of transcription regulation and could revolutionize the way in which epigenomic data are analyzed. A crucial first step in analyzing epigenomic data such as ChIP-seq, ATAC-seq, and DHS-seq data pertains to finding peaks that correspond to targeted DNA regions. The numbers and the boundaries of peaks called for in the same sample depends heavily on the specific parameter settings and the algorithm used as well as the sequencing depth and the quality of the experiment itself ⁷⁵. Different peak boundaries or presence/absence of peaks in different replicates are common problems^19,75. Many tools have been developed trying to define a consensus region from peaks overlapping among a set of replicates with varied exact positions, but different tools give very different consensus peaks, with differing lengths and positions¹⁹. These differences are sufficient to influence downstream biological interpretations, and lead to disparate scientific conclusions about enhancer biology and disease mechanisms⁷⁶. Our predicted CRMs could instead serve as a reference (refCRM) to map reads directly to those reference CRMs similar to mapping reads to gene models in RNA-seq analysis. Therefore, data from different replicates and different conditions would have signal information within the exact same genomic locations, enabling direct comparison across samples and conditions as well as the identification of differential CRM activity, similar to differentially expressed gene identification. Second, our work provides a valuable resource for guiding experimental design. Because our computational prediction is independent of the epigenomic data used for evaluating our predicted CRMs, the combination of these two orthogonal types of information will provide strong support for the functionality of a genomic region to guide experiments to dissect gene regulation as demonstrated in our current study. We provide here to the research community the genomic locations of all predicted CRMs ranked by the number of lines of supporting evidence. Furthermore, the small size (average ~ 200 bp) of predicted CRMs is ideal for MPRA assays because most studies synthesize libraries of candidate enhancers on microarrays, generally at ~ 200 bp^77,78. Third, genome-wide association studies (GWAS) have identified many non-coding variants that are in linkage disequilibrium with the causal variant but have not been able to pinpoint the causal variants in general. Our refCRMs could help to search for likely causal variants and to elucidate molecular mechanisms that support the genetic bases of diseases and complex traits in mammalian species. Fourth, the use of unsupervised machine learning allows the identification and annotation of clusters of CRMs with similar functionality, define TF markers and CRM targets, CRM-motif composition, and potential cooperation between TFs in the clusters. This information contributes to the understanding of gene regulatory networks. Based on the evidence collected, we successfully identified a cluster that is linked to neurodevelopmental disorders, cancer, and neurodegeneration. Lastly, our work will enable a systematic evaluation of CRM evolution and CRM activity regulation during development, evolution, environmental stimulation, and disease pathogenesis. Our PWM identification was based on the genomic sequences of mouse, rat, and human, and the detected PWMs are evolutionarily conserved among the species. Because MrMOD uses the set of mammalian-genome-specific PWMs to predict candidate CRMs, we expect MrMOD to be applicable to any species that evolutionarily resides between mouse and human within the mammalian phylogenetic tree, including non-human primate where experimental compendiums are not generated and it would be costly to do so. However, MrMOD does not require conservation information at CRM prediction step and should be able to detect non-conserved CRMs.

Genome sequence and orthologous genes

The unmasked genomic DNA sequences and annotation of mouse (Mus musculus) assembly GRCm38 (mm10) and human (Homo sapiens) assembly GRCh38 (hg38) from the Genome Reference Consortium were downloaded from the Ensembl website (https://nov2020.archive.ensembl.org/Mus_musculus/Info/Index and https://useast.ensembl.org/Homo_sapiens/Info/Index). Mouse genome was used as the anchor genome. Human and rat orthologs of mouse genes were downloaded from BioMart (https://useast.ensembl.org/info/data/biomart/index.html). Intergenic region sequences of up to 5 kb in length upstream of the start codon ATG of each gene in the corresponding genome were retrieved. If the distance to the next upstream gene is less than 5 kb, only the intergenic region was obtained.

PWM prediction using PhyloNet and motif consolidation

Using mouse genome as the anchor, we obtained a set of 18,215 genes with at least one ortholog in the rat or human genome. We retrieved up to 5 kb sequences upstream of the ATG codon for all of the genes. Each mouse gene and its orthologs formed a data entry and was used as input by PhyloNet²⁸ to query the database, which includes all the orthologous gene sets to identify conserved cis-regulatory elements that were associated with multiple genes. PhyloNet was run with options “-q 2000 -iq 500 -id 500 -s 4 -c1 -o2 -pf 50”. Up to 50 of the most significant PWMs from each query gene set were saved for further analysis. Because these initial predicted motifs in the PhyloNet output files are highly redundant, we took two steps to consolidate predicted motifs using the ALLR statistics³¹ as described previously²⁹. Briefly, the first step compares matrices in each query output file to consolidate matrices obtained from the same query sequences that significantly overlap. The unique PWMs obtained from each query gene at the first step were pooled together and further consolidated to generate the final set of 5,143 distinct motifs (p-value < 10⁻¹⁰) with lengths between 5 and 30 bases (average 18 bp).

Comparison with transcription factor PWMs in the TRANSFAC and CIS-BP database

Predicted PWMs were compared with PWMs of CIS-BP (Version 2.00) and TRANSFAC (version 10.2)^27,79 using MatAlign-v4a (Wang and Stormo; http://stormo.wustl.edu/MatAlign/). Matrices that had an ALLR score > 6.57 and the percentage of overlap between two matrices (OLAP score) > 68.1% were considered redundant²⁹. For each round of comparison, the best PWM was picked first (the one with the highest total ALLR score in the PhyloNet output). It was compared with the rest of the matrices using ALLR statistics, and any matrix that appeared redundant to the chosen matrix was removed. Then, the second best one was picked, and the process was repeated until all the matrices had been analyzed.

ChIP-Atlas database mouse transcription factor target enrichment analysis

ChIP-Atlas³² has a collection of analysis results from 2,540 ChIP-seq data sets including TFs and their potential target genes, totaling 723 TFs from 369 different cell types/tissues for mouse (mm10) (ftp://ftp.biosciencedbc.jp/archive/chip-atlas/data/mm10/target). The peaks used in this study from ChIP-Atlas were +/- 5kb from the transcription start site (TSS) of RefSeq. The threshold of significance used to determine whether a target gene is enriched for a TF was > 500 binding scores of MACS2. Significant overlap between predicted PWM-associated genes and TF targets from ChIP-Atlas were evaluated using a hypergeometric test. FDR-corrected p-value < 0.05 was considered significant.

Functional enrichment analysis of PWM-associated genes

Each PWM-associated gene set was compared with GO and KEGG annotation using the ClusterProfiler R package⁸⁰. Terms and pathways with FDR corrected q-value < 0.001 were considered significant.

Whole-genome-wide CRM identification

The 5,143 mammalian PWMs identified by PhyloNet were used as input for the algorithm implemented in CerMOD²⁹ to identify CRMs in the mouse and human genomes, respectively. First, Patser⁸¹ was used to identify all predicted binding sites for the 5,143 PWMs using default cutoff scores. Next, the algorithm calculates the average number of binding sites per position in each chromosome and Z score for each position. Peak positions that have a Z score > 2.33 (corresponding to p-value < 0.01, one-tailed) were identified. For each peak position, we extended it in both 5’ and 3’ directions if the next Z score > 0 position is fewer than 30 bp away (the longest motif length). Peak positions used in a previous extension step were not extended.

MrMOD identifies DNA regions that have TF binding sites significantly more than average (motif abundance z score > 2.33, p < 0.01, one-tailed) and automatically determines the boundaries by z score = 0 positions at each end of the DNA region. These regions correspond to putative regulatory CRMs.

Control regions development and annotation

To develop a set of control regions from each genome with the exact same genomic coverage and distribution, we first obtained regions not covered by CRMfull using BEDTools⁸² complement function excluding regions with more than 50% “Ns”. Because it is impossible to predict the exact boundaries of CRMs, regions smaller than 400 bp that were located between two predicted CRMs, were filtered out to remove small regions that could be part of the nearby CRMs. Next, we defined a subset of CRMs (CRMsub) that are smaller than 250bp as input. Then we trimmed the control regions by 40 bp each end, padding 5bp between fragments, to generate a distribution with exactly the same number/length as the CRMsub. The R package ChIPseeker⁸³ was used to annotate CRMfull and control regions as well as calculate feature genomic distribution.

First, regions not covered by CRMfull were obtained. Then, we subset the CRMs by eliminating the longest CRMs, resulting in predicted CRM subsets (CRMsub) that are smaller than 250 bp. Because of the lack of space to sample the control region, using small fragments gives sufficient space to sample the non-functional regions to get the control with the exact same number/length as the subset. In addition, it is impossible to predict the exact boundaries of CRMs. For this reason, small regions that were located between two predicted CRMs were filtered out as they could be part of the nearby functional CRMs. These regions were predicted to be non-functional, covering 21.8% and 20.8% of mouse and human genomes, respectively, and were used as the control regions (referred to as CTRLs) to evaluate the accuracy of predicted CRMs. CRMsub had the exact same genomic coverage and distributions relative to annotated genes and transcripts to the CTRLs for both genomes (Fig. 2B, 2C).

Comparison with experimentally defined CRMs, chromatin accessibility, and epigenomics data

Experimentally defined modules came from two sources. 1) We collected 97 experimentally defined CRMs for mouse and 60 modules for human through an extensive literature search (Supplementary Tables 1 and 2). 2) We downloaded 634 and 998 experimentally defined enhancers that exhibited enhancer activity¹⁰ from VISTA Enhancer Browser for mouse and human, respectively. Human and mouse predicted enhancers were obtained from Enhancer Atlas 2.0 (http://www.enhanceratlas.org/data/download/species_enh_csv.tar.gz). From the Cistrome database, chromatin accessibility (ATAC-seq and DNAse-seq) peaks and ChIP-seq histone marks (H3K27ac, H3K4me1, and H3K4me3) peaks were downloaded (http://cistrome.org/db/batchdata) for both mouse and human genomes. ATAC-seq Atlas is only available for mouse genome⁵⁰. ENCODE mouse and human cCREs²² are available at the web-based server Search Candidate cis-Regulatory Elements by ENCODE V3 (SCREEN; http://screen.encodeproject.org). Mouse²⁴ and human²⁵ scATAC-seq can be accessed at their portal (http://catlas.org/mousebrain/#!/ and http://catlas.org/humanenhancer/#!/, respectively). Each dataset included peaks from hundreds of different tissues and cell types. Files from the same type of data were merged using bedtools merge to eliminate overlapping elements. Modules bigger than 2.5 kb after merge were eliminated before comparison. We define an experimentally defined module as being correctly predicted at two different cutoffs: 1) experimentally defined modules overlap with predicted CRMs by 50% of the length of the shorter one; 2) they overlap by 1 bp. The odds ratio (OR), confidence interval, and p-value (chi-square independent test) were calculated comparing CRMsub and CTRL of each dataset of mouse and human. Peak-detection sensitivity and OR were calculated using the R package fmsb. The difference was considerate significant with an OR > 1 and p-value < 0.05.

Unsupervised machine learning to annotate CRM functions

We scanned the mouse genome for TFBS using mouse and human PWMs from the CIS-BP database. We obtained TFBS abundance in each CRM for every transcription factor in the CIS-BP database. Unsupervised clustering of CRMs based on TFBS abundance using single-cell genomics tool Seurat on chromosome 17, one of the smallest chromosome of the mouse genome. Each CRM cluster linked putative TFs (TF marker genes whose binding sites are enriched in the CRM cluster) with the putative target genes (genes associated with the CRMs in the cluster). The CRM associated-genes of each cluster were annotated using ChIPseeker R package.

Functional enrichment analysis of all CRM associated-genes

The annotated chr17 CRM associated-genes of each cluster were used to perform functional enrichment analysis: GO, KEGG, and DisGenet enrichments using ClusterProfiler R package. For DisGenet enrichment analysis, the mouse ENTREZ IDs were converted to human ENTREZ IDs. DisGenet was also tested with two other methods (EnrichR and disgenet2r) to compare and ensure the results of C17. Parameters used in the ClusterProfiler functional enrichment analysis were FDR < 0.05, minGSSize = 10, maxGSSize = 5000, minimum number of counts in a pathway = 5. Genes of chromosome 17 (chr17) were used as universe background for these analyses.

Functional enrichment and network analysis of TF markers of C17

The TF markers of C17 were also enriched using the same parameters described above. The only difference is that we did not include chr17 as universe, because TFs are not restricted to bind only CRMs of chr17. TFs that were enriched in three neurodevelopmental pathways (intellectual disability, neurodevelopmental disorder, autistic disorder, and schizophrenia) were combined (total of 57 TFs) and used to build a STRING network to visualize the connections between them.

CRM associated-genes of C17 are enriched for mouse ChIP-Atlas TF targets

TF targets enrichment analysis was performed using the same previous parameters used for the motif analysis: +/- 5kb from the TSS of RefSeq; threshold of significance used to determine whether a target genes is enriched for a TF was > 500 binding scores of MACS2; significant overlap between predicted CRM associated-genes of C17 and mouse TF targets from ChIP-Atlas were evaluated using hypergeometric test. FDR-corrected p-value < 0.05 was considered as significant.

Animals and procedures

Fertilized White Leghorn chicken eggs (Hy-line North America, Mansfield, GA) were incubated at 38.5–39°C and 50-80% humidity. A DNA solution of 2-5 mg/ml was injected into the lumen of the neural tube at HH stage 17–18 (E2.75-E3). Electroporation was performed using 3 × 50 ms pulses at 25 V, applied across the embryo using a 0.5-mm tungsten wire and a BTX electroporator (ECM 830). 100 unit/ml penicillin in Hank's Balanced Salt Solution was added on top of the embryos and embryos were incubated for 24 hours prior to analysis.

Immunohistochemistry

Embryos were fixed overnight at 4°C in 4% paraformaldehyde/0.1 M phosphate buffer, incubated in 30% sucrose/PBS for 24 h, and embedded in Optimal Cutting Temperature (O.C.T.). Cryostat sections (12 μm) were collected on Superfrost Plus slides and kept at −20°C. The following primary antibodies were used- rabbit anti FABP7 (Thermo Fisher Scientific Cat #PA5-24949) and rabbit anti SCG10/STMN2 (Novus catalog #NBP1-49461). Secondary antibody used- 647 goat (Thermo Fisher #A-21245). Images were taken under a microscope (SMZ-745T Zoom Stereo Photo Microscope, Nikon and Lionheart LX automated microscope, BioTek) and images were analyzed with Nikon NIS-Elements and Gen5 software.

DNA

The hs52 element was amplified by PCR from a genomic human DNA utilizing the primers [GCCAATTGCAATTTGGAATAACTTTCCCTACCC] and [GCGCTAGCTAAAAAGTGACCTGGGAAAACTCAG]. hs52.2 element was amplified utilizing the primers [GCCAATTGAACGCACCCTCTGTTCTTCAGT] and [GCGCTAGCGCACTAAGTACTATTATGTAGCACA]. hs52.3 element was amplified utilizing the primers [GCCAATTGCAGGCTTGGAAATGGGGCCAGG] and [GCGCTAGCACGTGGCAGTGAAAACGAGTGG]. The enhancers were cloned into 5'MfeI/3'NheI sites of the Cre plasmid. RPBSA: GFP plasmid used as positive control for electroporation (Addgene #60511).

hs52

CAATTTGGAATAACTTTCCCTACCCagtaaattgagcattactctaggattctgagacagagagaaagcacaattttaaaagctttgcagagttcctttgtaattagtcgcagctttccttgaatattaattttccctgcatccctttcaagtggttgagagactgtctctacaactacagagatgcaccctcagaacaacgacagcaagcggcatggtctcaaacacctatgattaagtagtctacagaacgcaccctctgttcttcagtgcagtgtgtagcttatcagtgcaaacagtttaatatttatgctaagaggattgtcaaaagcagcttctgttgctttaattcttgttttaaataaataatgagaacatttaaacacattactcttcttggggccccggggtcagctaatcttattatttatgaagtgatgtgctacataatagtacttagtgcatgttaacagacgctattatcagggccggatgcagagagctgaagatatattagaatgttatgtgtaatgtacgacggattgagtgcataggatgccggtgtagcaattaaccacactcgaaaataggtgtaaagttgaagtatgttttccccggggggatcccctcaccattaataattccccagagaagaagatgtctttcagttaggaaccctctctaccatcaggcttggaaatggggccaggatattccattctttgatctcttcatagtcagtcctacacagtcagaagacaaatagtgagcatgaccactttttaattgatttagacaaaaatggagagaaggcgggggtggagggaggcacatgtgcaatgctcccaagtgtcctcatagtgtttggttttgatccactcgttttcactgccacgtactccaggagagtcgagaaattgttcattccttaatgcaatctgtttccttctctctgatcctcattttgcagataaataaagctaaagccaCTGAGTTTTCCAGGTCACTTTTTA

hs52.1

TACCCAGTAAATTGAGCATTACTCTAGGATTCTGAGACAGAGAGAAAGCACAATTTTAAAAGCTTTGCAGAGTTCCTTTGTAATTAGTCGCAGCTTTCCTTGAATATTAATTTTCCCTGCATCCCTTTCAAGTGGTTGAGAGACTGTCTCTACAACTACAGAGATGCACCCTCAGAACAACGACAGCAAGCGGCATGGTCTCAAACACCTATGATTA

hs52.2

CAGAACGCACCCTCTGTTCTTCAGTGCAGTGTGTAGCTTATCAGTGCAAACAGTTTAATATTTATGCTAAGAGGATTGTCAAAAGCAGCTTCTGTTGCTTTAATTCTTGTTTTAAATAAATAATGAGAACATTTAAACACATTACTCTTCTTGGGGCCCCGGGGTCAGCTAATCTTATTATTTATGAAGTGATGTGCTACATAATAGTACTTAGTGCATGTTAACAGA

hs52.3

CCTCTCTACCATCAGGCTTGGAAATGGGGCCAGGATATTCCATTCTTTGATCTCTTCATAGTCAGTCCTACACAGTCAGAAGACAAATAGTGAGCATGACCACTTTTTAATTGATTTAGACAAAAATGGAGAGAAGGCGGGGGTGGAGGGAGGCACATGTGCAATGCTCCCAAGTGTCCTCATAGTGTTTGGTTTTGATCCACTCGTTTTCACTGCCACGTACTCCAGGAGAGTCGAGAAATTGTTCA

Ethics declarations

Fertilized chicken eggs were obtained from commercial sources (Hy-Line Hatchery, Georgia). Chick embryos were used only prior to hatching; the gestation stage was embryonic day 4 (96 hours post gestation). Ethical approval is not required according to NIH guidelines for the use of chick embryos prior to hatching (21 days gestation), which is not considered being "live vertebrate animal". All experimental protocols were approved by The University of Georgia. All methods used in this study are reported in accordance with ARRIVE guidelines.

Data and software availability

The code and all relevant data have been submitted to the WashU Epigenome Browser and can be visualized at the following URLs:

Human: https://epigenomegateway.wustl.edu/browser/?sessionFile=https://wangftp.wustl.edu/~dli/gzhao/CRMs-202207/hg38-s.json

Mouse:

https://epigenomegateway.wustl.edu/browser/?sessionFile=https://wangftp.wustl.edu/~dli/gzhao/CRMs-202207/mm10-s.json

Acknowledgments

We thank Dr. Gary D. Stormo and Kerry Grens for their comments. We thank Brian Koebbe and Eric Martin from the High-Throughput Computing Facility at WUSM for providing high-throughput computational resources and support.

Funding

This work was supported in part by the NIH R03AG070474, R21AG077643, R01NS123571, and 1U19NS130607 in support of TMG and GZ. The NIH R01 NS041021 in part supported the work of HWG and GZ. The 5T U24 HG012070 in part supported the work of DL, GZ, and TW. This publication is solely the responsibility of the authors and does not necessarily represent the official view of the National Institutes of Health.

Author Contributions

Study concept, design, and supervision: GZ. Bioinformatics analysis and database creation and maintenance: TMG, JX, DL, and GZ. Data curation and comparison: TMG. In-ovo experiments: CS, SB, and OA. Critical revision of the manuscript for important intellectual content: DL, HWG, and TW. TMG, OA, and GZ wrote the manuscript with feedback from all authors.

Competing interests

The authors have no conflicts of interest or financial ties to disclose.

Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip Rev Dev Biol. 2015;4:59–84. 10.1002/wdev.168.
Ben-Tabou de-Leon S, Davidson EH. Gene regulation: gene control network in development. Annu Rev Biophys Biomol Struct. 2007;36:191. 10.1146/annurev.biophys.35.040405.102002.
Weinhold N, Jacobsen A, Schultz N, Sander C, Lee W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat Genet. 2014;46:1160–5. 10.1038/ng.3101.
Murakawa Y, et al. Enhanced Identification of Transcriptional Enhancers Provides Mechanistic Insights into Diseases. Trends Genet. 2016;32:76–88.
Carullo NVN, Day JJ. Genomic Enhancers in Brain Health and Disease. Genes (Basel). 2019;10. 10.3390/genes10010043.
Claringbould A, Zaugg JB. Enhancers in disease: molecular basis and emerging treatment strategies. Trends Mol Med. 2021;27:1060–73. 10.1016/j.molmed.2021.07.012.
Hardison RC, Taylor J. Genomic approaches towards finding cis-regulatory modules in animals. Nat Rev Genet. 2012;13:469–83. 10.1038/nrg3242.
Gasperini M, Tome JM, Shendure J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat Rev Genet. 2020;21:292–310. 10.1038/s41576-019-0209-0.
Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–87. 10.1038/nrg1315.
Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer Browser–a database of tissue-specific human enhancers. Nucleic Acids Res. 2007;35:D88–92. 10.1093/nar/gkl822.
Buffry AD, Mendes CC, McGregor AP. The Functionality and Evolution of Eukaryotic Transcriptional Enhancers. Adv Genet. 2016;96:143–206. 10.1016/bs.adgen.2016.08.004.
Mundade R, Ozer HG, Wei H, Prabhu L, Lu T. Role of ChIP-seq in the discovery of transcription factor binding sites, differential gene regulation mechanism, epigenetic marks and beyond. Cell Cycle. 2014;13:2847–52. 10.4161/15384101.2014.949201.
Halfon MS. Studying Transcriptional Enhancers: The Founder Fallacy, Validation Creep, and Other Biases. Trends Genet. 2019;35:93–103. 10.1016/j.tig.2018.11.004.
Thurman RE, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489:75–82. 10.1038/nature11232.
Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr Protoc Mol Biol. 2015;109(29 21–21 29 29). 10.1002/0471142727.mb2129s109.
Arnold CD, et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science. 2013;339:1074–7. 10.1126/science.1232542.
White MA, Myers CA, Corbo JC, Cohen BA. Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks. Proc Natl Acad Sci U S A. 2013;110:11952–7. 10.1073/pnas.1307449110.
Yan F, Powell DR, Curtis DJ, Wong NC. From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biol. 2020;21:22. 10.1186/s13059-020-1929-3.
Yang Y, et al. Leveraging biological replicates to improve analysis in ChIP-seq experiments. Comput Struct Biotechnol J. 2014;9:e201401002. 10.5936/csbj.201401002.
Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform. 2016;17:967–79. 10.1093/bib/bbv101.
Gao T, Qian J. EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species. Nucleic Acids Res. 2020;48:D58–D64. 10.1093/nar/gkz980.
Consortium EP, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. 10.1038/s41586-020-2493-4.
Zheng R, et al. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 2019;47:D729–35. 10.1093/nar/gky1094.
Li YE, et al. An atlas of gene regulatory elements in adult mouse cerebrum. Nature. 2021;598:129–36. 10.1038/s41586-021-03604-1.
Zhang K et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001 e5919, 10.1016/j.cell.2021.10.024 (2021).
Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11:751–60. 10.1038/nrg2845.
Weirauch MT, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–43. 10.1016/j.cell.2014.08.009.
Wang T, Stormo GD. Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. Proc Natl Acad Sci U S A. 2005;102:17400–5. 10.1073/pnas.0505147102.
Zhao G, et al. Conserved Motifs and Prediction of Regulatory Modules in Caenorhabditis elegans. G3 (Bethesda). 2012;2:469–81. 10.1534/g3.111.001081.
Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996;24:238–41. 10.1093/nar/24.1.238.
Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003;19:2369–80. 10.1093/bioinformatics/btg329.
Oki S, et al. ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. EMBO Rep. 2018;19. 10.15252/embr.201846255.
Suzuki A, et al. Nanog binds to Smad1 and blocks bone morphogenetic protein-induced differentiation of embryonic stem cells. Proc Natl Acad Sci U S A. 2006;103:10294–9. 10.1073/pnas.0506945103.
Wilkinson AC, et al. Single-cell analyses of regulatory network perturbations using enhancer-targeting TALEs suggest novel roles for PU.1 during haematopoietic specification. Development. 2014;141:4018–30. 10.1242/dev.115709.
Moignard V, et al. Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis. Nat Cell Biol. 2013;15:363–72. 10.1038/ncb2709.
Kraus D, et al. Retinal expression of the X-linked juvenile retinoschisis (RS1) gene is controlled by an upstream CpG island and two opposing CRX-bound regions. Biochim Biophys Acta. 2011;1809:245–54. 10.1016/j.bbagrm.2011.03.001.
Reuveni E, Getselter D, Oron O, Elliott E. Differential contribution of cis and trans gene transcription regulatory mechanisms in amygdala and prefrontal cortex and modulation by social stress. Sci Rep. 2018;8:6339. 10.1038/s41598-018-24544-3.
Charital YM, van Haasteren G, Massiha A, Schlegel W, Fujita T. A functional NF-kappaB enhancer element in the first intron contributes to the control of c-fos transcription. Gene. 2009;430:116–22. 10.1016/j.gene.2008.10.014.
Keilani S, et al. Egr-1 induces DARPP-32 expression in striatal medium spiny neurons via a conserved intragenic element. J Neurosci. 2012;32:6808–18. 10.1523/JNEUROSCI.5448-11.2012.
Fernandez-Tresguerres B, et al. Evolution of the mammalian embryonic pluripotency gene regulatory network. Proc Natl Acad Sci U S A. 2010;107:19955–60. 10.1073/pnas.1010708107.
Jash A, Yun K, Sahoo A, So JS, Im SH. Looping mediated interaction between the promoter and 3' UTR regulates type II collagen expression in chondrocytes. PLoS ONE. 2012;7:e40828. 10.1371/journal.pone.0040828.
Melanson BD, et al. A novel cis-acting element from the 3'UTR of DNA damage-binding protein 2 mRNA links transcriptional and post-transcriptional regulation of gene expression. Nucleic Acids Res. 2013;41:5692–703. 10.1093/nar/gkt279.
Ryan BC, et al. Mapping the Pax6 3' untranslated region microRNA regulatory landscape. BMC Genomics. 2018;19:820. 10.1186/s12864-018-5212-x.
Yoo SH, et al. Period2 3'-UTR and microRNA-24 regulate circadian rhythms by repressing PERIOD2 protein accumulation. Proc Natl Acad Sci U S A. 2017;114:E8855–64. 10.1073/pnas.1706611114.
Wissink EM, Fogarty EA, Grimson A. High-throughput discovery of post-transcriptional cis-regulatory elements. BMC Genomics. 2016;17. 10.1186/s12864-016-2479-7.
Repele A, Krueger S, Bhattacharyya T, Tuineau MY, Manu. The regulatory control of Cebpa enhancers and silencers in the myeloid and red-blood cell lineages. PLoS ONE. 2019;14:e0217580. 10.1371/journal.pone.0217580.
Doni Jayavelu N, Jajodia A, Mishra A, Hawkins RD. Candidate silencer elements for the human and mouse genomes. Nat Commun. 2020;11:1061. 10.1038/s41467-020-14853-5.
Farrell CM, West AG, Felsenfeld G. Conserved CTCF insulator elements flank the mouse and human beta-globin loci. Mol Cell Biol. 2002;22:3820–31. 10.1128/MCB.22.11.3820-3831.2002.
Wang Y, et al. A locus control region adjacent to the human red and green visual pigment genes. Neuron. 1992;9:429–40. 10.1016/0896-6273(92)90181-c.
Liu C, et al. An ATAC-seq atlas of chromatin accessibility in mouse tissues. Sci Data. 2019;6:65. 10.1038/s41597-019-0071-0.
Hao Y et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 e3529, 10.1016/j.cell.2021.04.048 (2021).
Rico B, et al. Control of axonal branching and synapse formation by focal adhesion kinase. Nat Neurosci. 2004;7:1059–69. 10.1038/nn1317.
Pinero J, Sauch J, Sanz F, Furlong LI. The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 2021;19:2960–7. 10.1016/j.csbj.2021.05.015.
Vue TY, et al. ASCL1 regulates neurodevelopmental transcription factors and cell cycle genes in brain tumors of glioma mouse models. Glia. 2020;68:2613–30. 10.1002/glia.23873.
McLachlan T, et al. B-cell Lymphoma 6 (BCL6): From Master Regulator of Humoral Immunity to Oncogenic Driver in Pediatric Cancers. Mol Cancer Res. 2022;20:1711–23. 10.1158/1541-7786.MCR-22-0567.
Curry RN, Glasgow SM. The Role of Neurodevelopmental Pathways in Brain Tumors. Front Cell Dev Biol. 2021;9:659055. 10.3389/fcell.2021.659055.
Nyati KK, Kishimoto T. Recent Advances in the Role of Arid5a in Immune Diseases and Cancer. Front Immunol. 2021;12:827611. 10.3389/fimmu.2021.827611.
Kent LN, Leone G. The broken cycle: E2F dysfunction in cancer. Nat Rev Cancer. 2019;19:326–38. 10.1038/s41568-019-0143-7.
Hickey SL, Berto S, Konopka G. Chromatin Decondensation by FOXP2 Promotes Human Neuron Maturation and Expression of Neurodevelopmental Disease Genes. Cell Rep 27, 1699–1711 e1699, 10.1016/j.celrep.2019.04.044 (2019).
Simon R, Wiegreffe C, Britsch S. Bcl11 Transcription Factors Regulate Cortical Development and Function. Front Mol Neurosci. 2020;13. 10.3389/fnmol.2020.00051.
Ben Ayed I, et al. 8q21.11 microdeletion syndrome: Delineation of HEY1 as a candidate gene in neurodevelopmental and cardiac defects. Mol Genet Genomic Med. 2021;9:e1811. 10.1002/mgg3.1811.
Forrest MP, et al. The Psychiatric Risk Gene Transcription Factor 4 (TCF4) Regulates Neurodevelopmental Pathways Associated With Schizophrenia, Autism, and Intellectual Disability. Schizophr Bull. 2018;44:1100–10. 10.1093/schbul/sbx164.
Wang SM, et al. Astrocytic CCAAT/Enhancer-binding protein delta contributes to reactive oxygen species formation in neuroinflammation. Redox Biol. 2018;16:104–12. 10.1016/j.redox.2018.02.011.
Forrest MP, Waite AJ, Martin-Rendon E, Blake DJ. Knockdown of human TCF4 affects multiple signaling pathways involved in cell survival, epithelial to mesenchymal transition and neuronal differentiation. PLoS ONE. 2013;8:e73169. 10.1371/journal.pone.0073169.
Satterstrom FK et al. Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell 180, 568–584 e523, 10.1016/j.cell.2019.12.036 (2020).
Cummings CT, Rowley MJ. Implications of Dosage Deficiencies in CTCF and Cohesin on Genome Organization, Gene Expression, and Human Neurodevelopment. Genes (Basel). 2022;13. 10.3390/genes13040583.
Le Douarin N, Kalcheim C. The Neural Crest Vol. 36 (Cambridge University Press., 1999).
Garcia-Castro M, Bronner-Fraser M. Induction and differentiation of the neural crest. Curr Opin Cell Biol. 1999;11:695–8. 10.1016/s0955-0674(99)00038-1.
Timmer J, Johnson J, Niswander L. The use of in ovo electroporation for the rapid analysis of neural-specific murine enhancers. Genesis. 2001;29:123–32. 10.1002/gene.1015.
Avraham O, et al. Transcriptional control of axonal guidance and sorting in dorsal interneurons by the Lim-HD proteins Lhx9 and Lhx1. Neural Dev. 2009;4. 10.1186/1749-8104-4-21.
Shen Y, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488:116–20. 10.1038/nature11243.
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform. 2021;3:lqab052. 10.1093/nargab/lqab052.
Chi KR. The dark side of the human genome. Nature. 2016;538:275–7. 10.1038/538275a.
Hammal F, de Langen P, Bergon A, Lopez F, Ballester B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 2022;50:D316–25. 10.1093/nar/gkab996.
Landt SG, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012;22:1813–31. 10.1101/gr.136184.111.
Benton ML, Talipineni SC, Kostka D, Capra JA. Genome-wide enhancer annotations differ significantly in genomic distribution, evolution, and function. BMC Genomics. 2019;20:511. 10.1186/s12864-019-5779-x.
Maricque BB, Dougherty JD, Cohen BA. A genome-integrated massively parallel reporter assay reveals DNA sequence determinants of cis-regulatory activity in neural cells. Nucleic Acids Res. 2017;45:e16. 10.1093/nar/gkw942.
Klein JC, et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat Methods. 2020;17:1083–91. 10.1038/s41592-020-0965-y.
Matys V, et al. transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31. 2003;TRANSFAC:374–8. 10.1093/nar/gkg108.
Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16:284–7. 10.1089/omi.2011.0118.
Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–77. 10.1093/bioinformatics/15.7.563.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. 10.1093/bioinformatics/btq033.
Yu G, Wang LG, He QY. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015;31:2382–3. 10.1093/bioinformatics/btv145.

Tables 1 to 3 are available in the Supplementary Files section.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Towards a comprehensive regulatory map of Mammalian Genomes

Status:

Version 1

Abstract

Figures

Background

Results

Genome-wide TFBS prediction and validation in the mammalian genomes

Whole-genome-wide CRM prediction

Obtaining control regions for the evaluation of CRM predictions

Predicted CRMs include diverse classes of functional regulatory sequences

Statistical evaluation of MrMOD predictions using multiple orthogonal datasets

Unsupervised machine learning for CRM functional annotation

Using CRM to guide experimental design

Discussion

Methods

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1