Genome sequence and orthologous genes
The unmasked genomic DNA sequences and annotation of mouse (Mus musculus) assembly GRCm38 (mm10) and human (Homo sapiens) assembly GRCh38 (hg38) from the Genome Reference Consortium were downloaded from the Ensembl website (https://nov2020.archive.ensembl.org/Mus_musculus/Info/Index and https://useast.ensembl.org/Homo_sapiens/Info/Index). Mouse genome was used as the anchor genome. Human and rat orthologs of mouse genes were downloaded from BioMart (https://useast.ensembl.org/info/data/biomart/index.html). Intergenic region sequences of up to 5 kb in length upstream of the start codon ATG of each gene in the corresponding genome were retrieved. If the distance to the next upstream gene is less than 5 kb, only the intergenic region was obtained.
PWM prediction using PhyloNet and motif consolidation
Using mouse genome as the anchor, we obtained a set of 18,215 genes with at least one ortholog in the rat or human genome. We retrieved up to 5 kb sequences upstream of the ATG codon for all of the genes. Each mouse gene and its orthologs formed a data entry and was used as input by PhyloNet28 to query the database, which includes all the orthologous gene sets to identify conserved cis-regulatory elements that were associated with multiple genes. PhyloNet was run with options “-q 2000 -iq 500 -id 500 -s 4 -c1 -o2 -pf 50”. Up to 50 of the most significant PWMs from each query gene set were saved for further analysis. Because these initial predicted motifs in the PhyloNet output files are highly redundant, we took two steps to consolidate predicted motifs using the ALLR statistics31 as described previously29. Briefly, the first step compares matrices in each query output file to consolidate matrices obtained from the same query sequences that significantly overlap. The unique PWMs obtained from each query gene at the first step were pooled together and further consolidated to generate the final set of 5,143 distinct motifs (p-value < 10−10) with lengths between 5 and 30 bases (average 18 bp).
Comparison with transcription factor PWMs in the TRANSFAC and CIS-BP database
Predicted PWMs were compared with PWMs of CIS-BP (Version 2.00) and TRANSFAC (version 10.2)27,79 using MatAlign-v4a (Wang and Stormo; http://stormo.wustl.edu/MatAlign/). Matrices that had an ALLR score > 6.57 and the percentage of overlap between two matrices (OLAP score) > 68.1% were considered redundant29. For each round of comparison, the best PWM was picked first (the one with the highest total ALLR score in the PhyloNet output). It was compared with the rest of the matrices using ALLR statistics, and any matrix that appeared redundant to the chosen matrix was removed. Then, the second best one was picked, and the process was repeated until all the matrices had been analyzed.
ChIP-Atlas database mouse transcription factor target enrichment analysis
ChIP-Atlas32 has a collection of analysis results from 2,540 ChIP-seq data sets including TFs and their potential target genes, totaling 723 TFs from 369 different cell types/tissues for mouse (mm10) (ftp://ftp.biosciencedbc.jp/archive/chip-atlas/data/mm10/target). The peaks used in this study from ChIP-Atlas were +/- 5kb from the transcription start site (TSS) of RefSeq. The threshold of significance used to determine whether a target gene is enriched for a TF was > 500 binding scores of MACS2. Significant overlap between predicted PWM-associated genes and TF targets from ChIP-Atlas were evaluated using a hypergeometric test. FDR-corrected p-value < 0.05 was considered significant.
Functional enrichment analysis of PWM-associated genes
Each PWM-associated gene set was compared with GO and KEGG annotation using the ClusterProfiler R package80. Terms and pathways with FDR corrected q-value < 0.001 were considered significant.
Whole-genome-wide CRM identification
The 5,143 mammalian PWMs identified by PhyloNet were used as input for the algorithm implemented in CerMOD29 to identify CRMs in the mouse and human genomes, respectively. First, Patser81 was used to identify all predicted binding sites for the 5,143 PWMs using default cutoff scores. Next, the algorithm calculates the average number of binding sites per position in each chromosome and Z score for each position. Peak positions that have a Z score > 2.33 (corresponding to p-value < 0.01, one-tailed) were identified. For each peak position, we extended it in both 5’ and 3’ directions if the next Z score > 0 position is fewer than 30 bp away (the longest motif length). Peak positions used in a previous extension step were not extended.
MrMOD identifies DNA regions that have TF binding sites significantly more than average (motif abundance z score > 2.33, p < 0.01, one-tailed) and automatically determines the boundaries by z score = 0 positions at each end of the DNA region. These regions correspond to putative regulatory CRMs.
Control regions development and annotation
To develop a set of control regions from each genome with the exact same genomic coverage and distribution, we first obtained regions not covered by CRMfull using BEDTools82 complement function excluding regions with more than 50% “Ns”. Because it is impossible to predict the exact boundaries of CRMs, regions smaller than 400 bp that were located between two predicted CRMs, were filtered out to remove small regions that could be part of the nearby CRMs. Next, we defined a subset of CRMs (CRMsub) that are smaller than 250bp as input. Then we trimmed the control regions by 40 bp each end, padding 5bp between fragments, to generate a distribution with exactly the same number/length as the CRMsub. The R package ChIPseeker83 was used to annotate CRMfull and control regions as well as calculate feature genomic distribution.
First, regions not covered by CRMfull were obtained. Then, we subset the CRMs by eliminating the longest CRMs, resulting in predicted CRM subsets (CRMsub) that are smaller than 250 bp. Because of the lack of space to sample the control region, using small fragments gives sufficient space to sample the non-functional regions to get the control with the exact same number/length as the subset. In addition, it is impossible to predict the exact boundaries of CRMs. For this reason, small regions that were located between two predicted CRMs were filtered out as they could be part of the nearby functional CRMs. These regions were predicted to be non-functional, covering 21.8% and 20.8% of mouse and human genomes, respectively, and were used as the control regions (referred to as CTRLs) to evaluate the accuracy of predicted CRMs. CRMsub had the exact same genomic coverage and distributions relative to annotated genes and transcripts to the CTRLs for both genomes (Fig. 2B, 2C).
Comparison with experimentally defined CRMs, chromatin accessibility, and epigenomics data
Experimentally defined modules came from two sources. 1) We collected 97 experimentally defined CRMs for mouse and 60 modules for human through an extensive literature search (Supplementary Tables 1 and 2). 2) We downloaded 634 and 998 experimentally defined enhancers that exhibited enhancer activity10 from VISTA Enhancer Browser for mouse and human, respectively. Human and mouse predicted enhancers were obtained from Enhancer Atlas 2.0 (http://www.enhanceratlas.org/data/download/species_enh_csv.tar.gz). From the Cistrome database, chromatin accessibility (ATAC-seq and DNAse-seq) peaks and ChIP-seq histone marks (H3K27ac, H3K4me1, and H3K4me3) peaks were downloaded (http://cistrome.org/db/batchdata) for both mouse and human genomes. ATAC-seq Atlas is only available for mouse genome50. ENCODE mouse and human cCREs22 are available at the web-based server Search Candidate cis-Regulatory Elements by ENCODE V3 (SCREEN; http://screen.encodeproject.org). Mouse24 and human25 scATAC-seq can be accessed at their portal (http://catlas.org/mousebrain/#!/ and http://catlas.org/humanenhancer/#!/, respectively). Each dataset included peaks from hundreds of different tissues and cell types. Files from the same type of data were merged using bedtools merge to eliminate overlapping elements. Modules bigger than 2.5 kb after merge were eliminated before comparison. We define an experimentally defined module as being correctly predicted at two different cutoffs: 1) experimentally defined modules overlap with predicted CRMs by 50% of the length of the shorter one; 2) they overlap by 1 bp. The odds ratio (OR), confidence interval, and p-value (chi-square independent test) were calculated comparing CRMsub and CTRL of each dataset of mouse and human. Peak-detection sensitivity and OR were calculated using the R package fmsb. The difference was considerate significant with an OR > 1 and p-value < 0.05.
Unsupervised machine learning to annotate CRM functions
We scanned the mouse genome for TFBS using mouse and human PWMs from the CIS-BP database. We obtained TFBS abundance in each CRM for every transcription factor in the CIS-BP database. Unsupervised clustering of CRMs based on TFBS abundance using single-cell genomics tool Seurat on chromosome 17, one of the smallest chromosome of the mouse genome. Each CRM cluster linked putative TFs (TF marker genes whose binding sites are enriched in the CRM cluster) with the putative target genes (genes associated with the CRMs in the cluster). The CRM associated-genes of each cluster were annotated using ChIPseeker R package.
Functional enrichment analysis of all CRM associated-genes
The annotated chr17 CRM associated-genes of each cluster were used to perform functional enrichment analysis: GO, KEGG, and DisGenet enrichments using ClusterProfiler R package. For DisGenet enrichment analysis, the mouse ENTREZ IDs were converted to human ENTREZ IDs. DisGenet was also tested with two other methods (EnrichR and disgenet2r) to compare and ensure the results of C17. Parameters used in the ClusterProfiler functional enrichment analysis were FDR < 0.05, minGSSize = 10, maxGSSize = 5000, minimum number of counts in a pathway = 5. Genes of chromosome 17 (chr17) were used as universe background for these analyses.
Functional enrichment and network analysis of TF markers of C17
The TF markers of C17 were also enriched using the same parameters described above. The only difference is that we did not include chr17 as universe, because TFs are not restricted to bind only CRMs of chr17. TFs that were enriched in three neurodevelopmental pathways (intellectual disability, neurodevelopmental disorder, autistic disorder, and schizophrenia) were combined (total of 57 TFs) and used to build a STRING network to visualize the connections between them.
CRM associated-genes of C17 are enriched for mouse ChIP-Atlas TF targets
TF targets enrichment analysis was performed using the same previous parameters used for the motif analysis: +/- 5kb from the TSS of RefSeq; threshold of significance used to determine whether a target genes is enriched for a TF was > 500 binding scores of MACS2; significant overlap between predicted CRM associated-genes of C17 and mouse TF targets from ChIP-Atlas were evaluated using hypergeometric test. FDR-corrected p-value < 0.05 was considered as significant.
Animals and procedures
Fertilized White Leghorn chicken eggs (Hy-line North America, Mansfield, GA) were incubated at 38.5–39°C and 50-80% humidity. A DNA solution of 2-5 mg/ml was injected into the lumen of the neural tube at HH stage 17–18 (E2.75-E3). Electroporation was performed using 3 × 50 ms pulses at 25 V, applied across the embryo using a 0.5-mm tungsten wire and a BTX electroporator (ECM 830). 100 unit/ml penicillin in Hank's Balanced Salt Solution was added on top of the embryos and embryos were incubated for 24 hours prior to analysis.
Immunohistochemistry
Embryos were fixed overnight at 4°C in 4% paraformaldehyde/0.1 M phosphate buffer, incubated in 30% sucrose/PBS for 24 h, and embedded in Optimal Cutting Temperature (O.C.T.). Cryostat sections (12 μm) were collected on Superfrost Plus slides and kept at −20°C. The following primary antibodies were used- rabbit anti FABP7 (Thermo Fisher Scientific Cat #PA5-24949) and rabbit anti SCG10/STMN2 (Novus catalog #NBP1-49461). Secondary antibody used- 647 goat (Thermo Fisher #A-21245). Images were taken under a microscope (SMZ-745T Zoom Stereo Photo Microscope, Nikon and Lionheart LX automated microscope, BioTek) and images were analyzed with Nikon NIS-Elements and Gen5 software.
DNA
The hs52 element was amplified by PCR from a genomic human DNA utilizing the primers [GCCAATTGCAATTTGGAATAACTTTCCCTACCC] and [GCGCTAGCTAAAAAGTGACCTGGGAAAACTCAG]. hs52.2 element was amplified utilizing the primers [GCCAATTGAACGCACCCTCTGTTCTTCAGT] and [GCGCTAGCGCACTAAGTACTATTATGTAGCACA]. hs52.3 element was amplified utilizing the primers [GCCAATTGCAGGCTTGGAAATGGGGCCAGG] and [GCGCTAGCACGTGGCAGTGAAAACGAGTGG]. The enhancers were cloned into 5'MfeI/3'NheI sites of the Cre plasmid. RPBSA: GFP plasmid used as positive control for electroporation (Addgene #60511).
hs52
CAATTTGGAATAACTTTCCCTACCCagtaaattgagcattactctaggattctgagacagagagaaagcacaattttaaaagctttgcagagttcctttgtaattagtcgcagctttccttgaatattaattttccctgcatccctttcaagtggttgagagactgtctctacaactacagagatgcaccctcagaacaacgacagcaagcggcatggtctcaaacacctatgattaagtagtctacagaacgcaccctctgttcttcagtgcagtgtgtagcttatcagtgcaaacagtttaatatttatgctaagaggattgtcaaaagcagcttctgttgctttaattcttgttttaaataaataatgagaacatttaaacacattactcttcttggggccccggggtcagctaatcttattatttatgaagtgatgtgctacataatagtacttagtgcatgttaacagacgctattatcagggccggatgcagagagctgaagatatattagaatgttatgtgtaatgtacgacggattgagtgcataggatgccggtgtagcaattaaccacactcgaaaataggtgtaaagttgaagtatgttttccccggggggatcccctcaccattaataattccccagagaagaagatgtctttcagttaggaaccctctctaccatcaggcttggaaatggggccaggatattccattctttgatctcttcatagtcagtcctacacagtcagaagacaaatagtgagcatgaccactttttaattgatttagacaaaaatggagagaaggcgggggtggagggaggcacatgtgcaatgctcccaagtgtcctcatagtgtttggttttgatccactcgttttcactgccacgtactccaggagagtcgagaaattgttcattccttaatgcaatctgtttccttctctctgatcctcattttgcagataaataaagctaaagccaCTGAGTTTTCCAGGTCACTTTTTA
hs52.1
TACCCAGTAAATTGAGCATTACTCTAGGATTCTGAGACAGAGAGAAAGCACAATTTTAAAAGCTTTGCAGAGTTCCTTTGTAATTAGTCGCAGCTTTCCTTGAATATTAATTTTCCCTGCATCCCTTTCAAGTGGTTGAGAGACTGTCTCTACAACTACAGAGATGCACCCTCAGAACAACGACAGCAAGCGGCATGGTCTCAAACACCTATGATTA
hs52.2
CAGAACGCACCCTCTGTTCTTCAGTGCAGTGTGTAGCTTATCAGTGCAAACAGTTTAATATTTATGCTAAGAGGATTGTCAAAAGCAGCTTCTGTTGCTTTAATTCTTGTTTTAAATAAATAATGAGAACATTTAAACACATTACTCTTCTTGGGGCCCCGGGGTCAGCTAATCTTATTATTTATGAAGTGATGTGCTACATAATAGTACTTAGTGCATGTTAACAGA
hs52.3
CCTCTCTACCATCAGGCTTGGAAATGGGGCCAGGATATTCCATTCTTTGATCTCTTCATAGTCAGTCCTACACAGTCAGAAGACAAATAGTGAGCATGACCACTTTTTAATTGATTTAGACAAAAATGGAGAGAAGGCGGGGGTGGAGGGAGGCACATGTGCAATGCTCCCAAGTGTCCTCATAGTGTTTGGTTTTGATCCACTCGTTTTCACTGCCACGTACTCCAGGAGAGTCGAGAAATTGTTCA