Cell type-specific enhancers, cis-regulatory elements that up-regulate gene transcription in a cell type, play a key role in determining the regulatory landscape of the human genome (1). Enhancers are commonly located in the introns and immediately upstream of the transcription start site (TSS) of their target genes, but they are also known to populate gene deserts (2), reside in introns of neighboring genes (3) and co-localize with coding exons (4). Enhancer mutations are often associated with diseases (5–7). Accurate prediction of enhancers from DNA sequences is the basis of assessing whether mutation(s) could disrupt an enhancer’s activity, a type of mechanism for genetic diseases.
Predicting enhancers based on transcription factor binding sites (TFBS) was proposed because TFBS tend to be conserved over vertebrate evolution (8–10). To ameliorate the uncertainty problem in conservation and TFBS information, direct sequence features such as k-mers were used to model enhancer prediction (11, 12). These early studies did not achieve high prediction accuracy nor were they able to distinguish enhancers of different cell types.
With wide application of ChIP-seq technologies, enhancers were frequently profiled on a genome-wide scale (13). The ENCODE project produced genome-wide profiles of various epigenetic marks for multiple human cell types (14). By applying a hidden Markov model (i.e. ChromHMM) to these epigenetic marks, the sequence of the human genome has been binned into more than ten chromatin states, including enhancers (15, 16). The “strong enhancer” state, shown to be associated with increased gene expression, provides genome-wide positioning of active enhancers for a cell type (15). Although the availability of these datasets renders positioning of enhancers unnecessary, the sequence structures of enhancers, especially their subtle differences among cell types, can be useful in understanding cell type-specific gene regulation and should be explored.
The effectiveness of enhancer classifiers is influenced by proper generation of negative sequences. Negative sequences should contain similar basic sequence features with enhancers such as length distributions, GC and repeat contents (12, 17, 18); otherwise, enhancer classifiers may learn different nucleotide compositions rather than occurrences of key DNA motifs. Although there are many published studies regarding sequence-based enhancer prediction, it is still unknown whether these enhancer classifiers can distinguish enhancers from different cell types or tissues.
The sequence structures of enhancers may not be linear or additive. In fact, there could be complex grammar or semantics among different DNA elements that compose an enhancer (19, 20). In recent years, deep learning technologies have gained greater popularity than conventional machine learning methods, and have been adapted in biomedical research to address complex research questions (21–29). Thus, deep learing can be more powerful in classifying enhancers. In this study, we propose SeqEnhDL, a deep learning framework for classification of cell type-specific enhancers based on sequence features. The effectiveness and advantages of SeqEnhDL are demonstrated based on the chromatin state segmentation data of nine cell types from the ENCODE project (14).