Enhancer identification is an important problem in the field of genomics. Previous work has been instrumental in cataloging and exploring the many complementary approaches for enhancer prediction and validation; however, it remains challenging to prioritize functional sequences [1], [6]. In this project, we sought to evaluate whether predictions made by multiple enhancer identification strategies, or shared enhancers, had systematically different sequences than those that were identified by a small number of approaches. We hypothesized that sequences backed by a large amount of data would correspond to higher confidence, functional sequences, while those that were identified by fewer methods were more likely to be false positives.
We trained several random forest classification models to distinguish between shared versus unique enhancer sequences, and to distinguish between predicted enhancer sequences and sequences from the genomic background. We found that the shared vs unique models performed with significantly worse accuracy than either the shared vs random or the unique vs random models. We suspect that this is because both the shared and unique enhancer regions are identified by at least identification strategy, suggesting that they have at least some characteristics of functional sequences. The random enhancer regions were not identified by any other approach, thus were less likely to overlap with regulatory sequence patterns. However, the poor performance of the shared versus unique classifier suggests that the commonly used method of overlapping regions to identify true enhancers may not have a significant benefit.
A critical step of our analysis was the standardization of all enhancer sequences to 500 bp, a chosen standard that lies between the usual 200–1000 bp length of enhancers [16]. Prior to this addition, our preliminary RF models utilized initially performed with a high accuracy (within the 90–95% range). However, an important confounder had been inflating the model performance: the sequence length. Extending shorter sequences and trimming longer sequences to the 500 bp standard allowed for us to negate the confounding length factor. However, limitations exist in the interpretation of our results as no other enhancer lengths were tested. Trimming and extending sequences was done about the midpoint and with reference to the genome to capture the core “signal” present that could aid the random forest model in identifying enhancer sequences. However, it is possible that the core signal was not captured upon trimming. Furthermore, the signal could be diminished by extending enhancers to the 500 bp mark. Although our model results are thorough for a 500 bp enhancer standard, further testing is required to determine whether our results hold true for enhancer base pairs between the usual 200–1000 bp range.
Though the model can well distinguish between random and actual enhancer regions, further analysis must be conducted to see what factors contribute to different classifications. Our aim is to utilize a set of software tools known as the MEME-suite. Specifically, SEA is a software that utilizes a database of known TFBSs and finds enrichment for each of them in a set of sequences [17]. Our stock set of sequences is the overall set of enhancer regions, both shared and unique. By running these regions through SEA, we can find enriched TFBSs within our enhancer regions. The overall k-mer frequencies for the enriched TFBSs can be calculated with our above algorithm to determine what distribution of 3-mers or 4-mers is highest within enhancer regions. Furthermore, having this analysis done will be instrumental in determining the TFBSs enriched within enhancers specific to liver cells. Malfunctioning liver enhancers can then be pinpointed using the TFBS frequency distributions and studied as a target for genetic therapy.
Within the genomics community, it is assumed that regions that are identified by multiple enhancer identification methods is more likely to be an actual enhancer.
However, our results show that for the 500 bp standard and the utilized random forest pipeline it is likely that there is not a significant difference in sequence composition between shared enhancers that are identified by numerous methods and unique enhancers identified by a few. Therefore, given that enhancers are notorious and often difficult to place within the genome, it is likely beneficial to classify enhancers determined by any of the above accredited eight methods as a true enhancer region. Practicing a more inclusive selection process for true enhancers could be beneficial to avoid omission of potential regulatory regions responsible for disease from available enhancer databases.