IPEV: a web server for inferring pathogenic enhancers with variants

Background Enhancer has been recognized as an important driver whose genetic alterations contribute to disease progression. However, there is still no easy-to-use tools to identify pathogenic enhancers, allowing for deciphering functional influence of genetic variants on enhancer. Results We developed a user-friendly one-stop shop platform, named inferring pathogenic enhancer with variant (IPEV), only requiring variants as input, to quickly infer the pathogenic enhancers that harbor variants affecting their activities. Results of IPEV are explored in an interactive, user-friendly web environment, which is designed to highlight the most probable pathogenic enhancers and their target genes. Furthermore, IPEV provides intuitive visualizations of how a variant affects the corresponding enhancer activity by mediating TF binding changes. Conclusions IPEV is specially designed to prioritize the potentially pathogenic enhancers with genetic variants, and provides intuitive visualizations how a variant affects the corresponding enhancer activity by mediating which transcription factor binding changes. The use of IPEV does not require any specialized computer skills. We believe that IPEV will be useful in interpreting non-coding variants by the inferring pathogenic enhancers. It is freely available at http://biocc.hrbmu.edu.cn/IPEV/ or http://210.46.80.168/IPEV and supports recent versions of all major browsers.

main drivers contributing to tumor progression [5,6], providing a new understanding of the pathogenesis of cancer. Thus, inferring disease enhancers from substantial non-coding mutations is urgently needed for uncovering novel drivers underlying cancer and deciphering the mechanisms of tumorigenesis.
However, the identification of disease enhancers using non-coding mutations still remain a challenge due to a large number of genomic and epigenetic features required, as well as the binding of many TFs. Furthermore, it is a daunting task to process the large volume and high dimensional data, and build a machine learning model to infer the pathogenic enhancers, especially for biologists and clinicians without a programming background. To date, as there is still no tools for this purpose, we believe it is desirable to develop an easily used online tool to infer pathogenic enhancers from non-coding variants and interpret the potential effect of these variants in the enhancers.
The Inferring Pathogenic Enhancers with Variants (IPEV) is a one-stop shop to quickly infer and return the potential pathogenic enhancers that harbor genetic variants affecting their activities in an interactive and user-friendly web environment. IPEV infers the pathogenic enhancers based on more than 540 genomic and epigenetic features and a random forest classification model, and only requires variant data as input. Further visualization provided by IPEV allows users to investigate which TF bindings in the enhancer were affected by the non-coding variant. With a rapidly increasing interest in enhancers, we believe that IPEV is a timely and valuable tool for the understanding of enhancer deregulation in tumor pathogenesis.

Experimental datasets
In this study, we built a random forest classifier based on a range of genetic and epigenetic features to infer the pathogenic enhancers that harbor noncoding genetic variants. A total of 3865 enhancer-variant pairs were used in this study for training. The training set was composed of 438 disease-associated positive pairs and 3,427 negative pairs.  [14][15][16][17]. For each nonpathogenic enhancer, the associated variant was randomly selected from the 1000 Genomes Project (minor allele frequency ≥5%) [18].

Feature annotation integration
For each enhancer-variant pair, a 548-dimensional annotations were generated (Supplementary Table S1).

(1) Variant-based features
Mutation pattern. We took five-nucleotide window sequence centered on each variant site from the R package BSgenome.Hsapiens.UCSC.hg19 which contained full genome sequences for homo sapiens as provided by UCSC; Conservation. To determine the conservation of the variant base in the case of substitutions, PhastCons, phyloP and GERP++ were used [19][20][21]; Changes of TF bindings. Based on DeepBind [22], we calculated potential change scores of 515 transcription factor bindings mediated by the variant. First the binding score for the reference and variant sequence were calculated using the sequence around the variant (+/-25bp). Then the potential change scores were defined as .
Negative selection. SNP density was expressed as the average number of SNPs in each candidate enhancer region based on 1000 Genomes Project phase 3 data [18]. Meanwhile, we took 338,198 regions belonging to "sensitive" and "ultrasensitive" categories found by Recurrent features. We calculated frequency of overlap between the candidate enhancer with 6 classes of recurrent regulatory regions (Transcription factor binding peak; DNase I hypersensitive sites; Segway/ChromHMM predicted enhancers; Enhancer distal regulatory modules; site; COSMIC recurrent regulatory variants) which obtained from FunSeq2 [23];

Feature selection and outlier removal
Feature selection and outlier removal were employed to achieve the best performance.
The optimal feature set was selected depend on the largest area under the receiver operating characteristic curve (ROC-AUC) value as described in previous study [27].
Briefly, the confidence of each feature was measured by p values based on Wilcoxon rank sum test. Then we reduced one feature with the largest p-value at a time iteratively and evaluated the performance of classifier based on the mean ROC-AUC of 5-fold crossvalidation (5FCV). Further, we remove outlier based on proximities to all other cases within each class as described by Yang et al. [28]. For each class, 5% samples with the smallest proximities to all other cases were removed.

IPEV classifier
At last, we built a random forest classifier with 1000 trees, each of which was constructed by randomly selecting the same number of negative pairs as in the positive set, using R package randomForest.

Robustness
To be evaluate whether and how much the disproportionate samples affected the model, we firstly divided the variants into two set based on whether the variant were derived from cancers. The corresponding benign variants were selected randomly as the same size of positive set. Using the new sets of variants, we trained cancer-associated (CM) and noncancer-associated model (NCM) with the same features and parameters, and obtain the average ROC-AUC for 5FCV. Furthermore, we also compared the predicted ROC-AUC values and other measures, such as Matthews Correlation Coefficient (MCC), for each model with each test set as input (Table S2). These procedures were repeated 100 times.

Visualization
To help users intuitively understand how variants affect the activity of enhancers, we provided visualizations for the TFs which were both identified as changes of binding by DeepBind [22] and motifbreakR [29]. For results from motifbreakR [29], we only considered the changes of TF bindings with "strong" effect, and defined "gain of motif" if the alleleRef score is greater than alleleAlt score, otherwise, it is "loss of moitf".

Overview of the IPEV approach
In this study, we built a random forest classifier (IPEV) based on genetic and epigenetic features to infer the pathogenic enhancers that harbor non-coding genetic variants.  Table S1). After feature selection and outlier removal, we built a random forest classifier with 1000 trees, each of which was constructed through randomly selecting the same number of negative pairs as in the positive set, using R package randomForest ( Figure 1A). The detailed procedures are described in Materials and Methods.

Performance and robustness evaluation
The performance of IPEV was evaluated using 5-fold cross validation (5FCV) and leaveone-out cross validation (LOOCV). The average area under the receiver operating characteristic curve (AUC-ROC) was 0.912 for 5-fold cross validation and 0.916 for LOOCV ( Figure 2A). Considering the imbalance of positive and negative set, we further evaluated the method using balanced dataset by down-sampling the negative set. In this situation, the average AUC-ROC and MCC was 0.907 and 0.70 for 5FCV (Supplementary Table S2

IPEV input and interface
As show in Figure 1B, IPEV prompts the user to submit single nucleotide variants in tabseparated values (TSV) format or variant call format (VCF). Optionally, users can consider providing an email address for communicating job status. Once variant data submitted, IPEV performs filtering and maps variants to known enhancers. Next, a large number of genomic and epigenetic features for each enhancer-variant pair will be calculated and the potential impacts of TF bindings mediated by variants will be computed using motifbreakR [29]. Finally, according to these annotation features, IPEV calculates risk scores for all enhancer-variant pairs through the random forest classification model and then prioritizes pathogenic enhancers affected by variants. Figure 1C shows the inferred result table of the example provided by the website. A summary (top panel) provides basic statistic information after the mapping of variants to known enhancers. The result table displays an interactive spreadsheet, recording the inferred pathogenic enhancers and their associated variants. What's more, the gain or loss of TF bindings mediated by variants and the corresponding visualizations are provided in the column "Changes of TF bindings", which can help users intuitively understand how variants affect the activity of these enhancers. Any visualization in this table can be enlarged by clicking the images. In Figure 1C, a C>A substitution at chr6:28949469 was inferred to affect a pathogenic enhancer (chr6:28941202-28949600), in which the variant was predicted to create the consensus binding motif for the interferon-regulatory factor (IRF) proteins family and disrupt the binding motif for the transcription factor TFCP2 and Gamma-butyrobetaine dioxygenase (BBOX). The details for the variant affecting TF consensus binding sites are shown in motif sequence logo plots.
In addition, the result table also provides some extra annotation information for each pathogenic enhancer, including the target genes identified by enhancer-promoter looping across tissue-/cell-types, recurrent mutation in cancers, the presence of sensitive or ultra-sensitive regions, the SNP density calculated using data from 1000 Genome project, the coefficient of variation of epigenetic modifications (including H3K4me1 and H3K27ac) across 98 tissue-/cell-types, and the average conservation score (phastCons) [19]. All of the result table and the visualizations could be downloaded as a compressed file.

IPEV could predict deleterious regulatory variants in enhancers
To highlight the usefulness and performance of IPEV, we reanalyzed the experimentally validated pathogenic enhancer-variant pairs in the updated DiseaseEnhancer (version , which were not presented in the training set. Since the majority of enhancervariant pairs in the updated DiseaseEnhancer were derived from breast cancer, we thus used these variants as independent test set. We applied IPEV to the 71 single nucleotide variants collected from breast cancer. Among these variants, 3 were filtered out due to the inconsistency between the reference nucleotide and reference genome, and 18 not located within any candidate enhancer regions were also removed. Thus, the remaining 50 variants were further processed by IPEV. As a result, 90% of variants (45/50) were predicted to be located in 26 pathogenic enhancers. Furthermore, IPEV also provides a visual indication of how these variants affect these enhancers by changing TF binding. For example, the down-regulation of ATF7IP mediated by C-T transition at chr12:14410634 (rs11055880) was experimentally validated, but how this variant involved in the regulation was still unknown [30]. IPEV indicated that this variant may lead to gain of FOXK1 motif, recruiting epigenetic modifying complex [31], and thus repressing gene expression. Similarly, applying IPEV to experimentally validated pathogenic enhancer-variant pairs in chronic kidney disease, 71.43% of the variants (10/14) were predicted to be linked with pathogenic enhancers.
The above cases indicated that IPEV was highly sensitive and unbias for disease types.
IPEV is designed to quickly infer the potential pathogenic enhancers by only requiring variants as input. IPEV further provides the target genes of the pathogenic enhancers and intuitive visualizations of which TFs may be affected by the variants. The easy usage of IPEV allows any investigators to easily infer the pathogenic enhancers and interpret the associated variants without additional expertise in computer programming. It is particular important for biologists and clinicians, as they may reach different insights through hands-on analysis using IPEV rather than working through an intermediary quantitative scientist, due to the different mindset between them.
The next update of IPEV will include more non-coding annotations. A genome browser will be also integrated to support the visualization of multi-omics data, including the chromatin interactions identified by Hi-C, for intuitively deciphering the potential pathogenesis. We believe that IPEV will serve as a powerful platform for inferring the pathogenic enhancers and interpreting the associated non-coding variants.