Experimental datasets
In this study, we built a random forest classifier based on a range of genetic and epigenetic features to infer the pathogenic enhancers that harbor noncoding genetic variants. A total of 3865 enhancer-variant pairs were used in this study for training. The training set was composed of 438 disease-associated positive pairs and 3,427 negative pairs.
The positive set was compiled from the DiseaseEnhancer database (Version1.0.1) [7], which manually collected 847 disease-associated enhancers and their associated variants. Enhancers with multiple variants or indels were eliminated to restrict our set to single nucleotide substitutions. To acquire the nonpathogenic enhancer-variant pairs, an initial catalog of enhancer came from a combination of the Roadmap Epigenomics Mapping Consortium (REMC) [8], the Encyclopedia of DNA Elements (ENCODE) [9] Project (identified using ChromHMM [10]), and FANTOM5 [11] database as described in ELMER [12]. Target genes of these enhancers were identified based on the chromatin interactions from 4DGenome database [13]. Then enhancers were eliminated if they were included in the positive set, or their target genes were included in OMIM, DisGeNET, MalaCards or DISEASE database [14-17]. For each nonpathogenic enhancer, the associated variant was randomly selected from the 1000 Genomes Project (minor allele frequency ≥5%) [18].
Feature annotation integration
For each enhancer-variant pair, a 548-dimensional annotations were generated (Supplementary Table S1).
(1) Variant-based features
Mutation pattern. We took five-nucleotide window sequence centered on each variant site from the R package BSgenome.Hsapiens.UCSC.hg19 which contained full genome sequences for homo sapiens as provided by UCSC;
Conservation. To determine the conservation of the variant base in the case of substitutions, PhastCons, phyloP and GERP++ were used [19-21];
Changes of TF bindings. Based on DeepBind [22], we calculated potential change scores of 515 transcription factor bindings mediated by the variant. First the binding score for the reference and variant sequence were calculated using the sequence around the variant (+/- 25bp). Then the potential change scores were defined as .
(2) Region-based features
Conservation. Region-based conservation scores (i.e. PhastCons, phyloP and GERP) were evaluated by averaging over all base pairs for each candidate enhancer [19-21].
Negative selection. SNP density was expressed as the average number of SNPs in each candidate enhancer region based on 1000 Genomes Project phase 3 data [18]. Meanwhile, we took 338,198 regions belonging to "sensitive" and "ultrasensitive" categories found by Fu et al [23]. Frequency of overlap between the candidate enhancer with sensitive or ultrasensitive regions were then be calculated.
Potential TF binding. We took 161 transcription factor binding sites (TFBS) across 91 cell lines from UCSC [24], and motif instances of more than 1000 known and de novo TF motif in human genome (hg19) derived from encode-motif [25]. Then we calculated frequency of overlap between the candidate enhancer with them, respectively.
Epigenetic Activity. We used Coefficient of variation (CV, defined as the ratio of the standard deviation to the mean) and reads rer kilobase million (RPKM) of 5 histone modifications (H3K4me1, H3K4me3, H3K27ac, H3K27me3, H3K36me3) across 98 tissue-/cell-types from NIH Roadmap [8].
Disease features. We calculated frequency of overlap between the candidate enhancer GWAS disease SNP [26];
Recurrent features. We calculated frequency of overlap between the candidate enhancer with 6 classes of recurrent regulatory regions (Transcription factor binding peak; DNase I hypersensitive sites; Segway/ChromHMM predicted enhancers; Enhancer distal regulatory modules; site; COSMIC recurrent regulatory variants) which obtained from FunSeq2 [23];
Feature selection and outlier removal
Feature selection and outlier removal were employed to achieve the best performance. The optimal feature set was selected depend on the largest area under the receiver operating characteristic curve (ROC-AUC) value as described in previous study [27]. Briefly, the confidence of each feature was measured by p values based on Wilcoxon rank sum test. Then we reduced one feature with the largest p-value at a time iteratively and evaluated the performance of classifier based on the mean ROC-AUC of 5-fold cross-validation (5FCV). Further, we remove outlier based on proximities to all other cases within each class as described by Yang et al. [28]. For each class, 5% samples with the smallest proximities to all other cases were removed.
IPEV classifier
At last, we built a random forest classifier with 1000 trees, each of which was constructed by randomly selecting the same number of negative pairs as in the positive set, using R package randomForest.
Robustness
To be evaluate whether and how much the disproportionate samples affected the model, we firstly divided the variants into two set based on whether the variant were derived from cancers. The corresponding benign variants were selected randomly as the same size of positive set. Using the new sets of variants, we trained cancer-associated (CM) and noncancer-associated model (NCM) with the same features and parameters, and obtain the average ROC-AUC for 5FCV. Furthermore, we also compared the predicted ROC-AUC values and other measures, such as Matthews Correlation Coefficient (MCC), for each model with each test set as input (Table S2). These procedures were repeated 100 times.
Visualization
To help users intuitively understand how variants affect the activity of enhancers, we provided visualizations for the TFs which were both identified as changes of binding by DeepBind [22] and motifbreakR [29]. For results from motifbreakR [29], we only considered the changes of TF bindings with "strong" effect, and defined "gain of motif" if the alleleRef score is greater than alleleAlt score, otherwise, it is "loss of moitf".