5.1 Data and preprocessing
We used the same Hi-C data as [2], and downloaded the Hi-C data of seven cell lines of K562 (mesoderm-lineage cells from a patient with leukemia), GM12878 (lymphoblastoid cells), HeLa-S3 (ectoderm-lineage cells from a patient with cervical cancer), HUVEC (umbilical vein endothelial cells), IMR90 (fetal lung fibroblasts), NHEK (epidermal keratinocytes) and HMEC (mammary epithelial cells) from Gene Expression Omnibus (GEO) GSE63525. The human reference genome hg19 was used to define the genomic locations. Promoters and activate enhancers in the first four cell lines were identified using segmentation-based annotations from both ENCODE Segway [46] and ChromHMM of Roadmap Epigenomics [47], only ChromHMM annotations were used in the other cell lines. Then, RNA-seq data from ENCODE were used to select activate promoters according to the rule of their mean FPKM >0.3 with irreproducible discovery rate <0.1 for each cell line. The genome-wide Hi-C measurements were used to annotate all enhancer-promoter pairs as interacting or non-interacting in each cell type. For each enhancer-promoter pair, the distance between promoter and enhancer of the pair is more than 10kb and less than 2Mb [2]. To exclude the effect of distance on determining EPIs, interacting enhancer-promoter pairs were assigned to one bin (the total bin number is 5) based on quantile discretization of the distance between the enhancer and promoter. Random non-interacting pairs of active enhancers and promoters were assigned to their corresponding bin and then subsampled as the same number of positive samples within each bin. The subsampled non-interacting pairs were considered as negative samples. Table 4 gives the numbers of positive and negative pair in each cell line.
For each positive/negative sample, sequences of enhancers are extended or cut to 3kb flanking regions on location center of enhancers, and promoter are extended or cut to 2kb flanking regions on location center of promoters. One-hot coding format of enhancer and promoter sequence is used as input data of model.
We examined the overlapping number of positive EPIs between any two cell lines. For any two positive EPI pairs from any two cell lines, if the position of the two enhancers and the position of the two promoters both same, the two EPI pairs are considered to overlap. By comparison, positive samples of different cell lines have very little overlap (Table S9).
For comparison with RIPPLE based on epigenetic data, we used data sets from the Roadmap project for the six cell lines. Because we want to make prediction across cell lines, we downloaded the peak files of 14 data sets that are measured in all six cell lines. These data sets include CTCF, POLR2A, H2AZ, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1 and DNase-seq. An enhancer or promoter sample is represented as a binary vector in which each dimension corresponds to one of the epigenetic data sets. The feature vectors of enhancer and promoter are concatenated to represent each EPI pair. In addition, we also used other two features (i.e., the Pearson’s correlation of the 14 signals associated with the enhancer-promoter pair, and the mRNA level of the gene associated with the promoter) to represent each EPI pair.
5.2 SEPT
Domain adaptation [28] is defined as that source domain and target domain share features and categories, but feature distribution is different. The source domain samples with rich information are used to improve the performance of the model in target domain prediction. The source domain has abundant supervised labeling information, and the target domain has no or few labels. Because SEPT used the idea of domain adaptation, we therefore describe the input data in domain adaptation terms. As focusing on the task of EPIs prediction across cell lines, we assume that there is no labeled training data available in test cell line, only the locations of enhancers and promoters are provided. So, we can utilize the abundant supervised labeling information of other cell lines (called source domain) to improve the performance of EPIs prediction in new cell lines without labeled data (called target domain).
An overview of SEPT is shown in Figure 4(a). For predicting the EPIs in cell line #A, unlike the existing two methods which extract the specific features (in Figure 4(b)) or shared features (in Figure 4(c)) of cell lines, SEPT uses the rich information of cell lines #B and #C to extract the features that are relevant to the EPIs in cell line #A by using the transfer learning (TL). As shown in Figure 4(a), SEPT mainly includes feature extractor, domain discriminator and EPI predictor. SEPT simultaneously trains two classifiers of the main label classifier and the domain discriminator. These two classifiers share feature extractor layers. It is worth mentioning that we used GRL to design the domain discriminator. GRL reverses the direction of the gradient during the back propagation, but it does nothing when forward propagation. The mixed data of labeled EP pairs of cell line #B and cell line #C are used as the source domain data, and the data of unlabeled EP pairs of cell line #A are used as the target domain data. Each training sample has a domain label, with 0 indicating that the sample belongs to the source domain, and 1 indicating that the sample belongs to the target domain. Each mini-batch training data contains an equal number of samples from both source and target domains. In each training iteration, the parameters in feature extractor layers and EPI label predictor layers are updated on the source domain data, while the parameters in feature extractor layers and domain discriminator layers are updated on both the source and target domains data. In other words, in each training iteration, the feature extractor layers learn the features related to EPI from the samples of cell line #B and #C during the first back propagation, while during the second back propagation, the features learned in the feature extractor layers cannot distinguish which cell line the samples come from due to the GRL. With the training going on, SEPT gradually learns the features which are related to EPI and not related to cell lines.
Feature extractor consists of two convolution layers, two max-pooling layers, two dropout layers, and one recurrent long short-term memory (LSTM) layer. Domain discriminator consists of the GRL, one dense layer, one dropout layer, and the output. EPIs predictor consists of one dense layer, one dropout layer, and the output. Since informative features may differ between enhancers and promoters, we use two convolution layers, rectified linear unit (ReLU) and max-pooling layers for enhancers and promoters, respectively. Thus, the inputs are two one-hot matrixes to represent enhancer and promoter sequence, respectively. Because large number of kernels can sufficiently extract the features, and motif features of DNA sequences are short than 40bp, so each convolution layer consists of 300 ‘kernels’ with length 40. Max-pooling layer reduces the output dimension with pool length 20, stride 20. The outputs of the two branches are concatenated into one tensor, which is input to the dropout layer with dropout rates of 0.25. The dropout layer randomly selects partial input data to next layer to avoid overfitting. The recurrent LSTM layer is used to extract feature combinations of the two branches, and the output dimension of LSTM is 100. For domain discriminator, the output of LSTM layer feeds into GRL, and a dense layer maps the learned distributed features to the domain label space. It contains 50 units with ReLU activations. The output feeds into a sigmoid unit to predict the domain probability after dropout layer with dropout rates of 0.5. For EPI predictor, the output of LSTM layer feeds into dense layer, which further maps the learned distributed features to the sample label space. It contains 100 units with ReLU activations. The output feeds into a sigmoid unit to predict the probability after dropout layer with dropout rates of 0.5.
Fig. 4. Deep neural network-based methods for predicting EPIs. (a) SEPT architecture. SEPT uses the feature extractor to learn EPIs-related features from the mixed cell line data, meanwhile the domain discriminator with the transfer learning mechanism is adopted to remove the cell-line-specific features and retain the features independent of cell line. EPI label predictor identifies the EPIs in new cell line based on the learned features. (b) Previous model trains a model on specific cell line data, in which the training and test data are both from same cell line. (c) Mixed cell line data are used to train a general model for predicting EPIs in a new cell line.
5.2.1 SEPT Model training
We trained SEPT for 80 epochs with mini-batches of 64 samples by back-propagation. In the training phase, source domain data were used to train the feature extractor and the EPI predictor, and both source and target domain data were used to train the feature extractor and the domain discriminator. SEPT seeks to minimize the loss of EPIs label and domain discriminator. Binary cross-entropy loss function for both EPIs label predictor and domain discriminator is used, which is minimized by stochastic gradient descent (SGD) with initialized learning rate initialized equals 0.001. In view of the two optimization objectives, SEPT learns a discriminative representation for EPI prediction and indistinguishable representation for domain prediction. The objective function of the SEPT is defined as follows:
Here, Ly is the loss of EPIs label predictor, Ld is the loss of domain discriminator, Gf is a mapping that maps the input x to a feature vector, Gy is a mapping that maps the feature vector to the EPIs label, Gd is a mapping that maps the feature vector to the domain label, xi is the i-th sample, θf is the parameters of mapping Gf, θy is the parameters of mapping Gy, θd is the parameters of mapping Gd, yi is the EPI label of the i-th sample, di is the domain label of the i-th sample, N is the number of labeled EPI training samples, M is the number of unlabeled EPI training samples but they have the domain labels, and λ is a constant that controls the tradeoff between two objectives.
It is a problem of minimax optimization, that is, we attempts to seek a saddle point of the functional E(θf, θy, θd) that is delivered by parameters , and . At the saddle point, the loss of EPI label predictor and domain discriminator is minimized. The loss of EPI label predictor is minimized by the feature mapping parameter θf, while the loss of domain discriminator is maximized by θf on account of the GRL that changes the sign of the gradient during the back-propagation. In the end, SEPT learns the features that are discriminative and domain-invariant. The features learned from the cell lines with label information (source domain) are effective for the new cell lines (target domain).
The training procedure of SEPT can be described as follows: i) Randomly separating the dataset of target domain into approximate equal two parts, one as the training data in which each sample has domain label but no EPI label, and the other as the testing data. ii) Mixing the data from other six cell lines and randomly shuffling the data. The mixed data are used as source domain dataset in which each sample has both domain and EPIs labels. iii) Training SEPT with the source domain data and target domain data. iv) Evaluating the performance of SEPT with the test data in target domain.
All the experiments are based on Python by using the scikit-learn machine learning library [48] and Keras framework (https://keras.io/) with Tensorflow as back-end [49].
5.3 Evaluation metrics
We adopted the metrics of Accuracy, Precision, Recall, F1-score, AUC and AUPR to assess the performance of SEPT. These metrics are defined respectively as the following [50-53].
where TP is the number of correctly predicted EPIs, TN is the number of correctly predicted non-EPIs, FP is the number of incorrectly predicted EPIs and FN is the number of incorrectly predicted non-EPIs. AUC is the area under the receiver operating characteristic (ROC) curve which is the plot of the true-positive rate (i.e., sensitivity) as a function of false-positive rate (i.e., 1-specificity) based on various thresholds. AUPR is the area under the precision-recall curve which is the plot of the precision as a function of recall based on various thresholds.