Accounting for trans acting mechanisms in addition to cis regulatory mechanisms improved gene expression prediction significantly.
We built gene expression prediction models using information from GRNs, applying TF-TG edge-weights from these networks as predictors, under the hypothesis that accounting for trans-acting mechanisms would improve model accuracy. We then compared our results to those obtained from using TF-TG affinity scores calculated based on cis-regulatory features using the TEPIC algorithm. The workflow for our approach is provided in Figure 1. The number of TFs, TGs and TFBS corresponding to different ChIP-seq processing algorithms for both cell lines is provided in Table 1. All validation results within the HepG2 cell-line are shown in the supplementary section and in Supplementary Figure S1.
Table 1: Number of TFs, TGs and TFBS obtained from different ChIP-seq processing algorithms for GM12878 and K562 cell lines. All the ChIP-seq data for the analysis was downloaded from the ENCODE database. We have also included the TFBS and interactions identified for latter analysis.
|
GM12878
|
K562
|
|
TFs
|
TGs
|
TFBS
|
Unique TF-TG Pairs
|
TFs
|
TGs
|
TFBS
|
Unique TF-TG Pairs
|
Pos ChIP-Seq
|
149
|
17,106
|
4,209,133
|
1,216,272
|
309
|
18,190
|
11,614,248
|
2,372,274
|
FIMO
|
85
|
16,850
|
2,444,195
|
714,167
|
110
|
18,173
|
7,349,429
|
1,138,823
|
TEPIC
|
80
|
11,784
|
-
|
517,226
|
86
|
10,239
|
-
|
880,554
|
Promoter
|
149
|
11,509
|
458,959
|
276,138
|
308
|
15,668
|
1,293,933
|
681,847
|
Distal
|
149
|
16,964
|
3,750,174
|
1,128,079
|
309
|
18,152
|
10,320,315
|
2,312,490
|
Intronic
|
149
|
17,106
|
5,896,338
|
1,378,129
|
309
|
18,224
|
14,764,766
|
2,820,604
|
We created binary (binding/no-binding) TF-TG adjacency matrices using the positional and FIMO TFBS. For the TEPIC based adjacency matrix, we used affinity scores as weights. We combined these matrices with PPI data and cell type specific co-expression to fit a GRN using the PANDA algorithm. We fit four models using elastic-net based regularized linear regression (ENET) for each cell line. The first three models (FIMO GRN, Pos GRN, TEPIC GRN) used the edge weights from the respective GRN as input features to predict expression for test set TGs. The fourth model (TEPIC) used only TF-TG affinity scores without GRN information. Predictive performance was measured using mean-squared error (MSE), and Pearson’s correlation coefficient (PCC) between predicted and observed expression values within a 5-fold cross-validation framework repeating for 20 iterations.
We built the GRNs using motif networks derived from CTCF boundary defined cis-regulatory TFBS(Figure-1). We observed that models using such GRNs performed statistically significantly better than or similar to those built using a 50Kb window to define the cis-regulatory regions of the TGs (Supplementary Figure S2A). Furthermore, the model performance for GRNs built using binary adjacency matrices for motif networks was comparable to those utilizing a weighted motif network based on the number of TFBS for each TF-TG interaction (Supplementary Figure S2B).
As shown in Figure 2 and in Supplementary Figure S1, GRN based prediction models containing cis and trans regulatory mechanisms were more accurate than models built using only cis-regulatory TF-TG TEPIC affinity scores. Specifically, the median PCC for TEPIC GRN based models was higher compared to that of TEPIC for GM12878 (0.42 vs. 0.30, p=1.45e-11 2A), K562(0.30 vs. 0.28, p=3.50e-2 2C) while the median MSE for the former was lower than that for the latter for GM12878(0.83 vs. 0.91, p = 4.35e-10 2B), K562 (0.91 vs. 0.93, p=3.26e-02 2D). Additionally, we achieved a statistically significantly better prediction performance while using the GRNs constructed from TFs, for which we could calculate TEPIC affinity scores, used in the original TEPIC paper for each cell-line compared to TEPIC affinity scores as shown in Supplementary Figure S3. These results generalized to the HepG2 cell-line as well (Supplementary Figure S1).
While the top three and bottom three ranked TFs, based on their average ENET effect estimates, did not change in our TEPIC GRN models compared to the TEPIC models, 33 of the 80 for GM12878 and 38 of 86 K562 had a change of rank greater than 10 positions (SupplementaryTablesS3). TFs that improved in rank were associated with transcriptional activation, and those that decreased in rank were associated with transcriptional repression. Thus, capturing trans-acting regulatory mechanisms in their cellular context produces better predictions of gene expression and a better understanding of the relative importance of TFs in the transcription process. Results from all analyses (median measures, p-values and TF ranks) are provided in Supplementary Tables S4A – S4I and S5A-B.
Performance of all GM12878 based models was better overall compared to those obtained from the other two cell lines. This observation was true even when the Pos GRN models were constructed using a common set of TFs among the three cell-lines (Supplementary Figure S4A-B). Moreover, these models also performed the best when used to predict cross-cell type TG expression (Supplementary Figure S5C-D). Furthermore, the GM12878 Pos GRN also contained the highest proportion of literature annotated TF-TG interactions of the three cell-lines (Supplementary Figure S6). Thus, GM12878 GRNs were capturing the most accurate regulatory information between TFs and TGs and the models based on them were explaining the most variance in the gene expression trait. We hypothesized that this may have occurred due to the better estimates of the expression correlation network in the GM12878 line. To explicitly test this, we examined the influence of co-expression networks over the TF-TG regulatory relationships learned by the PANDA GRN by replacing them with networks built using identity matrices to build prediction models (Supplementary FigureS7). The models built using motif and co-expression information performed statistically significantly better compared to those built using just motif information or motif and PPI information, illustrating the importance of having a robust co-expression dataset for generating the GRN weights.
Pos GRN models for GM12878 and K562 had the best performance of all models tested. This could be due to overfitting caused by the higher number of TF features in these models. To test this, we restricted the analysis to GRNs built using the same number of TFs. The performance of Pos GRN in this restricted analysis was similar to that of FIMO GRN for all cell-lines (Supplementary Figure S4). Furthermore, in this sensitivity analysis we found that TEPIC GRN-based models were the most accurate for K562 and HepG2. In other words, when restricted a common set of TFs, TEPIC affinity scores were able to capture more regulatory information between TFs and TGs in comparison to simple positional ChIP-seq data and statistically defined FIMO-based TFBS for K562 and HepG2. Thus, the difference in performance of TF binding assessed by the three methods is either due to the specific collection of TFs assayed for a particular cell-type or just biological differences between the cell-types.
Expression prediction highlights the regulatory roles of transcription factors
Transcription factors may influence gene expression as core activating factors, as responsive factors to environmental stimuli, or as repressors. ENET regression models allow for this heterogeneity by linearly combining two penalizing terms, LASSO(L1) and Ridge(L2), that identify the most influential features(TFs) and shrink the weights of lesser features by either reducing them to 0 (L1) or to a very small number (L2). We examined the influence of the hyper-parameter α, which controls the ratio between the two terms in ENET models, on the prediction performance of genes for each cell type. As shown in the Supplementary Figure S8, moving towards a higher value of α improved the gene expression with a plateau at 0.5 after which improvement was not significant. This indicates a balance between a sparse regulatory model where certain TFs have large effects on gene regulation, and a distributed regulatory model where multiple TFs contribute small effects.
Using an α of 0.5 (balancing L1 and L2 penalties) and equation (1), we averaged the feature weights of 149 TFs(GM12878) and 309 TFs(K562) for the Pos GRN models fit for 20 iterations, which are shown in Figure 3 and in Supplementary Tables S6A and S6B.
For some collections of correlated TFs, collinearity was reduced in the learned models by shrinking effect estimates of some TFs close to zero (in the range [-0.01, 0.01]) (Supplementary Figure S9), as such these TFs may be representative of larger TF complexes. Histograms in Figure 3 show 5 roughly equal bins created using mean effect estimates. The overall range of the mean effect estimates for both cell types, as well as the ranges within each bin, are provided in Supplementary TableS7. We performed a GO enrichment analysis for TFs in each bin and reported the top 5 enrichment terms for biological processes and molecular functions in the Supplementary Tables S6C and S6D and in Supplementary Figure S10 for both cell types. We observed that as we moved from positive to negative TF effect coefficients (bin 5 to bin 1), the corresponding GO terms changed from reflecting transcriptional activation to those indicating transcriptional repression. Thus, we could derive functions of unannotated TFs based on the bins in which they are placed. For instance, K562 bin 1 contained MYNN(βK562 = -0.0059) whose function is largely unknown. However, based on its placement in the bin containing strong repressors such as CBX1(βK562 = -0.0188), HDAC6(βK562 = -0.0045) and BMI1(βK562 = -0.0341), we predict its function is related to transcriptional repression. Similarly, bin 5 for both K562 and GM12878 contained TFs related to core promoter activity and positive gene expression regulation such as TAF1(βGM12878 = 0.6334), TBP(βGM12878 = 0.2142), ELF1(βGM12878 = 0.2249), POLR2A(βK562 = 0.1123), POLR2G(βK562 = 0.0233), CHD1(βK562 = 0.0492)and MYC(βGM12878 = 0.1481). Relatively lesser known TF ZZZ3(βGM12878 = 0.1359; βK562 = 0.0375), which was also present in that bin may most likely play a similar transcriptional activation role. We also note that TFs with mean effect estimates very close to or equal to zero were present in bin 2 for GM12878 and in bins 2 and 3 for K562. These TFs were enriched for cofactor activity, and their functional annotations reflected their roles as secondary TFs that required binding of the primary TFs to the DNA in order to exert their influence. Thus, analyzing the prediction models and characterizing TFs based on their effect estimates enables us to hypothesize about the transcriptional regulatory roles of lesser known TFs.
Accounting for chromatin interactions between TFBS and gene promoters improves expression prediction
In order to determine the effect of TF binding within different regulatory regions on gene expression, we built prediction models using GRNs containing TFBS found in those regions and calculated their predictive performance. We first analyzed TFBS within the promoter regions (5Kb upstream of the TSS of the genes), intronic TFBS, and distal ones present outside these areas as shown in Supplementary Figure S11. The promoter region near the TSS of the gene is important for transcription initiation and regulation and it contains binding sites for pivotal pioneer TFs such as TAFs, POL2 subunits, and TBP. As expected (Figures 4A,4B and the Supplementary Tables S4A- S4I), the median PCC and MSE for the promoter TFBS based ENET models were significantly better than that of the ones containing the distal TFBS alone for GM12878(MSE p = 3.26e-02; PCC p =2.92e-04), K562(MSE p =3.75e-02, PCC p = 3.26e-02). Also, models containing intronic TFBS performed significantly better than those without (Figures 4C, 4D) with respect to median MSE (GM12878 p = 4.72e-04) and median PCC(GM12878 p = 1.33e-08; K562 p = 2.45e-02). The results corresponding to the HepG2 cell-line reflected similar predictive patterns (Supplementary Figure S1).
We next used Hi-C data corresponding to GM12878 and K562 in order to capture long distance interactions between distal TF binding and gene promoters. We used the motif adjacency matrices and weighted them based on the number of normalized Hi-C contacts between TF peaks and TG promoters for both cell lines using equation (2) as shown in Figure 5A. Prediction models including Hi-C adjusted distal TFBS were significantly more accurate compared to the ones built using normal distal TFBS as shown in Figure5B and Supplementary TablesS4A – S4I with regards to both PCC(GM12878 p =7.33e-03; K562 p = 2.00e-06) and MSE(GM12878 p = 1.43e-03; K562 p =5.61e-03) for both cell types.
Next, we expanded this weighting scheme to include promoter TFBS. As promoters are regions of high TFBS activity (as seen in our models, Figure 4A and 4B), we expected a high degree of Hi-C contact points within promoter regions. Unexpectedly, these models performed significantly worse; we observed a large number of promoter TFBS (59% for GM12878 and 90% for K562) that showed no evidence of within-promoter contacts, and using this weighing approach effectively down-weighted promoter TF-TG interactions (Hi-C DP). We therefore also considered an approach that applies the maximum Hi-C weight to all promoter TFBS (Hi-C UP), shown in Figure 5A. These Hi-C UP based prediction models significantly outperformed all the other models for both cell types as shown in Figure 5C and Supplementary Figure S12. Thus, Hi-C data added important regulatory information to our models capturing the effect of long distance interactions between TFs binding to distal regulatory elements and the TG promoter.
Weighting rare variants using GRN derived effect estimates enriches the SKAT based identification of significant TGs
Determining the impact of rare non-coding variants on TG regulation is a major challenge in the field of human genetics[23]. Here, we present the utility of our GRN based ENET models for weighting rare variants within kernel-based association tests to improve their power. We used the DGN dataset[43] containing HRC-imputed variant genotypes and RNA-seq from the whole blood of 922 individuals in order to perform SKAT[24] based rare variant analysis(Supplementary Figure S13). We generated a PANDA GRN for GM12878 based on intronic TFBS motif network weighted using HiC-UP weighting scheme described earlier and then used it to build ENET prediction models and subsequently derived average TF effect estimates. We extracted approximately 9.4 million rare SNPs(MAF < 0.01) from the DGN dataset and scored them based on their impact on TF binding intensity using the QBiC-Pred algorithm[25]. By merging this score with the average effect estimates of the corresponding TFs, based on equations (3) and (4), we created a variant scoring metric representing the estimated average effect of a base-pair change on TF-TG regulation. We used these merged scores to perform SKAT for the normalized expression of TGs in the DGN dataset. We compared the performance of this model to that obtained from aggregated QBiC-Pred z-scores, representing the effect of rare variants on TF-binding alone. As shown in Supplementary Figure S14A, both SKAT models were able to detect 175 common TGs at the multiple hypothesis correction significance threshold of p-value < 4.18e-06. Merge score based SKAT model was able to detect 158 unique TGs while z-score based model detected 56 unique TGs at this threshold. We also performed a replication analysis using the whole blood sequencing and expression data from 369 individuals within the GTEx dataset[26]. We were able to replicate 32% of the TGs uniquely identified by merge score based SKAT model (p-value < 0.05), while only 21% of the TGs uniquely identified by the QBiC-Pred z-score SKAT model replicated (Supplementary Figure S14B). We have provided the results from all the SKAT models in Supplementary Tables S8A-S8C. Thus, utilizing TF-TG regulatory information learned from our GRN framework for weighting rare variants enriched the identification of TGs, which would have been missed if we had only utilized variant influence over TF binding.