Conformation capture and E-P interactions
We used Hi-C followed by chromatin immunoprecipitation (HiChIP) targeting H3K27Ac in LNCaP cells (androgen-sensitive prostatic carcinoma cell line) across 5 biological replicates including 1 billion reads as previously described 14. HiChIP is an efficient protein-mediated chromatin-conformation assay 13. H3K27Ac is a marker of active enhancers and promoters. Briefly, we used HiC-Pro 48 to map HiCHiP reads and extract unique interactions; FitHiChIP 49 was used to identify significant interactions with a predefined set of peaks from H3K27ac ChIP-seq in LNCaP to refine accurate anchor ranges. We used q-value < 0.01 and a 5 kb resolution and considered only interactions between 5 kb and 3 Mb. In this analysis, we restricted to a stringent global background estimation to reduce as much as possible the number of false-positive interactions. The corresponding FitHiChIP specifications used were “IntType = 3” (the peak-to-all) for the foreground, meaning at least one anchor to be in the H3K27 peak, and “UseP2PBackgrnd = 1” (the peak-to-peak (stringent)) for the global background estimation of expected counts and contact probabilities for each genomic distance for learning the background and spline fitting. We identified 49,565 significant interactions (FitHiChIP, FDR < 0.01).
We categorized interactions by overlapping anchors with transcription start sites (TSS) and enhancers identified by H3K27ac ChIP-seq as previously described 14. Briefly, we first extended anchors by 5 kb on either side; we defined promoter regions around the TSS (+/- 500 bases) using RefSeq hg19 (see Data Availability); we defined enhancer regions using 49,638 regions from H3K27ac LNCaP in regular media (union of narrow and broad peaks). Out of the 49,565 significant interactions, we considered only the 24,547 E-P interactions. Specifically, we labeled the promoters and enhancer regions that overlap either right or left anchors and considered E-P if only one anchor overlaps a promoter and the other an enhancer region. The enhancer anchors at this stage of the analysis are of length 15 kb (5 kb resolution of the HiChIP data analysis and additional 5 kb padding added to anchors on either side).
We further prioritized E-P interactions to 1 kb regions and discarded from enhancers the 1 kb bins with fewer HiChIP interactions with the promoter (see E-P HiChIP prioritization section). The 15 kb original E-P interactions dataset contained a mean of 1.6 (1.3 s.d.) promoter anchors per enhancer anchor (after prioritization of enhancer anchor to 1 kb region, mean of 1.4 (0.9 s.d.) promoters per enhancer). There were 11,127 (17,683 prioritized 1 kb regions) enhancer anchors in total; 7,341 (12,385 prioritized 1 kb regions) enhancer anchors are contacted by one promoter anchor with a maximum of 21 promoter anchors (15 using prioritized enhancer regions) sharing the same enhancer.
E-P HiChIP prioritization
In order to reduce experimental artifacts in the context of our EPINs, we developed a specific prioritization method. This prioritization start by normalizing the data assuming, as most used capture-C normalizations (ICE 50, Vanilla, or KR 51) that all biases (e.g. GC content, number of restriction sites, mappability, or in the case of HiChIP, immunoprecipitation bias) can be corrected together. For this normalization step, we assume that there is a specific bias per any 1 kb genomic loci (ꞵx for loci x; see Figures S5A and S5B). This bias causes the difference between a theoretical expected number of interactions (EXY between loci X and Y) and the observed number of interactions (OXY between loci X and Y). In this representation we can define a system of 9 equations involving three 1 kb loci in the promoter and three 1 kb loci on the enhancer side. This system of equations is then solved using Sequential Quadratic Programming (SQP) 52. The procedure is repeated in an overlapping window manner along the 15 kb of the enhancer, alway against the target 1 kb of the promoter and its two 1 kb neighboring loci.
Before the normalization step, we observed a different interaction pattern for interactions below 10 kb (Figure S5C) due, in part, to the contiguity of restriction-enzyme fragments or chromatin persistence length. As these interactions may also be a source of bias in the construction of a PPI network, we removed them from our study.
We applied the normalization to the remaining interactions and observed a better correlation between genomic distance and interaction count (Figures S5D and S5E).
The normalized profile of interactions was finally used to prioritize most interacting 1 kb loci on the 15 kb enhancer (Figure S5F). The selected 1 kb regions are referred to as prioritized enhancer regions.
DNA binding motifs
DNA binding motifs were retrieved from JASPAR, an open-access database of curated, non-redundant binding profiles of DBPs (a.k.a. motifs) stored as position frequency matrices (PFMs). To detect the binding motifs, we used FIMO from the MEME-suite software, with p-value < = 1e-4 and q-value < = 5e-2 cutoffs. JASPAR contains 810 DNA binding motifs of 640 proteins that overlap the E-P contacts identified with HiChIP.
Gene expression data
We assayed RNA sequencing (RNA-seq) in the cell line LNCaP for two replicates using the VIPER pipeline as previously described 14, and fragments per kilobase of transcript per million mapped reads (FPKM) values were calculated for 20,114 RefSEQ genes. Genes with expression levels above the threshold of 0.003 in both replicates were considered in the entire analysis (Figure S4C).
Protein-protein interaction network
We obtained protein-protein interactions (PPIs) from the Integrated Interactions Database (IID) 53. To better contextualize the interactome information, we combined the annotations of the PPIs from IID database with the LNCaP gene expression data. As for the IID annotations, we applied the following selection criteria. First we selected interactions annotated as “experimental” in the “evidence type” field and identified by at least two independent biological assays reported in the “methods” field. Then, we filtered only for interactions in the prostate or in prostate cancer cells and between nuclear proteins. Finally, we retain proteins whose gene expression levels were FPKM > 0.003 in both replicates (this cut-off removes ~ 30% of the genes). In total, 14,221 proteins from a pool of 20,111 human protein coding genes meet the gene expression criteria. The combination of the above filtering criteria (gene expression, using only nuclear, prostate cancer or prostate and experimentally by 2 methods) resulted in an unweighted network of 31,944 prostate-specific nuclear PPIs among 4,295 proteins.53
PENGUIN pipeline
We set up graph-based approach, called Promoter-ENhancer-GUided Interaction Networks (PENGUIN), to reconstruct individual networks of protein interactions that might occur between one promoter (P) and its contacting enhancers (E), that we call E-P protein-protein Interaction Networks (EPINs). To reconstruct the EPINs, PENGUIN integrates information about chromatin contacts, protein-DNA binding, and protein-protein interactions (PPIs). For the case under study in this work (prostate cancer, PrCa), chromatin contacts information comes from H3K27Ac HiChIP of LNCaP cells (4,314 promoters and 5,789 enhancer regions; see Methods, “Conformation capture and E-P interactions''), protein-DNA binding information 51,52 comes from the JASPAR database (810 DNA binding motifs of 640 proteins; see Methods, “DNA binding motifs”), and PPIs information comes from the IID database (31,944 prostate-specific nuclear PPIs among 4,295 proteins; see Methods, “Protein-protein interaction network”).
The reconstruction of EPINs follows these steps: for each E-P contact, (1) DNA binding motifs are detected in the corresponding sequences of promoter and enhancer regions; (2) a subnetwork of PPIs is selected containing all promoter-bound proteins, all enhancer-bound proteins, and all their intermediate interactors, with a maximum of 1 intermediate node between enhancer and promoter bound DNA binding proteins; (3) intermediate interactors are discarded if they only connect promoter-bound proteins or enhancer-bound proteins. Using the provided PrCa information, PENGUIN reconstructed 4,314 EPINs consisting of a total of 9,141 PPIs among 885 proteins of which 751 are intermediate proteins linking promoter-bound and enhancer-bound proteins.
Node centrality measures
We employed two measures of node centrality, namely betweenness and degree. Betweenness is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that the number of edges that the path passes through is minimized. The degree of a node in a network is the number of connections it has to other nodes and the degree distribution is the probability distribution of these degrees over the whole network.
Clustering EPIN
We defined clusters by considering their edge content. We collected the full universe of edges using all existent edges between all promoter EPINs (the union graph) and we performed clustering using a binary representation that encodes this particular edge list. Clustering was performed over the overlap index distance matrix, by calculating Euclidean distance and using Ward’s linkage method. Each leaf in the obtained cluster is a promoter EPIN.
Identification of transcription factors binding directly enhancer to promoter
The TFs binding in the Enhancer region (TFs.E) and the TFs binding in the Promoter region (TFs.P) were identified. The PPIs networks were then searched to see if there is a path linking the (TFs.E) and the (TFs.P).
Identifying enriched clusters using functional annotations
We performed Fisher tests on the obtained clustering for every single branch of the dendrogram. We examined if there is an enrichment of any feature (CTCF, GWAS) for the leaves under the branch of interest compared to those in the rest of the tree. For the CTCF binding feature, we require a CTCF binding (see CTCF ChiP-Seq peaks) to Promoter and at least one of the Enhancer regions of the promoter EPIN. For the GWAS feature, we require the presence/overlap of a GWAS SNP (see Genome-wide association data) in at least one of the Enhancers of the promoter EPIN.
Fisher tests were used to calculate the odds ratio (OR) and enrichment p-values for presence of PrCa annotations within the identified clusters.
Druggability information
We extracted information for target druggability from DrugBank 54. The use of each drug was extracted from the Therapeutic Target Database 55. We annotated as PrCa druggable each protein node that is target for drugs that are assigned as Approved or under Clinical Trials (Phase 1, 2, 3) or Investigable for Prostate Cancer.
CTCF ChiP-Seq peaks
CTCF ChiP-seq peaks were retrieved from ENCODE project (https://www.encodeproject.org/) for the same Genome assembly, hg19 (Experiment ID: ENCSR315NAC, file ID: ENCFF155SPQ). These narrow peaks were mapped on the enhancer regions using the python package pyranges (see “E-P contacts” section). We found overlaps of the CTCT binding sites with enhancer and promoter anchors allowing a 10 kb gap between them.
PrCa SNPs
To explore enrichment of GWAS across the identified clusters, and to identify the SNP paths, we used the 95% credible set from fine-mapping of the largest PrCa genome-wide association studies (GWAS) collected from Schumacher et al. 2019 (N = 79,148 cases and 61,106 controls), which includes 20,370,946 SNPs, as previously described 56. Briefly, we fine-mapped 137 GWAS regions using PAINTOR 57, a Bayesian statistical method, with no functional annotations and specifying a maximum of 1 causal SNP. We then constructed a 95% credible set by taking the cumulative sum of the posterior probability until a cumulative 95% posterior probability was reached. This set was composed of 5412 distinct SNPs (rsid). We will refer to these 95% credible set SNPs as PrCa SNPs for brevity. Note that this set also includes SNPs that do not reach genome-wide-filters of p-value significance. We mapped the SNP location to prioritized enhancer regions anchor locations with a window of 10kb. 518 out of 5412 overlap our prioritized enhancer regions; 18 of them overlap our promoter regions. In total 218 prioritized enhancers and 14 promoters overlap a PrCa SNP.
SNP paths (PrCa SNPs in enhancer binding motifs)
A path in a network is a sequence of edges joining a sequence of nodes. We detected PrCa SNPs located in the DNA binding motifs in the enhancers and identified the corresponding SNP paths (linked edges and nodes) for each EPIN promoter. For SNP paths analyses and the web-browser, we used all PrCa SNPs in the 95% credible set. There were 36 PrCa SNPs falling in enhancer binding motifs across clusters 3, 4, 5, 6, 7, 8. To report the most interesting cases in the Tables and Results, we used the subset of those passing genome-wide significance of p-value for PrCa association < 5e-8. There were 15 PrCa SNPs falling in enhancer binding motifs across clusters 3, 5, 6, 7, 8.
SNP paths (PrCa SNPs in intermediate proteins)
We detected PrCa SNPs falling within genes that encode for intermediate nodes, and identified the corresponding SNP paths (linked edges and nodes) for each EPIN promoter. For SNP paths analyses and the web-browser, we used all PrCa SNPs in the 95% credible set. To report the most interesting cases in the Tables and Results, we used the subset of those passing genome-wide significance of p-value for PrCa association < 5e-8.
PrCa GWAS enrichment using GWAS Catalog and comparison with other diseases
This analysis had two aims: 1) explore whether we could replicate our finding and identify the GWAS enriched cluster using a different source for the GWAS; 2) to compare the GWAS signal for different diseases. We estimated enrichment of SNPs overlapping the enhancers in each of the identified clusters by exploring the NHGRI GWAS Catalog associations 24. First, we retrieved GWAS data and filtered the traits according to their “umlsSemanticTypeName” as defined in DisGeNet database 58 to one of the following: "Mental or Behavioral Dysfunction", "Neoplastic Process", "Disease or Syndrome", "Congenital Abnormality; Disease or Syndrome", "Disease or Syndrome; Congenital Abnormality", "Disease or Syndrome; Anatomical Abnormality". We considered only traits with at least 10 genome-wide-significant SNPs (unadjusted p-value < 5e-8). We mapped the SNP location to prioritized enhancer anchor locations with a window of 10kb. 104 diseases had SNPs overlaps and 17 of them have more than 10 SNP overlapping (Table S5). For each cluster, we tested enrichment of disease-associated SNPs using Fisher tests and considered significant p-value < 0.01 and OR > 1.
Oncogenes Gene list
We used a previously identified list of 122 Genes ("PrCa_GeneList_Used.csv") known to be somatically mutated in PrCa oncogenesis (37 out of 4,314 promoters considered). As previously described 14, the 122 oncogenes are a set of prostate cancer–genes curated from three large-scale PrCa studies that show evidence of somatically acquired mutations, at both localized and advanced prostate cancer, known and recurrently altered in localized prostate cancer and metastatic prostate cancer.
Enriched edges within each cluster
Fisher tests were used to compute odds ratios and p-values of the edges and nodes in the eight different clusters. Specifically, each edge or node was tested for presence/absence in a cluster compared to all others. Therefore, one edge or node can be enriched in one or more than one cluster, it cannot be enriched in all clusters.
Enriched intermediate nodes within each cluster
We computed protein importance for each cluster in terms of two network centrality measures: betweenness and degree. For each protein we obtain both betweenness and degree specificity ratios in order to equitably quantify internal protein centrality differences between the clusters. For each of the found clusters we independently estimated the specificity of the observed protein centrality measures (“Betweenness” and “Degree”). For a given protein (Pi) in a particular cluster (Cj), we define the specificity as the ratio between the mean centrality value of Pi inside the fraction of networks belonging to Cj ; divided by the mean centrality value of Pi for the fraction of networks outside of the cluster Cj.
Specificity ratio (Pi, Cj) = (mean (Pi centrality in Cj networks) + 1) / (mean (Pi centrality in non Cj networks) + 1)
We assessed protein specificity ratio significance for each cluster upon random network cluster generation. Aiming to assess the significance of the different specificity ratios for the proteins within each cluster, we developed a significance analysis test based on random cluster subsamplings. In order to compute the significance of a given protein specificity ratio (Pi) within a particular cluster of analysis (Cj), we performed 1000 random network samplings to produce random network clusters containing the same amount of networks as the real cluster being analyzed (i.e. if the real cluster contains 100 networks, the random clusters generated will contain 100 random networks out of the 4,314 clustered networks). Within each of those 1000 random clusters, we compute the corresponding protein specificity ratios, with the p-value representing the probability of finding the protein specificity ratio to be higher or equal to the real value computed for the particular cluster of interest (Cj).
We also performed Fisher tests to assess enrichment for the presence of the node in the cluster (Fisher test p-value < 0.01). EP300 was excluded from the enrichment test as the presence of that node was not significantly enriched (Fisher test p-value < 0.01). 22 proteins (SMAD2, KAT5, NCOR2, MAPK8, SMAD4, CREBBP, CTNNB1, PGR, HDAC3, HDAC2, GSK3B, UBA52, UBE2I, JUND, PIAS1, XRCC5, CDK6, XRCC6, MAPK1, FOS, HIF1A and MAPK3) were found to be significantly specific for both betweenness and degree ratios (p-value < 0.01 for both centrality measures and over-represented in this cluster using Fisher tests).
Functional gene set enrichment analysis
Functional enrichment analysis was performed using the g:GOST module from g:Profiler, a web tool to perform simultaneous gene set enrichment analysis across multiple biomedical databases 27. g:GOST performs cumulative hypergeometric tests of an input geneset against preprocessed database-specific gene sets. We run the web service considering only annotated genes for the statistical domain scope. Reported adjusted p-values correspond to Benjamini-Hochberg correction for multiple testing, with adjusted p-values ≤ 0.05 considered to be significant. Gene set enrichment analysis results are provided for KEGG pathways, Reactome, Gene Ontology, Wikipathways, TRANSFAC, miRTarBase, Human Protein Atlas, CORUM and Human Phenotype Ontology.
For the enrichment analysis of significantly specific proteins of the GWAS + cluster, we provided as input the 22 previously described proteins. For the enrichment analysis of the GWAS + cluster, we provided as input all genes associated with the EPIN promoters in cluster GWAS+.
Differential Gene Expression
We integrated data from EPIN promoters with differential gene expression (DE) from two sources.
DE analysis on prostate cancer tumor versus normal was downloaded from GEPIA: http://gepia2.cancer-pku.cn/#degenes, which use the TCGA and GTEx projects databases to compare gene expression between tumor and normal tissues under Limma, both under and over expressed. We used the default thresholds of log2FC of 1 and qvalue cut-off of 0.01. These data covered 84 out of 885 genes encoding for intermediates in PENGUIN and 413 out of 4,314 promoter EPINs.
DE analysis of RNA-Seq on LHSAR (an immortalized prostate epithelial line overexpressing androgen receptor) versus LNCaP was performed as previously described. Briefly, RNA-seq data were processed using the VIPER pipeline 59. Reads were aligned to the hg19 human genome built with STAR. FPKM values were calculated with Cufflinks for 20,114 RefSEQ genes included in the VIPER repository. Differential expression analysis was performed with the DESeq2 R package 60. 15,650 genes with DE data covered 884 of the 885 genes encoding for intermediates in PENGUIN and 3,286 genes out of 4,314 promoter EPINs.
We annotated whether the EPIN promoters themselves and the genes encoding the intermediate proteins in our data were DE using either of the two databases. We considered as DE those genes passing |log2 fold change| > 1 and adjusted p-value < = 0.01. For the LNCAP/LHSAR dataset, we could compute a Fisher test of enrichment of differentially expressed genes encoding for intermediate proteins within each EPIN promoter versus within the SNP paths (we could not compute this for the GEPIA since we did not have the full dataset of covered genes). The genes that were not passing these filters were considered non-DE and the genes not covered by the two datasets were excluded from the enrichment analysis described next. For each EPIN we calculated the fraction of DE intermediates within the SNP paths and we estimated the enrichment of those compared to the fraction of DE intermediates in the full EPIN network.
To find the enrichment of DE genes in SNP paths (PrCa SNPs in intermediate proteins) compared to those in the entire EPIN, we computed as enrichment the ratio of Fraction1 / Fraction2, where:
Fraction1 = (number of DE intermediates within SNP paths) / (number of covered intermediates within SNP paths), and
Fraction2 = (number of DE intermediates the EPIN) / (number of covered intermediates in the EPIN).
We report the EPIN genes passing enrichment (“enrichment_DE_deseq_SNP.bs.TF.path”) > 1.
pQTL look-up
We downloaded summary statistics with genome-wide association between SNPs and 4907 proteins reported in the deCODE study 61 and annotated with pQTL association the SNPs we identified falling in either in enhancer binding sites or in node genomic locations. The deCODE pQTL summary statistics data contained information on 4,907 proteins and 186 (201 PrCa SNPs out of the 213 PrCa SNPs we looked up were in the data and 186 also matched by alleles). 808 out of the 4,314 genes promoters ("Gene_network") and 278 out of the 885 gene intermediates (in total 997 out of 4,918 genes promoters and coding for intermediates in our networks) have information on associations with their respective coded proteins covered by the pQTL deCODE data.