MethNet: a robust approach to identify regulatory hubs and their distal targets in cancer

Aberrations in the capacity of DNA/chromatin modifiers and transcription factors to bind non-coding regions can lead to changes in gene regulation and impact disease phenotypes. However, identifying distal regulatory elements and connecting them with their target genes remains challenging. Here, we present MethNet, a pipeline that integrates large-scale DNA methylation and gene expression data across multiple cancers, to uncover novel cis regulatory elements (CREs) in a 1Mb region around every promoter in the genome. MethNet identifies clusters of highly ranked CREs, referred to as ‘hubs’, which contribute to the regulation of multiple genes and significantly affect patient survival. Promoter-capture Hi-C confirmed that highly ranked associations involve physical interactions between CREs and their gene targets, and CRISPRi based scRNA Perturb-seq validated the functional impact of CREs. Thus, MethNet-identified CREs represent a valuable resource for unraveling complex mechanisms underlying gene expression, and for prioritizing the verification of predicted non-coding disease hotspots.

because of the high cost of whole genome sequencing and the fact that coding sequences are easily identifiable and can be directly linked to changes in gene expression. In contrast, it is difficult to connect cis regulatory elements (CREs) in non-coding regions to their target genes as these can be located many hundreds of kilobases away on the linear chromosome. Nonetheless, noncoding regulatory elements like promoters, enhancers, silencers and structural elements cannot be ignored since variants have been shown to be more likely to contribute to disease than non-synonymous coding variants as they occupy a greater proportion of the genome compared to coding sequences, and alter the binding capability of factors that are the key drivers of gene regulation 1,2 . In addition, and equally important, epigenetic changes that alter the ability of a transcription factor (TF) to bind a regulatory element can have the same effect.
CREs are marked by active histone marks, DNA hypomethylation, and enriched for the binding of transcription factors. Distally located CREs rely on cohesin-mediated loop formation to bring them into physical contact with the promoters of genes they control 3 . The chromatin contacts can be stable (found in all cell types), or cell-type specific with contacts mediated by cell-type specific TFs, in a CTCF dependent or independent manner. Although gene regulation commonly occurs between a single CRE-promoter pair, transcriptional control can be complicated by CRE redundancy. Indeed, we and others have shown that enhancers can control the expression of more than one gene 4,5 , and similarly, promoters can act as enhancers that regulate other distally located target genes 6 . Thus, gene regulation can occur in 'hubs', which encompass multiple CREs and the promoters of their gene targets. CRE hubs connect regulatory elements that can be widely separated on the linear chromosome, but with interactions largely restricted to the same topologically associated domain (TAD), suggesting they rely on a loop extrusion mechanism for their formation 7 . Hubs are strongly enriched for promoters of target genes and super-enhancers critical for cell identity and are associated with high transcriptional activity indicating a probable important role in gene regulatory networks that control cell fate [8][9][10][11] . Understanding the mechanisms by which genes are controlled is challenging for a number of reasons. First, as mentioned above, regulatory elements can be located hundreds of kilobases away on the linear chromosome and they do not necessarily control the nearest neighboring gene. Second, even though regulatory elements and their target genes are generally in close contact in 3D space as a result of chromatin looping, not all elements that are in contact have a functional impact on gene regulation 8 . Third, although enrichment of active histone marks, DNA hypomethylation and transcription factors are all hallmarks of regulatory elements, their presence does not imply functional impact. Finally, there is no 'one rule for all' and every gene can be controlled by a combination of regulatory elements (enhancers / silencers) and unique chromatin folding constraints (mediated by binding of CTCF and cohesin). Thus, in order to better understand mechanisms underlying transcriptional control, we need to take advantage of datasets that couple gene expression with epigenetic marks to construct functional models.
DNA methylation is an epigenetic mark associated with the regulation of gene expression 12 . It has long been appreciated that methylated CpG islands (CGI) in the promoter of a gene have a strong silencing effect, but beyond that, the functional impact of methylation on gene expression is context dependent. Intragenic methylation occurs during transcript elongation in gene bodies to prevent the initiation of spurious transcripts [13][14][15] , while methylation of intronic and intergenic regions impacts the activity of regulatory elements 16 . Bisulfite converted sequence-based techniques such as whole genome bisulfite sequencing (WGBS), reduced-representation bisulfite sequencing, (RRBS) and array-based techniques like Illumina's BeadChip, have facilitated the genome-wide screening of chromatin for methylation marks, allowing for a systematic investigation of their methylation status. The ENCODE 17 and TCGA projects are particularly useful resources for identifying novel putative regulatory elements as they provide paired DNA methylation and gene expression data across multiple cell lines and patient-derived samples.
Several approaches have been proposed to investigate the role of methylation in gene regulation in a systematic way. These methods can be divided into two broad categories based on the manner in which they interrogate possible connections: (i) association mining and (ii) predictive modeling. In association mining, all possible pairwise connections between candidate regulatory sites and genes are tested independently, while in predictive modeling all possible associations are considered simultaneously. The association mining category includes methods like ELMER 18,19 and TENET 20 , in which correlations between differentially methylated sites and differentially expressed genes in tumor versus normal samples, are tested. The predictive modeling category includes methods like ME-Class 21 and the use of Random Forests (RFs) to predict whether a gene will be differentially expressed based on changes in methylation status and other chromatin features 22 . However, a limitation of this method is that the analysis is restricted to promoters and gene bodies. Other predictive methods like REPTILE 23 , use RFs to directly identify putative enhancers instead of modeling enhancer-gene associations. Aside from their strengths and weaknesses, all the above methods focus on a narrow region around the promoter and the gene body.
In this paper we present MethNet, a pipeline to uncover regulatory networks linking CpG sites to gene expression. In contrast to other methods, we identify regulatory elements within a 1Mb region around the promoter of every protein coding gene in the genome, taking into account the overall decay of regulatory connections related to the distance separation from a gene. The resulting regulatory network recapitulated known regulatory mechanisms like the silencing effect of methylation at gene promoters and the enrichment of CpG islands in CREs. Importantly, MethNet also identified novel CREs whose methylation was correlated with transcriptional activation or repression of their associated target genes. We characterized these networks and used them to identify cis-regulatory hubs involving multiple associations with a robust regulatory potential that are predictive of patient survival across all cancers as well as in particular tumor types. Promoter-capture Hi-C confirmed that highly ranked associations are mediated through physical interactions between CREs and their target gene promoters. Further, CRISPRi based scRNA Perturb-seq confirmed that MethNet was able to accurately identify functional regulatory elements. Thus, MethNet is a powerful and cost-effective tool that can be used to understand the underlying mechanisms of distal gene regulation and to prioritize the verification of predicted noncoding disease hotspots.

MethNet constructs regulatory networks using TCGA data
Most of the published epigenome-wide association studies (EWAS) have focused on links between differentially methylated regions and the expression of the closest gene. However, it is important to account for long range interactions when modeling gene expression, since regions that are distal on the chromatin fiber can be brought into close proximity by chromatin folding. Indeed, gene regulation occurs largely within highly self-interacting, megabase sized 'topologically associated domains' (TADs) [24][25][26] , that are separated by 'insulating boundaries' enriched for CTCF and cohesin. The boundaries are functionally important as they limit inter-TAD interactions, so that enhancers predominantly contact promoters within the same TAD. TADs form via 'loop-extrusion', with cohesin rings extruding DNA until they encounter two convergently oriented CTCF binding sites, which form the base of a loop 27 .
To account for this, MethNet considers as potential regulatory elements all the CpG probes that are within a 1Mbp radius of the TSS. It uses gene expression and methylation data from TCGA samples to construct predictive models that quantify the contribution of each site to a gene's regulation (Figure 1). TCGA is the largest resource with both genome-wide gene expression and DNA methylation data from the same patient samples and is therefore well-suited for this type of Paired RNA-seq and DNA-methylation data were accessed from TCGA. The expression of every protein coding gene across every cancer dataset was modelled as a function of the methylation status in a 1Mb radius surrounding the gene to generate a set of regulatory networks across all cancers. These were aggregated to produce a network of robust associations that were used to identify novel regulatory elements and hubs.
analysis. Gene expression is modeled separately for each cancer-type to account for the contextspecificity of gene regulation. Elastic-net regularization is used to restrict spurious associations. A consensus network was constructed by aggregating associations across all cancers, so that robust associations found in multiple cancers were ranked higher. Finally, we computed the regulatory potential of every CRE as a function of all the associations it is involved in, and characterized the epigenetic features that contribute (Figure 1).

MethNet identifies putative activating and repressing distal associations
The challenge for any CRE-discovery method is that the number of potential associations grows exponentially with the radius of their regulatory window (Figure 2a). A-priori there are approximately 8 million potential regulatory associations between CpG sites in a window size of 1Mb on either side of every protein coding promoter. MethNet uses an average of 400,000 (5%) associations per individual cancer to model the expression of every gene. A pan-cancer analysis indicated that there is strong context-specificity in the regulatory network, with up to one third (2.6 million) of all potential CREs having a putative functional impact in at least one cancer. The strength of these regulatory associations varied across different cancer types, with the majority of associations being specific to a particular cancer, consistent with lineage specific transcription Mean performance of MethNet models, as measured by the ratio of variance as a function of dataset size. d) MethNet association effect as a function of distance. Associations are grouped based on their ranked distance to a gene, where -1 includes associations from the first CpG island or non-island upstream of the promoter, and positive distance refers to associations downstream of the TES. For every group of potential associations, we calculated the average distance, mean coefficient and probability of a MethNet association. e-f) Examples of regulatory effects recovered by MethNet. Left: A repressive association between IFNγ and a CTCF binding site 250 kb upstream of the promoter. Right: An activating association between GSTT1 and a non-coding RNA 10kb downstream of the promoter. factor-mediated gene regulation (Figure 2b). The context-specificity of the regulatory networks identified are associated with varying degrees of accuracy and reliability across different cancer types. Notably, larger sample sizes enhance MethNet's capacity to uncover robust and reliable associations (Figure 2c). This observation underscores the potential scalability of MethNet and its ability to leverage larger datasets to further elucidate the complex regulatory mechanisms governing gene expression in cancer.
MethNet successfully recovers known regulatory mechanisms such as the well documented effect of gene silencing when a promoter is methylated (Figure 2d). Specifically, methylation of the first CpG island upstream of the TSS has a strong negative correlation with expression. Furthermore, MethNet identifies CpG islands as being more likely to associate with and have a stronger effect on gene expression than inter-island regions in an unbiased way 28 (Figure 2d). On average, the closer a CRE to its target gene, the greater the probability of its regulatory impact on transcriptional output. Importantly however, the probability of CRE-target gene association does not diminish to zero, indicating that CRE methylation beyond the immediate vicinity of the promoter can have an impact on gene expression in a context specific manner.
We define MethNet associations that have a negative or positive coefficient as activating or repressing, respectively. In activating associations, a gain of methylation is predicted to decrease gene expression, while in repressive associations, a gain of methylation is predicted to increase gene expression. For example, MethNet identified a CTCF binding site located 250 kb upstream of the IFNγ promoter whose demethylation is linked to transcriptional repression (Figure 2e and S1). This suggests that CTCF binding to the unmethylated DNA sequence could be acting as an insulator preventing the IFNγ promoter from coming into contact with elements that activate its expression. Silencing of this inflammatory cytokine could have profound effects on immune responses to cancer, and its regulation is thus important in the context of immunotherapy. In contrast, MethNet identified the promoter of a non-coding gene located downstream of GSTT1 that when demethylated activates Glutathione S-transferase1 expression. This suggests that the promoter of the non-coding gene may be acting as an enhancer for GSTT1 (Figure 2f and S2). Glutathione S-transferases (GSTs) are phase II metabolizing enzymes that play a key role in protecting against cancer by detoxifying numerous potentially cytotoxic/genotoxic compounds.
These two examples highlight the complex interplay between DNA methylation and gene expression, providing valuable insight into the mechanisms underlying the regulation of cancer relevant genes.

The regulatory potential of CREs is correlated with chromatin context and contact frequency
The number of potential CREs controlling transcriptional output varies by gene, and the number of genes controlled by a single CRE varies by element. To study individual CREs and rank them by their impact on transcriptional output we defined a metric to capture the intrinsic regulatory potential. In particular, we aggregated a CRE's contribution to the regulation of all genes in its vicinity. We quantified the importance of an association as the excess of its relative effect size with respect to a null model, where all elements have the same intrinsic potential and the only factor differentiating them is their distance to the gene promoter. The potential of a CRE was calculated by summing all the distance-adjusted contributions to genes in its vicinity (Figure 3a).
We next conducted a series of enrichment analyses to uncover distinctive characteristics that are linked to a CRE's regulatory potential. First, we analyzed the chromatin state of each CRE using ChromHMM 29 (Figure 3b). These investigations indicate that promoters acting as enhancers controlling other distal genes, have the highest overall regulatory potential. Our observations are in line with the known regulatory role of distal gene promoters 6 . Indeed, the associations detected by MethNet transcend linear distance, highlighting its capacity to identify distal regulatory networks.
Methylation has been shown to directly affect the presence of the transcriptional machinery, chromatin modifiers, and binding of transcription factors (TFs) whose chromatin occupancy in turn can act as a barrier to methylation [30][31][32][33] . To identify which factors could contribute to the regulatory Figure 3: a) Schematic depiction of regulatory potential. We quantified the importance of an association as the excess of its relative effect size with respect to a null model where all elements contribute proportionally to their distance from the promoter. b-d) Enrichment of regulatory potential in ChromHMM state (b), chromatin remodelers, the transcription machinery, and transcription factor binding sites (c), and H3K27ac chromatin loops from CD4-Naive, GM12878 and K562 cells (d.) potential of a CRE, we performed an enrichment analysis and identified members of the RNA polymerase complex (such as POLR2A and POLR2G) as the most enriched factors (Figure 3c). Additionally, we identified other highly enriched factors known to be involved in chromatin remodeling and altering the methylation status, including EGR1, HDAC1, EZH2 and PHF8. Moreover, using Hi-ChIP data from the benchmark study of Bhattacharyya et al., 34 we observed a strong positive correlation between the regulatory potential of a region and the number of active chromatin loops (H3K27ac loops) anchored at it, further highlighting the dynamic interplay between chromatin structure and regulatory activity (Figure 3d).

MethNet hubs that control multiple genes have an impact on patient survival
The distribution of the regulatory potential is long tailed, suggesting the existence of CREs with exponentially high regulatory potential (Figure 4a). Intriguingly, these elements exhibited low methylation variance across the different TCGA datasets, indicating that they could be under robust and stringent regulation. We used the elbow method to show that above a select threshold, CREs were significantly enriched for regulatory associations. This approach identified 6137 CREs, that we refer to as 'MethNet hubs', since their profile aligns with a model in which CREs exert control over multiple genes as expected (Figure 4b). A survival analysis, using the Cox proportional hazard model, across TCGA cancers with clinical data, revealed that methylation of MethNet hubs has a bigger impact on the overall survival of patients in comparison to non-hub elements of similar variance, both in a pan-cancer and cancer-specific context (Figure 4c and

S3).
This finding provides evidence for MethNet hubs having a pivotal role in the context of cancer biology and underscores their potential clinical relevance.
To gain insight into the characteristics of MethNet hubs, we repeated the previous enrichment analyses, this time comparing hubs to other CREs with positive potential (Figure 4d-f). As anticipated, we observed that hubs share many of the characteristics exhibited by high-ranking non-hub elements, however, key differences emerged upon closer examination. In the ChromHMM and binding site analysis, insulating elements and CTCF were respectively enriched in MethNet hubs compared to non-hubs (Figure 4d-e). This finding is consistent with the known role that insulating elements play in gene regulation via chromatin looping and insulated topologically associated domain (TAD) boundaries 35,36 . This hypothesis is further supported by our data showing that hubs are depleted in elements lacking chromatin loops (Figure 4f). In summary, we identified MethNet hubs that are characterized by high ranking regulatory potential and whose methylation status has low variance. Hubs are enriched for regulatory associations and insulating elements and show significant clinical relevance with respect to cancer patient survival.

MethNet hubs uncover known and potentially novel regulatory elements
Examples of two regulatory hubs are shown in Figure 5. The regulation of the Protocadherin alpha (PCDHA) cluster of genes has been shown to be controlled by a regulatory hub (HS5-1) that overlaps a CTCF binding site, which stochastically activates different PCDHA genes by cohesin-mediated looping 37,38 . Using MethNet we were able to unbiasedly identify this regulatory hub (highlighted in blue). We also found another hitherto unknown regulatory hub that is enriched for H3K27ac and H3K4Me3 (highlighted in orange) upstream of PCDHA. This hub has predicted regulatory associations with genes from all three clusters of the Protocadherin family, PCDHA, PCDHB and PCDHG, and the region has been characterized as a schizophrenia risk locus that is linked with the regulation of all three protocadherin families in the context of brain 39 . In-situ Hi-C data reveals the high contact frequency of this CRE with all three Protocadherin families (as shown in red in the Hi-C heatmap).

High scoring MethNet associations are mediated by long-range chromatin interactions
To determine whether distal CRE associations identified by MethNet are brought into contact with their target genes by chromatin looping, we performed a promoter-capture Hi-C experiment that identified chromatin interactions from all promoters in the genome in two distinct well characterized A549 and K562 cell lines. The QC for the promoter-capture Hi-C is shown in Figure   S4. Promoter-capture Hi-C, which enriches for loops anchored at gene promoters, allows us to simulate parallel 4C-seq experiments and recover regions that are in physical contact with each promoter. Given that MethNet consists of common and cell type specific associations, we would not expect all associations from our pan cancer analysis to be validated with the promoter-capture Hi-C data from the A549 and K562 cell lines. Indeed, an example of multi-locus interactions for the TP53 gene promoter in A549 and K562 shown in Figure 6a, highlights the cell type specific chromatin contacts found in the two cell lines.
To determine whether associations with higher scores are more likely to be facilitated via chromatin interactions, we used the union of loops from the A549 and K562 cell lines. We found a robust correlation between association score and probability of loop formation across all levels (Figure 6b). This data demonstrates that stronger MethNet associations are more likely to act via chromatin contacts, while weaker connections may represent indirect effects.
Next, we investigated whether the regulatory potential of MethNet is predictive of chromatin hubs, i.e. CREs that form multi-locus loops 40,41 . Remarkably, MethNet's regulatory potential demonstrated a high predictive power for identifying multi-locus loops, achieving an area under the receiver operating characteristic curve (AUC) of 86%, when applying the strictest criteria (Figure 6c). Although our analysis was primarily focused on promoter hubs (due to the experimental bias of promoter-capture Hi-C), the predictive power of MethNet extended to nonpromoter regions when less stringent criteria were used for calling hubs (maximum AUC 84%, Figure S5). In sum, our data indicate that high-ranking distal MethNet associations are brought into close physical proximity by long-range chromatin interactions.

Perturbation of MethNet hubs results in altered target gene expression
To functionally validate the predictions generated by MethNet, we performed perturb-seq that combines targeted perturbation of genomic regions with single-cell RNA sequencing (scRNAseq). Compared to a regular CRISPR interference (CRISPRi) assay, perturb-seq enables the simultaneous investigation of multiple genomic regions, using a pool of guide RNAs (sgRNAs). For the CRISPRi, we used the dCas9-KRAB-MeCP2 system that blocks binding of other factors and methylates the DNA as well as histones of the targeted region 42 , inducing robust silencing. The transcriptomic readout aligns well with the MethNet pipeline, which predicts changes in the expression of target genes. Furthermore, perturbation of distal regulatory elements is more likely to lead to subtle gene expression changes rather than cell death 43 , particularly when there is more than one CRE controlling a gene target. Figure S6 shows the percentage of A549 cells transfected with dCas9-KRAB-MeCP2 at day 14 after puromycin selection.
In total we targeted 55 potential regulatory elements with 2 to 5 guides each. To address the inherent limitations of the assay, targets were selected based on the following criteria: (i) CREs were unmethylated in A549 cells to allow for methylation by the dCas9-KRAB-MeCP2, (ii) 2 to 5 high-quality guides could be selected using the CRISPick 44,45 scoring system, and (iii) putative target-genes were expressed at levels that were high enough to be detected by scRNA-seq. An outline of the perturb-seq validation experiment is shown in Figure 7a.
Although we designed our experiment so that each cell received a single guide, some cells contained multiple guides ( Figure S7). Therefore, we used a linear model with a complex design matrix to deconvolve the individual effects of each guide on gene expression, similar to the approach used by Dixit et al 46 . The results of this analysis are illustrated in Figure 7b. Among the 55 targeted CREs, 17 were validated as being involved in 22 functional associations predicted by MethNet. To assess the significance of the identification of 17/55 functional CREs, we performed a bootstrap analysis, by randomly shuffling the sgRNA labels across cells, while maintaining the total number of detected guides. We generated 20,000 bootstrap samples to estimate the null distribution (Figure 7b) and estimated that the probability of detecting 17 or more regulatory regions was highly unlikely to have arisen by chance (p = 0.0004). This statistical evaluation supports the robustness and significance of the regulatory regions identified through perturb-seq.
Out of the 17 functional CREs identified, 4 were found to be associated with more than one target gene. In Figure 8a we highlight a regulatory hub that corresponds to the promoter of BNIP2 (highlighted in orange). BNIP2 itself is not highly expressed so it was not captured by the perturbseq, but expression of two predicted target genes, GCNT3 and ANXA2 (red loops) were found to be significantly downregulated. Both of these genes are associated with poor prognosis in multiple cancers. ANXA2 is a member of the calcium-mediated phospholipid-binding protein family of annexins, involved in epithelial mesenchymal transition, cell proliferation and survival 47 , while GCNT3, is a member of the N-acetylglucosaminyltransferase family that is associated with cell proliferation, migration and invasion in non-small-cell lung cancer 48 . Our data indicate that the hub corresponds to a promoter region that is predicted to act as an enhancer for GCNT3 and ANXA2, and thus methylation leads to a drop in their expression as we observed. We also identified two genes, MYO1E and LDHAL6B that are predicted by MethNet to be targets of the hub (highlighted by gray loops in the screenshot) that we were unable to validate. Although both gene targets are expressed and unmethylated in A549 cells, unlike ANXA2 and GCNT3 they were not connected to BNIP2 by chromatin loops in our promoter-capture Hi-C. Chromatin looping is likely to be important for long-range hub mediated regulation, and we speculate that in the case of these two target genes, contacts are regulated in a cell type specific manner. Other MethNet associations validated by the perturb-seq experiment are shown in Figure S8. These include the target gene AMIGO2, a cell adhesion protein that is linked with cell survival and metastasis of multiple adenocarcinomas 49,50 GFBP4, a tumor suppressor acting as a double-negative feedback in AKT and EZH2 signaling 51 (Figure S8). Taken together, these data confirm the robustness of the MethNet pipeline in predicting regulatory associations.

Discussion
Here we introduce MethNet, a pipeline that combines gene expression and methylation data from the same TCGA cancer samples to identify novel regulatory elements that can control genes beyond their immediate genomic vicinity. MethNet's unbiased approach led to the identification of numerous regulatory features commonly associated with the role of methylation in gene expression. In addition, it revealed the existence of novel regulatory elements with potential clinical significance. MethNet's most intriguing attribute is the ability to uncover the presence of regulatory hubs that can influence the expression of multiple genes. These hubs displayed the expected characteristics of previously identified hubs 7 such as enrichment in active chromatin marks and chromatin looping connections between hub CREs and their target genes. MethNet also revealed other hitherto unanticipated features of hubs, including a relatively high proportion of insulator elements compared to regular CREs and low methylation variance, suggesting that hub regulation is influenced by TAD structure and that they are tightly regulated. Moreover, the methylation status of hubs showed a significant correlation with overall patient survival as well as cancer specific survival.
It should be noted that our modeling of epigenetic context was limited by the availability of data. The DNA methylation profiles provided by TCGA were generated using the 450k array, whereas larger 850k arrays are now widely used. Furthermore, TCGA lacks other relevant data modalities such as chromatin accessibility and protein binding profiles that are important for a more complete understanding of the regulatory landscape. As the volume and diversity of genomic data expand, MethNet can be adapted to incorporate additional data modalities, further enhancing its capacity to unravel the complexities of gene regulation. Any method that identifies regulatory elements by linking gene expression with methylation, has to deal with the problem of spurious correlations. This issue is exacerbated because nearby methylation sites tend to be correlated and can act synergistically. Current methods try to mitigate these issues using permutation or cross-validation techniques. Our modeling approach is similar to that of Methylation-eQTL in that it also uses TCGA data and penalized regression to identify novel regulatory associations. Both methods assume a linear additive model which is a pragmatic choice given the size of the currently available data set. However, while Methylation-eQTL uses a sequential lasso approach to deal with promoter sparsity in individual tumors, we employed elastic-net regularization which leads to a less sparse solution and combats spurious CRE associations by aggregating results across multiple cancers. Another important difference between the two approaches is that MethNet, in addition to focusing on individual CRE-promoter connections, uncovers regulatory hubs and highlights their relevance in the context of normal gene regulation and cancer. In contrast to Methylation-eQTL, which only performed cross-validation using an independent data set, our experiments provide causal and mechanistic support for our findings, by validating the functional and physical connections between hub CREs and their target genes.
MethNet identified 37,740 novel distal regulatory elements, including the novel regulatory hub in the PCDHA cluster. This hub mapped to a GWAS (genome-wide association study) schizophrenia risk locus, supporting the functional relevance of the pipeline. While GWAS studies have identified more than 100,000 disease-associated SNPs (single nucleotide polymorphism) over the past decades 52 , the identification and prioritization of causal germline SNPs remains challenging due to linkage disequilibrium and the size of the non-coding genome. Nonetheless, identifying causal SNPs is of paramount importance in understanding the underlying mechanisms of disease susceptibility and leveraging GWAS data. Towards this end, MethNet provides a valuable resource for prioritizing the verification of predicted non-coding disease variants which are likely to disrupt regulatory elements, versus the numerous 'proxy' variants in high linkage disequilibrium that share the same disease-association statistics, but have no functional effect.
The results from the Pan-Cancer Analysis of Whole Genomes (PCAWG) and TCGA consortiums, which include more than 2500 cancer samples, were somewhat disappointing as only a few noncoding somatic driver mutations were identified 53 . This can be partly explained by the fact that the statistical approaches to identify coding driver mutations are not tailored to the non-coding genome. Furthermore, it has been shown that methods based on functional screening followed by experimental validation are better able to identify bona fide non-coding driver mutations 54 , compared to approaches that focus on the accumulation/recurrence of non-coding mutations. In this context, MethNet, in addition to being able to prioritize candidate non-coding driver mutations, can also identify the target genes of disrupted regulatory elements, which is crucial for developing novel therapeutic drugs.
Many studies have pointed out that regulation is more complex than simple enhancer-promoter loops 9 . For example, super-enhancers (SEs), which are large regions (around 8kb) characterized by strong enrichment in chromatin marks like Med1 or H3K27ac 55 , are associated with highly expressed, tissue specific genes 56 . In addition, there are CREs with multi-locus contacts that influence the expression of numerous genes. Our analysis of the consensus regulatory network revealed the existence of both individual CRE-promoter contacts, as well as MethNet hubs that have an outsized influence on gene expression.
In this study, we took advantage of a high-throughput perturb-seq assay which combines CRISPRi screening with scRNA. This assay is well suited to the functional screening of distal regulatory elements since it allows the identification of relatively modest gene expression changes upon perturbation of these elements. While coding mutations or disruption of gene promoters can lead to dramatic changes in cell fitness, it has been shown that the regulatory potential of distal elements might be more subtle, although these are likely to be important in disease processes. Indeed, in our study, we did not observe effects on cell survival, reflected by a loss of sgRNAs in the final pool (data not shown).
Using perturb-seq we were able to demonstrate that about one third of the targeted regulatory hubs were associated with transcriptional changes in cancer-related genes, suggesting that a substantial proportion of MethNet hubs contribute to the cancer cell phenotype by activating or silencing oncogenic and tumor-suppressor genes, respectively. In addition, the validated MethNet associations were not trivial, but linked genes to regulatory elements over long distances (mean distance 366kbp) bypassing multiple potential regulatory elements in between. Among these validated MethNet hubs, we observed expression changes in at least one of the predicted target genes, while in others no alterations were detected. This could be because (1) Gene expression levels could be below the threshold detectable by scRNA-seq, and (2) MethNet hubs are defined using a pan-cancer approach, and while this allows us to prioritize the most robust candidates, it does not rule out context-specific gene regulation, as we highlight in Figure 7, where predicted hub-target gene associations did not overlap with chromatin loops in the cell line we analyzed. Therefore, although MethNet is useful in identifying high-confidence regulatory hubs, studying their gene regulation network in a tissue-specific manner is also important.
In conclusion, MethNet represents a powerful computational framework for the integrative analysis of DNA methylation and gene expression data. Our study demonstrates the effectiveness of MethNet in identifying novel regulatory associations across multiple cancer types as well as context-specific connections, highlighting the importance of functional integrative analysis over simple correlations. The performance of MethNet scales with data, indicating its potential for further expansion with larger datasets and inclusion of other data modalities. Identification of previously unreported hubs and their association with clinical outcomes sheds light on the intricate interplay between chromatin structure and global transcriptional regulation. Overall, MethNet represents a valuable resource for deciphering the complex regulatory mechanisms underlying gene expression in cancer, and for prioritizing the validation of germline and somatic non-coding disease-associated variants.

Author contributions
These studies were designed by Theodore Sakellaropoulos, Jane Skok, Aristotelis Tsirigos and Catherine Do. All the analysis was performed by Theodore Sakellaropoulos. The Perturb-seq experiment was performed by Guimei Jiang; the promoter-capture Hi-C by Giulia Cova, Sitharam Ramaswami and Dacia Dimartino, supervised by Adriana Heguy (GTC). scRNA-seq for the perturb-seq was performed by Peter Meyn (GTC). The paper was written by Theodore Sakellaropoulos, Catherine Do and Jane Skok.

Funding
These studies were supported by a P01CA229086 (JS, AT, AH) and 2R35GM122515 (JS). GC was supported by a fellowship from the NCC.

Promoter Capture HiC sample and library preparation
Promoter Capture Hi-C data was generated in K562 and A549 cell lines using the Arima Hi-C+ kit, the Arima Promoter Capture Module, and the Arima Library Prep Module according to the Arima Genomics manufacturer's protocols. The A549 and K562 cell lines were purchased from ATCC. Two replicates of the Hi-C were performed in each cell line, and for each replicate 1 million cells were collected and double cross-linked using 3mM DSG (disuccinimidyl glutarate), followed by 1% formaldehyde. Samples were sequenced with Novaseq Illumina technology according to standard protocols with around 300 million (150bp paired-ends) reads per sample. The library preparation and sequencing were conducted by NYU Langone's Genome Technology Center.
The 248 sgRNAs targeting high-confidence MethNet hubs and 5 non-targeting sgRNAs (negative controls) were designed using CRISPick 44,45 for the Human GRCh37 (hg19) assembly using parameter settings: CRISPRi, SpyoCas9/Chen (2013) tracrRNA (Suppl. Table 1). The lentiviral libraries were transduced into the A549-dCAS9-KRAB-MeCP cells according to Cellecta® protocols. Briefly, 10 5 cells/well were seeded into a 6 well-plate. The optimal MOI of viral particles (to reach ~30-40% of infected cells) were added. On day 3, 2 ug/ml of puromycin was added, resulting in >90% transduced cell selection as confirmed by cytofluorometry using RFP ( Figure   S7). Cells were expanded under puromycin selection until day 14. For scRNA-seq using the 10X Genomics technology, 25,000 cells were harvested. For optimal multiplet detection and optimal signal-to-noise ratios, cells were hashtagged using 5 cell multiplexing oligos purchased from 10X Genomics. The sequencing library was, then, prepared using the Chromium Next GEM Single Cell 3ʹ Kit according to the manufacturer's protocol and sequenced by NYU Langone's Genome Technology Center.

Public Data Collection
The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. We collected paired gene expression and methylation data for all TCGA cancers with at least 100 samples, via the xenahubs portal 57 , yielding 8,264 samples in total. We used the pan-cancer batch-corrected normalized gene expression and beta values for methylation from Illumina's HumanMethylation450 BeadChip. Clinical data for those samples was downloaded from xenhubs where available.

Metadata for the CpG probes were collected from Illumina's annotation of HumanMethhylation450
BeadChip. The annotation was augmented by overlapping clusters with tracks from the UCSC genome browser. In particular, we used the ChromHMM chromatin annotation and the transcription factor binding site cluster tracks. Methylation clusters were annotated based on their most common labeling across all cell types. Hi-ChIP loops were downloaded from the supplementary material of the FitHiChIP paper 34 We used the combined loose and merged replicate (L+M) loops for 2.5kb bins for all cell lines (CD4-Naive, GM12878, K562). The primary data for Hi-ChIP loops were generated by Mumbach MR et al., 58 (GSE101498).

Predictive Modeling
We limited our analysis to protein coding genes. Using the gencode v30 gene annotation 59 and the Illumina CpG probe coordinates for hg19, we constructed a gene-probe adjacency network by connecting a gene with all probes within 1Mbp on either side of its transcription start site (TSS). This resulted in a network of 13M interactions between 20k genes and 450k probes. We collapsed probes that were within a diameter of 200bp (complete linkage) into probe clusters by averaging their beta value and linking them with all the genes interacting with the original probe set. This resulted in 300k CpG clusters, of a single probe or more, and a new network of 8M interactions.
We fitted a linear model for each gene and TCGA cancer independently, removing gene-cancer pairs with low variance (standard deviation less than 1). The variables of the model included all the methylation clusters neighboring the gene in the adjacency network as well as the sample type (tumor or metastatic vs normal) wherever there were multiple sources. To promote sparsity and better generalization in our models we used elastic net regularization to limit the number and effect size of the cluster-gene interactions using the model specification: Here i, g and c are indexes for the sample, gene, and cluster respectively. ygi is the log-normalized expression of g in sample i, βg0 is the basal level expression for sample i given it's clinical profile zi (tumor or normal), and xic are the coefficient and methylation status (beta value) of cluster c respectively, and kg is the number of clusters neighboring g. is the error term of the model. The R package glmnet 60 was used to fit the models and determine the trade-off ( , ) between accuracy and sparsity via 10-fold cross validation. This process resulted in a series of regulatory networks, one per cancer, connecting genes with methylation clusters if in the corresponding gene model the cluster had a non-zero coefficient.
For our pan-cancer analysis, we combined the interaction coefficients across all cancers by averaging across all cancers weighted by the corresponding model's performance as measured by R 2 .

MethNet Score
To quantify the contribution of each gene-cluster interaction to overall gene expression, we calculated a contribution score based on the following function: In contrast to the coefficient ( ), which quantifies how much the expression of gene g is altered if cluster c switches from a completely unmethylated to a completely methylated state, the score ( ) intends to capture the information gain of using MethNet's regulatory network instead of a naive model where all neighboring elements contribute equally (1/ ). Scores were computed based on the consensus network (pan-cancer analysis).
Finally, to identify the intrinsic potential of a regulatory element we regressed out the effect of the distance ( ), which act in an element agnostic manner, from the MethNet score using a GAM model with the absolute distance between TSS and methylation cluster as the only predictor, and taking residuals. Thus for the purposes of characterizing regulatory hubs we assume a distancebased, instead of an equivalence contribution of each region.
The MethNet potential of a cluster is defined as the sum of scores of all the interaction it's part of:

Regulatory effect as a function of distance
We analyzed the relationship between MethNet associations and distance to gene separately for CpG sites located within and outside the gene body.
For interactions occurring outside the gene body, we aggregated CpG clusters into two categories: CpG island and non-island regions. Subsequently, for each gene, we separately ranked these categories based on their distance to the gene body, where -1 is the closest upstream region, -2 the second closest etc, and accordingly 1, 2 for upstream regions. Two metrics were used: the mean coefficient of interaction and the probability of interaction, which were calculated for all clusters within the respective region.

Enrichment of Regulatory Potential
To assess the enrichment of regulatory potential, we performed an overlap analysis between CpG clusters and ChromHMM states. A linear model was fitted, with the "Low Signal" state serving as the reference baseline. Similarly, we conducted a similar analysis for transcription factor binding sites, allowing for the possibility of multiple factors binding to a single site to account for confounding effects. In this case, the basal state represented unbound chromatin.
The resolution of the H3K27ac loops was 2.5kb. We quantified the potential of each bin by aggregating the potential of the clusters contained within it. Bins were subsequently divided into quartiles based on the number of loops anchored within each. The enrichment of potential within each quartile was determined using a linear regression model.

Survival Analysis
We excluded from this analysis elements with low methylation variance (bottom 25%) and from the rest we selected all the hubs (730) and 16190 at random. A Cox proportional hazard model on overall survival (OS) was fitted for each element using the survival R package 61 . The methylation status (beta) was used as a predictor and we fitted a varying-slope model (coxph(Surv(OS.time, OS) ~ beta*strata(cancer)), to estimate both the mean effect of methylation across all cancers (Figure 4c) and the cancer specific effect ( Figure S3 for coefficients with less than adjusted p-value 0.05).

Analysis of promoter capture HiC
We called loops using the Arima pipeline with default parameters: https://github.com/ArimaGenomics/CHiC, which is based on HiCUP 62 and CHiCAGO 63 . We ran the analysis independently for the K562 and A549 cells and then used the union of loops for subsequent investigations. Quality control metrics and loop replication are shown in Figure S5.
The resolution of loop anchors was 5kb, so we only considered associations of length 10kb and above. We considered an association overlapping a loop if loop anchors overlapped with both the regulatory element and the gene. Enrichment was computed on the basis of associations: associations were grouped into bins based on their score and the ratio of associations overlapping with a loop was computed for each bin separately.
To estimate the predictive power of the MethNet potential to call hubs, we first computed the number of loops anchored at each bin and then we computed the potential of the bin by summing the potential of all the clusters contained in it. Finally, we called hubs for different thresholds and estimated the predictive power of the bin's potential using the AUC of the ROC curve using pROC packages 64 .

Analysis of perturb-seq
Cells were called and analyzed using the 10x Genomics Cell Ranger 7.0.0 for initial cell calling and protospacer calling. The results were further filtered based on the percentage of mitochondrial reads and total number of genes, to remove lysed cells, and based on the HTO tags to remove duplicates using the "hashedDrops" function of the DropletUtils package 65 and manually filtering to remove what appeared to be triplets per drop. After these filtering steps we ended up with 36,601 cells. The distribution of the number of expressed genes and number of reads per cell as well as the number of guides per cell are shown in Figure S5.
Subsequent analysis was based on the method suggested by Dixit et al. 46 In particular, we focused only on genes that were within 1Mb of a targeted region (as this is the maximum radius of MethNet association) and genes with a mean normalized expression above 0.0002. Since most cells had more than one sgRNA guide we analyzed the results for all the targets simultaneously. In particular, we identified sgRNA-gene interactions by fitting a linear model ( ) ∼ , where ( ) are the log-normalized counts (logNormCounts) and is a binary matrix where = 1 if the guide j is detected in the cell i. The models were fitted using the limma package 66 with bayesian shrinkage, and interactions were called using a threshold of 0.05 on the adjusted P-value. All recovered associations are shown in Figure S6. For the bootstrap analysis, we shuffled the rows of independently for each column and refitted the model. The metric we used to quantify the performance of MethNet was the number of targeted regions forming at least a single interaction at the 0.05 threshold. Since the confidence interval is affected by the number of cells transfected with each guide, this process controls for spurious interactions due to the differences of the cellbase of each guide.

Availability of data and materials
All the data generated in this paper have been submitted to the Gene Expression Omnibus (GEO) database under the superfamily accession number GSE236305. The promoter capture Hi-C and Perturb-seq data accession numbers are GSE235851 and GSE236304, respectively.    Figure S5: MethNet potential predictive power for chromatin hubs. We analyzed intergenic hubs separately from promoter hubs because the latter are enriched in our data because of the experimental design. As a result, we used thresholds that were more lenient.