The mutational burden within PDAC cis -regulatory regions
To identify likely pathogenic NCMs in PDAC, we retrieved SSM data from the ICGC Pancreatic Cancer Genome Project Australia (AU, n = 391 patients) and the Canada (CA, n = 268 patients) cohort. 1,379,638 and 2,211,000 somatic mutations were identified in the AU and CA cohort, respectively. After filtering out non-synonymous somatic mutations, 1,358,342 (98.5%) and 2,179,517 (98.6%) somatic NCMs were retained from the AU and CA cohort, respectively, for further analysis. This corresponds to an average of 3,701 (AU) and 8,132 (CA) NCMs per patient.
We wanted to focus on the NCM burden within CREs, specifically those enriched with H3K27ac, a chromatin feature associated with active enhancers19. We hypothesised that NCMs within these CREs may contribute to altering their function and target gene expression17,20. Using ChIP-seq datasets from seven PDAC cell lines21 and two patient-derived organoid samples22, we identified 404,415 enriched H3K27ac peaks across all samples. To consolidate H3K27ac peaks across the nine samples into one representative consensus region per loci, we stitched together quality peaks residing within 2,000bp of another (inter-peak distance), resulting in a total of 65,168 H3K27ac consensus peaks for further analysis (average peak length = 4,639 and SD = 7,949). This allowed us to narrow the search space for potentially important NCMs to ~ 10% of the genome. Patient somatic NCMs were then mapped to the consensus H3K27ac coordinates to obtain a list of NCMs in PDAC-specific CREs. From the AU cohort of patients, 101,209 somatic mutations were observed within 36,409 (55.9%) consensus peaks and 166,541 CA cohort mutations within 43,002 (66.0%) consensus peaks. Therefore, capturing 7.45% and 7.64% of all AU and CA NCMs, respectively (Fig. 1b).
Prioritisation of cis -regulatory regions enriched with putative functional NCMs
We next aimed to interrogate somatic NCMs residing within consensus H3K27ac marked regions. To ensure the study of a significant proportion of PDAC patients, we retained CREs with a patient mutation incidence of 2% or above (n ≥ 8), leaving 30,826 somatic mutations (AU cohort) across 1,711 consensus peaks/CREs and 64,867 somatic mutations (CA cohort) across 3,964 peaks (Fig. 1b). In total, 2.26% (AU) and 2.97% (CA) of the NCM burden remained to interrogate, similarly to observation in a previous study focusing on H3K27ac enriched elements18.
To prioritise the remaining CREs, we utilised two independent approaches: one measuring the functional effect of each NCM within a CRE and ranking them based on the median functional score of all NCMs; the other identifying CREs with significantly recurrent NCMs accounting for local background mutation rate, and replication timing (Fig. 1a). We carried out the first approach using the IW-scoring algorithm23, an integrative weighted scoring framework to score NCMs and prioritised elements with a median IW-score of two or above (corresponding to a p-value ≤ 0.1). From the remaining 1,711 (AU) and 3,964 (CA) peaks after filtering, we identified 14 CREs from the AU-cohort and 32 elements in the CA-cohort using the median threshold (Extended Data Fig. 1 and S1). This method prioritised CREs annotated to cancer-related genes such as the AP-1 transcription factor (TF) JUNB expressed in low-grade PDAC cells21,24, and GATA2, associated with high-grade PDAC21. Of the 46 prioritised CREs, five regions were shared between the AU and CA cohorts (Fig. 1c). These five CREs reside within the introns of the oncogenic long non-coding RNAs (lnRNA) MIR100HG25 and HOTAIR26 from and including the HOXC cluster of homeobox genes27, PDAC associated TFs FOXA128 and FOXP129 and ferroptosis related TF, NFE2L130 (Fig. 1c).
To further validate the putative significance of the NCMs within these five CREs, we compared the IW-score of NCMs residing within the H3K27ac positive regions to NCMs in immediate flanking sequences negative for H3K27ac marks. We observed a statistically significant higher IW-score of NCMs within H3K27ac enriched regions compared to those in flanking H3K27ac negative sequences (Extended Data S2), indicating the putative enhancer-associated NCMs have higher predicted functional consequences than mutations located outside these CREs. We also verified these findings with an independent scoring algorithm LINSIGHT, which scores variants on the likelihood of deleterious fitness consequences based on patterns of polymorphism and divergence from closely related species31. The LINSIGHT model demonstrated a significant increase in the selective constraint (i.e., more deleterious on fitness) of H3K27ac-associated NCMs compared to NCMs in nearby H3K27ac negative regions (Extended Data Fig. 2).
Using the second approach to identify significantly recurrently mutated CREs, we implemented LARVA32. The LARVA model yielded 68 (AU cohort) and 71 (CA cohort) candidate CREs which were significantly recurrently mutated in relation to nearby background sequences (Benjamini-Hochberg (BH) adjusted p ≤ 0.01). These significant regions collectively harboured 1,842 and 2,258 NCMs in the AU and CA cohorts. Many NCMs were located proximally to several well-known genes implicated in PDAC, for example, an intergenic regulatory region in proximity to the miRNA: miR-21 and the Wnt/β-catenin signalling protein gene WNT7b (Extended Data S3). Nine significantly mutated CREs were shared between AU and CA cohorts. These recurrent CREs included regions proximal to the TF genes TBX3 and BNC1, previously reported in PDAC33,34. NCMs were also located proximal to the adhesion molecule PXDN35 and transmembrane protein TENM336, the lncRNA gene TBX5-AS137, and microRNA, miR-1305 (Fig. 1c).
Notably, the MIR100HG enhancer cluster was the only one prioritised in the two approaches, but consisting of two separate CREs (Fig. 1c, Extended Data Fig. 3). Overall, our computational strategy has revealed NCMs enriched within or proximal to PDAC or cancer-related genes, including candidates identified from a previous non-coding study in PDAC16.
Proximal genes to enhancer NCMs are associated with transcription and PDAC-linked biological processes
We next performed pathway enrichment analysis based on the annotated genes proximal to CREs identified by the two in-silico approaches using the DAVID tool38. Inputting 95 annotated genes associated with 41 CREs identified by the IW-scoring approach, we observed significant enrichment in several gene families and regulatory processes, including homeobox genes, pattern specification, embryogenesis and transcriptional regulation pathways (Fig. 1d). Additional pathway analysis based on 212 genes annotated to the 130 recurrently mutated CREs identified significant enrichment in core molecular pathways including cell adhesion, epithelium development, cell proliferation, transcription, apoptotic processes and regulation of chemotaxis (Fig. 1d). The involvement of biological processes, such as embryogenesis, apoptosis and cell adhesion, has been reported in a previous genomic landscape study39. Furthermore, our findings complement Feign et al. in identifying NCMs significantly associated with homeobox genes and transcriptional regulation16. Our results suggest a convergent mode for CRE-associated NCMs in relation to biologically relevant coding genes in PDAC.
Enhancer NCMs show altered transcriptional reporter activity
To determine the effect of NCMs on the transcriptional regulatory activity, we performed luciferase-based enhancer reporter assays for a subset of NCMs. We selected twelve NCMs from two CREs identified from the first approach (IW score), comprising 11 single nucleotide variants (SNV) and a single 4bp deletion. Five SNVs were selected from the third intron of the FOXP1 gene, and seven NCMs in the third intron of the lncRNA MIR100HG (Fig. 1e). Interestingly, the 2kb region surrounding the seven NCMs at the MIR100HG locus lack detectable H3K27ac and H3K4me1 marks in most of the cell lines, except those derived from high-grade PDAC cells PANC-1 and PT45P1, suggesting this putative active enhancer is specific to high-grade PDAC (Fig. 1e, Extended Data Fig. 3). Luciferase reporter assays were carried out in the high-grade PDAC cell line PANC-1 and easily transfectable cell line HEK293T. Within the MIR100HG CRE, NCMs (MUT 3, 6 and 7) and (MUT 1–3) showed significant increases in reporter activity in HEK293T and PANC-1 cells, respectively (Fig. 1f). Overall, all NCMs at this MIR100HG CRE showed an increase in luciferase activity compared to WT sequences, suggesting NCMs within this CRE are potentially gain-of-function, i.e., increase regulatory activity. The ~ 2kb regulatory element surrounding five selected NCMs within the third intron of FOXP1 was positive for H3K27ac marks in six PDAC cell lines (except for MIA-PaCa2 cells), and two patient-derived organoid samples (Extended Data Fig. 4). Among the five NCMs tested, two NCMs in HEK293T cells and three in PANC-1 cells significantly altered luciferase expression. Most notably, mutation 3 (chr3:71104908:C > T, IW-score = 5.20, p = 0.006, LINSIGHT score = 97.2%) significantly decreased reporter gene expression in both cell lines (Fig. 1f). Interestingly, all five NCMs within the FOXP1 putative enhancer demonstrated concordance in the overall transcriptional regulatory activity in both cell lines.
STARR-seq assays prioritise a subset of 43 NCM candidates for further validation
Next, we screened a larger set of NCMs within consensus CREs using the high-throughput approach, Self-Transcribing Active Regulatory Region sequencing (STARR-seq)40. To focus on NCMs with the strongest evidence of predicted function, we retained 504 NCMs with a variant allele frequency above 20% and strong TF binding strength as predicted by motifbreakR41. Of the 504 NCMs, binding motifs of 258 TFs were strongly predicted to occupy these mutation sites. Moreover, among the 73 NCMs identified by the first approach (IW score), 47 (64%) NCMs were predicted to cause TF-motif gain and 26 (36%) loss-of-motif (break). Among the 431 NCMs selected from the second approach (LARVA), 216 (50%) NCMs caused predictive gain and 215 (50%) loss of motif changes. We included 83 single base indels, resulting in 587 candidate NCMs in the final STARR-seq library (Fig. 2a).
We designed ten 230bp oligos per NCM, five for each NCM and five for the corresponding wild type (WT). One oligo represented the NCM in the middle and four oligos had a 10 bp sliding genomic window (SW) in either direction from the centre of the oligo (Fig. 2a, see Methods). A further 210 positive (PDAC enhancers) and negative (no enhancer features) control oligos were included in the library, resulting in a pool of 6,082 oligos. Sequencing and quality analysis of the cloned STARR-seq plasmid library demonstrated good complexity and accuracy (Extended data Fig. 5), with comparable outcomes to a previous MPRA study using synthetically designed oligos42,43. We performed two biological replicates of STARR-seq by transfecting the PANC-1 cell line (see methods)44. After filtering low-quality reads across samples, we observed a good concordance between replicates (Fig. 2b). As expected, positive control sequences showed significantly higher reporter activity compared to negative controls (Fig. 2c).
We next tested the significance between mutant (MUT) and WT constructs on reporter gene expression across replicates. A total of 217 plasmids (representing 155 NCMs) showed significant differential enhancer activity (log2 fold change − 1.54 to 3.53, Student’s t-test, p < 0.05). 95 (61.3%) NMCs showed significantly increased enhancer activity, while 60 (38.7%) mutations showed a significant reduction in enhancer activity in comparison to WT sequences (Fig. 2c and 2d). Interestingly, 36 CREs harbouring indels showed significant fold changes at similar activity to SNVs (mean log2 FC 1.07). Despite the differences in assays and genomic context, we observed concurrent directional changes in enhancer activity at NCMs assayed by luciferase reporter assays and sequencing-based high-throughput STARR-seq (Extended data Fig. 6).
Focusing on the most significant alterations between MUT and WT alleles (t-test, p < 0.01), we highlighted 43 mutations, 33 of which demonstrated an increase in reporter activity and 10 with an observed reduction (Fig. 2e). Notably, the differential activity changes between MUT and WT in 13 NCMs were significantly altered in three or more independent STARRs-seq plasmids (p < 0.05). Similarly, 31 NCMs were significantly altered in two independent SWs demonstrating concurrent directional activity changes. Eight of the 43 NCMs were located within an enhancer cluster (observed in low-grade and MiaPaCa2 cells) upstream of the BNC1 gene (Extended data Fig. 7a). The NCMs proximal to BNC1 significantly increased reporter gene expression in PANC-1 cells in comparison to WT sequences (Fig. 2d). Assessing the expression of genes within 1Mb of this consensus peak by comparing MUT and WT patient GEPs, we did not observe a difference in the expression of BNC1, previously reported to be methylated in early stage PDAC patients45. However, we observed a significant increase in the expression of nearby genes BTBD1 (p = 0.003, ~ 234kb from the middle of the consensus peak to BTBD1 TSS), important in cell survival, the ubiquitin/proteosome degradation pathway and mesenchymal differentiation46 and FAM103A1 (p = 0.008, ~ 316kb) which encodes an important subunit for the 7-methylguanosine cap added to the 5' end of mRNA and an essential component for gene expression47. Patient GEP analysis also revealed a significant decrease in the transmembrane protein TM6SF1 (frequently hypermethylated48,49, p = 0.0003, ~ 137kb) between MUT and WT patients (AU cohort, Extended data Fig. 7b), overall suggesting these NCMs may exert their regulatory potential in a more distal manner. Additional significant increases in reporter gene expression were observed proximal to the PDAC-associated TF TBX50 (7 NCMs) and in the introns of lncRNA MIR100HG25 (3 NCMS, Fig. 2d).
To assess the putative biological implications of these top-performing STARR-seq NCMs, we took a closer look at the TF-motif binding predictions. From the 35 NCMs in the top 43 STARR-seq performing mutations with TF-motif predictions, 21 were characterised as TF binding motif-gain (creating de novo TF binding motifs), while 14 were TF binding motif-loss. For example, one gain-of-function NCM proximal to TBX3 (chr12:115067012:C > A) was predicted to create a binding motif for the oncogenic TF JUN (Fig. 2d and Extended Data Fig. 7c). This NCM led to a mean log2 fold-change of 3.69 in STARR-seq reporter gene expression across all five SWs (Mann Whitney U test, p = 0.016). As expected, JUN was highly expressed in PDAC patients based on the patient GEP in the AU cohort (Fig. 2f)51. The most significant loss-of-function was observed in a NCM located in the intron of FOXP1 (chr3:71123616:G > T) supported by three significant SWs (p < 0.05, average log2 fold change across SWs = -1.36). At this site, the binding motif of an unfolded protein response (UPR) mediating TF, the activating TF-3 (ATF3)52, was predicted to be disrupted (Extended Data Fig. 7d) and was found to be moderately expressed in the PDAC patient GEP (AU cohort, Fig. 2f). Furthermore, the top two NCMs located in the MIR100HG enhancer cluster also showed strong effects on TF binding: the first mutation (chr11:122010557:C > T) demonstrated a gain of TF motif, creating a de novo binding motif for NR6A1, a nuclear receptor family member; while the second mutation (chr11:122025440:G > C) was predicted to disrupt the binding motif for SOX10 (Fig. 2g), a reported tumour suppressor through the suppression of the Wnt/β-catenin pathway in digestive cancers53. We observed that NR6A1 and SOX10 TFs were expressed at moderate levels in PDAC patients (AU cohort, Fig. 2f). Overall, using the STARR-seq assay enabled the prioritisation of CRE-associated NCMs for further investigation.
CRE cluster harbouring NCMs located at the MIR100HG locus regulates genes in cis
The two computational approaches used in this study identified the lncRNA MIR100HG locus as a significant candidate for harbouring NCMs in separate CREs in each approach. Notably, MIR100HG is host to the oncogenic miR-s pre-miR125b-1 and pre-miR-100, previously implicated in PDAC25,54, and they modulate (including MIR100HG) in a pro or anti-tumourigenic manner depending on the cancer25,54−59. It hosts the tumour suppressors pre-miR-Let7a-225 and the pro-apoptotic protein BLID60, located within intron three of MIR100HG (Fig. 3a).
Next, we investigated the functionality of three CREs harbouring NCMs at the MIR100HG enhancer cluster using a CRISPRi approach recruiting the dCAS9/KRAB repressor to NCMs and CREs of interest61 (Fig. 3a and b). The first region located ~ 2kb away from the hosted pre-miR-125b-1 in the third intron of MIR100HG harboured NCMs identified from the first in silico approach. CRISPRi with a pool of four independent lentiviral guide RNAs (G 1–4) were selected close to NCMs that were shown to alter enhancer activity in either luciferase or STARR-seq experiments (Fig. 1e, 3b). Two guides (G 5–6) were designed to target region two harbouring five NCMs (CRE-two), including the most significant NCM identified to drive reporter enhancer activity using STARR-seq (M20 in Fig. 2e and Fig. 3b). An additional two guides (G 7–8) were designed to target the third region harbouring six NCMs (CRE-three), including a gain-of-function NCM from the most significant STARR-seq candidates (M37 in Fig. 2e).
CRISPRi, followed by RT-qPCR, showed a significant reduction in MIR100HG expression in all three CREs in this enhancer cluster in comparison to dCAS9/KRAB negative controls (Fig. 3b). This data suggests that these CREs function as active enhancers to regulate the expression of MIR100HG. Analysis of looping interactions from the 4D genome62 and integrated method for predicting enhancer targets (IM-PET)63 data in PANC cells indicated interactions between CRE-two and the promoter of UBASH3B located upstream of MIR100HG (Extended Data Fig. 8a). UBASH3B has been reported to inhibit the endocytosis of the epidermal growth factor (EGFR), an essential component in the development of pancreatic precursor lesions64–66. RT-qPCR analysis demonstrated a significant decrease in UBASH3B expression with the CRE-two CRISPRi compared to controls (Fig. 3b). CRE-three shows interactions with the promoter of ARHGEF12 (Extended Data Fig. 8b). ARHGEF12, a guanine nucleotide exchange factor (GEF), activates Rho A, a key regulator of cytoskeleton organisation and ROCK1/2 induced extracellular matrix remodelling, associated with poor outcomes in PDAC patients67. CRE-three CRISPRi resulted in a significant decrease in ARHGEF12 levels compared to controls (Fig. 3b). These results suggest that CRISPRi-based perturbation of CRE-two and three leads to downregulation of genes located in cis, although to a less extent compared to the reduction in MIR100HG expression.
CRISPRi perturbation of MIR100HG CREs alters core PDAC signalling pathways and cell motility.
We performed RNA-seq to evaluate the global mRNA changes in CRISPRi-targeted CRE-two and -three clones (Fig. 3b). Principle component and correlation analyses showed CRISPRi of CRE-two and -three shared similar gene expression programmes (Fig. 3c and 3d). Differential expression (DE) analysis identified 98 and 102 significant genes in the perturbation of CRE-two and -three clones compared to the control, respectively (FDR < 0.05 and absolute log2 FC > 1). Of them, 59 DE genes were shared between the two clones (Fig. 3e). We also observed a significant reduction in MIR100HG RNA-seq expression in both targeted CREs, consistent with the qPCR data (Fig. 3f).
Gene set enrichment analysis (GSEA)68 against the MsigDB Hallmark69 and oncogenic signature gene sets were then performed between the two CRISPRi groups and the dCas9-KRAB control (Fig. 4a). In both CRISPRi perturbations; we observed a comparable and significant downregulation of important PDAC hallmark gene sets involved in KRAS signalling70, UPR, reactive oxygen species (ROS)71 and TNFα signalling72 (Fig. 4a and 4b). Oncogenic signatures associated with critical drivers KRAS73, P53, epithelial-to-mesenchymal transition (EMT) inducing TGF-β and cell survival and proliferation-related MTOR73 pathway genes were significantly reduced in both inhibited cis-regions. In contrast, migration inhibiting cAMP74 and interestingly pro-EMT related LEF175 signatures were significantly upregulated (Fig. 4a). Collectively, the CRISPRi perturbation of two CREs at MIR100HG led to a significant reduction in key oncogenic molecular mechanisms observed in PDAC, resulting in a more favourable phenotype.
TGF-β regulates MIR100HG transcription and thus the release of its hosted miRs, inducing EMT, encouraging cell motility and metastasis25. Here, we identified many TGF-β related genes such as FGF176, KDM6B77, LIF78, PIK3CD79, PXDC1 and TAGLN80 were significantly downregulated in the two CRISPRi groups compared to the control (Extended Data Fig. 9). Hence, we further aimed to validate the reduction in TGF-β signalling observed with GSEA enrichment by using wound healing assays (Fig. 4c). Over 48-hours, the inhibition of CRE-two (G 5–6) resulted in a significant reduction in cell motility in comparison to controls, corroborating with a stronger gene enrichment reduction in TGF-β and EMT signalling compared to CRE-three inhibition (Fig. 4a). Similar but not significant changes in cell motility were observed in PANC-1 cells inhibited at CRE-three (G 7–8) (Fig. 4d). These results suggest a CRISPRi perturbation of CREs harbouring NCMs in the third intron of MIR100HG can decrease the migration ability in PANC-1 cells.
Mutation occurrence of functional CREs in other solid cancers
Lastly, we explored the NCM burden of our top five prioritised regions (obtained from the first approach) in other cancers. We analysed the mutational frequency of these CRE-associated loci in seven other solid tumours using SSM data from the ICGC in oesophageal (ESAD), liver (LIHC), breast (BRCA-UK), ovarian (OV), prostate (PRAD-CA, PRAD-UK), colorectal (COAD) and gastric cancer (STAD) cohorts. The HOTAIR/HOXC CRE had the highest mutation frequency of NCMs across oesophageal (16.6%), liver (13.2%), prostate (7.5%) and ovarian (19.4%) cancers along with PDAC (5–12%, Extended data Fig. 10). However, a low mutation frequency was observed in gastric, breast and colorectal cancers below 2%. The FOXA1 CRE was predominately mutated in prostate cancer at an incidence of ~ 16%, followed by liver, ovarian and oesophageal cancers at a frequency of ~ 5%, higher than that observed in PDAC (2%). Interestingly, this regulatory region and NCMs have been recently reported in prostate cancer and are correlated with decreases in FOXA1 expression and cell growth81. For the MIR100HG CRE, oesophageal and prostate cancer (UK cohort) showed the highest incidence at 14.2% and 5.7%, respectively, and liver and ovarian cancers showed a similar mutational incidence to the PDAC cohorts (2–3%). Other cancer types, such as breast, gastric and colorectal, had a very low to no mutational burden within this MIR100HG CRE (Extended Data Fig. 10c). The FOXP1 CRE had the highest mutation frequencies in the liver, oesophageal and ovarian cancers (6–8%), but the NFEL2 CRE generally had a much lower mutation frequency across all cancers, with a mutation burden of 2–3% in liver and oesophageal cancers, similar to that in PDAC. Our results suggest that several CREs identified in this study were also frequently mutated in other cancers. NCMs within these CREs may also play a functional role in contributing to these malignancies, as already documented in prostate cancer81.