CAGE-seq identifies non-promoter associated capped 3'UTR-derived RNAs
We and others [7–12] have previously reported the presence of CAGE-seq signals outside of annotated promoter regions in thousands of protein-coding genes. However, their origin or biological relevance has not been thoroughly interrogated. Here we first confirmed the prevalence of these signals in human cell lines using CAGE data provided by the ENCODE consortium. As expected, we detected a similar proportion of CAGE signals per genomic region in two different human cell lines (HeLa and K562) and showed that the CAGE signal is highly reproducible across replicates. This included the library size, number of uniquely mapped CAGE reads and distribution of the 5’ CAGE reads mapping to different genomic regions (Fig. 1a, S1a, b, c, d). A similar genomic distribution has also been detected by other groups before, using the same CAGE-seq protocol [19].
The relative intensities of CAGE signal detected at different genomic regions depend on the priming method for reverse transcription (oligo-dT, random hexamers, or mixtures thereof in different ratios) [12]. Oligo-dT priming quantitatively favours shorter transcripts, while the reverse is true for random priming. We subsequently verified CAGE signals within 3’UTRs are most detected when a combination of Oligo-dT and random primers is used, with the optimal inclusion ratio of 1 to 4 ratio of Oligo-dT to random primers [19, 20] (Fig. S1e). Notably, the same ratio was used in ENCODE CAGE samples analysed in this study.
The CAGE signal is the strongest at 5'UTRs of known protein-coding genes [19] (Fig. 1a, ~ 65% of total reads). While low-level non-promoter CAGE signal (sometimes referred to as “exon painting” [4] [21]), can be detected along the entire length of transcripts, the signal at 3’UTRs is consistently present and occurs in localised clusters, similar to CAGE signals at promoters (see Fig. S1l for examples). We focused on the 3’UTR region, which contains a substantial proportion (~ 11%) of the total CAGE reads (Fig. 1a), the implications of which are unknown. To identify robust CAGE signals with sufficient sensitivity, we used a 20nt window requiring at least two 5' reads overlapping from two different replicates for each cell line separately. This revealed 32,065 3’UTR CAGE clusters across all samples (Table S1). As expected, correlation between technical replicates was high (Pearson correlation (PC) > 0.9) for all CAGE signals, independently of genomic location (Fig. S1g-j). Moreover, expression of the 3’UTR CAGE clusters was highly reproducible between HeLa and K562 samples (PC ~ 0.98, Fig. S1f), suggesting biological relevance. Correlation across cell types was much higher for 3’UTR clusters than that for the 5’ UTR CAGE (PC ~ 0.79, Fig. S1g) and CDS CAGE (PC of 0.61, Fig. S1h) signals and comparable to intronic CAGE signal (PC ~ 0.93, Fig. S1i). This may suggest that 3’UTR CAGE signals originate from more stable and/or less tissue specific subset of transcripts. Together these analyses show that the transcripts whose 5’ end map to 3’UTR ends of protein-coding genes are highly reproducible across cell types, and that CAGE is a robust method for their quantitative detection.
3'UTR-derived RNAs are confirmed by RNA-seq, qPCR and long-read CAGE
Next, we aimed to confirm the existence of these 3’UTR capped RNAs using independent methods. First, we asked if these fragments could be identified in transcriptomic (RNA-seq) data. For this we compared the CAGE signal with the RNA-seq signal of two different cell lines. To categorise CAGE peaks we first used the paraclu [22] peak caller to identify clusters of 5’ ends of capped RNAs, and within each cluster we selected the highest signal as the dominant CAGE peak position. For comparison, we processed paired-end RNA-seq data from the same K562 and HeLa cell lines, then plotted read-starts and read-ends relative to the dominant 3’UTR CAGE peak per transcript (Fig. 1b - in blue, and S1J). Both RNA-seq samples showed highly reproducible enrichments of read ends coinciding with dominant 3’UTR CAGE peaks. This reveals that the 3’UTR CAGE peaks are confirmed by the read-ends from reverse-stranded RNA-seq data, which suggests that the signal could be originating from post-transcriptional cleavage sites. Notably, there is also a small enrichment of RNA-seq read-starts downstream from the 3’UTR CAGE peaks, which could represent the same RNA fragments detectable by the CAGE samples (Fig. 1b in yellow). More importantly, these findings demonstrate that 3’UTR capped fragments identified by CAGE can also be detected by other, methodologically independent, high-throughput sequencing methods such as RNA-seq.
We next aimed to confirm the presence of transcripts initiating at the 3’UTR CAGE peaks by an alternative experimental approach, not dependent on RNA library creation or high-throughput sequencing. We focussed on two genes, CDKN1B/p27kip1 (p27) and JPT2, which contain a dominant CAGE peak located within the 3’UTR region, demonstrated with highly reproducible read coverage for CAGE and RNA-seq in both K562 and HeLa cells (see Fig. S1m for example). Separate sets of primers were designed to quantitatively PCR-amplify ~ 150bp sequences within 300 nucleotides upstream and downstream of the 3’UTR CAGE peaks in CDKN1B and JPT2 (see Methods, Fig. S1k). In agreement with CAGE and RNA-seq data (Fig. S1m), RT-qPCR detected higher levels of these transcripts with the downstream primers (Fig. S1K) than with upstream primers. A similar enrichment in RT-qPCR signal (Fig. S1k) was observed with downstream primers in comparison to primers designed to amplify a ~ 150bp region spanning the CAGE peak in CDKN1B, suggesting an accumulation of 3’UTR fragments in comparison to full-length mRNAs.
Treatment of the samples with TerminatorTM 5’-Phosphate-Dependent Exonuclease (TEX), a 5′→3′ exonuclease that digests RNA with a 5′ monophosphate, but not RNA with 5′-triphosphate, 5′-cap or 5′-hydroxyl group had no or little effect on the amount of JPT2 and CDKN1B transcripts detected with primers amplifying either side of the 3’UTR CAGE peak within these cells. This was in sharp contrast with the known uncapped 3’ fragment of SLC38A2 mRNA, previously described by Malka et al. [8], which was, as expected, sharply reduced upon TEX treatment (Fig. 1c). These results lend further support that all the quantified transcripts, including the 3’UTR fragments, are capped.
We further confirmed that 3’UTR-derived RNAs could be detected by long-read Nanopore-sequencing CAGE (Fig. 1d). We were provided with data in cortical neuron samples by the FANTOM6 consortium for 10 genes that contain HeLa and K562 3’UTR CAGE peaks (Fig. 1d, S1n). In all of the 10 examples, the full length read sequencing CAGE identified reads spanning from the start of our identified CAGE 3’UTR peaks till the end of the annotated transcripts, whereas for most of these genes, reads spanning between the 5’CAGE and the 3’CAGE signal were absent (Fig. S1n). These observations suggest that the capped 3’UTR derived RNAs originate from the full length mRNA whilst fragments upstream of the 3’CAGE may not be stable. Notable exceptions are DDX17 and GHITM but it is unclear whether these 3’CAGE upstream sequences result from alternative polyadenylation or are products from the cytosolic cleavage of the full-lenght.
Capped 3’UTR-derived RNAs are evolutionarily conserved and generated post-transcriptionally
We next wanted to investigate whether the 3’UTR CAGE signals originate from post-transcriptionally capped RNA fragments. First, we explored whether there is evidence of nuclear Cap Binding Complex (CBC) binding to the capped 5’ ends of 3'UTR fragments, as this protein is known to bind to 5’ ends of nascent protein-coding mRNA transcripts in the nucleus. Individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) is a method that identifies protein-RNA crosslinking interactions with nucleotide resolution in a transcriptome-wide manner. We examined CBC-iCLIP data from HeLa cells, where the authors targeted nuclear cap-binding subunit CBP20 protein [23]. CBP20 is a nuclear component of cap-binding complex (CBC), which binds co-transcriptionally to the 5' cap of pre-mRNAs and interacts directly with the m7-G cap [24, 25]. The CBP20 RNA binding data was analysed using a standard iCLIP processing pipeline, where the nucleotide preceding the cDNA-start position after PCR duplicate removal is reported as the crosslinking position (see Methods). The CBP20 crosslinking positions were then screened across all dominant 5’UTR and 3’UTR CAGE peaks per transcript. As expected, CBP20 crosslinks were enriched around the dominant 5’UTR CAGE peaks where the TSS of full-length transcripts is positioned. However, the enrichment was very weak at the non-promoter 3’UTR CAGE peaks (Fig. S2a). This strongly indicates that the 3’UTR capped fragments identified by CAGE are not part of nuclear CBC, further suggesting that they are likely a product of an independent post-transcriptional processing pathway.
To further explore whether the 3’UTR capped fragments were generated co-transcriptionally, we investigated the location of cap signals in nascent RNAs identified in global nuclear run-on sequencing experiments of 5’ capped RNAs (GRO-cap) [26]. As anticipated, strong GRO-cap signals overlapped with CAGE peaks in 5’UTRs and, to a lesser extent, with introns and upstream CDSs but were notably absent around 3’UTR CAGE peaks (Fig. S2b). This also indicates that capping of 3’UTR fragments occurs post-transcriptionally.
Additionally, we analysed capCLIP data from HeLa cells. capCLIP is a version of CLIP that targets the translation elongation factor eIF4E, a cytoplasmic protein which binds the 7-methyl-GTP moiety of the 5′-cap structure of RNAs to facilitate the efficient translation of almost all mRNAs [27, 28]. The capCLIP data was analysed following the same methodology as CBP20-iCLIP. The enrichment of capCLIP signal at the non-promoter 3’UTR CAGE peaks was much stronger than in the CBC-iCLIP (Fig. S2a, S2c), which demonstrates that the cap of the 3’UTR-derived RNAs is strongly bound by cytoplasmic eIF4E, but not the nuclear cap binding protein CBP20, suggesting that these RNAs are predominantly cytoplasmic. Furthermore, we investigated ribosome footprinting data [29] to interrogate whether the 3'UTR-derived RNAs, which are bound by eIF4E, are translated. However, we did not find evidence of ribosomal binding to these RNAs that suggested active translation (data not-shown).
Next, we investigated the evolutionary conservation of 3’UTR-derived RNAs. Utilising UCSC conservation tracks, we computed conservation scores around 3’UTR CAGE peaks. To exclude the influence of coding regions and transcript termination sites, we specifically selected 21,831 3’UTR CAGE peaks positioned at least 150 bps away from the 3’UTR bordering region (≥ 150 bps downstream from CDS and ≥ 150 bps upstream from transcript termination). Remarkably, our findings reveal that the exact 3’UTR CAGE peaks exhibit lower conservation compared to the surrounding regions. However, the region immediately downstream of the 3’UTR CAGE peaks, corresponding to the "body" of 3’UTR-derived RNAs shows a notable enrichment in conservation scores, suggesting a potential functional contribution (Fig. S2d).
Altogether, these analyses confirm the presence of abundant, evolutionarily conserved, capped 3’UTR-derived non-coding RNAs that may originate from cytosolic cleavage of full-length mRNAs.
5' ends of 3'UTR-derived RNAs are enriched for G-rich motifs and strong secondary structures
Next, we wanted to understand the sequence features that distinguish the CAGE peaks corresponding to co-transcriptional capping of TSS from those originating from post-transcriptional capping of 3'UTR-derived RNAs. We first explored the possibility that 3’UTR fragments might be a by-product of nuclear polyadenylation and associated endonucleolytic cleavage. If this were the case, the identified 3'UTR CAGE peaks should be preceded by enrichment of the canonical polyA signal (A[A/U]UAAA hexamers), which recruit the nuclear polyadenylation machinery. However, we only found such enrichment at the annotated 3'UTR ends, and not upstream of the 3'UTR CAGE peaks (Fig. S2e). We observed a notable enrichment downstream of the 3’UTR CAGE peaks (Fig. S2e - red line), which most likely corresponds to the canonical polyA site as some of the 3'UTR-derived RNAs are relatively short and their 5’ ends are close to the annotated 3'UTR ends.
Next, we explored whether there were additional distinctive sequence characteristics between the two types of CAGE peaks. Consistent with previous studies [9, 13], we detected a strong G-enrichment overlapping the 5’ end of the CAGE reads present in non-promoter regions (Fig. 2a, S2f), distinct from the YR dinucleotide characteristic of signals at 5’ ends of genes. More surprisingly, CAGE peaks within the 3’UTR region showed a strong increase in internal pairing probability (see Methods: Secondary structure) in comparison to CAGE peaks in other regions (Fig. 2b, S2g), suggesting that structural preference may be important for the generation of 3’UTR-derived RNAs. Notably, the surrounding (within 100 bps) region of CAGE peaks in 5’UTRs is more structured (light blue line in Fig. 2b, S2g), representing the higher GC content that is present around all 5’UTRs in vertebrates [30], with a distinctive drop at -25 bps coinciding with the canonical TATA box position.
Motifs with G-rich repeats in the transcriptome can form non-canonical four-stranded structures (G4s) implicated in transcriptional regulation, mRNA processing, the regulation of translation and RNA translocation [31]. Similar to web-logo motif analyses of CAGE peaks from different mRNA regions (Fig. 2a, S2f), the nucleotide enrichment plot of GGG sequences showed the highest enrichment overlapping 3'UTR CAGE peaks (Fig. 2c, S2f). This raises the possibility that the sequence around the 3’UTR CAGE peaks may have an increased propensity to form RNA-G4 structures via the canonical G4 motif (G3-N1 − 7-G3-N1 − 7-G3-N1 − 7-G3) [32]. To further explore the RNA G-quadruplexes formation profile, we integrated RNA-G-quadruplex sequencing (rG4-seq) data from HeLa cells [33] and ran G4-Hunter predictions [34] around CAGE peaks. Both the rG4-seq data (HeLa) and G4-Hunter predictions (K562) showed the highest G4s enrichment around CAGE peaks in the 3'UTR region (Fig. S2h-l) with the highest percentage of rG4-seq hits within 3’UTRs (Fig. S2m). Nevertheless, it is worth noting that the number of 3’CAGE sites overlapping with rG4-seq sites was relatively small (~ 3800 out of ~ 133900). In sharp contrast, 8 of the 10 gene examples with 3’UTR CAGE peaks and validated with long-read CAGE explored here (selected due to highest 3’UTR CAGE signal) contained rG4-seq clusters coinciding with 3’UTR CAGE peaks (Fig. S1n). It is important to note that the determination of whether these sites are genuinely in the G4-folded state remains uncertain, as the rG4-seq method employs G4 stabilisers to artificially enhance G4 structures.
3’UTR CAGE sites are flanked by enriched UPF1 binding
The evidence outlined so far is consistent with our hypothesis that capped 3’UTR-derived RNAs are formed post-transcriptionally. Next, we aimed to determine whether specific RNA-binding proteins (RBPs) were involved in the mechanism of their generation. To that end, we analysed publicly available enhanced CLIP (eCLIP) data for 80 different RBPs in the K562 cell line, produced by the ENCODE consortium [35]. For each RBP, we calculated normalised cross-linking enrichment compared to other RBPs around maximum CAGE peaks per annotated gene region (5’UTR, CDS, intron, 3’UTR). This identified a specific set of RBPs around CAGE peaks, with UPF1 (Up-frameshift protein 1) as the top candidate in 3’UTRs (Fig. 2c), DDX3X (DEAD-Box Helicase 3 X-Linked) in 5’UTRs (Fig. S2n), KHSRP (KH-type splicing regulatory protein) in introns (Fig. S2o), and less protein specific enrichments in CDS with YBX3 (Y-Box-Binding Protein 3) as the top candidate (Fig. S2o,p). UPF1 is involved in a variety of RNA degradation pathways [36], including Nonsense-Mediated Decay (NMD) [37] and the normal mRNA decay where stalled UPF1 at CUG and GC-rich motifs activates decay [38]. KHSRP plays a well-characterised role in pre-mRNA splicing, but has also been involved in several other aspects of RNA biology, such as mRNA decay and editing and maturation of miRNA precursors [39]. On the other hand, the YBX3 has been implicated in regulation of mRNA translation as well as stability, likely in a transcript-dependent manner [40]. As a positive control for our enrichment score approach, we noted DDX3X enrichment around 5’UTR CAGE peaks. This is consistent with known roles for DDX3X in transcription and pre-mRNA splicing through interactions with transcription factors and Spliceosomal B Complexes [41].
Interestingly, the crosslinking of UPF1 is enriched within 20 nt upstream of the 3’UTR CAGE peaks, followed by a steep depletion within ~ 10 bps downstream (Fig. 2d). Additionally, a substantial correlation (R = 0,654) was observed between the 3’UTR CAGE signal and UPF1 binding, but not associated with gene expression or 3’UTR length (Fig. S2q). More specifically, the degree of UPF1 binding coincides with the intensity of the 3’UTR CAGE peaks and proximity to the peaks (Fig. S2r, s). However, transfecting K562 cells with UPF1-targeting small interfering RNAs (siRNAs) for 48 hours did not lead to changes in the enrichment of RT-qPCR signal obtained with primers targeting downstream of the 3’UTR CAGE peak in CDKN1B or JPT2 when compared to upstream-targeting primers (Fig. S2t). Thus, it remains unclear if the precise binding position of UPF1 relative to the re-capping position may be important for the generation of the 3’UTR capped fragments, or if accumulation of UPF1 is an indirect result of the presence of other factors that contribute to the cleavage.
mRNA cleavage by small interfering RNAs generates newly capped RNA fragments
mRNAs can be cleaved post-transcriptionally through RNA interference (RNAi). Indeed, a common way to artificially accomplish gene silencing is to utilise siRNAs to induce endonucleolytic degradation of the target transcripts [42, 43]. SiRNAs are usually 21–23 nt long and their sequence is antisense to their mRNA target sequence. Silencing by siRNAs is induced through the endonuclease activity of Argonaute 2 (AGO2), a subunit of the RNA-induced gene-silencing complex (RISC) in the cytoplasm [44].
We hypothesised that siRNA silencing through AGO2 cleavage could lead to cytoplasmic capping of the cleaved RNA fragments instead of degradation. To test this hypothesis, we first investigated if CAGE-seq could detect cleaved RNA fragments guided by siRNA. We analysed CAGE data from siRNA-treated samples from the FANTOM5 dataset [45], which included samples from the TC-YIK human cell line transfected with siRNAs targeting mRNAs of 28 different transcription factors (20 siRNAs designed by ThermoFisher and 8 by the study authors) and 5 non-targeting control samples, in triplicates. We detected CAGE signal at the exact position targeted by the siRNA in at least two replicates in 20 out of the 28 samples (Fig. 3a). The strongest enrichment in CAGE signal relative to the siRNA target site was detected in the Islet-1 knockdown (ISL1-KD) samples, with no signal detected in control samples (Fig. 3b,c,S3a). More interestingly, the dominant CAGE 5' end signal was present in the middle of the siRNA target sequence (Fig. 3e, S3a), where the AGO2 cleavage is known to take place [46, 47]. As expected, the TSS CAGE signal in the 5'UTR of the corresponding protein-coding gene dropped by ~ 75% compared to the control samples in all 3 replicates (Fig. 3b,c), confirming that the silencing of the ISL1 transcript was efficient. Together these results indicate that siRNA-mediated recruitment of AGO2 can lead to the generation of post-transcriptionally capped RNA fragments following mRNA cleavage.
3'UTR CAGE peaks coincide with AGO2 and UPF1 binding sites alongside G-rich motifs
Since the endonuclease activity of AGO2 facilitates mRNA cleavage guided by siRNAs, we investigated whether AGO2 binding also occurred at the endogenous 3'UTR CAGE peaks. There was no publicly available AGO2 binding data for either HeLa or K562 cells so we produced ‘enhanced individual nucleotide resolution’-CLIP (eiCLIP) [48] data for AGO2 (AGO2-eiCLIP) in HeLa cells. Our analysis revealed that 32.8% of the crosslinking positions mapped to the 3’UTR region (Fig. S3b), with a higher binding enrichment in known microRNA (miRNA)-regulated transcripts, and with a clear miRNA-seed matching-sequence enrichment downstream of the crosslinking site (Fig. S3c,d). Similarly to UPF1, AGO2 crosslinks were enriched immediately upstream from the 3'UTR CAGE peaks but, unlike UPF1, they were not depleted in the downstream region (Fig. 3d, S3e, 3d).
In animals, endogenous RNAi is mainly mediated by microRNAs (miRNAs). MiRNAs are ~ 21–23 nucleotide (nt) long RNAs that, in contrast to siRNAs, recruit the miRNA induced silencing complex (miRISC) containing AGO1-4 to mRNAs with partial sequence complementarity. As a result, miRNA action induces translational repression and/or exonucleolytic cleavage of the target mRNAs [42, 43]. Thus, miRNA-mediated degradation of target mRNAs in animals usually involves deadenylation, decapping, and degradation by the major cytoplasmic 5’to-3’ exonucleases, rather than direct endonucleolytic cleavage by AGO2 [5, 49]. Nevertheless, it has been demonstrated that extensive miRNA-mRNA pairing can also trigger AGO2 catalytic activity [50–52]. We thus hypothesised that AGO2 miRNA-guided cleavage of mRNA targets might lead to the generation of recapping fragments in a similar manner to that observed for siRNAs. To test this we first identified genomic sequences with extensive complementarity (fewer than 2 mismatches) to human miRNAs. We identified 29 such targets that mapped within 3’UTRs but there was no CAGE signal present around any of them (data not shown). In line with this, AGO2 crosslinking enrichment around 3’UTR CAGE signals was considerably weaker for AGO2 binding sites overlapping with predicted miRNA binding sites (see Methods, Fig. S3f). Intriguingly, CRISPR/Cas9-mediated elimination of AGO2 in K562 cells did not change the enrichment in RT-qPCR signal detected for JPT2 and CDKN1B with primers downstream the 3’CAGE versus upstream primers (Fig S3g-i). All together, our observations suggest that AGO2 binds immediately upstream of the site of cleavage that generates 3’UTR-derived RNAs independently of miRNA directed recruitment and that endogenous 3’UTR-derived RNAs are not produced as a result of AGO2 cleavage activity.
Accordingly, we instead explored the binding specificity of AGO2-eiCLIP data, and performed a motif analysis using HOMER motif finder. When analysing the 15 bp flanking region around AGO2-crosslinking peaks (see Methods), one of the most prominent motifs was highly enriched in Gs (Fig. S3j − 2nd and 3rd). Notably, this also agrees with one of the first AGO2-CLIP studies performed on mouse embryonic stem cells, where the authors showed that, without the miRNA present, AGO2 binds preferentially to G-rich motifs [53].
As we had previously demonstrated that G-rich motifs, which have the capability to form RNA-G-quadruplexes, are enriched around 3’UTR-derived RNAs, we next investigated whether AGO2 and UPF1 could be attracted to these specific G-rich motif structures independently of their location to CAGE peaks. We first aligned AGO2-eiCLIP and UPF1-eCLIP cross-linking positions relative to the 3' end of rG4-seq sites in different regions of primary transcripts. Both AGO2 and UPF1 crosslink-binding sites are much more highly enriched at rG4-seq sites in the 3’UTRs relative to 5’UTRs, introns and coding sequence although we noted that the binding of UPF1 occurred at the 3’end of the G4-seq sites and AGO2 bound immediately upstream (Fig. S3k-n). To further explore the relationship between AGO2 and UPF1 binding concerning 3’UTR CAGE peaks and their association with G4-seq signals, we categorised the 3’UTR CAGE peaks into four classes, depending on the presence or absence of these elements. We observed that the majority of 3’UTR CAGE peaks contained both AGO2 and UPF1 binding but not G4-seq signal although, on the other hand, the majority of 3’UTR CAGE sites overlapping with G4-seq also contained AGO2/UPF1 sites (Fig. S3o). Then we further explored the binding position of UPF1 and AGO2 relative to the 3'UTR CAGE peaks, in the presence or absence of the G4-seq site. Interestingly, both proteins exhibited a distinct shift in position influenced by the G4 motif enrichment; whilst AGO2 showed a pronounced shift to the upstream region of the 3’UTR CAGE peak in the presence of G4-seq sites, UPF1 displayed a downstream shift (Fig. S3p-q). An important next direction for future studies will be to experimentally investigate the mechanistic implications of the overlap between sites with the ability to form RNA-G4 structures and AGO2 and UPF1 binding for the generation of the capped 3’UTR-derived RNAs.
Capped 3’UTR fragments of CDKN1B and JPT2 transcripts do not co-localise with the parental mRNAs
Finally, we examined the potential implications of 3’UTR-derived RNAs. Specifically, we sought to understand how 3’UTR-derived RNAs might localise either together or independently from the parental mRNAs. To test this, we designed smFISH (single molecule fluorescence in situ hybridization) probes to simultaneously image the RNA upstream and downstream of the proposed post-transcriptional cleavage and capping site in CDKN1B and JPT2 using hybridisation chain reaction RNA-fluorescence in situ hybridization (HCR-FISH 3.0) [54]. To account for technical biases in detection, we also designed probes against the coding sequence (hereafter upstream) and 3’UTR (hereafter downstream) of a control mRNA, PGAM1, which does not contain CAGE peaks in the 3’UTR and contained a similar 3’UTR length to our targets.
We performed HCR-FISH in HeLa cells to determine whether putative 3’UTR-derived RNAs can be found independently of the RNA upstream of the cleavage site (Fig. 4a, b). In the control transcript, PGAM1, we observed that 17.3% of upstream signals did not have a colocalising downstream signal and 21.3% of downstream signals did not have a colocalising upstream signal (Fig. 4c, S4a). However, the mRNAs that contain a 3’UTR CAGE signature were significantly more likely to show independent signals from the RNA downstream of the proposed cleavage site (CDKN1B: 53.3%, p adj. < 0.05; JPT2: 52.3%, p adj. < 0.05; Fig. 4c). In the case of JPT2, we also observed significantly more independent signals from the upstream probes (29.3%, p adj. < 0.05; Fig. 4c). These observations are consistent with the existence of cleaved 3’UTR fragments in the cell, and they reveal that these products may localise differently from their host transcripts.