AQR knockdown induced the highest number of cryptic splice sites
Five cell lines from three organisms were used here to analyze cryptic splice sites. The five cell lines are K562, HepG2, S2, N2A, and CGR8. The shRNA-seq/siRNA-seq data for K562, HepG2 and the fruit fly S2 cell lines were directly obtained from the ENCODE project. The mouse N2A and CGR8 cell lines contain ~1,400 RBP/TF siRNA-seq datasets, which are pooled sequencing data, and 52 cassette splicing events were sequenced [8]. The general workflow is shown in Figure 1A. For each shRNA-seq/siRNA-seq dataset, I used STAR aligner [9] to obtain all the junctions (both new junctions and annotated junctions). Then, these junctions were converted into 5' splice sites and 3' splice sites for each shRNA-seq/siRNA-seq. For exclusion of the known junctions and noisy background junctions, these collected 5' splice sites and 3' splice sites were then filtered by 5' splice sites and 3' splice sites from gene annotation and all the control RNA-seq datasets (Figure 1A). The same workflow was performed for the other cell lines.
Among the 241 RBPs’ shRNA-seq datasets in K562, knockdown of AQR induced the highest number of cryptic splice sites (Figure 1B). In addition to that of AQR, knockdown of several other RBPs induced a relatively higher number of cryptic 3' splice sites and cryptic 5' splice sites: EJC components, SF3A and SF3B complex components, the spliceosome core protein RRPF8, the poly-U binding proteins hnRNP C, U2AF2 and PUF60, and the m6A reader FTO (Additional file 1). Similar results were obtained using the HepG2 cell line (Figure 1C and Additional file 2).
The mouse dataset contains ~1,400 siRNA-seq datasets and almost all spliceosome components. Among the IBC components (AQR, ISY1, SYF1, ZNF830, and CRNKL1), three of them are the top genes in the datasets when considering cryptic 3' splice sites (Figure 1D). Other top genes included SLU7 and DHX8/PRP22 (Additional file 3). For cryptic 5' splice sites, AQR was still the top gene (Figure 1E). The 2nd highest was Med26, which belongs to the mediator complex (Additional file 3). One study reported that mediator and polymerase II can influence cryptic 5' splice site recognition in yeast [10]. The above results were consistent between the two cell lines (N2A and CGR8, please see Additional file 3).
Previous studies on the ENCODE fruit fly siRNA-seq dataset showed that spliceosome component knockdown caused a smaller PSI change than knockdown of alternative splicing regulation RBPs [11], while the above results showed that cryptic splice sites are mainly caused by knockdown of spliceosome components. Thus, I further tested whether spliceosome component knockdown induced the highest number of cryptic splice sites in this fly siRNA-seq dataset, which may indicate that spliceosome components are more related to splicing fidelity than alternative splicing regulation. The results showed that RBPs whose knockdown induced the highest number of cryptic sites are spliceosome components in fruit flies, as shown in Figure 1F and Additional file 4. Pea/DHX8/Prp22 knockdown induced the highest numbers of cryptic 3' splice sites (Additional file 4), and this protein knockdown also induced one of the highest numbers of cryptic 3' splice sites in the mouse dataset, as shown above.
Sequence and motif properties of the cryptic splice sites
Next, I analyzed the properties of these cryptic sites. The number of cryptic 3' splice sites was greater than the number of cryptic 5' splice sites for all RBPs in the K562 and HepG2 cells (Figure 2A). This result suggests that 3' splice sites undergo more regulation than 5' splice sites by the RBPs studied here. I calculated the maxEntScore [12] for three different types of splice sites: all AG/GT splice sites in introns, AG/GT splice sites that are used when proteins are knocked down, and canonical intron splice sites. The results indicated that sequence preference has a strong effect on cryptic splice site usage (Figure 2B&C). I then performed motif analysis for all the RBPs’ cryptic 5' splice sites and cryptic 3' splice sites for K562, and all the motifs were similar to AGGTAAG. The corresponding sequence in U1 snRNA is UCCAUUC, so all the motifs around the cryptic 5' splice sites are complementary to the U1 snRNA (Figure 2D&E). The complementation between the cryptic 5' splice site motif and U1 snRNA indicated that 5' splice sites are mostly regulated by U1 snRNA. For cryptic 3' splice sites, only the motifs of hnRNP C contained an obvious poly-U. This result is consistent with one previous study on hnRNP C competing with U2AF2 to repressing the cryptic 3' splice sites in the poly-U region of ALU [13]. Poly-U is bound by U2AF2 to initiate spliceosome assembly, while the lack of poly-U in the cryptic 3' splice site motifs suggests that splicing fidelity regulation by these RBPs mainly occurs after U2AF2 participates in the spliceosome. A consistent motif result was observed using the fruit fly dataset. The cryptic splice sites induced by knockdown of most RBPs are strongly enriched near the canonical 3' or 5' splice sites (Additional file 5). The result of cryptic sites’ positions near canonical sites suggest they usually don’t disrupt exon definition and this agreed with that RBPs mainly regulate splice sites after U2AF2 participated in spliceosome when near canonical splice sites.
Some of the RBPs used above may not be directly involved in splicing fidelity, and they may act as a regulation layer for other RBPs. To find the potential similar functions and potential regulation between the RBPs in splicing fidelity, I overlapped the cryptic junctions between the shRNA-seq dataset of every two RBPs and performed a Poisson test. As shown in Additional file 1, junctions overlapping within EJC’s components showed very significant p-values, and junctions overlapping between U2AF1 and U2AF2 also showed very significant p-values. Similar results were obtained using the HepG2 cell line (Additional file 2). Overlapping junctions were also determined for the N2A and CGR8 cell lines to test if cryptic junctions were stable between these cell lines. This result showed that junctions overlapping within the components of IBC had the highest ranks (Additional file 3).
Analyzing the cryptic splice sites in different regions
The regulation mechanism of splicing is different in deep introns and near canonical splice sites [14]. Thus I further classify the cryptic splice sites into different regions. I analyzed cryptic splice sites in ALU elements, cryptic splice sites in exons, cryptic splice sites near canonical splice sites (< 200bp) and cryptic splice sites that are in deep introns (distance from canonical splice sites > 200bp) respectively. Knockdown of HnRNP C induced the highest number of cryptic 3' splice sites in ALU elements (Additional file 1&2). Knockdown of RAVER1 and SRSF3 induced the highest number of cryptic 3' splice sites in deep introns in HepG2 cells (Additional file 2). When considering cryptic splice site motifs in deep introns, most RBPs’ cryptic 3' splice site motifs contained more obvious poly-U, and cryptic 5' splice site motifs are more complementary with U1 snRNA (Additional file 5). In total, this analysis suggests that in deep introns, cryptic 3' splice sites need poly-U to initiate spliceosome assemble. While around canonical splice sites, cryptic 3' splice sites shared the poly-U of the canonical splice sites.
Metaprofiling of cryptic exons and cryptic splice sites suggests that cryptic 5' splice sites are induced by cryptic exons
Exons can be accurately identified by sequencing reads that span two or more junctions [6] (Figure 3A). I used this method to evaluate all shRNA-seq datasets and found that AQR knockdown still induced the highest number of cryptic exons (Figure 3B). To determine why AQR knockdown simultaneously induced the highest number of cryptic 5' splice sites, cryptic 3' splice sites, and cryptic exons, I generated the metaprofiles of the three and aligned them in Figure 3C. The cryptic exon 5' positions are highly consistent with the cryptic 5' splice sites. This result suggests that cryptic 5' splice sites are induced by cryptic exons. Note that cryptic exons are much harder to detect than cryptic splice sites since one read must contain at least two junctions to detect a cryptic exon, and the read length constraints must be considered (100 bp for ENCODE shRNA-seq). The similarity between the motifs of cryptic 5' splice sites and cryptic exons’ 5' splice sites further indicates that they are produced by the same mechanism (Figure 3C).
To further test this, I generated the above profiles for PRP17 in HepG2 cells, whose knockdown (by CRISPR) induced one of the highest numbers of cryptic 3' splice sites in HepG2 cells (Additional file 2). A similar profile was observed compared with that of AQR in K562 cells (Figure 3D). The motifs of the cryptic exons’ 5' splice sites after knockdown of AQR and PRP17 are highly complementary to the U1 snRNA recognition motif.