BRCA1
We initially analysed BRCA1 as proof-of-principle because its mutational landscape in cancer is well described and includes splicing mutations that have been repeatedly analysed (10,11). We first downloaded the RNA-seq data for BRCA1 from snaptron into a spreadsheet (see Materials and Methods). This spreadsheet lists over 6000 differently spliced transcripts of BRCA1, although the large majority of these are background splicing events that are only supported by very low reads. Fig 1A lists the splicing events with the highest reads, these include intron removal and major alternative splicing events. At least 8 isoforms of BRCA1 have been identified (12) and the major alternative splice sites (but not the isoforms) can easily be identified in Figure 1A (shaded).
Figures 1B, C and D illustrate how background RNA sequencing data can be used to predict css or exon skipping events that are likely to result from splicing mutations of BRCA1 or any gene. Fig 1B examines the theoretical effect of mutation of the BRCA1 intron 5’ss 41222944 (shown in red). This might be expected to enhance the use of alternative 5’ss partners for the non-mutated 3’ss 41219713, as illustrated in figure 1D. Figure 1B lists the 5’ss partners for 3’ss 41219713 that have been identified in snaptron. As expected there are a large number of reads (148299) for splicing between 3’ss 41219713 and its normal 5’ss partner 41222944 of BRCA1 (blue shading). Other 5’ss partners of the 3’ss 41219713 are also used but at much lower background levels in wild type BRCA1 transcripts. These include single and multiple exon skipping events (yellow shading) between the 3’ss 41219713 and the 5’ss of other upstream introns (compare Figure 1A and B). In addition there are 2 reads for a rare splicing event between 3’ss 41219713 and an exonic 5’ss that is located -93 bases upstream of the normal 5’ss 41222944 and further low level reads for seven background 5’ss that are located downstream within the intron.
Mutation of BRCA1 5’ss 41222944 is known to activate a css at +69 (13-15) or at +65 (16) These two css exactly match the bss with the most supporting reads (Figure 1B, red shading). The background splicing information is therefore a good match to the slightly different experimental results of both groups.
Similarly, Figure 1C examines the possible effect of mutation of the 3’ss 41203135 (red shading) by showing all of the splicing events involving its normal partner 5’ss 41209068, as illustrated in Figure 1D. Mutation of 3’ss 41203135 is known to activate exon skipping between the normal partner 5’ss 41209068 and the downstream intronic 3’ss 41201212 plus weaker activation of the 3’css 41203127 (13,14). Figure 1C shows that these two splicing events have the most reads of the background splicing events involving the 5’ss 41209068 of the wild type BRCA1 gene.
The data for Figures 1B and C is summarised in Table 1 (rows 13 and 32), which includes all mutations of the splice sites of BRCA1. Fig S1 shows this data in full in the same format as Fig 1B, C. From the literature we identified seventeen different css that are activated by mutations of the indicated BRCA1 splice sites and Table 1 shows that 15 of these css exactly match bss of wild type BRCA1, the two exceptions are shaded in column 3 and discussed in Table S1, which also provides references. Twelve of the15 bss that match css have the highest reads of all candidate bss, as listed under column 4 and as illustrated in Figs 1B, C. Sites that are candidates for css activation are defined here as bss within 1000 bases of the intronic ss that is mutated (see Discussion).
Many of the splice site mutations of BRCA1 in Table 1 activate exon skipping rather than css and eight of the splice site mutations do both (Table 1, column 2). The ratio of css reads to exon skip reads from the background RNA sequencing data (Table 1, columns 5,6) appears to correlate with the experimental finding of whether splice site mutations activate css or exon skipping. There are six exceptions to this that are shaded as pairs in columns 5 & 6 and are discussed (Table S1). Also shaded are some possible false positive bss reads for both css activation (column 5 rows 5, 24, 31 and 35) and for a double exon skip (column 7 row 16), see Table S1 and Discussion. This data (Table 1) suggests that the effect of splice site mutations upon css activation and even exon skipping can be inferred from background splicing data. In order to test this hypothesis we undertook analyses of further experimental databases that include over 300 medical syndromes caused by splice site mutations.
DBASS, BRCA2 and DMD
We next compared the snaptron database with the database of aberrant splice sites (DBASS). DBASS lists the experimental results for splicing mutations that cause a wide range of human genetic diseases (4). Table 2 is a summary of Table S2, which is an index all of the splicing mutations in DBASS, and shows that the DBASS mutations are subdivided into those that activate aberrant 5’ or 3’ splice sites (DBASS5 and DBASS3) and that the most common mutations activate css but can also generate de novo css or pseudoexons.
We first compared the DBASS5 experimental results for 5’ css activation with the snaptron RNA splicing data. Table S2 shows how 199 of the 459 mutations in DBASS5 that activate css were systematically chosen to cover every listed medical syndrome. We generated similar tables of background splicing to those illustrated in Fig 1A,B,C for each of the 199 mutations and compared these with the experimental results. Each analysis is summarised in single rows in Table S3 sheet 1. The background splicing tables (see Fig 1B or C) are not shown but the key results are recorded in Table S3 and the raw data can easily be generated as described in Materials and methods. Table 3 row DBASS5 summarises Table S3 sheet 1 and shows that 201 out of 237 (85%) of the 5’css identified by experiment (some mutations activate more than one css) exactly match bss in snaptron and are therefore already in use at low levels by normal genes. 150 out of 201 (75%) of the bss that match the position of css have the greatest number of supporting reads compared to other bss (Table 3, S3). Similar results were found for the analysis of the 3’css listed in DBASS3 where 97 out of 110 (87%) 3’css matched bss in snaptron (Tables 3, S2, S3).
The reason why 15% or so of the experimentally identified 5’ css or 3’css did not match a background ss was usually because there were no background ss reads for comparison (Table S3). Where background ss data was available, we found that background ss did not match the experimentally reported 5’ or 3’ css in only 2 to 3% of cases, listed as poor matches in Tables 3 and S3. Table 3 also includes summaries for similar analyses of BRCA1 (Table 1), BRCA2 and DMD (Tables S4, S5). DBASS5* and DBASS3* of Table 3 summarise an analysis of a subcategory of css that are activated by mutations that occur outside the highly conserved regions of the normal 5’ or 3’ss (Tables 2, S2, S6). The activated css of DBASS5* and DBASS3* tend to match bss with particularly high reads (Table S6, see Discussion). Overall the very large majority of css originate from bss (see Discussion) and usually the bss that is activated is the one with the most reads relative to other bss (Table 3).
Exon skipping
We next asked whether background splicing data can indicate whether splice site mutations might cause exon skipping rather than css activation. Some of the papers referenced in DBASS report whether or not exon skipping accompanied css activation (Table S3, column N). Table 4 column 1 summarises that there are 39 reports of both exon skipping and css activation and 71 reports of css activation only for the 5’ss mutations analysed in Table S3. For the reports of css activation only, the total number of background single exon skip reads from the 71 examples is 6621, which is much smaller than the total background skip reads (251128) from the 29 reports of both css and skip activation, so confirming the correlation seen for Table 1. Similar results were found for DBASS3 (Table 4).
Table 4 also summarises an analysis of a second database of splicing mutations (Tables S7, S8) that generally cause exon skipping rather than css activation (5). Table 4 shows that we analysed 79 experimental reports of 5’ss mutations that cause exon skipping only. Of these, 71 examples have higher background splicing reads for exon skipping than reads for potential css (background ss within 1000 bases of the intronic ss). Conversely, the 71 experimental reports in DBASS5 of 5’ss mutations that only caused css activation (column 1, line 5) had higher reads for the css than for background exon skipping in 60 out of 71 examples. Similar results are found by comparing the 64 examples of 3’ss mutations that cause exon skipping only with the 18 examples of 3’ss mutations in DBASS3 that cause css activation only (Table 4). Overall these results confirm that the likely effect of splicing mutations upon css activation or exon skipping can in general be inferred from their background splicing ratios. The exceptions to this general finding are shaded in Table 4 and discussed in more detail in Tables S3 and S7. This analysis shows that when the background reads for single exon skipping are greater that the background reads for any candidate css then exon skipping preferentially occurs in response to a splice site mutation (Fig 1D).
Multiple exon skipping
Table 5 lists all experimental reports of multiple exon skipping events that we found and compares these to the background splicing reads from snaptron. We also included experiments that did not detect the multiple skipping events indicated by snaptron but used RT-PCR primers that were capable of doing so (rows 33 to 42). We did not include predictions of multiple exon skipping from snaptron where experiments were restricted to single skip analyses. The first three examples are taken from a report about the LAMP2A, B and C variants which are generated by alternative splicing from a common 5’ss and three alternative 3’ss (17). The authors report that the same mutation of the common 5’ss has different effects upon single or double exon skipping by each 3’ alternative ss. It can be seen that these differences in skipping correlate well with the relevant background splicing reads (Table 5, Appendix 1). Other notable features of Table 5 include reports of double exon skips only (rows 19 and 26) or mainly double exon skipping (rows 3, 7, 9, 22 and 24) and how this correlates with the higher background reads for double skips than single exon skips in snaptron. Similarly the reports of css and triple exon skipping (row 18) and single and quadruple exon skipping (row 23) are a good match to the background splicing reads.
There are ten examples (rows 33 to 42) where the experimental results do not match the multiple exon skip predictions from snaptron and six examples (8, 12, 13, 15, 18 and 30) where there is some but not exact agreement between the experimental results and background splicing reads. There are also six css listed that did not match snaptron background ss reads. For the css of row 5, snaptron has no splicing variants with which to compare and for row 2 the css has a non-consensus sequence, which is filtered from snaptron (3). The other four non-matching css are discussed at the bottom of the source tables. This analysis shows that high background reads for multiple exon skips is a good indication that these events will occur in response to splice site mutations.
De novo ss and pseudoexons
Table 6 summarises our comparison of the snaptron database with mutations in DBASS that generate de novo splice sites (also known as de novo css) or pseudoexons (Table 2). Here we have divided the de novo mutations into two types, created or enhanced. Created refers to a mutation that creates the GT or GC dinucleotides of a 5’ de novo ss or that creates the AG dinucleotide of a 3’ de novo ss. Enhanced refers to mutations that enhance already existing GT, GC or AG dinucleotides. As expected none of the 34 and 123 created de novo ss of DBASS5 or DBASS3 match bss in snaptron (Table 6, row1). Even if there were reads for the original dinucleotide these would have been filtered from this database (3). There are 95 reports of mutations that enhance de novo ss in DBASS5 (Table 2) and we analysed the first 40 medical syndromes caused by this mutation type and report that 29 of these de novo css positions exactly match bss from snaptron (Table 6, row 1). Similar results were found for mutations that generated 3’ de novo ss (row 2), although a far bigger proportion of the mutations created an AG dinucleotide splice site rather than enhanced existing AG sites.
Pseudoexons are most commonly generated when a mutation that creates a 5’ or 3’ de novo ss also causes the activation of a partner pseudoexon ss (Fig 2A,B). The 5’ and 3’ de novo ss that initiate pseudoexon formation matched background ss at a similar level to the de novo mutations only (Table 6). For the 3’pss that partner the 5’ de novo mutations, there is a match of 59 out of seventy one 3’pss with background ss (Table 6). Of the twelve 3’pss that did not have a match in snaptron, ten were partnered to 5’ de novo sites that were created from non-GT or non-GC dinucleotides (Table S9). Table S9 also describes that 54 out of the fifty nine 3’pss that matched background ss were the nearest upstream background 3’ss to the downstream mutation that created the de novo 5’ pss. Four of the five 3’pss were only marginally more distant from an inner background 3’ss and all had far more reads than the inner 3’bss (Table S9).
Table 6 row 4 shows that a smaller proportion of 5’ pss matched background ss (10/22). In all cases the matching bss are the nearest of all bss to the upstream 3’ de novo ss mutation (Table S9).
Pseudoexons that were created by means other than de novo css mutations (Fig 2C) had the best match to bss (Table 6 row 5, Fig 2C). These were mainly mutations within the pseudoexon, some of which are known to create splicing enhancers, but also included five mutations outside the pseudoexon that enhance the polypyrimidine tract or the branch point recognition sites for the 3’pss. In addition, some of the pseudoexons were activated by mutations of flanking 5’ or 3’ ss (Table S9). 25 out of 26 pairs of these pseudo splice sites matched background ss in snaptron and 48 of these 50 pss matched bss with the highest reads of all background ss within the intron in which the pseudoexon was formed (Table S9).
Spliceosome mutations and cancer
Mutations of the spliceosome, in particular of SF3B1, have been reported to activate novel aberrant splicing events in leukaemia and other cancers (18-20). We report that all tested novel cancer ss caused by SF3B1 mutations matched background ss in snaptron with relatively high read numbers (Table 7, Table S10). Nevertheless, the background read numbers for the aberrant 3’ or 5’ css that are activated by spliceosomal mutations are still in the order of 1000 fold less than the reads for normal intron removal, as indicated in the last column of Table 7 and see Table S10. By contrast, the rarer exon skipping or exon inclusion events that are enhanced by SF3B1 mutations have background splicing reads only 20 to 40 fold lower on average than normal intronic splicing (Table 7).
Mutations of the splicing components U2AF and SRSF2 are reported to cause quantitative rather than qualitative changes in splicing (21,22), whereas mutations of the small non-coding RNA U1 are reported to activate novel splicing events in SHH medulloblastomas (23). However, we found that 23 out of 24 of the most novel aberrant splice sites caused by U1 mutations matched background splice sites, including matches to aberrant splice sites for PTCH1, GLI2, CCND2 and PAX5, which are implicated in this cancer (Tables 7, S10). Sixteen out of the 23 css caused by U1 mutations matched background ss within the top three background reads (Tables 7, S10)
Recursive splicing
Large introns are removed in sections by a process called recursive splicing that uses internal splice sites within introns (24-27). We analysed the first 20 of over 2000 recursive splice sites discovered by a screen of the human genome (25) and Tables 8 and S11 show that all of these sites matched background ss, as would be expected (26) and that in 12/20 cases the matching background ss had the highest reads of all bss with an individual intron. Similarly, Tables 8 and S11 show that 82 and 86% of 5’ and 3’ recursive splices identified in human DMD introns (27) matched background ss and that the background ss with the most reads matched 3’ and 5’RS from DMD introns on 23/34 and 26/36 occasions.