Identification of commonly regulated alternative exon splicing events from RNA-seq data using multiple CRC sample cohorts.
We collected two public RNA-seq data related to CRC from GEO database (Supplementary table 1): CRC18P and CRC10P, in which paired tumor (CRC) and normal colon tissue (NC) were collected from 18 and 10 patients, respectively. We applied percent-spliced-in (PSI) analysis (Figure 1A, see Methods for details) to identify splicing events, which are different in CRC compared to NC tissues with PSI change>20% and P<0.05. Overall the ∆PSI values are correlated between CRC18P and CRC10P cohorts (supplementary figure 10A). As results, six exons showed significant regulation in both CRC18P and CRC10P cohorts. Three exons in three genes (CLK1, COL6A3 and CD44) showed relatively more magnitude of PSI change among the six and were followed up in more detail (Figure 1B, Supplementary table 2-3).
The exon 4 of CLK1 has a median inclusion level (PSI) of about 50% in NC samples and higher inclusion level (median 80%-90%) in CRC samples in CRC18P and CRC10P (Figure 1C-E, supplementary figure 1). In CRC18P, MC samples maintain the higher PSI compared to NCs (Figure 1D).
The exon 6 of COL6A3 has a median inclusion level (PSI) of about 10% in NC samples and higher inclusion level (median 40%-50%) in CRC samples in CRC18P and CRC10P (Figure 1F-H, supplementary figure 2). In CRC18P, MC samples maintain the higher PSIs as CRCs compared to NCs (Figure 1G).
Complex AS of CD44 gene in CRC
Exon v10 of the CD44 gene was found to be upregulated in CRC compared to NCs (Figure 1B, Supplementary table 2-3, supplementary figure 3). Exon v10 is the last exon of the 9 alternative exons between exon 5 and exon 16 (Figure 2C). We noticed multiple alternative exons were included in CRC tissues. The complexity of the AS in CD44 prompted us to use a modified PSI calculation method. As shown in Figure 2A-B, PSI_junc5’ represents the usage/expression of an exon-exon junction compared to all junctions sharing the same 5’ splice site. PSI_junc3’ represents the usage/expression of an exon-exon junction compared to all junctions sharing the same 3’ splice site. Results showed an increased expression of junc5’_E5-v8 (Figure 2D left), junc3’_v10-E16 (Figure 2D right), and decreased expression of junc5’_E5-E16 (Figure 2D middle) in CRC tissues in both cohorts. All other junctions showed less significant changes (data not shown). These data indicate an increase of CD44v8-10 (inclusion of exons v8, v9, and v10) and the decrease of CD44s (standard isoform that skips all 9 alternative exons) isoform in CRC tissues. Metastatic tissues also showed similar changes compared to NCs (Figure 2D top).
Alternative first exon regulation in CRC
Alternative terminal exon regulations including alternative first exons or alternative last exons are types of AS, which couples with alternative transcription start site and alternative polyadenylation, respectively. Regular method of calculating PSI for an exon (Figure 1A) does not apply to these situations since only one side of the terminal exon has splice junctions. We used DEXSeq [38], PSI_junc5’ and PSI_junc3’ (Figure 2A-B) to study these events. The basic criteria for DEXSeq data are adjusted P-value<0.001 and fold change>2 (Supplementary table 2,4). The criteria for PSI_junc5’ and PSI_junc3’data are Wilcoxon rank-sum test P-value<0.05 and PSI change>20% (Supplementary table 2, 5-6). After manual inspection of the raw results, we finally identified four events (Figure 3A-D, supplementary figure 4-7) regulated in CRC compared to NC. All the four events are alternative first exon events. ARHGEF9 showed down-regulation of the more upstream first exon E1a (significant in both DEXSeq and PSI_junc3’ methods, Figure 3A, supplementary figure 11A). HKDC1 E1a-E3a showed up-regulation (significant in DEXSeq method only, Figure 3B). CHEK1 first exon E1a also showed downregulation (Figure 3C, significant in DEXSeq method only). HNF4A exon 1A (more downstream than 1D) showed downregulation, relative to exon 1D (significant in PSI_junc3’ method only, Figure 3D, supplementary figure 11B). In all the four cases, changes of the first exon will alter the protein sequences thus will have a potential functional impact on the genes.
AS associated with metastasis of CRC
Taking the advantage of the metastasis data from CRC18P, we sought to identify splicing events associated with metastasis of CRC. The ∆PSI values are correlated between CRC (vs. NC) and MC (vs. NC) in CRC18P cohort (supplementary figure 10B). We selected a subset of splicing events with the further increase or decrease of PSI values in metastatic tissues compared to CRC by at least 15%. The following criteria were used to select these events: MC vs. NC |ΔPSI|>20% and P<0.001 (Wilcoxon rank-sum test), and MC vs. NC and CRC vs. NC in CRC18P and CRC10P data showing a consistent trend (same sign). 6, 3 and 25 events were identified from PSI_exon, PSI_junc5’ and PSI_junc3’ data, respectively (Figure 4A, Supplementary table 2-3, 5, 6). We manually selected two splicing events in SERPINA1 and CALD1 based on the fact that they were picked up by multiple analysis methods. The results indicated a higher magnitude of PSI change in metastatic samples (Figure 4A-C). In metastatic samples, the first exon of SERPINA1 switched from 1a to more downstream one 1b (Figure 4B, supplementary figure 8, 11C). CALD1 tended to express a shorter version of exon 5 and had exon 6 skipped by alternative 5’ splice site usage and exon skipping respectively (Figure 4C, supplementary figure 9, 11C).
Validation of splicing events using TCGA data
To validate the splicing events identified in this study in an independent dataset, we used a recently developed tool TSVdb [39], which integrates and visualizes AS data based on TCGA samples for 33 tumor types. We searched all the 9 genes identified in this study (CLK1, COL6A3, CD44, ARHGEF9, HKDC1, CHEK1, HNF4A, SERPINA1, CALD1) against TSVdb and downloaded the junction usage value for colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ). Junction usage is calculated by dividing junction quantification value to mean junction quantification value of that person for a specific gene [39]. We combined junction usage for 379 primary solid tumor samples with 1 metastatic and 2 recurrent solid tumor samples (382 CRC tumors samples in total) and compared with 51 normal colon or rectal tissue samples. Results indicated that, except for one gene CHEK1 (not significant, nominal P=0.2), all other 8 genes showed significant splicing changes in tumors samples compared to normal tissues (Wilcoxon test, Bonferroni adjusted P-value range from 2×10-15 to 8×10-44, supplementary figure 12, supplementary table 2, 7), consistent with what we observed in the two GEO datasets. ARHGEF9 E1-E3 junction and HNF4A E1b-E3 junction showed most significant P-values (Figure 5A-B). The highly consistent results among several cohorts of patient data indicate that the splicing events identified in this study are reliable markers of CRCs.
TCGA tumor samples may contain a fraction of stromal cells which may confound the splicing analysis. We tested the correlation of junction usage and percent of stromal cells for 382 CRC or metastatic tissues for the junctions from the 9 genes. Most junctions do not show any correlation with percent of stromal cells except for a junction in CD44 (Rho=-0.17, Bonferroni corrected P=0.013), in which high stroma content is associated with lower expression of the isoform that is more highly expressed in tumors (Supplementary table 2, 8). These indicated that stromal content in general does not confound the splicing analysis but in some cases, it may compromise the analysis.
To test whether the splicing events for multiple genes identified from this study can be used as biomarkers in CRC, we performed logistic regression analysis. We selected the 8 genes with significant junction usage in CRC compared to normal (supplementary figure 12, supplementary table 2, 7). For each gene, only one junction with the most significant P-value was selected. The resulting 8 junctions from 8 genes were separated to 3 groups based on the correlation of their junction usage in all samples (supplementary figure 13A). The group 1 is more distinct from group 2 and 3. To build a logistic model, we tried to identify the best combination of junctions/genes as predictors. We performed the analysis using single predictors (using the 8 genes separately), two predictors (using one from group 1 and 1 from group 2 or 3) and three predictors (using one from each of the three groups). In total, 41 sets of predictors were evaluated. Whether the sample is a CRC (value=1) or normal (value=0) is the dependent variable in the model. For each predictor set, 5 times of training and testing process were performed with each time using randomly selected 50% of the samples for training and the rest of samples for testing. The average AUC scores range from 0.873 to 0.999 (supplementary figure 13B). The models using more predictors perform better than less predictors but it’s not always the case. 17 of the 41 predictor sets have an average AUC above 0.99 including 7 two-predictor sets and 10 three-predictor sets. The best predictor set is the one using three junctions from CALD1, COL6A3 and HNF4A. 3 of the 5 models using this set of predictors has an AUC score of 1 (100% true positive and 0% false positive rate). One of the 3 models is shown in Figure 5C (ROC curve) and D (logistic regression curve) using the testing data only.
Relevance of splicing events to the survival of patients
To assess the clinical significance of these findings, we compared the overall survival (OS) data for patients with different splicing profile for the 10 genes. For each junction usage data, we separated all of patients with OS data into two equal sized groups (high and low junction usage groups). We found that two junctions (COL6A3 E5-E6 junction and HKDC1 E1-E2 junction) showed a significant difference in OS (Figure 5E-F). The patients with higher expression of both junctions, which are also more expressed in CRC cancers (Figure 1GH and Figure 3B), are associated with better survival.