Identification of commonly regulated alternative splicing events from RNA-seq data using multiple CRC sample cohorts.
We collected three public RNA-seq data related to CRC from GEO database (Figure 1, Supplementary table 1): CRC18P, CRC10P, and CRC3P, in which paired tumor (CRC) and normal colon tissue (NC or NM) were collected from 18, 10 and 3 patients, respectively. In CRC18P, 18 matched liver metastatic (MC) tissue were also collected. We applied percent-spliced-in (PSI) analysis (see Methods for details) and DEXSeq [26] to identify splicing events, which are different in CRC or MC compared to NC tissues. CRC3P data has relatively lower sequencing depth and higher 3’ bias in two NM samples (Supplementary table 2). Therefore, we focused on the analysis of CRC18P and CRC10P. As results, three exons in three genes (CLK1, COL6A3 and CD44) were identified to show splicing changes in CRC vs. NC in both CRC18P and CRC10P data (Figure 2A-B, Supplementary table 3).
The exon 4 of CLK1 has a median inclusion level (PSI) of about 50% in NC samples and higher inclusion level (median 80%-90%) in CRC samples in CRC18P and CRC10P (Figure 2C-E, supplementary figure 1). In CRC18P, MC samples maintain the higher PSI compared to NCs (Figure 2D). The exon4-skipping isoform, predominantly expressed in NC samples, matches the isoform 3 of CLK1 transcripts from Refseq annotation. Interestingly, isoform 3 is annotated as a non-coding transcript because skipping the 91 nt exon 4 is predicted to cause out-of-frame translation and the nonsense mediated decay (NMD) of the transcript (Figure 2C). The increase of exon4-inclusion isoform (isoform 1) at the expense of isoform 3 in CRC and MC may thus increase the productive transcript level of CLK1 and potentially produce more proteins. CLK1 encodes a member of the CDC2-like family of protein kinases (CLKs). In the cell nucleus, CLKs phosphorylate serine/arginine-rich proteins, release them into the nucleoplasm and then regulate alternative splicing of genes. Small molecule inhibitors against these CLKs were developed and exhibited growth suppression and apoptosis induction effect [28]. Here, we found that CLK1 itself can be alternatively spliced. CRC cells may increase the CLK1 expression by regulating the inclusion of exon 4. If the function of this splicing event can be validated by further evidence, exon 4 skipping could be a new target for cancer therapy.
The exon 6 of COL6A3 has a median inclusion level (PSI) of about 10% in NC samples and higher inclusion level (median 40%-50%) in CRC samples in CRC18P and CRC10P (Figure 2F-H, supplementary figure 2). In CRC18P, MC samples maintain the higher PSIs as CRCs compared to NCs (Figure 2G). This splicing change has been reported previously in colon, bladder, prostate and pancreatic cancers tissues compared to normal tissues [11, 29], indicating that the splicing change may play a role in multiple cancer types. High expression of COL6A3 in stroma were associated with poor prognosis in CRC [30]. COL6A3 was also found to be a key hub gene in the cell migration/extracellular matrix module that was associated with poor prognosis in CRC [31]. COL6A3 knockout decreases cell proliferation and invasion, increases cell apoptosis in cancer cell lines [31]. Taken together, the relevance and importance of COL6A3 to CRC tumorigenesis were highlighted. Our finding of exon 6 splicing change added more complexity of the role of COL6A3 in CRC and this may serve as an additional biomarker in the diagnosis of CRC.
Alternative splicing of CD44 gene in CRC
Exon v10 of the CD44 gene was found to be upregulated in CRC compared to NCs (Figure 2B, Supplementary table 3, supplementary figure 3). Exon v10 is the last exon of the 9 alternative exons between exon 5 and exon 16 (Figure 3C). We noticed multiple alternative exons were included in CRC tissues. The complexity of the alternative splicing in CD44 prompted us to use a modified PSI calculation method. As shown in Figure 3A-B, PSI_junc5’ represents the usage/expression of an exon-exon junction compared to all junctions sharing the same 5’ splice site. PSI_junc3’ represents the usage/expression of an exon-exon junction compared to all junctions sharing the same 3’ splice site. Results showed an increased expression of junc5’_E5-v8 (Figure 3D top), junc3’_v10-E16 (Figure 3D bottom), and decreased expression of junc5’_E5-E16 (Figure 3D middle) in CRC tissues in all of the three cohorts. All other junctions showed less significant changes (data not shown). These data indicate an increase of CD44v8-10 (inclusion of exons v8, v9, and v10) and the decrease of CD44s (standard isoform that skips all 9 alternative exons) isoform in CRC tissues. Metastatic tissues also showed similar changes compared to NCs (Figure 3D left).
The expression of alternatively spliced CD44 adhesion molecules has been implicated in the pathogenesis and metastasis of colorectal cancer. mRNA expression, different splice isoform expression or protein isoform expression have been used in different studies. However, the results are usually conflicting. Either positive correlation [23, 32-35] or no correlation [36] to CRC or metastasis has been reported. Our study suggests that isoform CD44v8-10 is upregulated in CRC, while CD44s are relatively down-regulated in liver metastasis tissues . Furthermore, increasing of CD44v8-10 expression has not been shown in metastasis tissues as well. Therefore,CD44v8-10 or maybe the ratio of CD44v8-10/CD44s could be used as a CRC biomarker.
Alternative first exon regulation in CRC
Alternative terminal exon regulations including alternative first exons or alternative last exons are types of alternative splicing, which couples with alternative transcription start site and alternative polyadenylation, respectively. Regular method of calculating PSI for an exon (Figure 2A) does not apply to these situations since only one side of the terminal exon has splice junctions. We used DEXSeq, PSI_junc5’ and PSI_junc3’ (Figure 3A-B) to study these events. After manual inspection of the raw results (Supplementary table 4-6), we finally identified four events (Figure 4A-D, supplementary figure 4-7) regulated in CRC compared to NC. All of the four events are alternative first exon events. In ARHGEF9 and CHEK1, the more upstream first exons showed downregulation (Figure4 4A, C), whereas in HKDC1 and HNF4A the upstream first exons showed upregulation (Figure4 4B, D), relative to the downstream first exons. In all of the four cases, changes of the first exon will alter the protein sequences thus will have a potential functional impact on the genes.
Although some studies indicated that many of these genes linked to tumorigenesis or metastasis, the detailed functions of these splicing events in CRC have not been reported. ARHGEF9 has been shown to play a role in tumor cell migration, invasion and metastasis by linking oncogene CHD1L and Cdc42 pathway in hepatocellular carcinoma (HCC) [37]. HKDC1 encodes the hexokinase domain containing 1, which catalyzes the phosphorylation of glucose. Its high expression in hepatocarcinoma (HCC) is related to poor overall survival (OS) [38]. And it is also predicted to be a novel therapeutic target in lung cancer [39]. CHEK1 contributes to CDC25C-mediated Docetaxel resistance and can also be a therapeutic target in prostate cancer [40]. In HNF4A gene, the promoter P1 driven isoforms (expressing exon 1A) were decreased in CRCs and MCs compared to the promoter P2-driven isoforms (expressing exon 1D) (Figure 4D, supplementary figure 7). It was reported that P1-HNF4a is expressed primarily in the differentiated compartment of the mouse colonic crypt and P2-HNF4a in the proliferative compartment. The mice that could only produce P2-HNF4a experienced more colitis and developed more tumors than normal mice [41]. Taken together, the splicing changes of these genes identified in this study may contribute to tumorigenesis and can be better biomarkers than gene expression in CRC.
Alternative splicing associated with metastasis of CRC
Taking the advantage of the metastasis data from CRC18P, we sought to identify splicing events associated with metastasis of CRC. We selected a subset of splicing events with the further increase or decrease of PSI values in metastatic tissues compared to CRC. We also required that the CRC and MC samples in CRC18P and the CRC samples in CRC10P data have the consistent trend of PSI change compared to the corresponding NC samples. Nine, eight and fifty events were identified from PSI_exon, PSI_junc5’ and PSI_junc3’ data, respectively (Figure 5A, Supplementary table 3, 5, 6). We manually selected three splicing events in SERPINA1, CALD1 and FBLN2 based on the fact that they were picked up by multiple analysis methods. The results indicated a higher magnitude of PSI change in metastatic samples (Figure 5A-D).
We found that the first exon of SERPINA1switch from 1a to more downstream one 1b in metastatic samples (supplementary figure 8). Although the two splice isoforms have the same start codon in exon 4, their different 5’UTR sequences may contribute to different RNA decay or protein translation of the gene. Elevated expression of SERPINA1 has been associated with the advanced stage, lymph node metastasis, poor prognosis and shorter overall survival in CRC [42] and gastric cancer [43].
Splicing of a shorter version of exon 5 of CALD1 has been identified by using the alternative 5’ splice sites. And skipping of exon 6 of CALD1 in metastatic CRC samples has also been found (Figure 5C, supplementary figure 9). The same splicing event was reported to be present in colon, bladder and prostate tumors compared to normal tissues [11]. The isoform expressed in metastatic CRC samples matches the low-molecular-weight isoforms (L-CAD), specifically encoded by WI-38 L-CAD II (transcript variant 2). The same isoform was significantly associated with poorer prognosis in urothelial bladder carcinoma (BC) [44]. L-CaD was also linked to lymph node metastasis and poorer prognosis in oral cavity squamous cell carcinoma (OSCC) [45].
Skipping of exon 9 of FBLN2 was identified in metastasis CRC samples (Figure 5D, supplementary figure 10). Preferentially skipping of exon 9 of FBLN2 has been reported in five cancer types [12]. The FBLN2 short isoform expression was suggested to drive malignant progression and metastasis in lung adenocarcinoma [46].
Validation of splicing events using TCGA data
To validate the splicing events identified in this study in an independent dataset, we used a recently developed tool TSVdb [27], which integrates and visualizes alternative splicing data based on TCGA samples for 33 tumor types. We searched all the 10 genes identified in this study (CLK1, COL6A3, CD44, ARHGEF9, HKDC1, CHEK1, HNF4A, SERPINA1, CALD1, FBLN2) against TSVdb and downloaded the junction usage value for colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ). Junction usage is calculated by dividing junction quantification value to mean junction quantification value of that person for a specific gene [27]. We combined junction usage for 379 primary solid tumor samples with 1 metastatic and 2 recurrent solid tumor samples and compared with 51 normal colon or rectal tissue samples. Results indicated that, except for one gene CHEK1 (not significant, P=0.2), all of other 9 genes showed significant splicing changes in tumors samples compared to normal tissues (Wilcoxon test, P-value range from 1×10-16 to 1×10-45, supplementary figure 11), consistent with what we observed in the three GEO datasets. ARHGEF9 E1-E3 junction and HNF4A E1b-E3 junction showed most significant P-values (Figure 6A-B). The highly consistent results among several cohorts of patient data indicate that the splicing events identified in this study are reliable markers of CRCs.
Relevance of splicing events to the survival of patients
To assess the clinical significance of these findings, we compared the overall survival (OS) data for patients with different splicing profile for the 10 genes. For each junction usage data, we separated all of patients with OS data into two equal sized groups (high and low junction usage groups). We found that two junctions (COL6A3 E5-E6 junction and HKDC1 E1-E2 junction) showed a significant difference in OS (Figure 6C-D). The patients with higher expression of the junctions, which are also more expressed in CRC cancers, are associated with better survival. The high gene expression of COL6A3 in stroma has been linked to poor overall survival in CRC [30], while high expression of HKDC1 is related to poor overall survival in hepatocarcinoma [38]. However, it’s still unclear whether the splicing isoforms of these two genes have the same functions. It might suggest that the splice isoforms have variable functions in different cancer types. More data is required to illustrate the functional link between the isoforms and patient survival.