Identification of RNA splicing junctions across different human tissues and cells
To construct the landscape of splicing events across different human sample types, we collected RNA-Seq data of 17,382 normal samples (30 tissue types), 9,581 cancer samples and 702 matched normal samples (33 tissue types), 1128 extracellular vesicle (EV) samples (4 biofluids), 1046 cancer cell lines and 4936 single cells (6 cell types) (Supplementary Table 1). We developed an optimized RNA splicing junction detection tool (Assembling Splice Junctions Analysis, ASJA) to analyze the RNA-seq data (Fig. 1a). ASJA identified unique splicing junction position in all regions of the genome and quantified junction expression based on assembling transcripts allowing comparison among samples. In total, we identified 870,389 splicing events including 333,349 in normal tissues, 667,215 in cancer tissues, 472,193 in EVs and 526,982 in cells (Fig. 1b). There are 246,652 splicing events present across all the sample types (Fig. 1b).
For each sample type, we obtained an average of 186,567 splicing events per normal tissue, 181,711 per adjacent normal tissue, 173,747 per cancer tissue, 80,109 per EV sample, 189,171 per cell line and 25,983 per single cell (Fig. 1c). Generally, splicing junctions detected in normal tissues were more numerous compared to cancer. Notably, about half of the splicing events (324,651, 48.65%) detected in cancer tissues are unannotated (Fig. 1d). Among annotated junctions, 306,426 (89.45%) were from protein-coding genes, and only 36,138 were from lncRNA (Fig. 1d). Among unannotated junctions, 300,058 (92.42%) were overlapped with gene which may represent new isoforms for host genes, and 24,593 were from intergenic regions (Fig. 1d).
To demonstrate the robustness of splicing events detected using ASJA from the short-read RNA-seq data, we made the comparison with long-read RNA-seq data. About 75% of the junctions could be observed in the long-read RNA-seq data from GTEx project. We also performed the short-read and direct full-length RNA-seq of HuH-7 and HCT 116 cells. For the detected junctions from short-read RNA-seq, about 80% in HCT 116 and 90% in HuH-7 could be validated from direct RNA-seq data (Fig. 1e). Moreover, the analysis of chromatin accessibility profiles of 23 types of primary human cancers, represented by 410 tumor samples derived from 404 donors from TCGA, revealed that clear ATAC-seq peaks were detected around the transcriptional start site of parental transcripts with unannotated junctions (Fig. 1f). Those results suggested that the identified RNA splicing junctions were derived from bona fide transcripts.
Characterization of RNA splicing junctions
To identify splicing type for each junction, alternative splicing events from those junctions were extracted. We identified a total of 114,490 alternative splicing events, of which alternative promoter (AP) events accounted for the majority (45,955, 40.14%) (Fig. 2a). Notably, a substantial number of splicing events exit in the intronic and intergenic regions (Fig. 2b). Even in spliced exonic regions including coding sequences (CDS) and untranslated region (UTR, 3’ and 5’), there could still occur splicing events (Fig. 2b). Supporting this observation, some recent studies provided experimental evidences for the true existence and functional roles of atypical splicing sites10,11, suggesting these junctions are attractive potential tumor targets and worth further investigation.
We further investigate splicing characteristics for different gene types and found that protein-coding genes had many more splicing events than lncRNA genes (median: 24 vs median: 3) (Fig. 2c). Even after excluding exon number, protein-coding genes still had higher exon usage rates than lncRNA genes (Wilcox test, p < 2.2×10 − 16) (Fig. 2d). Notably, intronic lengths of lncRNA genes were observed to be longer than those of protein-coding genes (Fig. 2e), suggesting that intronic length may affect splicing. Among these spliced genes, a group of genes were spliced frequently (exon usage ratio > 2). These genes were found to be involved in different biosynthesis and metabolic pathways (Fig. 2f).
Additionally, we found 143,463 splicing events under tissue-specific splicing regulation across GTEx tissue types (Fig. 2g). We next asked whether the splicing events marked tissue-specific genes. Tissue-specific (TS) score was calculated for each junction and gene across 30 normal tissues (see Methods). We found 90,768 splicing events under strong tissue-specific regulation (TS score > 0.95, Supplementary Table 2). By checking the tissue specificity of gene expression patterns of their parental genes, we found that tissue-specific splicing junctions were significantly enriched for tissue-specific genes compared to random expectation (p-value < 2.2e-16, Fisher’s exact test). However, we observed that some genes did not show tissue-specific expression, but their splicing sites had obvious tissue preferences. For example, the splicing site chr22:37688360|37688850:+ from protein-coding gene NOL12 was specifically expressed in testis (Extended Data Fig. 1a), while its parental gene did not show tissue specificity (Extended Data Fig. 1b).
Discover tumor-specific transcripts (TSTs) in pan-cancer
Although the number of splicing junctions in each cancer tissue was less than that of normal tissue, a substantial number of splicing junctions exclusively exist in cancers, suggesting the presence of abundant tumor-specific transcripts (TSTs). We performed stringent criteria to identify the tumor-specific splicing events. Only those junctions frequently expressed (> 5%) and specifically expressed or 10-fold higher in cancer compared to the maximum of adjacent and normal tissues are considered as tumor-specific junctions (TSJs) (Fig. 3a). In total, we discovered 32,848 TSJs derived from 29,051 TSTs (Supplementary Table 3). TSTs are frequently detected across 33 cancer types from 270 to 6842, of which ovary cancer (OV) and testicular germ cell tumors (TGCT) have the largest numbers of TSTs (Fig. 3b). Most of the TSTs (28,200, 97.07%) are un-annotated (Fig. 3c), and more than half of TSTs (16,284, 56.12%) were generated from intergenic region (Fig. 3d). The transcriptional start site of TST exhibits open chromatin features, implicating disrupted epigenetic regulation in tumorigenesis as a possible mechanism underlying aberrant TST expression (Fig. 3e). Moreover, most of these TSTs identified from cancer tissues were also present in cancer cell lines (25,808, 88.95%), suggesting that TSTs mainly from cancer cells. It is noteworthy that these TSTs included LIN28B-TST and MARCO-TST that we previously reported13,14.
Most TSTs are recurrent, with approximately 4882 TSTs appearing in at least 100 (> 1%) samples (Fig. 3f). For instance, a TST derived from XCL1 was expressed across several types of cancer, including bladder urothelial carcinoma (BLCA), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma (HNSC), lung squamous cell carcinoma (LUSC), and other types of cancer (Extended Data Fig. 2a). Moreover, TSTs exhibit cancer-specific expression patterns (Fig. 3g), as exemplified by a TST from ID4 specifically expressed in OV (Extended Data Fig. 2b). These cancer-specific features suggest the potential utility of TSTs as diagnostic biomarkers and therapeutic targets in cancer.
TSTs are associated with poor prognosis and act as potential cancer drivers
We next explored the clinical significance of cancer patients with the presence of TSTs. A total of 3,130 TSTs were found to be associated with survival in at least one cancer (Extended Data Fig. 3a), and some of these TSTs were clinically relevant in multiple cancers (Extended Data Fig. 3b). To assess the degree of tumor-induced TST production, we developed a scoring system for each tumor sample on the basis of expression and frequency of TSTs (Fig. 4a, see Methods). Pan-cancer analysis of 9,555 tumor patients showed that higher TST level was significantly associated with worse overall survival (OS) and disease-specific survival (DSS), shorter progression-free interval (PFI) and disease-free interval (DFI) (Fig. 4b and Extended Data Fig. 3c). We found that TST level was associated with multiple clinical end points in several cancer types including pancreatic adenocarcinoma (PAAD), liver hepatocellular carcinoma (LIHC), brain lower grade glioma (LGG) and uveal melanoma (UVM) (Extended Data Fig. 3d).
We investigated the associations between TST score and other molecular features. The results showed that TST levels were positively correlated with genome aneuploidy and negatively correlated with tumor mutational burden (TMB) (Fig. 4a). Interestingly, a positive correlation was observed between TST level and stemness (Fig. 4a). The correlation between TST and stemness was calculated in 33 cancer types from TCGA. In 16 cancer types, TST levels showed a strong and significant positive correlation with stemness, with correlation coefficients ranging from 0.318 to 0.683 (Extended Data Fig. 4). To substantiate the correlation between TST level and stemness, we scrutinized a single-cell lung cancer dataset. 1969 cancer cells with a junction number greater than 15000 were retained for analysis using CytoTRACE, an algorithm that infers stem cell properties by scoring each cell based on its transcriptional diversity. As shown in Fig. 4c, Cells with more numerous TSTs displayed lower general differentiation (higher stemness, higher CytoTRACE score) compared to other cells (Fig. 4c). To reconstruct the acquisition of cell fate over time, we analyzed Monocle 2 to scRNA-Seq data, which provides pseudo-temporal ordering of cells relative to developmental progression (‘pseudotime’). It was also found that TST number changed along with cell differentiation trajectory (Fig. 4c). These results indicate that abnormal transcription of tumors is positively correlated with stemness.
To determine whether TSTs have functional effects, TST-derived genes were examined. We observed that some cancer driver genes such as EGFR and ERBB2 can produce TSTs. It was further investigated whether TSTs can function as potential cancer driver events by focusing on genes affected by somatic mutations and/or TSTs. A clear pattern was observed wherein significantly mutated genes (SMGs) such as TP53, CTNNB1, ARID1A, and RB1, and significantly TST-produced genes (STGs) such as TGM3, TRIM24, and LIN28B were bifurcated into two groups in LIHC cohort (Fig. 4d), suggesting SMGs and STGs are mutually exclusive. Among STGs, LIN28B-TST was previously reported to encode a protein isoform with additional N-terminal amino acids and was critical for cancer cell proliferation and tumorigenesis14. This supported the notion that STGs could represent an independent catalog of genes with functionally important tumor-specific transcripts alterations in cancer. To explore whether lack of mutations in STGs is a general feature, we investigate the relationship between TST scores and TMB in the pan-cancer cohort. As expected, the results showed that samples with high TST levels had low TMB, while the low TST group had a higher level of mutation (Fig. 4e). When examining genes affected by somatic mutations and/or TST splicing such as CNTNAP2 (one of the largest genes in the human genome, occupying 1.5% of chromosome 7) and PLD1 (tumor-associated gene) in pan-cancer cohort, we confirmed this pattern of mutual exclusivity (Fig. 4e). The mutual exclusivity between TST and mutation suggests that TSTs may act as drivers similar to mutations in cancer development.
TSTs are enriched in transposable elements derived transcripts with oncogenic function
Transposable elements (TEs) are major components of the human genome, comprising almost 50% of the total genomic DNA. We identified junctions derived from TE sites in both tumor and normal tissues. In total 154505 splicing TE junctions(spTEs) were detected in tumor tissues and 40737 were in normal tissues. The number of spTEs in cancer was much greater than in normal tissues (Fig. 5a), indicating that TE expression is activated in cancer. Strikingly, the TSTs were significantly enriched in TEs (tumor-specific TEs, tsTEs), particularly in the LTR class (Fig. 5b and Fig. 5c). Further analysis showed that tsTEs evolved much more recently than the background (Fig. 5d), possibly because young TEs are less affected by silencing mechanisms in tumors and have stronger transcriptional potential.
TEs are generally believed to activate the immune system by their transcript-derived components15. However, we found that high expression of tsTE was associated with lower immune activity in tumor samples (Fig. 5e) and correlated with poor prognosis (Fig. 5f), indicating a potential role for tsTE in promoting tumor development. Further analysis of tsTEs in cancer tissues revealed that a substantial number of tsTEs are frequently expressed (Fig. 5g). Among these, tsTE1 had the highest expression frequency in various cancer tissues (Fig. 5g and extended Data Fig. 5a). tsTE1 was a novel transcript located in cytoplasm (Extended Data Fig. 5b). RACE analysis confirmed that tsTE1 included a 511nt LTR sequence and was predicted to be a non-coding RNA (Extended Data Fig. 5c). Inhibition of tsTE1 expression by siRNA significantly inhibited the proliferation of cancer cells expressing tsTE1, while having no effect on cells that did not express tsTE1 (Fig. 5h and Fig. 5i). Conversely, overexpression of tsTE1 using a lentivirus significantly promoted colony formation in various cancer cells (Fig. 5j and Extended Data Fig. 5d). Next, pathway analysis was performed using differentially expressed genes from samples with and without tsTE1 expression. The findings revealed that tsTE1 was significantly correlated with cell cycle pathways and may promote tumorigenesis by regulating cell cycle-related genes (Fig. 5k). These results suggest that tsTEs could have oncogenic functions in tumor development.
TSTs represent a source of tumor neoantigens
Tumor neoantigens are currently important targets for cancer immunotherapy, but the discovery of neoantigens from tumor mutated proteins is limited. We investigated whether TSJs/TSTs derived tumor-specific peptides/proteins could potentially produce specific neoepitopes. We developed a workflow for predicting tumor neoantigens from RNA-seq (Extended Data Fig. 6). Among a total of 29,051 TSTs across 33 TCGA cancer types, 12,348 TSTs had the potential to encode proteins, accounting for 42.5% of all TSTs. We quantitatively assess the immunogenic potential of peptides across splicing junctions with coding capacity with a previously published score tool16. The immunogenic-related scores of 23,394 predicted translated-TSJ-derived peptides were ranked (mean: 0.263, range: 0-0.50) (Extended Data Fig. 7a). About half of peptides have strong affinity with MHC-I molecules in more than 5 patients (Extended Data Fig. 7b). Notably, TSJ introduced more potential neoantigens in tumors with low nsSNV-derived neoantigen load such as testicular germ cell tumors (TGCT), while mutation introduced more potential neoantigens in tumors with low TSJ-derived neoantigen load such as lung adenocarcinoma (LUAD) (Fig. 6a). In a further correlation analysis, we found a significant positive correlation between coding TSJ number (Extended Data Fig. 7c) and TSJ-derived antigen load, but a weak negative correlation between TMB and TSJ-derived antigen load (Extended Data Fig. 7d). Therefore, for cancer patients with a lack of non-synonymous mutation-derived neoantigens, TSJ-derived antigens may offer potential therapeutic targets.
To validate the expression of these predicted peptides, we used MS-GF + 17 to search mass spectrometry (MS) data in ovarian cancer (OV) and breast invasive carcinoma (BRCA) samples available from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) project, and identified a total of 37 neoepitopes derived from TSJs (Supplementary Table 4). Compared to non-synonymous single nucleotide variants (nsSNVs) and insertions/deletions (indels), TSJs introduced a higher number of CPTAC-confirmed neoantigens per sample in both OV and BRCA (Fig. 6b). Although CPTAC proteomic data confirmed the expression of peptides derived from TSJs, they were not able to demonstrate whether these peptides were processed and presented on major histocompatibility complex (MHC). To validate this, we conducted an analysis on 352 immunopeptidome samples of different cancer cells or tissues (Supplementary Table 4). Due to limited detection sensitivity of immunopeptidome and TSJs may cause in-frame deletions of functional protein domains or generate novel reading frames through frameshifts, we identified neoepitopes derived from TSTs. Altogether, 257 TST neoantigens were confirmed to be presented by MHC-I molecules, and both TSTs and TST-derived immunopeptides could be reproducibly detected in corresponding samples (Fig. 6c and Fig. 6d). The TST-derived peptides and UniProt annotated peptides had similar peptide length distributions, with the 9-mer peptides being most frequently presented by MHC-I molecules (Extended Data Fig. 7e), and the TST-derived 9-mer peptides had similar binding motifs with UniProt annotated peptides (Extended Data Fig. 7f).
Among these TST-derived immunopeptides, some peptides from TEs are derived from validated endogenous viral elements (EVEs) in the gEVE database18. These EVEs were identified by processing both RepeatMasker annotations and conserved known motifs from viral proteins such as Gag and Pol. ERVL-MaLR and ERV1 elements show selectivity in providing immunopeptides presented by MHC-I molecules (Fig. 6e). For example, DPVPDIIPV is a novel epitope produced by RNGTT-derived TST containing ERV1 elements, which was confirmed by mass spectrometry-based immunopeptidome analysis and induced increased levels of CD8 + T cell immune infiltration (Fig. 6f). Another example is SRVNGAWQY produced by TST from non-TE regions, which also causes differences in CD8 + T cell infiltration levels (Fig. 6g). These results suggest that TST-derived peptides are credible and may serve as a valuable source for neoantigen discovery.
We next sought to examine whether TST-derived neoantigen load was associated with clinical response to immune checkpoint blockade (ICB). To test this, we analyzed data from one cohort of metastatic gastric cancer patients receiving pembrolizumab treatment. We found patients with clinical benefit had a higher number of TST neoantigen load than patients without benefit (Extended Data Fig. 8a). The expression level of immune checkpoint molecules was considered one of the indications for immune checkpoint inhibitor therapy. We compared the expression of 16 literature-curated immune checkpoint molecules between high and low neoantigen load patients. The expression of PD-1, CD276, HAVCR2, LAG3, SIGLEC7 and SIGLEC9 was significantly higher in high neoantigen load group (Extended Data Fig. 8b). In addition, we found that six patients had a higher level of TST-derived neoantigen load, four of whom achieved complete or partial remission (CR/PR) after receiving immunotherapy (Extended Data Fig. 8c). Conversely, among 17 patients with moderate or low TST-predicted neoantigen loads, immunotherapy was not effective in 12 cases (Extended Data Fig. 8c).
RNA splicing junctions in extracellular vesicles for cancer diagnosis
Extracellular vesicles (EVs) are small heterogeneous vesicles that are secreted by virtually all cells and are abundant in all body fluids. They carry various molecular cargo, including RNA, and act as a reliable liquid biopsy for disease diagnosis19. We performed EV RNA-Seq analysis of 1128 samples from four different body fluids. We identified a total of 472,193 linear-splicing sites and 121,816 back-splicing sites in EVs. We found 46.83% (349,613/746,461) of the tissue/cell expressed linear-splicing junctions are detectable in EVs (Fig. 7a). RNA splicing junctions in EVs and tissues have a similar genomic distribution (Fig. 7b), and circRNAs are significantly enriched in EVs (Fig. 7c).
It has been demonstrated that specific cargo of EVs is reflective of donor cell origin and its physiological state20. We performed a tracking analysis with RNA splicing sites for investigating the origins of plasma EVs at tissue level. Figure 7D showed the deconvoluted tissue-proportion analysis of EV sources based on the expression of tissue-specific splicing sites (TSS) in the 1,076 individual plasma samples. We found that plasma EVs were mainly contributed by the spleen tissue in addition to blood (Fig. 7d). When considering cancer types, we observed that TSS of the liver were enriched in liver cancer compared to control samples (Fig. 7e), while TSS of adipose tissue were enriched in breast cancer compared to control samples (Fig. 7f). Furthermore, we observed that tumor-exclusive splicing junctions-parent genes were enriched for metabolic and tumor-related pathways, while immune-related signatures were exclusively enriched in EVs (e.g., T cell and B cell receptor signaling pathways) (Extended Data Fig. 9a).
To investigate whether RNAs with specific splicing sites can be used as cancer biomarkers, we first examined the expression of TSTs in plasma EVs. Interestingly, TSTs could be detected in EVs and were expressed in multiple samples (Fig. 7g). Nevertheless, the overall frequencies of TSTs were low (293 out of 747 cancer samples expressed TSTs) (Fig. 7g), possibly due to insufficient sequencing sensitivity. Additionally, we identified 194 specific RNA junctions that had significantly different levels between patients and healthy individuals (Extended Data Fig. 9b), of which 178 (17 were from tumor-associated genes) were upregulated in cancer and only 16 downregulated (Extended Data Fig. 9b and Extended Data Fig. 9c). For different types of cancer, up-regulated splicing sites were varied. For example, the splicing sites from APOH and HRG genes were specifically expressed in liver cancer (Extended Data Fig. 9d), while ETV4 and EAF2 are specifically expressed in pancreatic cancer (Extended Data Fig. 9e) and breast cancer EVs (Extended Data Fig. 9f), respectively. We therefore identified cancer-specific splicing sites (Fig. 7h) and investigated whether they can distinguish cancers. The EV datasets, encompassing seven cancers, was partitioned into training (70%) and validation (30%) subsets. A multi-class Support Vector Machine (SVM) model was employed to construct a splicing classifier using these cancer-specific junctions. The efficacy of the splicing classifiers was assessed via Receiver Operating Characteristic (ROC) analysis, focusing on the Area Under the Curve (AUC). The diagnostic ROC curves demonstrated outstanding AUC values: 91.1% for the training set (Fig. 7i) and 89.0% for the validation set (Fig. 7j), underscoring the robust predictive power of the model. These results suggest that RNA splicing sites in EVs are rich sources of diagnostic targets for different types of cancer.