An overview of DESJ-detection
Revealing splicing differences at the single-cell level would deepen our understanding of cell heterogeneity, function, and phenotype. Some major challenges of differential splicing analysis at the single-cell level include that scRNA-seq data has a high rate of dropout events and low sequencing depth compared to bulk RNA-Seq. These two features hinder our ability to accurately reveal the splicing structure of genes. In addition, splicing analysis is mainly limited to exon skipping (SE) and mutually exclusive exons (MXE). To address these challenges, we proposed DESJ-detection, an algorithm that uses junction-spanning reads to detect DESJs (Fig. 1A). First, we input all the junction read counts of each cell and output a junction-cell count matrix for each gene. Second, we applied iterative K-means to cluster cells and removed the clusters with low expression (SD < 0.2 and mean < 1) of all junctions resulting from low coverage and high dropout rate. Next, we utilized a new normalization method at the gene level to eliminate the interference of DEGs on DESJ detection. Specifically, this normalized the junction read count with the read count of each gene rather than uniquely mapped reads of each cell. Finally, we identified DESJs based on the Limma-tread algorithm with fold change and adjusted p-values. DESJ-detection can detect DESJs at any regions of a given gene; therefore, it can discover any type of AS, rather than being limited to SE and MXE events. We also developed a convenient pipeline (https://github.com/liushang17/DESJ-detection), which covers the generation of junctions, filtering and annotation of junctions, preparation of junction count matrices, and detection of DESJs (Fig. S1A).
To assess the performance of the software in terms of DESJ detection, we simulated scRNA-seq data with a pipeline based on Spanki considering different factors including read coverage, dropout rate, and isoform usage difference. Our method proved to be effective. For example, the simulated cells were divided into five clusters by the expression of two isoforms of PPT1. Four cell clusters showed differential junction expression and another cluster with low gene expression was removed by iterative K-means clustering (Fig. S1B). We observed sensitivity up to ~70%, even at the lowest coverage level (RPK = 25) when the isoform usage difference is more than the control and without dropout events (Fig. 1B). The sensitivity was essentially maintained at 85% at the general coverage level (RPK >= 50). In addition, the sensitivity exceeds 70% when the dropout rate is < 0. Over 95% of identified genes exhibited DESJ. Further, we also evaluated DESJ-detection by comparing it with other software including Outrigger, DEXSeq, rMATs, and Limma-trend (Fig. 1C). DESJ-detection performed the best (high TPR and low FPR) under all conditions tested. Outrigger had similarly low FPR but divergent TPR, and was especially influenced by coverage. High coverage led to better performance in Outrigger. rMATS and DEXseq were heavily influenced by dropout events. When the dropout ratio = 0.4, no gene was detected in rMATs, and DEXseq failed to run successfully. Limma-trend exhibited lower TPR but similar FPR to that of DESJ-detection. Taken together, DESJ-detection is robust and highly sensitive to DESJs.
Differential usage of junctions in UTRs across T cell clusters
We performed DESJ-detection on a published scRNA-seq data set 1. This dataset included 5,063 T cells from tumor tissues, normal tissues, and peripheral blood of six HCC patients that had been assigned to 11 T cell subsets including naïve T cells (C01_CD8.LEF1, C06_CD4.CCR7), effector T cells (C02_CD8.CX3CR1, C11_CD4.GNLY), exhausted T cells (C04_CD8.LAYN, C10_CD4.CXCL13), Tregs (C07_CD4.FOXP3, C08_CD4.CTLA4), mucosal-associated invariant T cells (C03_CD8.SLC4A10), and intermediate T cells (C05_CD8.GZMK, C09_CD4.GZMA). We obtained a set of 134,414 junctions that were characterized by read counts < 4 in at least 10 cells, covering 12,587 genes (Fig. S2 A & Fig. S2B). The junctions that were annotated to one gene were retained. In the end, we retained 119,311 junctions from 10,556 genes. Using DESJs analysis, we identified 1,176 DESJs across 11 clusters (log2(FC) ≥ 1, adjusted p-value ≤ 0.01; Supplementary Table 2).
To characterize the distribution of DESJs across the genome, we investigated the frequency of DESJs in different genomic regions. We found a significant higher frequency of DESJs in UTRs than in coding regions between clusters (p-value = 0.004 for CD8+ T cells and 6.456e-13 for CD4+ T cells; Student’s t-test; Fig. 2A). AS in the 5′ UTR occurs more frequently than in the 3′ UTR (Fig. S2C), in line with the findings from previous studies 18. There are similar phenomena in the human reference transcriptome. A junction is considered to be involved in alternative splicing when this junction does not appear in all isoforms of the gene. AS in the UTRs (98.7%) occurs potentially more frequently than in the coding regions (83.6%). Total 6115 AS junction happened in 5′ UTR while 4946 in 3′ UTR. AS events in UTRs might involves TTSs and TSSs (Fig. S2E). Higher frequency of DESJs in UTRs may be due to longer junction lengths in UTRs. Junction length refers to the genomic position of the last base of the intron minus the first base (Fig. S2F). We additionally observed that DESJs are significantly longer than non-DESJ in both UTRs and coding regions. The DESJs in UTRs were also longer than those in coding regions (Fig. 2B). UTRs are usually longer than coding regions. Thus, these two phenomena might be explained by the fact that longer junctions would provide more possible splice sites and potential regulatory regions. Therefore, our results highlight the generality of AS in UTRs.
In the meantime, we noticed that a few DEGs between clusters were also DESJ genes in T cells (Fig. 2C). The proportion ranged from 11–37% across T cell clusters, which is similar to findings from a previous study 19. For example, ARHGAP9, a member of RhoGAP family that is associated with good prognosis, was a differential expressed gene (highly expressed in C04_CD8.LAY, C10_CD4.CXCL13, and C08_CD4. CTLA4), and also showed differential splicing in UTRs across CD8+ T cell clusters (C01_CD8.LEF1, C02_CD8.CX3CR1, C04_CD8.LAYN, C05_CD8.GZMK). Specifically, ARHGAP9-203 was upregulated in exhausted T cells and tumor-infiltrated Tregs, while ARHGAP9-204 was mainly expressed in naïve T cells and peripheral blood Tregs (Fig. 2D). Therefore, our results indicate that AS in UTRs may play a role in regulating gene expression between cell clusters.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of genes with DESJs in their UTRs revealed involvement of the VEGF signaling pathway, T cell receptor signaling pathway, spliceosome, P53 signaling pathway, and cell apoptosis (Fig. 2E). Meanwhile, the genes with DESJs in the coding region were associated with innate immune pathways and spliceosomes (Fig. S2G). This emphasizes that AS in UTRs may be related to the specific function(s) of cells. Taken together, AS in UTRs is common and may contribute to the regulation of gene expression and cell heterogeneity.
T cell heterogeneity at the splicing level
To explore the association between AS and the function of T cell heterogeneity, we examined DESJs across T cell clusters to obtain cell-type-specific DESJs. In this study, we detected 335 DESJs from 165 genes among CD8+ sub-clusters and 484 junctions from 239 genes among CD4+ sub-clusters (Supplementary Table 2). We used two distinct indices to hierarchically cluster T cells, the number of DESJ genes and the expression of DESJs across all cell clusters. Both indices revealed that cells with a similar function rather than lineage exhibited a similar AS pattern. (Fig. 3A). For example, tumor-infiltrating Tregs and exhausted T cells (C04_CD8.LAYN, C08_CD4.CTLA4, C10_CD4.CXCL13) clustered together, demonstrating a huge difference between these cells and others. In addition, naïve T cells (C01_CD8.LEF1, C06_CD4.CCR7), effector T cells (C02_CD8.CX3CR1, C11_CD4.GNLY), and intermediate state T cells (C05_CD8.GZMK, C09_CD4.GZMA) clustered together respectively. Exhausted T cells showed the highest number of DESJs compared to other T cells, indicating that exhausted T cells exhibit the greatest changes in AS. These results indicate that junction usage differences between cell clusters mainly depends on the functional state of the clusters.
We next focused on the DESJ genes in four functional states including naïve T cells, effector T cells, exhausted T cells, and intermediate T cells. Naïve and exhausted T cells mainly showed differential splicing in genes relating to splicing and immunity such as CD45, HSPB1, CLK1, SRSF2, SNRNP70, PRDM1, NOSIP; effector T cells were characterized by differential splicing in ZEB2, FYB1, and SYNE1 (Fig. 3B). WARS was highly expressed in exhausted T cells and is a maker of exhaustion that showed differential splicing between exhausted T cells and other T cells (Fig. 3C, D & Fig. S3A). The junction representing WARS-202 (chr14_100369259_100376259_2) showed widespread expression in all T cells while the junction representing WARS-204 (chr14_100369259_100375282_2) was widely expressed in Tregs (C08_CD4.CTLA4) and exhausted T cells (C04_CD8.LAYN, C10_CD4.CXCL13). Prognostic analysis using TCGA LIHC data revealed that elevated expression of WARS was associated with poor prognosis (Fig. 3E). We found that elevated expression of the WARS-204 isoform was correlated with poor prognosis (Fig. 3F). Prognostic analyses with the TCGA LIHC data at the isoform level also supported our results (Fig. S3B). We hypothesized that various immunity therapy-related target genes might also show this pattern, and identified several T cell immunity checkpoint genes whose elevated expression was related to poor prognosis, such as TNFRSF4 (Fig. S3C). Expression of the two isoforms is mutually exclusive in T cells, rather than one isoform being more highly expressed than the other (Fig. S3D). In summary, these results demonstrate that AS significantly affects the function and phenotype of T cells and could be used as a potential marker for cancer prognosis and treatment.
Two novel functional subpopulations identified by CD103-201 and ARHGAP15-205
To further reveal the heterogeneity of T cell clusters, we utilized DESJs to identify functional subpopulations. ITGAE, also known as CD103, is a tissue-resident T cell marker and is highly expressed in exhausted T cells. One of its isoforms, CD103-201, was upregulated in CD8+ exhausted T cells (CD4_CD8.LAYN), while another isoform, CD103-202, was universally expressed in all CD8+ T cells. Differential splicing of CD103 does not fall into any common AS category (SE, MXE, IR, A5SS, and A3SS), since CD103-201 is expressed over nine more exons than CD103-202 (Fig. 4A). In addition, CD103-202 showed widespread expression in C04_CD8.LAYN, while CD103-201 exhibited a bimodal pattern in C04_CD8.LAYN, implying that different isoforms of CD103 likely play different roles in exhausted T cells (Fig. 4B). Meanwhile, the two populations (CD103-201+ and CD103-201-) displayed obvious differences in expression patterns. Specifically, the CD103-201+ population exhibited high expression of the exhausted marker ENTPD1, while the CD103-201- population showed high expression of ribosome proteins including RPL27, RPL35A, RPS29, and RPS21 (Fig. 4C). Gene ontology (GO) enrichment analysis revealed that the CD103-201- population has strong expression of translation factors, indicating that the CD103-201- population may be in a state of transition (Fig. S4A). To verify this hypothesis, we performed pseudotime analysis among all four CD8+ clusters. The CD103-201- population and C05_CD8.GZMK were located more centrally in the Monocle trajectory and had a significantly lower exhaustion score than the CD103-201+ population (Fig. 4D). Thus, this result further suggests that the CD103-201- population was in a “pre-exhaustion” state. As expected, we also observed that the CD103-201- population was associated with better prognosis than the CD103-201+ population according to the TCGA LIHC data (p = 0.009, Cox regression, Fig. S4 B). In summary, these results support CD103-201 may be associated with T cell exhaustion and thus may have clinical applications.
ARHGAP15, a Rac1-specific GAP, was reported to be associated with the development of diverse tumors, including colorectal cancer20, glioma21 and pancreatic ductal adenocarcinoma 22. However, little is known about the relationship between T cell state and ARHGAP15 at the isoform level. Our study discovered that ARHGAP15-201 was universally expressed in all cell clusters, but ARHGAP15-205 exhibited elevated expression in C06_CD4.CCR7 and C09_CD4.GZMA (Fig. 4E). Further, ARHGAP15-205 shows a striking bimodal expression distribution in both CD4 naïve T cells (C06_CD4.CCR7) and CD8 naïve T cells (C01_CD8.LEF1) (Fig. 4F & Fig. S4E). This implies that ARHGAP15-205 may affect the functional state of naïve T cells. We identified 174 genes that were highly expressed in ARHGAP15-205+ naïve T cells (FDR < 0.01, log2(FC) ≥ 1; Supplementary Table 1). These genes significantly overlapped with genes that are markers of an activated state as defined by previous studies (Fig. S4C). Thus, the ARHGAP15-205+ population may represent an activated state. Signature genes of ARHGAP15-205+ include S100A4, ITGB1, S100A6, and LGALS1, supporting that the ARHGAP15-205+ population trends towards an activated state (Fig. 4G). In contrast, the ARHGAP15-205- population was characterized by high expression of genes related to a resting state including CCR7, SELL, and LEF1. Meanwhile, GO biological process enrichment analysis showed that the ARHGAP15-205+ population signature genes were enriched in cell differentiation (including leukocyte and lymphocyte differentiation) and cell activation (Fig. S4D). In addition, pseudotime analysis of cells in C06_CD4.CCR7, C09_CD4.GZMA, C10_CD4.CXCL13, and C11_CD4.GNLY showed that ARHGAP15-205+ cells clustered more closely to cells in C09_CD4.GZMA and had a lower naïve score compared with the ARHGAP15-205- population (Fig. 4H). These results suggest that ARHGAP15-205+ CD4 naive T cells might be in the “pre-activation” state and possess immune killing function. Similar results were associated with respect to CD8 naïve T cells (C01_CD8-LEF1; Fig. S4E & Fig. S4F). In summary, ARHGAP15-205 may play a role in T cell activation.
We used Seurat to cluster cells in C04_CD8.LAYN and C06_CD4.CCR7 with a TPM expression matrix. The clustering results based on gene expression were dissimilar to the classification results using AS of CD103 and ARHGAP15 (Fig. S4G). This indicates that the novel cell subtype may be indeed determined by AS. Altogether, these results emphasize that AS analysis at single-cell level would reveal cell heterogeneity and facilitate the discovery of cell sub-clusters in higher resolution than at the gene expression level.