Pervasive Intronic Polyadenylation Serves as a Potential Source of Cancer Neoantigens

doi:10.21203/rs.3.rs-1537870/v1

Download PDF

Article

Pervasive Intronic Polyadenylation Serves as a Potential Source of Cancer Neoantigens

https://doi.org/10.21203/rs.3.rs-1537870/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Tumor-specific neoantigens have emerged as a promising source for cancer immunotherapy. These tumor-specific neoantigens could arise from somatic mutations, aberrant splicing and RNA editing. Since intronic polyadenylation has similar potential as mutations to generate tumor-specific transcripts and peptides, it may serve as another neoantigen source, which has not been explored. We developed a novel computational pipeline for identification of tumor-specific transcripts and their translated neoantigens derived from intronic polyadenylation. Applying it to RNA-seq data from 5,654 tumor samples of various cancers and 11,000 + normal samples of different tissues, we observed widespread tumor-specific intronic polyadenylated transcripts and their corresponding neoantigens. In addition, we also discovered complementary effects of neoantigens derived from different sources, identified neoantigens arising from recurrent intronic polyadenylated transcripts, and validated their immunogenicity. Here, we have demonstrated that we were able to identify and predict neoantigens from intronic polyadenylated transcripts using RNA sequencing data, hence, allowing us to explore such neoantigens as potential candidates for cancer immunotherapy.

Neoantigens are antigens that are specifically presented on tumor cells but not in normal cells and are recognized by CD8 + T cells which elicit the potent anti-tumor response¹. Currently, most of neoantigens applied in immunotherapy have been derived from somatic mutations (SMs). Early clinical trials in melanoma patients treated with cancer vaccines demonstrated immunogenicity of mutation-derived neoantigens^2–4. In addition to SM, alternative neoantigen sources have recently been proposed. These include aberrant splicing as well as RNA editing^5–8. Two forms of aberrant RNA splicing have been shown to potentially generate neoantigens: one translated from neo-junctions linked two coding exons, and the other one from retained introns (RIs). Notably, the loads of neoantigens derived from RI or other forms of splicing are shown to be comparable to those from SM in melanoma and other cancers^5,6.

In addition to splicing and RNA editing, polyadenylation is another important process of posttranscriptional modifications during RNA processing, whereas alternative polyadenylation (APA) could generate RNA transcripts with variable 3’ends. Previous studies showed that APA within 3’UTR could activate oncogenes in cis by escaping from microRNA targeting, as well as deactivate tumor-suppressors in trans by competing for microRNA bindings in cancer cells^9,10. Recently, it has been shown that abnormal polyadenylation that could alter the coding sequence is associated with human diseases, including cancers^11,12. Furthermore, widespread intronic polyadenylation (IPA) sites have been reported in several studies^13,14. Since aberrant IPA in cancer cells has similar potential as SM and splicing/RI to generate tumor-specific transcripts and peptides, we postulate that IPA may serve as a hitherto unexplored source of neoantigens.

To better evaluate polyadenylation sites (PASs) in a genome-wide manner, various experimental techniques have been developed to enrich fragments at transcriptional 3’ends utilizing the polyA tail as a hook, followed by high-throughput sequencing^15–18. As relatively lower read coverages are presented at transcriptional 3’ends in RNA-seq data than those in 3’end sequencing ones, unbiased detection of PASs in a genome-wide and precise manner using RNA-seq could be challenging. However, the 3’enriched approaches are rather laborious and/or costly to performed on a grand scale compared to conventional RNA-seq. By contrast, RNA-seq is normally performed for transcriptome analysis of normal and disease/cancer samples, and an increasing number of such datasets is now available and accessible. Hence, it will be advantageous to utilize data from conventional RNA-seq for the detection of PASs, in particular, those with high sequencing depths. Until now, a number of computational methods have been developed for APA prediction^19–24. Most of them are either purely based on DNA sequences using cis-elements located around PASs or rely on RNA-seq read coverage transitions at the 3’ends using various mathematical models (Supplementary Table 1). The former ones are able to predict general putative PASs without considering their specificity in different cell/tissue types^19,20, while the latter ones have demonstrated good performance on the identification of PASs within 3’UTR and can deliver sample-specific APA results^21–24. However, almost none of them are able to effectively identify de novo IPA events. Employing the polyA-spanning reads for PAS identification²⁵ and the RNA-seq coverage transitions around PASs for APA-expression estimation^21–24, we developed a de novo computational method termed as PolyA-Spanning Reads Analysis Method (in short PASRAM), which is capable of utilizing RNA-seq data for more accurate and comprehensive identification of PASs genome-wide, including IPA sites. Applying the pipeline to 5,654 samples from 12 major cancers and 11,000 + samples from normal tissues, we observed widespread IPA-transcripts and IPA-derived neoantigens across different cancers. The loads of predicted neoantigens derived from IPA transcripts were comparable and complementary to other neoantigen sources such as SM and RI. Importantly, we were able to identify recurrent, tumor-specific IPA transcripts and neoantigens pan cancer, making them clinically attractive. Lastly, we demonstrated using an ex vivo co-culture platform that the IPA-derived neoantigens are highly immunogenic and hence, can be considered as potential candidates for cancer vaccines.

Pipeline for identification of IPA-derived neoantigens. To comprehensively characterize neoantigens derived from IPA in human cancers, we utilized RNA-seq data of 5,654 tumor samples from 12 cancer types in The Cancer Genome Atlas Project (TCGA) as well as 11,000 + normal samples from different normal or adjacent normal tissues in the Genotype Tissue Expression (GTEx), BLUEPRINT epigenome project or TCGA (Fig. 1a and Supplementary Table 2). Applying our computational method PASRAM to these data, we were able to de novo identify tumor-specific IPA sites. Briefly, we extracted soft-clipping information from mapped RNA-seq reads to identify polyA-spanning reads covering the 3’ end of mRNA and polyA tail (Supplementary Fig. 1), and then clustered those reads to predict the representative PAS for the cluster (Fig. 1b). We further filtered the identified PASs by requiring a significant drop in RNA-seq read coverage at the downstream regions comparing to that at the upstream. Lastly, we estimated expression of the transcript for each PAS by using both the read coverage drop and the reads supporting poly-A spanning. To extract tumor-specific IPA transcripts, we discarded any IPA sites if they were also found in any 11,000 + normal samples of various tissues or in GENCODE annotation, and performed further filtering based on expression (Fig. 1b and Methods).

In parallel, the genotypes and expression levels of HLA alleles were estimated by the seq2HLA tool²⁶. Finally, binding affinities between the peptides encoded by IPA transcripts and their corresponding HLA alleles were predicted²⁷, and the candidates with high binding affinities were further filtered with UniProt proteome database to acquire a list of putative IPA-derived neoantigens (Fig. 1c and Methods). Lastly, we employed an ex vivo co-culture platform for assessing activation and expansion of cytotoxic T cells and immunogenicity of the predicted IPA-derived neoantigens (Fig. 1d).

Comprehensive evaluation of PASRAM for APA detection. We first evaluated PASRAM by using MCF7 and HEK293 RNA-seq datasets. A total of 12,343 and 14,557 PASs were identified with ~ 3.3% located within introns for MCF7 and HEK293, respectively (Supplementary Figs. 2a, 3a). We assessed predicted IPA sites as well as genome-wide predicted PASs step by step from the following five aspects. Firstly, we overlapped these identified PASs with available PacBio long reads for MCF7 as well as two annotated PAS databases: PolyASite²⁸ and PolyA_DB²⁹ for both cell lines. High overlapping rates (~ 80%) with the two databases were observed, while 57% of predicted PASs were supported by MCF7 PacBio long-read transcriptome, which contains only 55,770 assembled transcripts (Supplementary Fig. 2b, 3b). Four predicted IPA sites from genes ANKEF1, NCOA3, KCTD20, and LUC7L2 were depicted in genome browser with the co-occurrence of PacBio long reads for MCF7 along with detected polyA-spanning reads (Fig. 2e, j, k and Supplementary Fig. 4a), demonstrating high accuracy of our predictions.

Secondly, we showed that both intronic and genome-wide PASs identified by PASRAM shared similar nucleotide compositions as canonical/annotated PASs^18,30, with adenosine as the most dominating nucleotide at the cleavage sites for MCF7 (Fig. 2a, b and Supplementary Fig. 2c). Thirdly, the polyA signals were found in about 80% of PASs within the 60 nt upstream region, and the most prevalent motif was AATAAA followed by ATTAAA with significant enrichment over the background (Fig. 2c, d and Supplementary Fig. 2d), which were consistent with the findings from 3’ end sequencing method¹⁸. Similar results were also observed in HEK293 cell line (Fig. 3a-d and Supplementary Fig. 3c-e). Fourthly, RT-qPCR and 3’ RACE validation experiments were conducted to further confirm the presence/expression of our predicted IPA transcripts (Fig. 2f-i and Supplementary Fig. 4b-e). Finally, a publicly available A-seq2 dataset for HEK293 cell line³¹ was utilized to benchmark PASRAM along with two published methods (DaPars²³ and APAtrap²⁴) capable of de novo PAS identification. As shown in Supplementary Fig. 3f, PASRAM demonstrated better performance in de novo PAS identification than the other two methods by achieving higher true positives almost without any false positives. It should be however noted that a low true positive rate was generally obtained when using RNA-seq data. This was mainly due to low read coverages for the missed PASs which were enriched in reads sequenced using A-seq2. Collectively, the above findings demonstrate higher quality of the IPA transcripts identified by PASRAM.

Landscape of neoantigens loads in 12 TCGA cancer types. After the evaluation of our method for PAS detection, we applied it to analyze RNA-seq data of 5,654 tumor samples of 12 cancer types and 11,383 normal samples (Fig. 1a and Supplementary Table 2). Widespread IPA transcripts were detected across different cancers. On average, cancer samples had about seven times more IPA-transcripts than adjacent normal tissues (p < 0.001) (Fig. 4a). Applying the filtering steps mentioned above for tumor-specific events, widespread tumor-specific IPA-transcripts could also be identified across different cancers, whereby gastric and ovarian cancers were observed with the highest average numbers of tumor-specific IPA transcripts per sample (Fig. 4b), probably owing to their highest sequencing depths among the 12 cancers. To ensure that the predictions were both tumor and IPA specific, we also inspected their IGV plots across different tumor and normal samples as well as polyA-spanning reads (Fig. 4c and Supplementary Fig. 5).

To demonstrate that tumor-specific transcripts were potentially translated to generate stable neoantigens, we analyzed proteomic data of 38 breast cancer and 59 ovarian cancer samples from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) with the matched TCGA RNA-seq samples³². A total of 802 polypeptides derived from the predicted IPA transcripts were supported by the CPTAC data with false discovery rate (FDR) of 0.01 as cutoff (Supplementary Table 3). Among the 802 IPA-derived polypeptides identified by MS data, five were predicted to have strong binding affinities to the corresponding HLA alleles (Fig. 4d and Supplementary Fig. 6). Similar numbers of tumor-specific polypeptides were identified for all three sources of IPA-, SM- and RI-derived neoantigens. Comparable numbers were also observed in other studies for different sources of neoantigens^5,6,33.

Next, we investigated whether the identified IPA neoantigens were redundant or already covered by existing SM- or RI-derived neoantigens for each patient. To do so, we correlated neoantigen loads of the three sources for all 5,654 patients or individual cancers. No correlations were observed across the 5,654 patients (Fig. 4g,h) as well as for individual cancers (Supplementary Fig. 7), indicating that different sources of neoantigens are mutually exclusive. Hence, IPA-derived neoantigens add to the repertoire of neoantigens available for cancer vaccines.

IPA neoantigen loads correlate with anti-tumor immune response and prognosis. To determine the ability of neoantigens to activate anti-tumor immune response, we correlated the neoantigen load from each source (SM/RI/IPA), as well as the total neoantigen load from all three sources, with the fractions of different tumor-infiltrating immune cells (TIICs) across tumor samples in each cancer type. Like other neoantigen sources^6,34, positive correlations between neoantigen loads and different TIICs (e.g. CD8 + T cells, and activated NK cells) could be observed for various cancer types (Fig. 5a). Notably, higher positive correlations were observed with the fraction of T follicular helper (Tfh) cells, which was consistent with previous data reporting a positive association of the Tfh levels with tumor mutational burdens (TMBs) in lung adenocarcinoma patients³⁵. As Tfh cells are crucial for antigen-specific B cell development, we also observed that the neoantigen loads were positively correlated with the fraction of memory B cells (Fig. 5a). Interestingly, the total neoantigen load of the three sources (SM + RI + IPA) showed slightly stronger correlations with these TIICs fractions than any of the three neoantigen sources individually for most of 12 TCGA cancers (Supplementary Fig. 8a). This might further indicate potential complementary contributions of all different neoantigen sources in regards to antitumor immune responses. As quality control for the above correlation analyses, we also correlated the neoantigen loads of IPA, RI and SM with the reported mutation-related neoantigens and mutation rates across the 12 TCGA cancers³⁴. High correlations were observed for our predicted SM loads with the published SM-neoantigen loads as well as silent/non-silent mutation rates, while there were no/poor correlations of both IPA- and RI-derived neoantigens with those mutation related terms (Supplementary Fig. 8b).

To further investigate potential complementary or synergistic effects of different neoantigen sources and prognosis values of IPA-neoantigens, we first performed survival analysis for TCGA patients stratified separately based on SM- or IPA-neoantigen loads individually (Fig. 5b, left and middle panels). Next, we combined the IPA/SM neoantigen loads and performed survival analysis only for the two patient groups with both SM- and IPA-neoantigen loads greater or lower than their median values. Significant improvements in prognosis were observed for TCGA bladder cancer patients (Fig. 5b, right panel) as well as for the TCGA lung squamous cell carcinoma cohort (Supplementary Fig. 9).

We further investigated the specificity of IPA neoantigens to cancers and examined their association with patient survival among a subgroup of patients. To this end, survival analysis stratified by the IPA-neoantigen loads was first performed for all patients in a cohort (Fig. 5c, upper panel and Supplementary Fig. 10a). Then we extracted only the samples with low SM-neoantigen loads and performed survival analysis again using the IPA-neoantigen loads for those extracted patients. Notably, the IPA-neoantigen load was significantly associated with better prognosis for those AML and gastric cancer patients with low SM neoantigen loads (Fig. 5c, bottom panel and Supplementary Fig. 10b), indicating that the IPA neoantigen load could be a prognosis marker only when IPA neoantigen is the prevailing source of neoantigens. To determine the correlation of T cells response with the IPA neoantigen load, we focused only on the extracted patients and evaluated the differences in expression of immune response markers including T cell markers (CD8A, CD8B, PD1, CTLA4) and immunogenetic markers (GZMA, PRF1) between the patients with high and low IPA-neoantigen loads. Most of them were significantly upregulated in AML and gastric cancer patients with high IPA neoantigen loads (Fig. 5d and Supplementary Fig. 10c). While both PD1 and CTLA4 were positively correlated with the IPA neoantigen load in the TCGA gastric cancer cohort, only CTLA4 (but not PD1) had elevated expression for AML patients of high IPA-neoantigen loads, indicating that additional combinatory immunotherapy of different immune-checkpoint inhibitors (ICIs) might be beneficial for AML and gastric cancer patients with high IPA- but low SM-neoantigen loads.

Next, we investigated the high and low load grouping of IPA neoantigens for the TCGA melanoma cohort as well as two independent cohorts of melanoma patients undergoing ICI immunotherapy^36,37. Slightly poorer survival was observed for the group of high IPA-neoantigen loads in the TCGA cohort while significantly higher IPA neoantigen loads were found in the non-responders for both independent cohorts (Supplementary Fig. 11a-c). These negative correlations could potentially be explained by the lower ratios of IPA neoantigens to the bulk of SM, RI and IPA (Supplementary Fig. 11d), suggesting that a source of low abundance neoantigens may not properly represent the total neoantigen reservoir triggering immune response. Furthermore, this supports the rationale that focusing on only one neoantigen source may not be sufficient to provide a more holistic understanding of a patient’s immune response to neoantigens for increased successful patient treatment.

Identification and functional validation of recurrent IPA-derived neoantigens. As neoantigens are derived specifically from individual genome and/or transcriptome, neoantigen immunotherapy needs to be personalized to treat individual patients. Due to high costs and time delays associated with the manufacture of individualized neoantigens, it is desirable to derive recurrent neoantigens among the cancer patients. Thus, we investigated whether there were any recurrent IPA events across the patients in each of the 12 TCGA cancers and compared the recurrent transcripts as well as neoantigens among the three sources (IPA, RI and SM). Recurrent transcripts or neoantigens in a cancer cohort were defined as those tumor-specific events with identical genomic locations or amino acid sequences identified in more than 10% of patient samples of the cohort. For six out of the 12 TCGA cancer types, several recurrent IPA transcripts could be detected in more than 20% of patients (Fig. 6a), while a number of highly recurrent RI transcripts were observed in TCGA AML patients (Fig. 6b). However, there were hardly any SM events with more than 5% recurrence rate except for BRAF V600E in skin and colon cancers (Fig. 6c). To ensure the reliability of predicted recurrent tumor-specific IPA transcripts, we inspected a number of representative candidates associated with HLA-A*11:01 using IGV plots across different tumor and normal samples as well as examined the presence of polyA-spanning reads (Supplementary Fig. 12). Similar to transcripts, recurrent IPA-derived neoantigens were also identified in the six different cancers, recurrent RI-derived neoantigens were observed mainly in TCGA AML patients, and hardly any recurrent SM-derived neoantigens could be found (Supplementary Fig. 13a-c).

As our evaluation system of peptide immunogenicity was based on HLA-A*11:01 (Fig. 1d), we selected only those recurrent IPA derived neoantigen peptides which were predicted to bind HLA-A*11:01 with strong affinities. To investigate the immunogenicity, we synthesized 9-mer peptides for ten selected recurrent tumor-specific neoantigens (Supplementary Table 4). Mature DCs from HLA-A*11 healthy donors were pulsed with the pooled peptides and co-cultured with autologous T cells. We observed a significant increase in IFNγ secreting T cell colonies in the peptides-pulsed cells relative to the non-peptides-pulsed control measured using ELISpot (p = 0.008; Fig. 6d,e). In addition, 14.6% of the co-cultured cells were IPA neoantigens-specific as determined by tetramer staining (Fig. 6f). Next, we co-cultured neoantigens-expanded T cells with target cells expressing a granzyme reporter that emits IFP only after cleavage of the reporter substrate by granzyme B³⁸. When the IPA neoantigens-expanded T cells were co-cultured with the target reporter cells, we observed a significant increase in IFP-positive target reporter cells (26.4%) relative to the negative control (8.77%; Fig. 6g). With the positive control (EBV antigen), 27.7% of CD8 + T cells were tetramer-positive and 28.1% of the target reporter cells were observed IFP-positive (Supplementary Fig. 13d,e), indicating that firstly the granzyme reporter is a reliable readout and secondly IPA-neoantigens are able to achieve good results for T cell expansion (14.6%) and cytolytic activity (26.4%) compared to the positive control. To further investigate the immunogenicity of individual peptides, we performed ELISpot analysis of IFNγ secretion using the IPA neoantigen-expanded T cells. Notably, we confirmed that three of ten synthesized IPA neoantigen peptides were significantly immunogenic (Fig. 6h,i). Together, these results demonstrate that the IPA-derived neoantigens are immunogenic, resulting in activation and expansion of antigen-specific T cells as well as peptide-specific killing.

Currently, SM is the most well-known source for neoantigen identification. Although neoantigen immunotherapies or personalized vaccines using patient-specific neoantigens have demonstrated feasibility, safety, and immunogenicity as indicated in clinical trials of cancer treatment, patient response rate still remains limited^3,39,40. Here we propose IPA as an alternative source of neoantigens, thus increasing the availability of immunogenic neoantigens that can used for cancer vaccines. To facilitate the identification of IPA transcripts, we developed a computational method for de novo IPA identification based on RNA-seq data (PASRAM). PASRAM utilized the following two features: a) soft-clipping information to derive polyA-spanning reads and b) read coverage difference around a PAS to estimate IPA transcription levels. For tumor specificity, we strictly applied the above two features. No polyA-spanning reads should be detected for any candidate PASs in any normal samples, while high coverage difference was set for those tumor samples harboring a PAS but much low cutoff for all normal ones. We specifically avoided using expression fold changes between tumor and normal samples for tumor specificity, as high fold changes could still include the IPA transcripts of moderate/high expression in normal samples or those of low expression in tumor samples. Utilizing two cell lines, we successfully validated PASRAM from several key aspects, including overlapping of two known APA databases, known features of nucleotide compositions around PASs, and occurrences of the polyA signals within 60 nt upstream of PASs. Importantly, PASRAM was able to detect more IPA transcripts with high confidence compared to other published methods. It should also be noted that due to lower read coverage of RNA-seq data in 3’UTR regions compared to specific 3’ sequencing data, any coverage-based computational methods will miss out on those IPA sites with background/low reads. As shown in Supplementary Fig. 3f using A-seq2 data as benchmarking³¹, there was a higher false negative rate even for using PASRAM. This issue can be solved with increasing sequencing depth. Interestingly, it also means that the real reservoir for IPA-derived neoantigens might be much larger than our current predictions.

For intronic neoantigens based on RNA-seq data, there are currently two major concerns. The first is whether the intronic transcripts are real or just artifacts, and the second is related to the ability of the intronic transcripts being translated into functional peptides. We addressed the first concern by assessing the overlapping rate with PacBio long reads as well as performing qPCR and 3’RACE validation experiments to confirm the presence/expression of such transcripts (Fig. 2e-i and Supplementary Fig. 4). For validation of translated peptides derived from an IPA transcript, CPTAC proteome dataset was employed. We identified similar numbers of peptides for the three sources (IPA, RI & SM), which were consistent with published results^5,6,33. So far by utilizing CPTAC or HLA-pulldown proteome datasets, identification of only a handful neoantigen peptides have been reported in different studies regardless of mutation-altered, splicing or intronic transcripts results. However, the validation rates are rather limited considering that a large number of SMs, splicing events and IPA transcripts have been discovered. In particular, the low validation rate is also true for SMs, whose transcripts are likely translated into neoantigen peptides with no nonsense mediated decay (NMD). It should be noted that IPA neoantigens, like SM but unlike RI, may not undergo NMD according to the canonical rules of lacking a downstream exon junction complex⁴¹. Thus, it seems unlikely that the predicted peptides are not translated, rather the current proteomic datasets may only be suboptimal for neoantigen validation. The low validation rates could also be partially due to higher false positives generated at the step of prediction of peptide-HLA binding affinities.

Profiling IPA neoantigens of TCGA 12 cancers identified widespread IPA-neoantigens across different patients, and the predicted IPA neoantigens could be complementary rather than redundant to other published neoantigen sources such as SM- and RI/splicing-derived neoantigens. IPA-neoantigen loads could be associated with antitumor immune responses and cancer prognosis. However, we also demonstrated that if a neoantigen source contributed only minimal to the overall neoantigen load, its correlation with patient immune response or prognosis might not be accurate.

As different sources of neoantigens are complementary to each other, all neoantigens should be explored for effective immunotherapy. However, it is worthy to note a significant advantage for IPA-derived neoantigens: IPA-neoantigen recurrence. Due to the extremely high inter-tumor heterogeneity of somatic mutations, recurrent mutation sites were rare. Although there are some hotspots, their recurrent rates are often only limited to ~ 5% except for late stages of limited cancer types⁴². Thus, cancer vaccines utilized SM-derived neoantigens will be largely needed to be personalized for each patient, which could be rather costly and time-consuming. However, we identified dozens of recurrent IPA-derived neoantigens, which were presented in 10+% of cancer patients for several cancer types. Vaccines combining multiple of recurrent IPA-derived neoantigens could be applied to a wide range of patients, which could potentially lead to cost-effective neoantigen-based immunotherapy.

Regardless of their origins from SM, RI/splicing, RNA editing or IPA, clinically relevant neoantigen peptides should be presented by antigen presenting cells to activate and expand T cells. Our ELISpot and tetramer staining results clearly demonstrate that predicted peptides are able to significantly activate and expand CD8 + T cells. Utilizing an engineered cell line of HEK293T with deletion of endogenous HLA genes but overexpression of HLA-A*11:01, we observed peptide-specific immunogenicity for our predicted neoantigens associated with HLA-A*11:01. These results clearly demonstrate potential clinical relevance of our predicted IPA peptides.

In summary, we developed and validated a computational pipeline for identification of IPA-neoantigens. Applying the pipeline to 5,654 TCGA RNA-seq data, widespread tumor-specific IPA events were identified across various cancers. We also revealed that IPA could be another prospective source for neoantigens in addition to SM and splicing/RI, and these different neoantigen sources could complement each other. Interestingly, a number of recurrent tumor-specific IPA-transcripts/neoantigens were identified in several cancer types, potentially leading to cost-effective immunotherapy. Most importantly, we experimentally demonstrated that the IPA neoantigens were able to activate cytotoxic T cell response for target tumor cell killing. Thus, IPA may serve as a novel source for the development of immunotherapeutic tumor vaccines.

Sequencing, clinical and other data. We utilized 5654 samples of 12 cancers from TCGA and 11,383 normal tissues from TCGA (adjacent normal), GTEx, and BLUEPRINT (Fig. 1a and Supplementary Table 2). Next-generation sequencing and clinical data for the 12 cancers in TCGA were downloaded from dbGaP TCGA repository, including RNA-seq data and VCF files across 12 major cancer types (n = 5,654), RNA-seq data of all adjacent normal samples (n = 742), and clinical information for all patients in the 12 cancers. RNA-seq data of 9,777 normal tissues were downloaded from GTEx Portal (www.gtexportal.org), while 864 RNA-seq data of normal blood cells were downloaded from BLUEPRINT DCC Portal (http://dcc.blueprint-epigenome.eu/#/home). RNA-seq data of MCF7 and HEK293 cell lines were downloaded from GEO (GSE126365 and GSE56010, respectively). The GFF file of PacBio Iso-seq data for MCF7 transcriptome was obtained from https://github.com/PacificBiosciences/DevNet/wiki/IsoSeq-Human-MCF7-Transcriptome. A-seq2 data of HEK293 cell line was downloaded from the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/) under accession number SRP065825. Raw RNA-seq datasets of melanoma patients prior to ICI immunotherapy in two independent cohorts were obtained from GEO (GSE78220 and GSE91061). Proteomics mass spectrometry (MS) data for TCGA ovarian and breast cancers were downloaded from the CPTAC data portal (https://cptac-data-portal.georgetown.edu).

Identification of PASs by PASRAM. After mapping of raw RNA-seq FASTQ reads onto hg19 with STAR (v2.7.1a)⁴³, the uniquely mapped reads with at least 3 soft-clipped nucleotides were extracted based on CIGAR string from bam files using SAMtools⁴⁴. Next, genomic locations of tentative cleavage sites were determined by utilizing mapping information as well as CIGAR string of soft-clipped reads. The soft-clipped fragments with ≥ 3 A residues at 3’ end or ≥ 3 T residues at 5’ end, which covered a part of both 3’end RNA and polyA tail sequences, were defined as polyA-spanning reads. To elucidate this selection process of polyA-spanning reads, a number of representative cases were shown with IGV plots (Fig. 2e,j,k and Supplementary Figs. 4a,5,12). Any tentative cleavage sites with less than two supporting polyA-spanning reads were removed. As the identified polyA spanning reads resemble the polyA supporting reads from the 3’ READs sequencing method, we applied the same method for PAS identification by grouping the neighboring cleavage sites into a cluster¹⁶. In other words, when multiple cleavage sites were presented, we clustered all cleavage sites within 24 nt together and designated the site with the highest read coverage as the representative cleavage site for the cluster. If the width of a cluster was larger than 24 nt, the cleavage sites located more than 24 nt away from the representative PAS were re-clustered. The process of clustering was repeated until all tentative PASs were assigned.

Based on GENCODE annotation (V34lift37), we selected the PAS clusters located in introns as IPA sites. To ensure that the authenticity of IPA transcripts, we filtered out those intronic PASs whose distances to any annotated CDSs or upstream exon-intron junctions were less than 24 nt. We also checked the upstream exon-intron junction reads and coverage rate of the intronic part for each IPA transcript. For an IPA with less than five reads supporting the upstream exon-intron junction or intronic coverage rate below 90%, the IPA was discarded for further analysis. As the background sequences for charactering PASs, we utilized randomized genome-wide and intronic sites generated by BEDtools shuffle⁴⁵.

Identification of tumor-specific IPA sites/transcripts. In order to estimate the expression level of an IPA transcript, we normalized the read counts to RPM (Reads Per Million mapped reads). RPM_pAs was denoted as the number of identified polyA-spanning reads. As the abundancy of polyA-spanning reads crossing an IPA site is usually much lower than the number of mapped reads around the site, RPM_pAs may not accurately represent the expression of the IPA transcript. Thus, the RNA-seq reads up- and down-stream of the IPA site in a window of 100 nt were determined were also used. As a significant drop in read coverage was expected across a real PAS from the upstream to the downstream region, the difference (RPM_UpS-DwS) in the RPM between the upstream and downstream of the IPA site was also determined. Finally, the expression level of an IPA transcript (i.e. RPM_IPA) was defined as the maximum of RPM_pAs and RPM_UpS-DwS.

To obtain tumor-specific IPA transcripts, we first eliminated any identified IPA sites that could be also found in GENCODE. We also removed those IPA sites for which polyA-spanning reads could be identified in any of normal samples. Next, we estimated the expression for each of the remaining IPA transcripts in all tumor and normal samples using the above-mentioned method. For tumor specificity, we used two different expression cutoff values. In other words, to be tumor-specific for an IPA transcript in a tumor sample, the expression level (RPM_IPA) should be greater than and equal to one, while the IPA transcript expression in any of normal samples should not be greater than 0.1. We specifically avoided using expression fold changes between tumor and normal samples for tumor specificity, because high fold changes could still include the IPA transcripts of moderate/high expression in normal samples or those of low expression in tumor samples. We further excluded any IPA transcripts if tumor specificity could be found only in a single one of the 5654 tumor samples to avoid potential artifacts.

Determination of patient HLA alleles. The 4-digit HLA alleles of each patient in TCGA were identified by Seq2HLA²⁶ based on the RNA-seq data. The expression levels of HLA-A, -B and -C were also calculated using Seq2HLA by considering the variability in the MHC class I locus. When correlating the HLA expression with the neoantigen load, the pool of HLA-A, -B and –C expression levels was used. We also downloaded HLA typing results for 4,333 patients predicted based on POLYSOLVER⁴⁶ based on whole-exome sequencing data of the 12 TCGA cancers. We observed that 94% of the 4-digit HLA alleles predicted by seq2HLA overlapped with those from POLYSOLVER, and assumed no significant difference in HLA typing between these two tools.

Generation of peptide fragments from the three resources. The custom script from our previous work⁴⁷ was employed to predict the genomic loci of the retained intron transcripts for 5,654 cancer samples and 11,383 normal samples of different tissues. We applied the following filtering criteria to obtain a set of reliable RI transcripts: a) at least 5 exon-intron junction reads on both sides of a RI; b) minimal retention splicing index of 0.1 for any RIs; c) at least 90% of genomic sequences covered with RNA-seq reads for any RIs.

Similar criteria to IPA transcripts were imposed to obtain tumor-specific RI ones. All the tumor-specific RI transcripts were further subjected to analysis of NMD according to the canonical rules⁴¹. If the first intron is retained, a premature termination codon (PTC) found 200bps beyond the downstream exon junction complex (EJC) would lead to NMD, while for a retained intron located from the 2nd to the penultimate intron, retained transcripts would undergo decay if a PTC is identified 50bps beyond the downstream EJC. For the last retained intron, the RI transcript would escape from NMD, like IPA transcripts, as no downstream exon junction complex exists to trigger NMD⁴¹. Only NMD-escaped RI transcripts were considered further.

With at least one amino acid from introns, the IPA/RI-derived cancer-specific peptides of 9 amino acids were generated by extending open reading frames (ORFs) into intronic sequence until reaching an in-frame stop codon or IPA site similar to the previous approaches⁵. On average, less than 0.5% of IPA/RI-derived peptides contained only one amino acid from the intron region. For SM, single nucleotide variants of the 12 TCGA cancers were derived from the downloaded VCF files. The peptide FASTA sequences were built using one to eight flanking amino acids on each side of the mutant amino acid. If the mutation was at the beginning or end of a transcript, the preceding or succeeding 8 amino acids were then taken to build the peptide FASTA sequence.

Prediction of neoantigens from IPA, RI and SM. The binding affinities between cancer-specific peptides and corresponding HLA alleles were estimated by NetMHCpan v4.0²⁷. Peptides with rank percentage ≤ 2% were considered as putative HLA-presenting neoantigens. Then we filtered out the neoantigens that were found in normal human proteome (UniProt), and eliminated the neoantigens with less than five RNA-seq supporting reads for any one of the nine amino acids to obtain a final list of neoantigens.

Tumor-specific recurrent events. Recurrent transcripts or neoantigens in a cancer cohort were defined as those identified tumor-specific events with exactly the same genomic locations or amino acid sequences, which were found in more than 10% of the tumor samples in the TCGA cancer cohort.

A-Seq2 data analysis. A-seq2 data were processed as described³¹ with the software in gitlab repository (https://git.scicore.unibas.ch/zavolan_public/A-seq2-processing), and the PASs determined from A-seq2 reads were used to benchmark PASRAM performance along with two published methods capable of de novo PAS identification i.e. DaPars²³ and APAtrap²⁴.

Identification of expressed peptides from CPTAC. Utilizing proteomics mass spectrometry (MS) data for TCGA ovarian and breast cancers generated by CPTAC³², we validated the predicted neoantigen peptides. Comet sequence database search tool⁴⁸ was applied to identify any polypeptides by searching CPTAC MS2 spectra against a reference database containing both the human proteome and the predicted amino acid sequences derived from SM-, RI- and IPA-transcripts. The Percolator algorithm was employed to score the identified peptides⁴⁹. Overlapping these MS-identified peptide candidates with the predicted neoantigens, CPTAC-supported peptides were identified and visualized by MaxQuant⁵⁰.

RNA extraction, RT-qPCR and Sanger sequencing. Total RNA was extracted using RNeasy mini kit (Qiagen) and treated with DNase I (Qiagen) on column. cDNA was synthesized using the Advantage RT-for-PCR kit (Takara), followed by qPCR analysis. The first strand synthesis of cDNA was performed using oligo dT. The relative expression levels of target genes were normalized to those of the housekeeping gene (β-actin). The purified PCR products were sent for direct sequencing and visualized by SnapGene Viewer. Sequences of primers were listed in Supplementary Table 5. We designed four pairs of primers: crossing the exon-intron junction upstream of an IPA site (IPA1), within the IPA intronic part (IPA2), in the intronic part downstream of the IPA (Long), and crossing the exon-exon junction of the intron harboring the IPA site (Short) (Supplementary Fig. 4b, left panel).

3’ RACE. cDNA of MCF7 cells was synthesized using anchored oligo dT, to anchor its poly(A) ends. cDNA end was then amplified by PCR using gene specific forward primer and anchor reverse. Sequences of primers were listed in Supplementary Table 6. PCR products were separated by agarose gel and purified by Gel DNA Exaction mini kit (Vazyme). Purified products were sent for Sanger sequencing. Triplicates were performed for the 3’ RACE.

Quantification of immune cell infiltration. Estimates of immune cell infiltration of TCGA primary tumors were downloaded from the TIMER web server⁵¹ and cross-referenced to our sample classification as high or low neoantigen group. Wilcoxon rank sum test was used to evaluate the statistical difference in the levels of immune cell infiltrations between high and low neoantigen groups. Cytolytic activity was computed as the geometric mean of expressions of GZMA and PRF1 (in RPKM).

Correlation between neoantigen loads and fractions of TIICs across tumor samples. The predicted fractions of tumor infiltrated immune cells (TIICs) in each TCGA tumor samples were downloaded from the publication³⁴. Next the neoantigen loads derived from the three sources (RI, SM an IPA) were correlated with the fraction of each TIIC for each cancer type.

In vitro activation and expansion of T cells. Monocytes were isolated from peripheral blood mononuclear cells (PBMC) from HLA-A*11 donors by allowing the cells to adhere to the surface of the tissue culture flask in RPMI containing 1% human serum (Sigma-Aldrich) for 2h at 37^oC in a CO₂ incubator. CD8 + T cells were isolated from the same donor’s PBMC using EasySep™ Human CD8 + T Cell Isolation Kit (STEMCELL Technologies) following the manufacture’s protocol. Monocytes were differentiated to mature dendritic cells (DCs) using the ImmunoCult™ DC Culture Kit (STEMCELL Technologies) following the manufacturer’s instructions. Matured DCs were dissociated and co-cultured with the isolated autologous CD8 + T cells in the presence of the peptides encoding IPA-neoantigens (GenScript; Supplementary Table 4) at a concentration of 2µg per ml per peptide for 14 days. As a positive control, CD8 + T cells were co-cultured with matured DCs pulsed with the Epstein-Barr virus (EBV) peptide (IVTDFSVIK, GenScript) at a concentration of 2µg per ml and cultured for 14 days.

ELISpot assay. ELISpot plates pre-coated with antibodies specific for IFNγ (Human IFNγ ELISpot PLUS kit, Mabtech) were washed with PBS and blocked with PBS containing 0.5% fetal calf serum (Thermofisher Scientific) for 2h at room temperature. 2 x 10⁴ expanded T cells were stimulated with 2 x 10³ K562 expressing HLA-A*11:01 in the presence of peptides encoding IPA neoantigens or absence (control) for 24h. All tests were performed in duplicates. Spots were visualized with a biotin-conjugated anti-IFNγ antibody followed by incubation with Streptavidin-ALP and BCIP/NBT-plus substrate following the manufacturer’s protocol. Plates were scanned and analyzed using CTL ImmunoSpot® S6 ENTRY Analyzer.

Tetramer staining. Flex-T™ HLA-A*11:01 monomer UVX (BioLegend) was irradiated with UV in the presence of the peptides encoding IPA-neoantigens following the manufacturer’s protocol. The monomers were then assembled into tetramers with PE-conjugated streptavidin. The expanded T cells were subsequently stained with the peptides-loaded tetramers and APC anti-CD8 antibody (BioLegend). The samples were acquired on the BD LSR Fortessa™ Flow Cytometer (Becton-Dickinson) and analyzed using the FlowJo™ software.

Cytolytic reporter assay. HEK293T cells (ATCC) were engineered to lack HLA through CRISPR-mediated deletion of the endogenous HLA genes. The HLA-A*11:01 transgene was then overexpressed in the HLA-deleted HEK293T cells together with an infrared fluorescent protein (IFP)-based granzyme reporter. The granzyme reporter was constructed by replacing the caspase cleavage site of the fluorogenic protease reporter with a granzyme-specific cleavage sequence³⁸. The HLA-A*11:01 reporter cells were co-cultured with the IPA neoantigens-expanded T cells in the absence (negative control) or presence of the IPA neoantigen peptide(s) for 6h at 37^oC in a CO₂ incubator to show the peptide-specific killing (Fig. 1d). Again, using the EBV-antigen as a positive control, the reporter cells were co-cultured with EBV-antigen expanded T cells in the presence or absence of the EBV-antigen peptide. The cells were then acquired using BD LSR Fortessa™ Flow Cytometer (Becton-Dickinson) and analyzed using the FlowJo™ software.

Data availability

Data were downloaded from various publicly available websites. The detailed downloading information is described in Methods.

Code availability

Pipeline codes are available at GitHub https://github.com/RENXI-NUS/IPA-neoantigen-prediction.

Acknowledgments

This work was supported by Singapore Ministry of Education under its Research Centres of Excellence initiative and by RNA Biology Center at the Cancer Science Institute of Singapore, National University of Singapore, as part of funding under the Singapore Ministry of Education's Tier 3 grants [grant number MOE2014-T3-1-006]. XR and TM are supported by NUS research scholarship from the Singapore Ministry of Education. GC is supported by National Research Foundation (NRF) Singapore, under the NRF fellowship [NRF-NRFF12-2020-0007]. The computational work for this article was mainly supported by National Supercomputing Centre, Singapore (https://www.nscc.sg).

Author contributions

H.Y. and X.R. conceived the study. H.Y., X.R. and G.C. wrote the manuscript. X.R. initiated the study. H.Y. supervised the study. X.R., B.Z. and J.L. designed the pipeline and performed bioinformatics analyses with the help of T.K.T., Y.L., O.A., C.S.W., and Y.L. Validation of transcripts and experimental design were performed by Y.Y.S., S.Y.T., L.W.D., M.L., S.J.T., Y.H.H., W.C. and L.L.C. G.C. designed and supervised the experiments for validating antigen immunogenicity. T.M., B.L., B.H.T. and L.J. performed the experimental validation of antigen immunogenicity. All authors have read and approved the final manuscript.

Competing interests

The authors declare no competing interests.

Supplementary information

Supplementary figures and tables

Matsushita, H. et al. Cancer exome analysis reveals a T-cell-dependent mechanism of cancer immunoediting. Nature 482, 400–404 (2012).
Khodadoust, M.S. et al. Antigen presentation profiling reveals recognition of lymphoma immunoglobulin neoantigens. Nature 543, 723-727 (2017).
Ott, P.A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).
Sahin, U. & Türeci, Ö. Personalized vaccines for cancer immunotherapy. Science 359, 1355-1360 (2018).
Smart, A.C. et al. Intron retention is a source of neoepitopes in cancer. Nat. Biotechnol. 36, 1056–1058 (2018).
Wang, T.Y. et al. A pan-cancer transcriptome analysis of exitron splicing identifies novel cancer driver genes and neoepitopes. Mol. Cell 81, 2246-2260 (2021).
Lu, S.X. et al. Pharmacologic modulation of RNA splicing enhances anti-tumor immunity. Cell 184, 1–16 (2021).
Zhang, M. et al. RNA editing derived epitopes function as cancer antigens to elicit immune responses. Nat. Commun. 9, 3919 (2018).
Berkovits, B.D. & Mayr, C. Alternative 3′ UTRs act as scaffolds to regulate membrane protein localization. Nature 522, 363–7 (2015).
Tian, B. & Manley, J.L. Alternative polyadenylation of mRNA precursors. Nat Rev
Mol Cell Biol. 18, 18–30 (2016).
Mayr, C. & Bartel, D.P. Widespread shortening of 3′ UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673-684 (2009).
Xiang, Y. et al. Comprehensive characterization of alternative polyadenylation in human cancer. J. Natl. Cancer Inst. 110, djx223 (2018).
Tian, B. et al. Widespread mRNA polyadenylation events in introns indicate dynamic interplay between polyadenylation and splicing. Genome Res. 17, 156–165 (2007).
Lee, S.H. et al. Widespread intronic polyadenylation inactivates tumour suppressor genes in Leukaemia. Nature 561, 127-131 (2018).
Gruber A.J. et al. A comprehensive analysis of 3' end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation. Genome Res. 26, 1145-1159 (2016).
Hoque, M. et al. Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing. Nat. Methods 10, 133–139 (2013).
Jenal, M. et al. The poly(A)-binding protein nuclear 1 suppresses alternative cleavage and polyadenylation sites. Cell 149, 538–553 (2012).
Martin, G. et al. 3' end sequencing library preparation with A-seq2. J. Vis. Exp. 128, 56129 (2017).
Bogard, N. et al. Deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 19-106 (2019).
Li, Z. et al. DeeReCT-APA: Prediction of alternative polyadenylation site usage through deep learning. Genom. Proteo. Bioinformatics on-line (2021).
Ha, K. et al. QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol. 19, 1-18 (2018).
Wang, R. & Tian, B. APAlyzer: a bioinformatic package for analysis of alternative polyadenylation isoforms. Bioinformatics 36, 3907-3909 (2020).
Xia, Z. et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types. Nat. Commun. 5, 1-13 (2014).
Ye, C. et al. APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics 34, 1841-1849 (2018).
Smibert, P. et al. Global patterns of tissue-specific alternative polyadenylation in Drosophila. Cell Rep. 1, 277-289 (2012).
Boegel, S. et al. In silico HLA typing using standard RNA-Seq sequence reads. Methods Mol. Biol. 1310, 247-258 (2015).
Jurtz, V. et al. NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 199, 3360-3368 (2017).
Herrmann, C.J. et al. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3′ end sequencing. Nucleic acids Res. 48(D1), D174-D179. (2020).
Wang, R. et al. A compendium of conserved cleavage and polyadenylation events in mammalian genes. Genome Res. 28, 1427-1441 (2018).
Tian, B, et al. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201-212 (2005).
Gruber, A.R. et al. Global 3’ UTR shortening has a limited effect on protein abundance in proliferating T cells. Nat. Commun. 5, 5465 (2014).
Ellis, M.J. et al. Clinical proteomic tumor analysis consortium (CPTAC) connecting genomic alterations to cancer biology with proteomics: The NCI clinical proteomic tumor analysis consortium. Cancer Discov. 3, 1108–1112 (2013).
Kahles, A. et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211-224 (2018).
Thorsson, V. et al. The immune landscape of cancer. Immunity 48, 812-830 (2018).
Ng, K.W. et al. Somatic mutation-associated T follicular helper cell elevation in lung adenocarcinoma. Oncoimmunol. 7, e1504728 (2018).
Hugo, W. et al. Genomic and transcriptomic features of response to anti-PD-1 therapy in metastatic melanoma. Cell 165, 35-44 (2016).
Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell 171, 934-949 (2017).
Kula, T. et al. T-Scan: a genome-wide method for the systematic discovery of T cell epitopes. Cell 178, 1016-1028 (2019).
Keskin, D.B. et al. Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial. Nature 565, 234-239 (2019).
Cai Z. et al. Personalized neoantigen vaccine prevents postoperative recurrence in hepatocellular carcinoma patients with vascular invasion. Mol. Cancer 20, 164 (2021).
Lindelboom R.G.H. et al. The rules and impact of nonsense-mediated mRNA decay in human cancers Nat. Gen. 48, 1112-1118 (2016).
Lang F, Schrörs B, Löwer M, Türeci Ö, Sahin U. Identification of neoantigens for individualized therapeutic cancer vaccines. Nat Rev Drug Discov. On-line (2022).
Dobin A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 (2013).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Quinlan, R. et al. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842 (2010).
Shukla, S. A. et al. Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes. Nat. Biotechnol. 33, 1152-1158 (2015).
Madan, V. et al. Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome. Nat. Commun. 6, 6042 (2015).
Eng, J.K. et al. Comet: an open‐source MS/MS sequence database search tool. Proteomics 13, 22–24 (2013).
Käll, L. et al. Semi-supervised learning for peptide identification from shotgun proteomics datasets, Nat. Methods 4, 923-925 (2007).
Tyanova, S. et al. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protocols 11, 2301 (2016).
Li, T. et al. TIMER: a web server for comprehensive analysis of tumor-infiltrating immune cells. Cancer Res. 77, e108–e110 (2017).

There is NO Competing Interest.

IPAneoantigenmanuscriptSupplementaryinformationNatComm.docx
Supplementary Figures and Tables

Download PDF

Version 1

posted

You are reading this latest preprint version

Pervasive Intronic Polyadenylation Serves as a Potential Source of Cancer Neoantigens

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1