Sequencing, clinical and other data. We utilized 5654 samples of 12 cancers from TCGA and 11,383 normal tissues from TCGA (adjacent normal), GTEx, and BLUEPRINT (Fig. 1a and Supplementary Table 2). Next-generation sequencing and clinical data for the 12 cancers in TCGA were downloaded from dbGaP TCGA repository, including RNA-seq data and VCF files across 12 major cancer types (n = 5,654), RNA-seq data of all adjacent normal samples (n = 742), and clinical information for all patients in the 12 cancers. RNA-seq data of 9,777 normal tissues were downloaded from GTEx Portal (www.gtexportal.org), while 864 RNA-seq data of normal blood cells were downloaded from BLUEPRINT DCC Portal (http://dcc.blueprint-epigenome.eu/#/home). RNA-seq data of MCF7 and HEK293 cell lines were downloaded from GEO (GSE126365 and GSE56010, respectively). The GFF file of PacBio Iso-seq data for MCF7 transcriptome was obtained from https://github.com/PacificBiosciences/DevNet/wiki/IsoSeq-Human-MCF7-Transcriptome. A-seq2 data of HEK293 cell line was downloaded from the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/) under accession number SRP065825. Raw RNA-seq datasets of melanoma patients prior to ICI immunotherapy in two independent cohorts were obtained from GEO (GSE78220 and GSE91061). Proteomics mass spectrometry (MS) data for TCGA ovarian and breast cancers were downloaded from the CPTAC data portal (https://cptac-data-portal.georgetown.edu).
Identification of PASs by PASRAM. After mapping of raw RNA-seq FASTQ reads onto hg19 with STAR (v2.7.1a)43, the uniquely mapped reads with at least 3 soft-clipped nucleotides were extracted based on CIGAR string from bam files using SAMtools44. Next, genomic locations of tentative cleavage sites were determined by utilizing mapping information as well as CIGAR string of soft-clipped reads. The soft-clipped fragments with ≥ 3 A residues at 3’ end or ≥ 3 T residues at 5’ end, which covered a part of both 3’end RNA and polyA tail sequences, were defined as polyA-spanning reads. To elucidate this selection process of polyA-spanning reads, a number of representative cases were shown with IGV plots (Fig. 2e,j,k and Supplementary Figs. 4a,5,12). Any tentative cleavage sites with less than two supporting polyA-spanning reads were removed. As the identified polyA spanning reads resemble the polyA supporting reads from the 3’ READs sequencing method, we applied the same method for PAS identification by grouping the neighboring cleavage sites into a cluster16. In other words, when multiple cleavage sites were presented, we clustered all cleavage sites within 24 nt together and designated the site with the highest read coverage as the representative cleavage site for the cluster. If the width of a cluster was larger than 24 nt, the cleavage sites located more than 24 nt away from the representative PAS were re-clustered. The process of clustering was repeated until all tentative PASs were assigned.
Based on GENCODE annotation (V34lift37), we selected the PAS clusters located in introns as IPA sites. To ensure that the authenticity of IPA transcripts, we filtered out those intronic PASs whose distances to any annotated CDSs or upstream exon-intron junctions were less than 24 nt. We also checked the upstream exon-intron junction reads and coverage rate of the intronic part for each IPA transcript. For an IPA with less than five reads supporting the upstream exon-intron junction or intronic coverage rate below 90%, the IPA was discarded for further analysis. As the background sequences for charactering PASs, we utilized randomized genome-wide and intronic sites generated by BEDtools shuffle45.
Identification of tumor-specific IPA sites/transcripts. In order to estimate the expression level of an IPA transcript, we normalized the read counts to RPM (Reads Per Million mapped reads). RPMpAs was denoted as the number of identified polyA-spanning reads. As the abundancy of polyA-spanning reads crossing an IPA site is usually much lower than the number of mapped reads around the site, RPMpAs may not accurately represent the expression of the IPA transcript. Thus, the RNA-seq reads up- and down-stream of the IPA site in a window of 100 nt were determined were also used. As a significant drop in read coverage was expected across a real PAS from the upstream to the downstream region, the difference (RPMUpS-DwS) in the RPM between the upstream and downstream of the IPA site was also determined. Finally, the expression level of an IPA transcript (i.e. RPMIPA) was defined as the maximum of RPMpAs and RPMUpS-DwS.
To obtain tumor-specific IPA transcripts, we first eliminated any identified IPA sites that could be also found in GENCODE. We also removed those IPA sites for which polyA-spanning reads could be identified in any of normal samples. Next, we estimated the expression for each of the remaining IPA transcripts in all tumor and normal samples using the above-mentioned method. For tumor specificity, we used two different expression cutoff values. In other words, to be tumor-specific for an IPA transcript in a tumor sample, the expression level (RPMIPA) should be greater than and equal to one, while the IPA transcript expression in any of normal samples should not be greater than 0.1. We specifically avoided using expression fold changes between tumor and normal samples for tumor specificity, because high fold changes could still include the IPA transcripts of moderate/high expression in normal samples or those of low expression in tumor samples. We further excluded any IPA transcripts if tumor specificity could be found only in a single one of the 5654 tumor samples to avoid potential artifacts.
Determination of patient HLA alleles. The 4-digit HLA alleles of each patient in TCGA were identified by Seq2HLA26 based on the RNA-seq data. The expression levels of HLA-A, -B and -C were also calculated using Seq2HLA by considering the variability in the MHC class I locus. When correlating the HLA expression with the neoantigen load, the pool of HLA-A, -B and –C expression levels was used. We also downloaded HLA typing results for 4,333 patients predicted based on POLYSOLVER46 based on whole-exome sequencing data of the 12 TCGA cancers. We observed that 94% of the 4-digit HLA alleles predicted by seq2HLA overlapped with those from POLYSOLVER, and assumed no significant difference in HLA typing between these two tools.
Generation of peptide fragments from the three resources. The custom script from our previous work47 was employed to predict the genomic loci of the retained intron transcripts for 5,654 cancer samples and 11,383 normal samples of different tissues. We applied the following filtering criteria to obtain a set of reliable RI transcripts: a) at least 5 exon-intron junction reads on both sides of a RI; b) minimal retention splicing index of 0.1 for any RIs; c) at least 90% of genomic sequences covered with RNA-seq reads for any RIs.
Similar criteria to IPA transcripts were imposed to obtain tumor-specific RI ones. All the tumor-specific RI transcripts were further subjected to analysis of NMD according to the canonical rules41. If the first intron is retained, a premature termination codon (PTC) found 200bps beyond the downstream exon junction complex (EJC) would lead to NMD, while for a retained intron located from the 2nd to the penultimate intron, retained transcripts would undergo decay if a PTC is identified 50bps beyond the downstream EJC. For the last retained intron, the RI transcript would escape from NMD, like IPA transcripts, as no downstream exon junction complex exists to trigger NMD41. Only NMD-escaped RI transcripts were considered further.
With at least one amino acid from introns, the IPA/RI-derived cancer-specific peptides of 9 amino acids were generated by extending open reading frames (ORFs) into intronic sequence until reaching an in-frame stop codon or IPA site similar to the previous approaches5. On average, less than 0.5% of IPA/RI-derived peptides contained only one amino acid from the intron region. For SM, single nucleotide variants of the 12 TCGA cancers were derived from the downloaded VCF files. The peptide FASTA sequences were built using one to eight flanking amino acids on each side of the mutant amino acid. If the mutation was at the beginning or end of a transcript, the preceding or succeeding 8 amino acids were then taken to build the peptide FASTA sequence.
Prediction of neoantigens from IPA, RI and SM. The binding affinities between cancer-specific peptides and corresponding HLA alleles were estimated by NetMHCpan v4.027. Peptides with rank percentage ≤ 2% were considered as putative HLA-presenting neoantigens. Then we filtered out the neoantigens that were found in normal human proteome (UniProt), and eliminated the neoantigens with less than five RNA-seq supporting reads for any one of the nine amino acids to obtain a final list of neoantigens.
Tumor-specific recurrent events. Recurrent transcripts or neoantigens in a cancer cohort were defined as those identified tumor-specific events with exactly the same genomic locations or amino acid sequences, which were found in more than 10% of the tumor samples in the TCGA cancer cohort.
A-Seq2 data analysis. A-seq2 data were processed as described31 with the software in gitlab repository (https://git.scicore.unibas.ch/zavolan_public/A-seq2-processing), and the PASs determined from A-seq2 reads were used to benchmark PASRAM performance along with two published methods capable of de novo PAS identification i.e. DaPars23 and APAtrap24.
Identification of expressed peptides from CPTAC. Utilizing proteomics mass spectrometry (MS) data for TCGA ovarian and breast cancers generated by CPTAC32, we validated the predicted neoantigen peptides. Comet sequence database search tool48 was applied to identify any polypeptides by searching CPTAC MS2 spectra against a reference database containing both the human proteome and the predicted amino acid sequences derived from SM-, RI- and IPA-transcripts. The Percolator algorithm was employed to score the identified peptides49. Overlapping these MS-identified peptide candidates with the predicted neoantigens, CPTAC-supported peptides were identified and visualized by MaxQuant50.
RNA extraction, RT-qPCR and Sanger sequencing. Total RNA was extracted using RNeasy mini kit (Qiagen) and treated with DNase I (Qiagen) on column. cDNA was synthesized using the Advantage RT-for-PCR kit (Takara), followed by qPCR analysis. The first strand synthesis of cDNA was performed using oligo dT. The relative expression levels of target genes were normalized to those of the housekeeping gene (β-actin). The purified PCR products were sent for direct sequencing and visualized by SnapGene Viewer. Sequences of primers were listed in Supplementary Table 5. We designed four pairs of primers: crossing the exon-intron junction upstream of an IPA site (IPA1), within the IPA intronic part (IPA2), in the intronic part downstream of the IPA (Long), and crossing the exon-exon junction of the intron harboring the IPA site (Short) (Supplementary Fig. 4b, left panel).
3’ RACE. cDNA of MCF7 cells was synthesized using anchored oligo dT, to anchor its poly(A) ends. cDNA end was then amplified by PCR using gene specific forward primer and anchor reverse. Sequences of primers were listed in Supplementary Table 6. PCR products were separated by agarose gel and purified by Gel DNA Exaction mini kit (Vazyme). Purified products were sent for Sanger sequencing. Triplicates were performed for the 3’ RACE.
Quantification of immune cell infiltration. Estimates of immune cell infiltration of TCGA primary tumors were downloaded from the TIMER web server51 and cross-referenced to our sample classification as high or low neoantigen group. Wilcoxon rank sum test was used to evaluate the statistical difference in the levels of immune cell infiltrations between high and low neoantigen groups. Cytolytic activity was computed as the geometric mean of expressions of GZMA and PRF1 (in RPKM).
Correlation between neoantigen loads and fractions of TIICs across tumor samples. The predicted fractions of tumor infiltrated immune cells (TIICs) in each TCGA tumor samples were downloaded from the publication34. Next the neoantigen loads derived from the three sources (RI, SM an IPA) were correlated with the fraction of each TIIC for each cancer type.
In vitro activation and expansion of T cells. Monocytes were isolated from peripheral blood mononuclear cells (PBMC) from HLA-A*11 donors by allowing the cells to adhere to the surface of the tissue culture flask in RPMI containing 1% human serum (Sigma-Aldrich) for 2h at 37oC in a CO2 incubator. CD8 + T cells were isolated from the same donor’s PBMC using EasySep™ Human CD8 + T Cell Isolation Kit (STEMCELL Technologies) following the manufacture’s protocol. Monocytes were differentiated to mature dendritic cells (DCs) using the ImmunoCult™ DC Culture Kit (STEMCELL Technologies) following the manufacturer’s instructions. Matured DCs were dissociated and co-cultured with the isolated autologous CD8 + T cells in the presence of the peptides encoding IPA-neoantigens (GenScript; Supplementary Table 4) at a concentration of 2µg per ml per peptide for 14 days. As a positive control, CD8 + T cells were co-cultured with matured DCs pulsed with the Epstein-Barr virus (EBV) peptide (IVTDFSVIK, GenScript) at a concentration of 2µg per ml and cultured for 14 days.
ELISpot assay. ELISpot plates pre-coated with antibodies specific for IFNγ (Human IFNγ ELISpot PLUS kit, Mabtech) were washed with PBS and blocked with PBS containing 0.5% fetal calf serum (Thermofisher Scientific) for 2h at room temperature. 2 x 104 expanded T cells were stimulated with 2 x 103 K562 expressing HLA-A*11:01 in the presence of peptides encoding IPA neoantigens or absence (control) for 24h. All tests were performed in duplicates. Spots were visualized with a biotin-conjugated anti-IFNγ antibody followed by incubation with Streptavidin-ALP and BCIP/NBT-plus substrate following the manufacturer’s protocol. Plates were scanned and analyzed using CTL ImmunoSpot® S6 ENTRY Analyzer.
Tetramer staining. Flex-T™ HLA-A*11:01 monomer UVX (BioLegend) was irradiated with UV in the presence of the peptides encoding IPA-neoantigens following the manufacturer’s protocol. The monomers were then assembled into tetramers with PE-conjugated streptavidin. The expanded T cells were subsequently stained with the peptides-loaded tetramers and APC anti-CD8 antibody (BioLegend). The samples were acquired on the BD LSR Fortessa™ Flow Cytometer (Becton-Dickinson) and analyzed using the FlowJo™ software.
Cytolytic reporter assay. HEK293T cells (ATCC) were engineered to lack HLA through CRISPR-mediated deletion of the endogenous HLA genes. The HLA-A*11:01 transgene was then overexpressed in the HLA-deleted HEK293T cells together with an infrared fluorescent protein (IFP)-based granzyme reporter. The granzyme reporter was constructed by replacing the caspase cleavage site of the fluorogenic protease reporter with a granzyme-specific cleavage sequence38. The HLA-A*11:01 reporter cells were co-cultured with the IPA neoantigens-expanded T cells in the absence (negative control) or presence of the IPA neoantigen peptide(s) for 6h at 37oC in a CO2 incubator to show the peptide-specific killing (Fig. 1d). Again, using the EBV-antigen as a positive control, the reporter cells were co-cultured with EBV-antigen expanded T cells in the presence or absence of the EBV-antigen peptide. The cells were then acquired using BD LSR Fortessa™ Flow Cytometer (Becton-Dickinson) and analyzed using the FlowJo™ software.
Data availability
Data were downloaded from various publicly available websites. The detailed downloading information is described in Methods.
Code availability
Pipeline codes are available at GitHub https://github.com/RENXI-NUS/IPA-neoantigen-prediction.