Standard pipelines filter out thousands of potential novel gene predictions
We processed the UHRR Iso-Seq data using 5 different pipelines to compare the effect of each method on gene discovery and model prediction accuracy (Fig. 1e). The first pipeline, which we will refer to as the “Cupcake pipeline”, uses inter-read error correction (in the form of clustering long reads and using the alignment to polish the sequences) along with Cupcake Collapse (a tool for collapsing redundant transcript models). The Cupcake pipeline represents the current standard pipeline, as is run by default on PacBio machines. The second pipeline, which we will refer to as the “Polish pipeline”, is a modified version of the Cupcake pipeline where TAMA Collapse is used in place of Cupcake Collapse. The third pipeline, which we will refer to as the “Lordec pipeline”, uses inter-read error correction in the form of aligning short read RNAseq data to long reads along with TAMA Collapse. For the Lordec pipeline we used short read RNA-seq data from the UHRR but from another study(11). The fourth pipeline, which we will refer to as the “TAMA Low pipeline”, uses no inter-read error correction. The settings used for the TAMA Collapse run in TAMA Low are also the same settings used in the Polish and Lordec pipelines. The fifth pipeline, which we will refer to as the “TAMA High pipeline”, is identical to the TAMA Low pipeline with the exception of the use of the local density error (LDE) feature which is currently unique to TAMA Collapse. In the TAMA High pipeline, the LDE algorithm is used to remove transcripts with more than 1 error within a 20 bp range of a splice junction. High densities of errors near the splice junctions often cause erroneous splice junction mapping leading to inflated transcript numbers. We allowed for 1 bp of error due to possible true genomic variation between the UHHR samples and the reference genome assembly. We then only kept transcript models with read support from both SMRT cells (the Iso-Seq data was generated from 2 SMRT cells) to avoid using reads that originated from PCR artefacts.
Out of the 5 pipelines, the Polish and Cupcake pipelines had the lowest sensitivity in terms of the number of predicted genes (25,731 and 25,239 genes, respectively) and transcripts (126,288 and 128,389, respectively) (Table 1). As expected, due to the similarity of algorithms between low stringency TAMA Collapse and Cupcake collapse, these pipelines produced similar numbers of genes and transcripts. The primary reason for the reduced sensitivity in these pipelines is that the Cluster/Polish steps remove any reads that do not cluster with at least one other read (also known as singletons). Thus, low expression transcripts can be filtered out due to a lack of read depth.
The TAMA Low and Lordec pipelines resulted in the greatest sensitivity in terms of the number of genes (168,328 and 166,766 genes, respectively) and transcripts (752,996 and 753,756 transcripts, respectively) predicted (Table 1). The high sensitivity of the Lordec pipeline is not surprising as Lordec does not filter any reads. Thus, the same number of reads were used during genome mapping for the Lordec pipeline as compared to the FLNC Low pipeline and the same TAMA Collapse parameters were used in both pipelines. In theory, the Lordec pipeline should have resulted in greater numbers of genes, since short read error correction is supposed to correct low quality Iso-Seq reads which should then increase the number of mapped reads that pass the filtering thresholds of TAMA Collapse. Surprisingly, we do not see a significant increase in sensitivity from using Lordec.
Table 1 Pipeline Comparison
Comparison of gene and transcript number across pipelines broken down into different categories. Gene level matches refer to gene models which overlap on the same strand. Transcript level matches refer to transcript models with similar exon structures.
Match Type | Polish | Cupcake | Lordec | TAMA Low | TAMA High |
Total Genes | 25,731 | 25,241 | 166,766 | 168,328 | 40,821 |
Total Transcripts | 126,288 | 128,389 | 753,756 | 752,996 | 135,218 |
Gene Level Matches | 19,348 | 19,342 | 30,835 | 30,947 | 21,284 |
Transcript Level Matches | 20,364 | 20,428 | 28,143 | 28,234 | 17,932 |
Novel Genes | 8,519 | 8,068 | 139,769 | 141,097 | 23,302 |
Novel Transcripts | 103,052 | 104,824 | 716,241 | 713,546 | 115,893 |
The TAMA High pipeline resulted in gene (38,743) and transcript (135,218) totals that were greater than the Polish and Cupcake pipeline but not nearly as high as the TAMA Low and Lordec pipelines.
To test whether the number of genes identified in the 5 pipelines correlated to true sensitivity (as opposed to erroneous gene predictions from sequencing noise), we compared the resulting annotations from the five pipelines with the Ensembl v96 human annotation(12) using TAMA merge (Fig. 2a-b). The Ensembl human annotation is created by a combination of a semi-automated gene build pipeline and manual annotation by the Havana team. As a result, the Ensembl human annotation is considered among the most complete and accurate annotations for the human genome. We used two definitions for matching: gene-level matches and transcript-level matches. Gene-level matches are defined as the number of genes with overlap between gene loci (between each pipeline and the Ensembl annotation) on the same strand. Gene level matches represent the upper bound estimate of the number of true known gene models identified in each pipeline. Transcript-level matches are defined as the number of transcripts that have the same exon structures between annotations (see Methods). These represent an estimate of the accuracy of the transcript models predicted by each pipeline since an erroneous transcript model would still be considered a match on the gene level.
The pipelines which predicted the highest number of genes (TAMA Low and Lordec) also resulted in the highest number of both gene-level and transcript-level matches. This suggests that the increased stringency of the other pipelines significantly decreases the detection of real genes and transcripts. However, both the TAMA Low and Lordec pipelines had over 140K mono-exonic genes which is more than 5 times the number of mono-exonic genes as the Polish pipeline (Fig. 2c). This vastly greater number of mono-exonic genes could represent either real genes, RNA sample noise or DNA contamination. Therefore, the increased sensitivity of these pipelines could come at the cost of having a significant number of possible false positives.
The TAMA High pipeline had more gene-level matches as compared to the Polish and Cupcake pipelines but fewer transcript-level matches. This may be because the TAMA High filtration of transcript models with only single SMRT cell read support removed reads representing lowly expressed transcripts. However, the number of mono-exonic genes within the TAMA High annotation appeared to be more reasonable. Therefore, it appears that the TAMA High pipeline resulted in a reasonable balance between sensitivity and specificity.
TAMA HIGH identifies more accurate transcript models without reducing gene discovery
To understand the accuracy of each pipeline for predicting transcript models, we looked at both TAMA’s predicted error rates as well as splice junction wobble. Wobble refers to mis-mapping of splice junctions causing small differences in the genomic loci of mapped features such as exon boundaries and splice junction donor/acceptor sites (Fig. 3a). While the error rates of mapped reads are often used to assess the improvement of long read data from different pipelines(13), this metric is actually not as useful for understanding the overall improvement in the transcriptome annotation. In genome-based transcriptome annotations, typically the most important features to identify are the transcription start sites (TSS), transcription end sites (TES), splice junctions, and exon chaining. These features allow for predictions of coding and promoter regions which are often crucial for downstream analyses. Thus, for transcript structure identification, errors near the splice junctions have a greater probability of altering the resulting transcript model than errors occurring farther away from the splice junctions. This means that the percentage of errors within a read may not be as impactful as the distribution of errors. Thus a more accurate metric for the performance of error correction methods is to assess the amount of “splice junction wobble” between the predicted transcripts and known transcripts.
To demonstrate this concept we looked at the error profiles for each mapped read for the TAMA Low, TAMA High, Polish, and Lordec pipelines. Note that the mapped FLNC reads are the same for the FLNC High and FLNC Low pipelines. The Cupcake pipeline was omitted in this analysis because Cupcake Collapse does not provide a report on the errors in the mapped reads whereas TAMA Collapse provides detailed information on the type and amount of errors for each read.
Using the output from TAMA Collapse we looked at length of coverage, mapping identity, clipping, insertions, deletions, and substitution errors. These values represent the comparison of the mapped reads to the genome assembly and thus only serve as an estimate of the true rates of error. We also looked at average error rates between the pipelines by counting the number of base pairs that were not matching between the mapped read and the genome sequence and dividing this number by the length of the mapped read. This includes soft clipping, insertion, deletion, and substitution errors but does not include hard clipping.
The TAMA Low pipeline had the highest average error rate (2.83%) and the highest amount of each type of error while the Polish pipeline had the lowest error rates (0.52%) with the lowest amount of each type of error. The Lordec pipeline coverage and identity were unexpectedly similar to the FLNC pipelines suggesting that Lordec correction did not provide a large gain in error correction. The Lordec pipeline had a similar amount of clipping errors as compared to the FLNC pipelines but lower rates of insertion, deletion, and substitution errors (Fig. 2d). This indicates that the Lordec correction may have some issues correcting the ends of reads.
The Polish pipeline had longer average mapped read lengths compared to other pipelines. This is most likely due to the Cluster/Polish algorithm which merges read sequences with up to 100 bp length differences on the 5’ end. This behavior essentially absorbs the shorter reads into the longer reads effectively removing their length representation.
We then looked at transcript model accuracy by measuring the wobble at splice junctions with respect to transcript models annotated in the Ensembl human annotation for the 5 different pipelines (Fig. 3b and 3d). Wobble typically occurs due to high error density near the splice junctions leading to small shifts in mapping the ends of each exon(14). The amount of wobble between the transcript models of each pipeline compared to the reference annotation provides a metric for the accuracy of the transcript models produced by each pipeline. We ignored wobble at the transcript start and end sites due to the high variance of these features in natural RNA(15),(16). We also only assessed Ensembl transcript models that had coverage from all assessed pipelines to account for the differences in sensitivity between the pipelines.
The TAMA Low pipeline produced the highest average wobble per splice junction (0.72 start wobble and 0.69 end wobble) while the TAMA High pipeline had the lowest average wobble values (0.26 start wobble and 0.24 end wobble).
We also looked at the number of transcripts with perfect splice junction matches to the reference annotation. The TAMA Low pipeline had the lowest number of perfect transcript matches, while the TAMA High pipeline had the highest number of perfect transcript matches (Fig. 3c). The Polish pipeline had the second highest number of perfect transcripts. Thus, despite the lower overall error rates in the mapped reads from the Polish pipeline, the TAMA High pipeline produced more accurate transcript models. This suggests that the LDE filtration in the TAMA High pipeline resulted in more accurate identification of splice junctions.
Inter-read error correction leads to read jumbling affecting thousands of genes and transcripts
One of the major concerns when using inter-read error correction methods such as Cluster/Polish and Lordec is the possibility of creating sequencing errors as a result of combining read sequences from different transcripts. The different transcripts can either be from different genes (gene-level jumble) or a combination of alternative transcripts within the same gene (transcript-level jumble). Gene-level jumble typically occurs due to the sequence similarity of paralogues within gene families(17). In both gene-level and transcript-level jumble, it is more likely that the highest expressed gene or transcript within the read clusters will mask the lower expressed genes. This is because the final cluster sequence is determined by sequence coverage. However, in cases where the read coverage within a jumble cluster is similar across unique transcripts, it is more likely that the resulting cluster read will have a mixture of sequences from each unique transcript within the cluster.
To investigate how often these jumble events occur, we compared the read mappings from the TAMA Low pipeline to the inter-read error correction pipelines to find reads which mapped to different genes and transcripts in each comparison. While it is possible that the TAMA Low read mappings are erroneous, they represent the read sequences without any over-correction. Also reads which have different mapping loci between the TAMA Low pipeline and an inter-read error correction pipeline indicate that there is enough sequence ambiguity to call into question the effect of the inter-read error correction.
Since the Cupcake pipeline uses the same Cluster/Polish step as the Polish pipeline, there should be no differences in the read mappings. Similarly, the TAMA Low and TAMA High pipelines used the same read mappings.
Comparing the TAMA Low pipeline to the Polish pipeline, we found 34,637 reads that switched from one gene locus to another after Cluster/Polish correction (Fig. 4b). This gene loci switching involved 6,774 genes, 3,230 of which were only found with the FLNC Low pipeline while 104 genes were only found with the Polish pipeline. The asymmetry of the number of unique genes between the pipelines suggests that Cluster/Polish is incorrectly clustering reads from different genes.
To gain a more detailed understanding, we focus on the PReferentially expressed Antigen of MElanoma (PRAME) gene family. The PRAME gene family is highly associated with cancer development(18),(19),(20),(21) and is used as a biomarker for identifying various forms of cancer. Within the PRAME gene family there are 24 annotated paralogues(22). In this example, the Polish pipeline caused a false negative for the detection of one of the PRAME paralogues, PRAMEF8. The TAMA Low pipeline finds 9 reads mapping to PRAMEF8 (Fig. 4a) while the Polish pipeline shows no reads mapping to PRAMEF8. Of the 9 PRAMEF8 reads from the TAMA Low pipeline, 5 were mapped to another paralogue, PRAMEF15, in the Polish pipeline. We analyzed the sequence similarity between the two paralogues by aligning the PRAMEF8 and PRAMEF15 transcript sequences with Muscle(23) and found that they had 76% identity. While the sequences of the two genes are similar, the genome mapping identity for the reads were higher than the sequence similarity between the two paralogues. The PRAMEF8 read with the lowest identity score during genome mapping in the TAMA Low pipeline had an identity of 89% and 6 PRAMEF8 reads had mapping identities over 98%. Thus, there is strong evidence that the reads mapped correctly in the TAMA Low pipeline and were altered to the point of mis-mapping in the Polish pipeline. This particular type of error could have major consequences for studies aimed at identifying gene biomarker expression.
To assess the effect of short/hybrid inter-read error correction on gene level read jumbling, we compared the TAMA Low pipeline to the Lordec pipeline. There were 19,064 reads which switched from one gene locus to another (Fig. 4c), involving a total of 3,476 genes, 775 of which were only found with the TAMA Low pipeline while 675 genes were only found with the Lordec pipeline. The number of genes found only in the Lordec pipeline is much higher than the number of genes found only in the Polish pipeline which may indicate that Lordec correction is more inclined to produce false positives.
We also examined how erroneous inter-read error correction can lead to transcript level jumbling. In this case, when reads from different transcripts from the same gene are grouped for error correction, the resulting sequence will, at best, represent only the more highly expressed transcript and, at worst, represent an erroneous jumbled sequence.
Comparing the TAMA Low pipeline to the Polish pipeline, we found 477,351 reads that mapped to different transcript models within the same gene. There were 112,891 transcripts affected by transcript-level jumbling, 44,852 of which were found only in the TAMA Low annotation while 1,372 transcript were found only in the Polish annotation. This represents a large difference between the two annotations with the TAMA Low pipeline predicting far more transcript models than the Polish pipeline.
Comparing the TAMA Low pipeline to the Lordec pipeline, we found 187,829 reads that mapped to different transcript models. This involved 142,704 transcripts with 7,117 transcripts found only in the FLNC Low annotation and 11,732 transcript found only in the Lordec annotation. Again, it appears that the Lordec pipeline is more prone to producing false positives.
To summarize, in both the long and short inter-read error correction pipelines we saw a significant number of gene-level and transcript-level read jumbling which may result in the prediction of gene and transcript models that are not biologically accurate. Hence, we argue that the best approach would be to forego inter-read error correction and instead focus on methods, such as the TAMA Collapse LDE algorithm, for removing reads with error profiles that could lead to erroneous transcript model predictions.
Novel genes are dominated by mono-exonic non-coding and coding genes
To gain insight into the 141,097 novel gene models found in the TAMA Low pipeline, we looked at several features: coding potential, number of exons, intronic overlap with other genes, overlap with regulatory features, and the presence of immediately downstream genomic poly-A stretches. Coding potential and splice junctions are often used as strong evidence of a functional gene. Conversely, overlap with introns (from other genes), genomic poly-A stretches immediately downstream of a gene model, and the absence of splice junctions (single exon transcripts) provide evidence that the source of the model could be from either non-functional transcribed products or genomic contamination.
Coding potential was assessed using three complementary methods. First, we used an open reading frame sequence analysis tool, CPAT(24), to detect coding potential. This method only works when the transcripts models do not contain frame shifts caused by erroneous splice junction calling. Second, we used TAMA merge to identify gene models that overlapped the genomic loci (on the same strand) of protein coding genes within the Ensembl annotation. Third, we used the TAMA ORF/NMD pipeline which is a frame shift-tolerant method of matching transcript sequences to peptide sequences from the Uniprot(25) database. We combined these three methods to account for the various errors which can cause false negatives in protein coding gene prediction.
Only a small number of the novel genes (122 out of 141,097) were supported by all features which are considered evidence for functionality (multi-exonic, coding, intergenic, and processed poly-A) (Fig. 5a). This is expected given that these features are used by short read RNA-seq annotation pipelines for validation. Therefore, many of the gene models with these features are likely to have already been annotated by Ensembl.
The two most common sets of features for the TAMA Low predicted novel genes are “single exonic, non-coding, intronic gene overlap, and genomic poly-A” at 32% (45,820) and “single exonic, coding, intronic gene overlap, and genomic poly-A” at 20% (27,524). The first set of features are indicators for lncRNA while the second set of features are indicators for processed pseudogenes or real coding genes. Together, these account for over 52% of the novel genes. While these features are also associated with non-functional sequences such as pre-processed RNA or degraded RNA fragments, the fact that they do not overlap with the exons of known genes suggests that many of these putative novel gene models may represent real genes which were overlooked or undetected in previous annotations efforts.
The third most common set of features (“single exonic, non-coding, intronic gene overlap, and processed poly-A”) are indicators for real mono-exonic lncRNA and make up 10% (14,036) of the novel gene models (Fig. 5a). Since there are no genomic poly-A stretches downstream of these models, the source RNA must have had poly-A tails added during RNA processing which would suggest the RNAs truly exist in that form and may have functional roles. Given that these models did not overlap any gene models in the Ensembl annotation, this would represent a large increase in the number of predicted lncRNA for the human genome.
There were an additional 11,255 (8%) genes with features (single exonic, non-coding, intergenic, and genomic poly-A) that are indicative of either mono-exonic intergenic lncRNA or genomic contamination. With the UHRR being one of the most carefully prepared RNA samples, this would indicate that researchers would require more advanced methods of either RNA preparation or sequencing analysis to confidently identify novel genes.
The Universal Human Reference RNA contains a significant amount of degraded RNA
We developed a metric called the “Degradation Signature” (DegSig) to measure the 5’ transcript completeness. When both a capped selected cDNA library and non-capped selected cDNA library are available on the same sample, we can measure the amount of multi-read-supported, multi-exonic transcripts at a given loci that show a cascading pattern of TSSs along the length of the longest transcript model (Fig. 5b). A higher percent difference indicates that the sample RNA had a greater proportion of degraded RNA.
The DegSig of the Ensembl human reference annotation is 1.5% which represents an estimate of the lower limit for DegSig values. However, since 5’ degraded transcripts are difficult to distinguish from real TSS cascade pattern gene models, it may be that these types of gene models are under-represented in all annotations since most annotation pipelines remove 5’ shorter transcripts models. This analysis supports the idea that a DegSig of 0% is likely impossible due to true TSS cascade models.
We applied the DegSig metric to chicken brain Iso-Seq data, where libraries were prepared using both a non-cap selecting method and the TeloPrime(26) 5’ cap selection. The TeloPrime library should contain a lower percentage of degraded transcript sequences since it selects for complete capped RNAs. The non-cap selected data had a DegSig of 56.3% while the DegSig for the TeloPrime library data was 23.6%, suggesting a large difference in the proportion of degraded RNA sequences captured as cDNA by the two different methods.
We ran DegSig on the UHRR Iso-Seq dataset individually by SMRT cell and chromosome. Almost all chromosomes had a DegSig between 32% and 41% (Fig. 5c). However, the Y chromosome had a DegSig of 26.7% and 27.2% for SMRT Cell 1 and 2, respectively. One explanation for the much lower DegSig on the Y chromosome may be due to the lack of read depth for the Y chromosome (only 629 and 588 reads from SMRT cells 1 and 2, respectively). Lower read depths can decrease the DegSig values due to the lack of coverage for each gene.
The range of DegSig for the human data is higher than that for the chicken 5’ cap selected RNA data, suggesting that there may be a significant number of truncated models in the UHRR Iso-Seq transcript annotation. This could also be a source of the unusually high number of novel alternative transcripts predicted in the FLNC Low pipeline.