Eukaryotic transcriptomes are large and complex: most genes can produce multiple isoforms, which may differ in their splicing pattern, localization of 5’ and 3’ ends and protein-coding potential [1]. RNA polymerase II continues well beyond the polyadenylation site as part of the mechanism of transcriptional termination, yet this read-through transcription is missing from gene models [2]. Moreover, transcripts generated from non-coding regions of the genome are often poorly annotated yet may exert regulatory functions even despite the absence of a stable RNA product [3-5]. Steady-state RNA sequencing methods offer little information on transient non-coding RNA species and read-through transcription. However, such transcription events result in overlapping transcription units which may contribute to gene expression regulation [6].
In the past, building a gene model for a species from EST libraries and cDNA clones sequenced by the Sanger method could require years of labor by an international consortium. Genome-wide detection of transcribed exons from RNA-seq data became feasible with short-read sequencing platforms such as Illumina. Indeed, transcript models for many less characterized genomes are primarily based on short-read RNA-seq data, e.g. Pisum sativum [7], Oryza sativa [8] and Fragaria vesca [9]. Even for the model plant species Arabidopsis thaliana, the commonly used transcriptome annotations TAIR10 and Araport11 are to a large degree based on Illumina RNA-seq datasets [10, 11].
However, the short read RNA-seq has some fundamental limitations. First, the RNA-seq coverage gradually decreases towards gene borders, which limits accurate definition of 5' and 3‘ ends of transcripts [12]. Second, although the positions of splice sites can be determined with high accuracy, the correct resolution of isoform splicing patterns is limited. Third, sequencing of steady-state RNA in wild type samples gives little information about transient non-coding transcripts. Taken together, these considerations offer a cautionary tale for gene annotations based on RNA-seq data.
Third generation sequencing techniques from Oxford Nanopore (ONT) and PacBio recently revolutionized the field of transcriptomics. Theoretically, a long RNA-seq read may cover the whole isoform from start to end, directly informing on the exon structure and splicing patterns. ONT also allows for direct sequencing of RNA molecules, thus eliminating biases associated with cDNA synthesis. This attractive feature makes ONT Direct RNA-seq a promising choice for de novo characterization of novel transcriptomes and validation of existing transcript models.
However, ONT Direct RNA-seq has four key limitations. First, up to 30-40% of bases can be called with errors [13, 14]. To tolerate the sequencing errors, the dedicated aligners allow for more mismatches and thus inevitably sacrifice the accuracy of alignments. As a result, the alignment software may fail to detect true genomic origin of the read or correctly define the exon-intron borders. Second, a fraction of long reads cover only 3' portions of the original mRNA molecule, for example due to either RNA fragmentation, or premature termination of the sequencing reaction. Even when an intact mRNA molecule is fully sequenced from 3' end to 5' end, the last base of the read usually aligns a few tens of nucleotides downstream from the true transcription start site (TSS) [15]. Third, ONT sequencing is prone to homopolymeric tract skipping [16], which may appear as short exitrons (exonic introns) in Direct RNA-seq data. Fourth, the existing full-length RNA-seq protocols rely on RNA poly(A) tails for sequencing library construction. Therefore, non-polyadenylated RNA species which include many long non-coding RNAs (lncRNAs) will not be detected. Taken together, multiple factors preclude accurate transcript and gene model calling solely from the long reads.
Complementary genomics methods circumvent many of these limitations. For example, CAGE-seq [17] and PAT-seq [18] can detect RNA 5’ and 3’ ends, respectively, with high resolution, while nascent RNA methods such as GRO-seq [19] or NET-seq [20] can detect transient transcripts. Therefore, we posit that the highest quality transcript models can be constructed using integrative analysis levering the complementary strength of each of these methods.
We developed a de novo gene and transcript model construction pipeline TranscriptomeReconstructoR which takes three datasets as input: i) full-length RNA-seq (e.g. ONT Direct RNA-seq) to resolve splicing patterns; ii) 5' tag sequencing (e.g. CAGE-seq) to detect TSS; iii) 3' tag sequencing (e.g. PAT-seq) to detect polyadenylation sites (PAS). Optionally, it can also take a nascent RNA-seq dataset (e.g. NET-seq) to find transient RNAs. The pipeline returns the discovered genes and transcripts. We also included the option to refine the de novo gene and transcript models by the existing transcriptome annotation. TransctiptomeReconstructoR thus can be used for validation or improvement of existing gene models, as well as for data-driven annotation of non-model species with no prior gene model available.
Implementation
We implemented the pipeline as an R package, available from the dedicated repository on GitHub (https://github.com/Maxim-Ivanov/TranscriptomeReconstructoR). The package is based on Bioconductor packages (GenomicRanges, GenomicAlignments, rtracklayer), as well as on tidyverse and collections packages from CRAN [21, 22]. It takes aligned BAM files as input and returns a set of GRanges and GRangesList objects which represent the gene and transcript models. These GenomicRanges objects can be either exported as BED files for visualization in genomic browsers, or directly used as input for downstream analysis by various packages available from the Bioconductor. The TranscriptomeReconstructoR workflow is streamlined and includes 6 consecutive function calls (optionally 8, if the nascent RNA-seq dataset is used). The accompanying vignette contains the user manual, as well as in-depth description of the algorithm (Additional File 1).
The underlying basis for TranscriptomeReconstructoR is the correction and validation of long RNA-seq reads by the positions of TSS and PAS that are called from the independent short read 5' and 3' tag datasets. Long reads from ONT Direct RNA-seq or PacBio Iso-Seq are extended towards nearby TSS and/or PAS, given that the extension distance does not exceed the reasonable limit (100 bp by default). After the extension, the long reads are classified as "complete" or "truncated", depending on the overlap with the independently called TSS and PAS (Fig. 1A). The extension procedure allows to rescue a substantial fraction of long reads and simultaneously decrease the number of artifact transcript isoforms with alternative 5'- or 3' terminal exons.
The method also suppresses the alignment noise of long reads by the adjustment of 5'- and 3' splice sites. The true alternative splice sites are assumed to be separated by a certain minimal distance (10 bp by default). Exonic subalignments of long reads with 5' or 3' borders differing by less than this value are grouped together, and their coordinates are unified by the majority vote (Fig. 1B). Otherwise the "fuzzy" borders of subalignments might inflate the number of alternative 5' and 3' splicing events.
A common problem with full-length RNA sequencing reads aligned by dedicated aligners such as Minimap2 [23] is the under-splitting, i.e. erroneous extension of an exon over the adjacent intronic region [24]. As a result, the next exon appears as missing from such read, although it is present in other reads aligned within the same locus. The most probable explanation for this phenomenon is the inherent low sensitivity of long read aligners [24]. Assuming that the majority of reads still align correctly, we detect such alignment errors by comparing each subalignment in a long reads to the linear sequence of constitutive exons, that we extract from the whole set of long reads aligned to given locus. Subalignment with an alternative 5’- and/or 3-border relative to a constitutive exon are considered valid alternative exons, only if the next constitutive exon is also present in the read (Fig. 1C). Otherwise, the alignment is marked as a possible alignment error. Reads containing at least one alignment error are skipped from the transcript calling procedure, thus further decreasing the number of artifact alternative 5' and 3' splice sites.
Finally, identical long reads without alignment errors are collapsed into transcripts. Such transcripts are divided into High Confidence (HC), Medium Confidence (MC) and Low Confidence (LC) groups, depending on the support from TSS and PAS datasets. HC transcripts are constructed from long reads which start in a TSS and end in a PAS (i.e. are supported by all three datasets). MC transcripts have either TSS or PAS, whereas LC transcripts are supported by long reads only. The MC and LC transcripts may originate from partially fragmented RNA molecules and thus may have unreliable outer borders. They are called to rescue the loci where no complete reads were discovered, perhaps due to low expression level. Furthermore, the called transcripts are clustered into HC, MC and LC genes.
If a nascent RNA-seq dataset is available, the gene model can be further improved by intervals of nascent transcription which can be found outside of the called gene boundaries (Fig. 1D). These often represent transient and/or non-polyadenylated RNAs which escape detection by the poly(A)-dependent steady-state RNA sequencing methods. Another phenomenon which can be observed only in the nascent RNA-seq track is read-through transcription. The read-through (RT) "tails" immediately downstream from the protein-coding genes are explained by the "torpedo" termination model where RNAPII elongation may continue for up to a few Kb beyond the cleavage and polyadenylation site [2]. If such "tails" were identified, we appended them to the respective called gene.
TranscriptomeReconstructoR returns a set of GRanges and GRangesList objects which exhaustively annotate the transcriptome. They contain the following information: i) The coordinates of HC, MC and LC genes; ii) The length of RT tails; iii) The exon-intron structure of HC, MC and LC transcripts; iv) The coordinates of intergenic and antisense transient RNA; v) The exon-intron structure of fusion transcripts (i.e. transcripts which cover two or more adjacent genes due to inefficient termination). These GenomicRanges objects can be easily exported as BED files for visualization in genomic browsers.