TAMA Version Commit 39c1270c6e1ef2cf5d39f7f047fa15e0f1a6c790 was used for this study.
More detailed information on how TAMA works can be found here:
https://github.com/GenomeRIK/tama/wiki
Wobble
Wobble is defined in this text as the distance measured in bases between the mapped starts and ends for exons. This term is used to describe small differences (<50bp) in predicted starts/ends based on mapped reads. These differences can occur due to real differences in starts/ends or due to errors in the reads flanking the starts/ends. For example, if a read has a number of missing bases immediately flanking a splice junction (SJ are comprised of one exon start and one exon end), the predicted splice junction from mapping may be off by the same number of missing bases. TAMA Collapse and TAMA Merge both use wobble to allow for the grouping of reads to be collapsed into a single transcript model. This is assessed by comparing every pair of transcript models within the same genomic loci (at least 1 bp same strand overlap connecting all loci grouped reads). In each pair assessment, each exon start and end from each predicted transcript model is compared to see if they occur within the user defined wobble threshold.
Due to this allowance of wobble between predicted starts and ends of exons, a phenomenon termed in this text as wobble walking can occur (Fig. 1c). Wobble walking is defined as a situation where 3 or more transcript models have exon starts or ends where the most upstream exon start/end prediction and the most downstream exon start/end prediction occur at a distance greater than the wobble threshold. However, the other exon start/end positions occur in such a way that when ordered based on genomic position there are no consecutive pairs of exon starts/ends which are farther apart than the wobble threshold. Thus by using the pairwise non-stochastic method of matching transcript models, all transcript models in this situation would match due to the linking effect across all represented exons starts/ends. When this situation occurs, the distance between the exon starts/ends between the grouped transcripts used for collapsing can be greater than the user defined wobble threshold.
TAMA Collapse
TAMA Collapse performs multiple functions: transcriptome assembly, variant calling, genomic downstream poly-A detection, and transcript/gene level quantification. The primary function is to create a non-redundant error corrected genome reference based transcriptome annotation. TAMA Collapse takes as input a sorted SAM/BAM file representing long read RNA sequencing data mapped onto a reference genome assembly as well as a fasta file representing the reference genome assembly used for mapping. TAMA Collapse is designed to be highly tunable and relies on 4 main parameters to define its behaviour: wobble thresholds, collapse mode, splice junction ranking, and the amount of mapping mismatch surrounding splice junctions (LDE).
The wobble thresholds and collapsing modes are used to define how mapped reads are grouped for collapsing. Wobble thresholds can be defined for the TSS, TES, and SJ. Wobble thresholds are given in integer values representing base pair distances. These thresholds define the limit between two features (such as TSS) to be considered a matching feature. There are two collapsing modes which are termed capped and non-capped modes. The capped mode requires that all grouped transcript models (mapped reads) have the same number of exons and all their exons have matching start and end sites as per the user defined wobble thresholds. Matches are performed pairwise in a non-stochastic algorithm. This pair-wise matching is what leads to wobble walking.
The splice junction ranking and local density error algorithm are designed to identify the most likely real splice junctions given a group of matching transcript models. Both the splice junction ranking and LDE rely on user defined threshold of distance from SJ to assess. The LDE feature can be turned on or off. When turned on, the user can specify the distance from the splice junction to assess and the number of allowed mismatches within that distance. If the number of mismatches exceed the threshold, the read is discarded. This is intended to prevent erroneous splice junction predictions. The splice junction ranking can be turned on or off by the user. When turned off, the splice junctions are selected based on the the highest read coverage. When splice junction ranking is turned on, TAMA Collapse ranks the splice junction read support based on the amoun of mismatches flanking the splice junctions. In this method, a splice junction with read support where there are no mismatches flanking the splice junction is given the highest rank and chosen as the final predicted splice junction.
While TAMA Collapse has multiple file outputs, the main output is a bed12 formatted annotation file containing all non-redundant transcript models.
TAMA Merge
TAMA Merge is designed to remove transcript model redundancy either between multiple input annotations or within a single input annotation. TAMA Merge accepts as input 1 or more annotations in bed12 format. TAMA Merge has multiple output files, however the main output file is an annotation file in bed12 format. TAMA Merge also keeps track of the transcript models and their source annotation which were “merged”. This means that for each transcript model, TAMA Merge provides information on which input annotations had transcripts matching it. TAMA Merge uses the same wobble parameter/algorithm and collapsing modes as TAMA Collapse. However, individual input files can be assigned different collapsing modes. This is useful for merging long read data which is likely to contain 5’ truncated transcript models with a reference annotation. In addition to collapsing mode and wobble thresholds, TAMA Merge allows user to assign priority to different input annotation for features such as TSS, TES, and SJ. For instance, a short read derived annotation can be given priority for SJ, while a long read annotation can be given priority for TSS and TES.
TAMA Read Support Levels
The tama_read_support_levels.py tool is designed to generate a file that relates each transcript and gene model with the ID’s of reads which were used to generate those models. This can also be thought of as producing read count information for transcripts and genes. The tama_read_support_levels.py tool works on all annotation output files from all TAMA modules as well as on PacBio annotation files. This tool was used to identify reads that were involved in read jumbling.
TAMA Filter Fragments
The tama_remove_fragment_models.py tool is used to remove transcript models that appear to be fragments of full length models. The criteria for fragment models is that they contain the same internal exon-intron structure as a transcript that is longer on both the 5’ and 3’ ends. The splice junction wobble can be adjusted by the user.
TAMA Remove Single Read Models
The tama_remove_single_read_models_levels.py tool is used to filter a transcriptome annotation based on the amount of read support for each transcript model. This can be run on either the results of TAMA Collapse or the results of TAMA Merge. When used with TAMA Merge with multiple input annotations, tama_remove_single_read_models_levels.py can filter out models based on the number of supporting sources for each transcript model. When TAMA Merge is used to merge a long read data based annotation with a reference annotation. tama_remove_single_read_models_levels.py can be used to filter out models in the long read annotation that do not match the reference annotation. This is how TAMA performs guided annotation.
TAMA Find Model Changes
The tama_find_model_changes.py tool is designed to identify reads which have different transcript/gene model assignments between different pipelines. This is referred to as read jumble in this study. This tool takes as input a TAMA Merge annotation which was generated by merging annotations from the 2 pipelines to be compared. This tool also requires a read support file generated by tama_read_support_levels.py. Read jumbles are identified by using the read ID’s and comparing the transcript models they are assigned to within the TAMA Merge annotation file. Any read that supports more than 1 transcript model is considered to be involved in a read jumbling event.
TAMA ORF/NMD Pipeline
The TAMA ORF/NMD pipeline is a method for identifying open reading frames (ORF) from transcript models and relating them to known protein coding genes. Nonsense mediated decay (NMD) product predictions are also made by identifying stop codons which occur 50 bp upstream of a splice junction. The first step of the pipeline is the conversion of the transcript nucleotide sequences into amino acid sequences. This is done by looking for all ORF’s which have a stop codon and selecting the longest ORF’s from each frame (3 forward strand frames). Start codons are not required for an ORF prediction, however, if a start codon is not found, the corresponding ORF is labeled as evidence that the transcript is from a degraded RNA. BlastP is then used to relate the resulting amino acid sequences to a protein database. The ORF from each transcript with the best hit to the database is then selected as the predicted true ORF. Using the ORF information, the transcripts are then labeled with attributes based on the protein hit.
TAMA Degradation Signature
The TAMA Degradation Signature (DegSig) score is intended to provide a metric for the relative amount of sequencing reads originating from degraded RNA. The DegSig score is calculated by the following formula:
DegSig = (CT-NT) / CT
Where CT is the number of multi-exon transcript models from genes with more than 1 read support after using TAMA Collapse with the capped mode, and NT is the number of multi-exon transcript models from genes with more than one read support after using TAMA Collapse with the no_cap mode.
Simulated long read datasets and processing for benchmarking
The simulated PacBio and Nanopore datasets (https://figshare.com/articles/RNA_benchmark_datasets/5360998) were produced in another study [12] using PBSIM [13]. These datasets were also used and described in the Stringtie2 paper [9].
Both datasets were mapped to chromosome 19 of the human reference genome as provided in the simulated dataset. Minimap2 [31] (version 2.15-r915-dirty) with the parameters “--secondary=no -ax splice -uf” was used for mapping. Samtools [32] (version 1.9) was used for all SAM/BAM file handling.
For the TAMA Low processing, TAMA Collapse was used with the parameters “-d merge_dup -x ${capflag} -a 200 -z 200 -sj sj_priority -log log_off -b BAM”. For the TAMA High processing, TAMA Collapse was used with the parameters “-d merge_dup -x no_cap -a 300 -m 20 -z 300 -sj sj_priority -lde 3 -sjt 10 -log log_off -b BAM”. After TAMA Collapse, both TAMA Low and TAMA High shared the same processing with tama_remove_fragment_models.py used with default parameters to remove transcript models that appear to be fragments of longer models. This resulted in the final annotations for both pipelines.
For the TAMA Guided pipeline, the output from the TAMA Low TAMA Collapse run was merged with the reference annotation containing both expressed and non-expressed transcript models using TAMA Merge with “-a 300 -z 300 -m 20 -d merge_dup” parameters. The input filelist.txt file for TAMA Merge set both annotations to capped mode with full priority (1,1,1) given to the reference annotation. The tama_remove_single_read_models_levels.py tool was then used with “-l transcript -k remove_multi -s 2” parameters resulting in the final annotation. The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
For the Stringtie2 pipeline, Stringtie2 (v2.1.3b) was used with the “-L” parameter after mapping.
For the Stringtie2 Guided pipeline, Stringtie2 (v2.1.3b) with “-L -G <reference annotation>” parameters was used. The reference annotation used was the same annotation as used in in TAMA Merge for the TAMA Guided pipeline.
For the TALON pipeline (unguided), a blank database was created using “talon_initialize_database” with default settings and an empty GFF file. Then “talon_label_reads” was used with “--t 1 --ar 20 --deleteTmp” parameters. Then default “talon” was used. This was followed by “talon_filter_transcripts” using “--maxFracA 0.5 --minCount 5 --minDatasets 1” parameters. The default “talon_create_GTF” was used to create a GTF file for the annotation.
For the TALON guided pipeline, a database was created using “talon_initialize_database” with default settings and the same GFF reference annotation file used for TAMA Guided and Striingtie2 Guided. Then default “talon” was used. This was followed by “talon_filter_transcripts” using” --maxFracA 0.5 --minCount 1 --minDatasets 2” parameters. The default “talon_create_GTF” was used to create a GTF file for the annotation.
All resulting annotations were compared to the annotation file containing all expressed transcript models using GffCompare (v0.11.2).
Universal Human Reference RNA and PacBio sequencing
RNA and cDNA library preparation and sequencing were undertaken by Pacific Biosciences. Pacific Biosciences made the data available for public use via a Github repository (https://github.com/PacificBiosciences/DevNet/wiki/Sequel-II-System-Data-Release:-Universal-Human-Reference-(UHR)-Iso-Seq). The RNA library was first created by pooling the Universal Human Reference RNA (Agilent) with SIRV Isoform Mix E0 (Lexogen). cDNA was prepared from the RNA using the Clontech SMARTer kit. The sequencing library was prepared using the Iso-Seq Template Preparation for Sequel Systems (PN 101-070-200) and Sequencing Sequel System II with "Early Access" binding kit (101-490-800) and chemistry (101-490-900). The sequencing library was sequenced on two Sequel II SMRT cells.
Iso-Seq Processing
The UHRR Sequel II Iso-Seq data was processed into CCS reads using the ccs tool with the parameters “--noPolish --minPasses=1”. CCS reads with cDNA primers and polyA tails were identified as full-length, non-concatemer (FLNC) reads using lima (--isoseq –dump-clips) and isoseq3 refine (--require-polya).
Lexogen SIRV Iso-Seq dataset benchmarking
The UHRR Sequel II Iso-Seq data also contained a spike-in of Lexogen SIRV RNA. For the Cupcake pipeline we used the FLNC reads from each SMRT cell and used Cluster/Polish for long read inter-read error correction. We then mapped the resulting reads using Minimap2 (--secondary=no -ax splice -uf -C5) to the “SIRV_isoforms_multi-fasta_170612a.fasta” reference genome assembly provided by Lexogen. After mapping we ran Cupcake Collapse “collapse_isoforms_by_sam.py” with the Cupcake manual recommended settings “--dun-merge-5-shorter”. We then used Cupcake “chain_samples.py” to merge the assemblies from each SMRT Cell. This resulted in the final annotation for the Cupcake pipeline.
For all the other pipelines (TAMA Low, TAMA High, TAMA Guided, Stringtie2, Stringtie2 Guided, TALON, and TALON Guided), we mapped the FLNC reads to the same reference genome as above using the same parameters for Minimap2.
For the TAMA Low processing, TAMA Collapse was used with the parameters “-d merge_dup -x no_cap -sj sj_priority -log log_off -b BAM -lde 5 -sjt 20 -a 100 -z 100”. For the TAMA High processing, TAMA Collapse was used with the parameters “-d merge_dup -x no_cap -sj sj_priority -log log_off -b BAM -lde 1 -sjt 20 -a 100 -z 100”. After TAMA Collapse, both the TAMA Low and TAMA High pipelines used TAMA Merge (-a 100 -z 100 -d merge_dup) was used to merge the TAMA Collapse outputs from each SMRT Cell. The tama_remove_single_read_models_levels.py tool was then used with “-l transcript -k remove_multi -s 2” parameters resulting in the final annotation. The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
For the TAMA Guided pipeline, TAMA Collapse (-d merge_dup -x capped -sj sj_priority -log log_off -b BAM -a 0 -m 0 -z 0) was used on the Minimap2 output files for each SMRT Cell. TAMA Merge (-d merge_dup -a 0 -m 0 -z 0) was then used to combined the TAMA Collapse outputs from each SMRT cell. TAMA Merge (-d merge_dup -a 0 -m 0 -z 0) was then used again to match the output with the SIRV annotation file (SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf). . The tama_remove_single_read_models_levels.py tool was then used with “-l transcript -k remove_multi -s 2” parameters resulting in the final annotation. The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
For the Stringtie2 pipeline, Stringtie2 (v2.1.3b) was used with the “-L” parameter after mapping.
For the Stringtie2 Guided pipeline, Stringtie2 (v2.1.3b) with “-L -G <reference annotation>” parameters was used. The reference annotation used was the same annotation as used in in TAMA Merge for the TAMA Guided pipeline
For the TALON pipeline (unguided), a blank database was created using “talon_initialize_database” with default settings and an empty GFF file. Then “talon_label_reads” was used with “--t 1 --ar 20 --deleteTmp” parameters. Then default “talon” was used. This was followed by “talon_filter_transcripts” using “--maxFracA 0.5 --minCount 10 --minDatasets 2” parameters. The default “talon_create_GTF” was used to create a GTF file for the annotation.
For the TALON guided pipeline, a database was created using “talon_initialize_database” with default settings and the same GFF reference annotation file used for TAMA Guided and Striingtie2 Guided. Then default “talon” was used. This was followed by “talon_filter_transcripts” using” --maxFracA 0.5 --minCount 5 --minDatasets 2” parameters. The default “talon_create_GTF” was used to create a GTF file for the annotation.
All resulting annotations were compared to the Lexogen SIRV annotation file (https://www.lexogen.com/wp-content/uploads/2018/08/SIRV_Set2_Sequences_170612a-ZIP.zip) using GffCompare (v0.11.2).
Chicken Brain RNA and PacBio sequencing
The non-cap selected chicken brain Iso-Seq data is from the European Nucleotide Archive submission PRJEB13246 which was previously analyzed and published [4].
The cap selected chicken brain Iso-Seq data was from an adult Advanced Intercross Line chicken whole brain sample. The RNA was extracted from the tissue sample using the Qiagen RNeasy Mini Kit. The RNA was converted to cDNA using the Lexogen TeloPrime kit. The resulting cDNA library was sent to Edinburgh Genomics for sequencing on the Sequel system using 2.0 chemistry.
TAMA Low Pipeline for UHRR
Full descriptions of the TAMA algorithms can be found in the wiki pages of the Github repository (https://github.com/GenomeRIK/tama/wiki). FLNC reads were mapped to GRCh38 (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa) using Minimap2 (--secondary=no -ax splice -uf -C5 -t 8). The resulting bam files were then split into 12 smaller bam files using tama_mapped_sam_splitter.py which splits bam files by chromosome thus preventing splitting between reads from the same gene. Split bam files were annotated using TAMA collapse (-d merge_dup -x no_cap -a 100 -z 100 -sj sj_priority -lde 5 -sjt 20 -log log_off) then merged into a single bed file using TAMA merge (-a 100 -z 100). The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
TAMA High Pipeline for UHRR
TAMA collapse was run on the split bam files using more stringent parameters that filter out any mapped read with more than 1 error within 20 bp of a splice junction (-d merge_dup -x no_cap -a 100 -z 100 -sj sj_priority -lde 1 -sjt 20 -log log_off). Merging was done in the same manner as the TAMA Low pipeline. Transcript models supported only by reads from a single SMRT Cell were filtered out using tama_remove_single_read_models_levels.py (-l transcript -k remove_multi -s 2). The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
Polish Pipeline for UHRR
FLNC reads from the isoseq3 refine step were clustered using isoseq3 cluster and isoseq3 polish with default parameters. The output high-quality transcripts were mapped to the genome using Minimap2 (--secondary=no -ax splice -uf -C5 -t 8) and processed using TAMA collapse (-d merge_dup -x no_cap -a 100 -z 100 -sj sj_priority -lde 5 -sjt 20 -log log_off). The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
Lordec Pipeline for UHRR
FLNC reads from the isoseq3 refine step were error corrected using LoRDEC (-k 31 -s 3) with short read RNA-seq data from the Universal Human Reference RNA (Agilent) (https://www.ncbi.nlm.nih.gov/sra/SRX1426160) (https://rnajournal.cshlp.org/content/22/4/597.full.pdf). The resulting error-corrected reads were processed in the same way as the TAMA Low starting from the mapping step. The tama_read_support_levels.py tool was used at each step of processing to keep track of read support for each transcript model.
Finding transcript matches and loci overlap between Iso-Seq annotations and the Ensembl annotation
We used TAMA Merge to compare the annotations from each Iso-Seq pipeline (TAMA Low, TAMA High, Polish, and Lordec) to the Ensembl v94 annotation. All input annotations were set to capped mode in the input fielist.txt files. The “-a 300 -z 300 -m 0 -d merge_dup” parameters were used to run TAMA Merge. Transcript matches were identified from the trans_report.txt file while gene loci overlap was identifed from the gene_report.txt file.
Comparing 5’ completeness between the TAMA High, Polish, and Ensembl v94 annotations
We used TAMA Merge to compare the annotations for pairs of annotations (TAMA High-Polish, TAMA High-Ensembl, Polish-Ensembl). Both annotations in each merging were given no_cap parameters in the filelist.txt input file. We used the same TAMA Merge settings as were used for identifying matching transcript models between annotations. We used the TAMA Merge trans_report.txt output file to identify which source annotation had the longer 5’ representation for each matching transcript model.
Degradation Signature Analysis
We split the SAM files from the mapping by chromosome. We then used these single chomosome SAM files as inputs to 2 TAMA Collapse runs. One TAMA Collapse run used the capped mode and the other run used the no_cap mode. Both runs used “-a 100 -z 100 -sj sj_priority -lde 5 -sjt 20 -log log_off -b BAM” parameter settings. We then used the trans_read.bed files from each pair of TAMA Collapse runs as inputs for the tama_degradation_signature.py tool which calculated the DegSig scores.
Mismatch and Wobble Analysis
The mismatch profiles for the mapped FLNC, Cluster/Polish corrected, and LoRDEC corrected reads were extracted from the TAMA Collapse read.txt output files generated in each pipeline.
To assess the wobble between each pipeline and the Ensembl annotation, we used TAMA merge with parameter settings (-a 300 -z 300 -m 30 -d merge_dup) which considers any transcripts which have up to 300bp difference in their transcription start and end and up to 30bp difference in their splice junctions starts and ends to have “nearly identical structures”. This is the definition for matching at transcript level.
Read Jumbling Analysis
Read ID’s were tracked through each processing step using the tama_read_support_levels.py tool. TAMA Merge was used to combine the annotations from the different pipelines (TAMA Low-Polish, TAMA Low-Lordec) using the same parameters as the was used in the wobble analysis. The TAMA Merge output and tama_read_support_levels.py outputs were used as input for the tama_find_model_changes.py tool that identified reads which had different transcript model assignment between each pair of pipelines.
Coding Potential Analysis
For the Ensembl match evidence of coding potential, we labelled the Iso-Seq annotation genes as coding if they had any overlap on the same strand as an Ensembl-annotated protein coding gene.
CPAT was used with default parameters and the built-in Human Hex models. A cutoff score of 0.364 (suggested by the CPAT creators [28]) was used to segregate between coding and non-coding transcripts.
We used the TAMA ORF/NMD pipeline for the third source of coding evidence. The transcript models were converted into fasta sequences using Bedtools [33]. ORFs were predicted for each transcript from the fasta file then translated into amino acid sequences. BlastP [34] (-evalue 1e-10 -ungapped -comp_based_stats F) was used to match the amino acid sequences to the UniRef90 database, where the top hits were selected as the best ORF prediction. Transcripts with no hits were considered to be non-coding.
Matching TAMA High annotation to Ensembl v100
For identifying gene models found in the Ensembl v100 human annotation matching gene models predicted in the TAMA High annotation which were not present in the Ensembl v94 human annotation, we used TAMA Merge with “-m 0 -a 300 -z 300” parameters in capped mode for all three annotations (TAMA High, Ensembl v94, and Ensembl v100). These parameters group transcript models between the annotations if they share the exact same splice junctions and exon chaining but with an allowance of up to 300 bp difference in TSS and TES. We then identified all gene models which were the product of merging a TAMA High annotation gene with an Ensembl v100 gene and with no Ensembl v94 gene represented for that loci.