DEEP-LONG: A Fast and Accurate Aligner for Long RNA-Seq

Background In recent years, because of the development of sequencing technology, long reads were widely used in many studies, include transcriptomics studies. Obviously, Long reads have more advantages than short reads. And long reads align also different from short reads align. Until now Lots of tools can process long RNA-Seq, but there still have some problems need to solve. We developed Deep-Long to process long RNA-Seq, Deep-Long is a fast and accurate tool. Deep-Long can handle troubles come from complicated gene structures and sequencing errors well, Deep-Long does well especially on alternative splicing and small exons. When sequencing error rate is low, Deep-Long can rapidly get more accurate results. While sequencing error rate rising, Deep-Long will use more time, but still more fast and accurate than most other tools. Conclusions

especially suitable for the detection of novel splice sites. In recent years, long reads has been widely utilized in the analysis of the genomes of various organisms including prokaryotic and eukaryotic (7).
RNA-Seq align is a fundamental method of RNA sequence analysis. Because of Respective characteristic of long reads and short reads, RNA-Seq align methods have many differences. Currently, lots of long RNA-Seq aligner could handle third generation sequencing data. BLAT is an early align method. Gmap is widely used in many studies. compare with BBmap and STAR, Gmap have better performance, and use less time (9). GraphMap, Magic-BLAST, Minimap2 and deSALT are recently works, they all use different reference index from traditional hash-index to reduce seeding time. Minimap2 collect same minimizers between reference and reads, and then nd optimal seed chain, performs DP-based global alignment between adjacent anchors in a chain. Minimap2 can map DNA or long mRNA sequences against a large reference database. optimized seeding and DP(Dynamic Programming) strategies make Minimap2 align reads in a very short time. deSALT use de Bruijn graph-based index align seeds on all reads, use these seeds align results to construct an optimal transcript, according to the optimal transcript give reads nal align results. Because the probability of different reads have same sequencing error at same site is very low, construct an optimal transcript can help reads across high error region and get more credible align results, so deSALT performance well while reads have high sequencing error rate.
Deep-Long pay more attention to process hard align regions, such as small exons and exons with more SNPs and RNA editings. Deep-Long align reads in two granularities with two reference indexes. First granularity, Deep-Long use BWT(Burrows-Wheeler transform) index align long consecutive regions on reads. The most signi cant advantage of BWT index is save storage, so Deep-Long can run on a desktop computer easily. For hard align regions, Deep-Long use a 8 bp hash index of reference expect to nd more seed clues of exons, and then give more credible align results. In most cases Deep-Long can rapidly get more accurate results, especially on alternative splicing and small exons. When sequencing error is very high, for single reads, it's hard to nd long consecutive seeds, Deep-Long will use more time on hard regions, and similar sequence make 8 bp hash seeds give wrong guidances.
With the development of sequencing technology, various strategies were applied to reduce sequencing error rate, the accuracy of PacBio ROI reads > 98%, ONT use INC-Seq strategy obtain nanopore reads median accuracy > 97%, so more high accuracy long reads will be used in future RNA-Seq studies. When the sequencing error rate is very high, we don't suggest to use Deep-Long, deSALT will be a better choice.

Method
Deep-Long adopt seed-and-extend strategy like most aligners. Seed-and-extend strategy is an effective way on spliced RNA-Seq alignment, through combine seeds from different regions to nd splice site.
Seeding step always cost plenty of time, different align tools choose different ways to reduce seeding time.

1) Finding MEM seeds use BWT index.
Deep-Long use BWT index to nd overlapped Maximal Exact Match (MEM) seeds from the beginning to end of each read. MEM concept was used in STAR, Mummer and MAUVE. Search MEM seeds only to the unmapped regions of the read sequentially makes the STAR algorithm extremely fast. Due to the high sequencing error of TGS reads, lots of MEM seeds may align to wrong regions on reference and nally miss the right align result of entire read. A little overlap between seeds can prevent this deviation with hardly no more time cost. After that Deep-Long will give a more micromesh deep align of unmapped regions between MEM seeds. Thanks for BWA, we can use BWT index to nd MEM seeds easily.
for each read R, and given reference genome sequence G, from the rst base of read R, nd MEM seeds (i, j, l), Ri is the i-th base of read, Gj is the j-th base of reference, l is the length of seed. Seed (i, j, l) is the seed from i-th base of read and j-th base of reference have Maximal Exact Match seed length l, Ri + l is different from Gj + l. from i + l, i + l-4 and i + l + 4 base of read, nd next MEM seeds, until the end of reads. DEEP-LONG accept MEM seeds length more than 16 bp, for less than 16 bp MEM seeds (i, j, l), from i + 1 base of read nd new MEM seeds.
Sometimes, this seed nding strategy may cause bias, so DEEP-LONG use non-overlap 25 bp length seeds to correct this bias. For high sequencing error rate reads, DEEP-LONG suggest each 10 base use 15 bp length seeds.
Extend each seed from both front and back direction to make sure each seed (i, j, l) is the longest exact match, so Ri-1 is different from Gj-1, and Ri + l + 1 is different from Gj + l + 1.
2) Finding extend seeds on reference.
Because of SNP, RNA editing and sequencing error base, RNA-Seq is not continuous compare with reference. for each seed (i, j, l), DEEP-LONG nd Extend seeds from both tails. For example, seed1 (i1, j1, l1) is an Extend seed of seed (i,j,l) from back tail, i + l = < i1 < i + l + 10 and j + l-5 = < j1 < j + l + 15,l > = 4. choose the best combination of Extend seeds join into MEM seeds set, and continue to nd Extend seeds of seed1 from back tail recursively Extend seeds help to across SNP, SNV(single-nucleotide variant),RNA editing and sequencing error base regions, and get more information about exon area, and then help to pick out correct seed chain and get correct alignment results.
DEEP-LONG calculate all seeds score, the initiation score of one seed (i, j, l) is l*match score. After nding seed chain, the score of one seed is the score of the best seed chain end with this seed. The score of seed Si can be calculate from equation.
f(i) is seed Si nal score, f(j) is seed Sj nal score, s(i) is the initiation score of one seed Si, (i, j) is the overlap of Si and Sj, β(i, j) is the gap score between Si and Sj.
DEEP-LONG consider chain score more than (0.7*read length*match score) or best chain score more than (0.3*read length*match score) to get nal alignment results. If sequencing error is very high, it's hard to nd enough seeds to support correct results, and then fail to align this read. After nding 8 bp seeds, DEEP-LONG also nd extend seeds of these seeds, nd best candidate chain and use DP strategy to ll blank.

5) Checking exon and splice sites.
Most splice sites have splice signal GT-AG/CT-AC, DEEP-LONG will consider splice signal in uence, a splice signal will add extra score to alignment result score. Long RNA-Seq always across more than 2 exons, GT-AG and CT-AC come from different chain of DNA, Check splice sites to make sure splice sites from same RNA-Seq have splice signals of same chain.
Because of high error rate, exons less than 40 bp will cause more wrong results. DEEP-LONG will check exons shorter than 40 bp, if remove short exon can increase results score, DEEP-LONG will abandon this exon. While input reads error rate is a little high, the tail exon of read is more likely a wrong exon, DEEP-LONG suggest cut off less than 25 bp tail exons.
3 Result deSALT supply abundant data and compositive and synthetical evaluation program. We use deSALT simulation scripts to simulated 5 RNA-seq long read datasets of human(GRCh38, version 94) and 5 RNAseq long read datasets of mouse (GRCm38, version 94) with depth of 30, respectively termed as "PacBio ROI reads", "PacBio subreads", "ONT 2D reads", "ONT 1D reads" and "NS-ONT reads", with different sequencing error and read length. Beside "NS-ONT reads" generated by NanoSim (21), others generated by PBSim (22), and all detail parameters follow deSALT. DEEPL(Deep-Long) and other three most frequently used tools, deSALT, Minimap2 and GMAP were applied on 10 simulated datasets and 2 real datasets. All tools run without any gene annotations. software version and Command lines are provided in Supplementary. We also follow deSALT use ve metrics to describe sensitivity, accuracy and performance of the aligners.
Base%: the proportion of bases being correctly aligned to their ground truth positions.
Exon%: the proportion of the exons being correctly mapped. can hardly nd seed combinations higher than process threshold score value, and nally fail to give align results. But especially it has the ability to handle many di cult issues.
For reads contain small exons, DEEPL also have good performance, DEEPL use 8 bp-kmer hash index to nd more information of read, 8 bp-kmer can easier nd small exons, so DEEPL can easily process reads with small exons. Furthermore, for reads contain alternative splicing, DEEPL still have good performance, sometimes lower than deSALT. DEEPL done a lot on check whether exons credible, and DEEPL align each read independently, splice pattern of other reads hardly affect align result. So DEEPL can process alternative splicing like normal splicing, there is nothing special.
Moreover, the number of exons within a read have no effect to DEEPL align result, reads with various kinds numbers of exons nearly have same performance. DEEPL give align results of reads depend on seeds nding, if seeds con rm one exon exist, DEEPL will show it in align result. DEEPL search 8 bp hash seeds on reference intensively, try to nd all exons on reads.
Overall, sequencing error is the major effect to DEEPL performance. Numbers of exon, length of exons and transcript structure have little effect to align results, DEEPL can process them easily. Sequencing error also affect speed of DEEPL. Speed of DEEPL always slower than deSALT and Minimap2, but faster than GMAP. But while sequencing error is very low, DEEPL can even use less time than deSALT and Minimap2, because of structural characteristics of BWT index, when sequencing error rate is low, DEEPL can nd accurate and unique MEM seeds with BWT index very fast.
Here we also test DEEPL on two sets of real data of human SRR11638299 and SRR3476690, and two sets of real data of mouse SRR7345558 and SRR7345562. SRR11638299 and SRR3476690 are sequenced by PACBIO_SMRT (PacBio RS II) from different samples. SRR7345558 and SRR7345562 are Single cell isoform sequencing from whole mouse tissue RNA-seq from same sample, but SRR7345558 sequenced by OXFORD_NANOPORE (MinION), and SRR7345562 sequenced by PACBIO_SMRT (Sequel).
We evaluate real data align results use different program with deSALT, deSALT only consider the rst align result of the same read, we prefer to consider the best align result of the same read. ALL align tools we have tested can align most reads well. On real data DEEPL also have a good performance. DEEPL always nd out most transcripts and exons same with annotation. DEEPL nd exon ability is unacted on exon length. DEEPL also nd more small exons length < 20 bp. When exon length < 15 bp, GMAP always have good performance. Don't like simulated data, DEEPL align result on OXFORD_NANOPORE real data also have good performance as well as PACBIO_SMRT real data. But GMAP have an obviously low exon and transcript level on OXFORD_NANOPORE real data. Time cost on real data always have same tendency with simulated data, while GMAP use notable less time on SRR11638299.
We notice GMAP align results of some data sets have an unanticipated performance. We test each unexpected results for many time for all aligners, and try to remove personal and other factors, make sure the results credible, but we still don't know why this happen. At present, lots of tools can use SGS reads to reduce long reads sequencing error rates, such as LCS.
Along with the development of sequencing technology, more studies work on reducing sequencing error rates, like PacBio ROI reads and INC-Seq, and more low sequencing error rate reads will be used in future studies, Deep-Long will be more e cient on these data.

Availability of data and material
The source code of DEEPL are available at: https://github.com/cathy-houli/DEEPL.
The data simulation and benchmarking scripts and evaluation programs are available at: https://github.com/hitbc/deSALT.