Ultrasensitive gene fusion detection reveals fusion variant associated tumor heterogeneity


 Gene fusions are common drivers and therapeutic targets in cancers, but clinical-grade bioinformatics callers are lacking. Here we introduce a novel method SplitFusion, which is fast by leveraging BWA-MEM split alignments, can detect cryptic splice site fusions, and can infer frame-ness and exon-boundary alignments for functional prediction and minimizing false-positives. SplitFusion demonstrates superior sensitivity, specificity, accuracy and consumes minimal computing resources. In our study of 1,076 formalin-fixed paraffin-embedded lung cancer samples, SplitFusion detected not only common fusions (EML4 4.7%, ROS1 2.0% and RET 1.1%) with various partners, but also rare (KLC1-ALK, CD74-NRG1, and TPR-NTRK1) and novel (FGFR3-JAKMP1, CLIP2-BRAF, and ITPR2-ETV6) fusions. In 35 glioblastoma samples, SplitFusion-Target detected six (17%) EGFR vIII (exons 2-7 deletion) cases. Furthermore, we find that the EML4-ALK variant 3 is significantly associated with occurrence of multiple breakpoint-defined subclones, namely high intratumor heterogeneity. In conclusion, SplitFusion is well-suited for clinical use and for studying fusion-defined tumor heterogeneity.


99
(with all its split alignments) will be kept among the duplicated UMI-LS reads, resulting in a 100 consolidated BAM file (Fig. 1b).

115
SplitFusion uses the lengths of these unmapped sequences (soft-clipped length) to adjust for the ligation site mapping position (encoded in read ID) for later accurate calculation of the number of 117 partner ends (ligation sites). SplitFusion then sorts all the split alignments from the same query read 118 by their original segment orders on the query read.

121
From the transformed split-read alignments, SplitFusion infers breakpoint candidates. For any two 122 neighboring alignments on the same query read, if they align to different chromosomes or the same 123 chromosome but distant at >750,000 bases (the largest intron size used by BLAT 22 ), they are deemed 124 candidate fusion partners and output to separate files (left and right). The partners are then merged by 125 their read ID, and their junction-corresponding chromosomal alignment positions are deemed as 126 candidate breakpoints (Fig. 1c). For a typical two-split alignment read, the merged new record has the  As an initial step to minimize false positive results, SplitFusion filters the candidate breakpoints by 137 minimum alignment length, minimum non-overlapping length of split parts, maximum gap length, 138 and minimal number of ligation site (UMI-LS) (Fig. 1d). SplitFusion can apply different minimum 139 mapping lengths on different partner alignments. Typically for anchored end NGS data, this feature 140 allows for a high fusion calling specificity by specifying a high mapping length value (default 25; Fig.   141 1d) on the ligation end, and also for a high fusion calling sensitivity by specifying a short mapping 142 length (default 18) on the anchored end (additional specificity on the anchored end was already imposed in the wet lab protocol by using outer gene-specific primers [GSP1s; Fig. 1a] which were 144 however not present in NGS data).

146
Breakpoint gene annotation, frame-ness, exon boundary, further filtering and target reporting 147 Next, SplitFusion annotates candidate breakpoints with gene names, exon numbers, and cDNA 148 positions (Fig. 1e). It is common that split alignments of the same query read share a few identical 149 ending DNA bases, which are "double counted" typically because they belong to Gene A exon and is

159
together with any potential partners, will be annotated. Following annotation, SplitFusion infers the 160 frame-ness of fusion transcript (in-frame/out-frame or not applicable; Fig. 1f) and judges whether the 161 breakpoints are on known exon boundary (both/one/none) according to RefSeq 26 (Fig. 1g).

162
Optionally, a backend "whitelist" database can be used for targeted output to report alternative 163 splicing (e.g. for MET exon 14 skipping), exon deletion (e.g. EGFR vIII exons 2-7 deletion) and gene 164 truncation (e.g. FGFR1/2/3 exon 18 truncation), as well as a "blacklist" database to remove recurring  detected, ten fusion-supporting split-reads (in FASTQ) will be randomly extracted from the 172 consolidated BAM, saved as a text file that can be examined manually, and also converted into BAM 173 file that can be visualized with a third party software or with an IGV API (Fig. 1h)   reported six EGFR vIII cases (Fig. 3 & Extended Data Fig. 3).

223
ETV6 adenocarcinoma (Fig. 4a & Extended Data Fig. 5). We did not attempt to call MET exon 14- unknown. Using SplitFusion, we found that multiple gene fusion variants with in-framed exons of 234 partner genes, with or without involving cryptic splice site sequences, co-occurred frequently in the 235 same tumors (Fig. 4b). Among the kinase genes analyzed, EML4-ALK v3 and CD74-ROS1 harboring 236 lung tumors had the most frequent co-occurrences of multiple fusion variants.

247
We demonstrate that SplitFusion is fast, highly sensitive, specific and computationally efficient in the

309
The performance among the four software was scored based on the following metrics: 1) Sensitivity:

358
Implementation and availability.

359
The code implementing all steps is included in the SplitFusion software package. SplitFusion is