DeBreak: Deciphering the exact breakpoints of structural variations using long sequencing reads

18 Long-read sequencing has demonstrated great potential for characterizing all types of structural 19 variations (SVs). However, existing algorithms have insufficient sensitivity and precision. To 20 address these limitations, we present DeBreak, a novel method for comprehensive and 21 accurate SV discovery. Based on alignment results, DeBreak employs a density-based 22 approach for clustering SV candidates together with a local de novo assembly approach for 23 reconstructing long insertions. A partial order alignment algorithm ensures precise SV 24 breakpoints with single base-pair resolution, and a k-means clustering method can report multi25 allelic SV events. DeBreak outperforms existing tools on both simulated and real long-read 26 sequencing data from both PacBio and Nanopore platforms. An important application of 27 DeBreak is analyzing cancer genomes for potentially tumor-driving SVs. DeBreak also 28 demonstrates excellent performance in supplementing whole-genome assembly methods. 29

. For SVs located in repetitive regions, the clustering window is automatically adjusted 137 to tolerate shifts of raw SV signals caused by repeated segments (Fig S5). 138 Previous work 35 has highlighted the functional importance of multi-allelic copy number 139 variations (CNVs) in gene dosage and gene expression. DeBreak can accurately identify multi-140 allelic SVs (mSVs). After density-based clustering, raw SV signals of candidate mSVs are 141 further clustered through a k-means clustering algorithm to characterize two non-reference 142 alleles (Fig S6). We applied this method to identify putative mSVs in HG002. In total, we 143 identified 802 multi-allelic SVs in a single genome. 144 We also benchmarked genotyping accuracy of the four tested SV callers with the high-145 confidence SV callset. On the three datasets, DeBreak and cuteSV performed better than pbsv 146 and Sniffles (Table S1). DeBreak achieved the highest genotyping accuracy in PacBio CLR and 147 Nanopore datasets, while cuteSV achieved slightly higher genotyping accuracy in the PacBio 148 HiFi dataset. 149 We then assessed the SV discovery accuracy for SVs of different sizes. DeBreak 150 achieved stable and high accuracy for small and large SVs, especially in detecting insertions 151 (Fig 2a). Notably, for ultra-large insertions longer than the sequencing reads (>10kbp), DeBreak 152 achieved higher accuracy, recall, and precision than the other three SV callers (Fig S7), 153 benefiting from its large-insertion detection module with local de novo assembly. In the PacBio 154 HiFi and Nanopore datasets, DeBreak also achieved relatively high accuracy for SVs of different 155 sizes (Fig S8). We next evaluated the accuracy of SV breakpoint positions reported by 156 DeBreak. The high sequencing error rate of long reads often causes imprecise inference of SV 157 breakpoints. We compared SV callsets from the four tested SV callers to high-confidence 158 benchmark SV callset to assess for shifts in breakpoint positions. With the breakpoint 159 refinement module, DeBreak identified 59.90% of SVs that were consistent with exact SV 160 breakpoints reported in GIAB, higher than pbsv (49.17%), Sniffles (4.99%), and cuteSV (5.18%) using the PacBio CLR dataset (Fig 2b). For PacBio HiFi and Nanopore datasets, DeBreak also 162 achieved the highest SV breakpoint accuracy among the four evaluated SV callers (Fig S9). 163 To assess the effect of sequencing depth on SV discovery accuracy, we downsampled 164 the PacBio CLR, HiFi, and Nanopore datasets by a series of depths by randomly sampling the 165 reads. At each depth, DeBreak and pbsv were applied with the default parameters. In contrast, 166 a set of parameters were tested for Sniffles and cuteSV, and the SV calls with the highest F1 167 scores were selected for comparison. Overall, SV discovery was more accurate for datasets 168 having higher sequencing depth (Fig 2c). Starting from 20x, DeBreak already achieved 169 accuracy of over 90% for PacBio CLR, HiFi, and Nanopore datasets. For datasets with depth 170 ≥20x, DeBreak consistently identified SVs with the highest accuracy among the four tested SV 171 callers. Note that DeBreak and pbsv automatically adapted to lower depths using default 172 settings, while Sniffles and cuteSV both required extra effort in manually tuning parameters to 173 optimize performance (Fig S10). Taken together, these results highlight that DeBreak can 174 accurately identify different types of SVs with precise breakpoints in real human genomes. 175 176 Comparison to assembly-based SV discovery 177 Currently, de novo assembly is used to comprehensively characterize genome-wide SVs 26-29 . To 178 compare alignment-based with assembly-based SV discovery approaches, we applied 179 DeBreak, pbsv and cuteSV on six samples from the Human Genome Structural Variation 180 Consortium (HGSVC). Three of these six samples were sequenced with the PacBio CLR 181 platform, and the other three were sequenced with the PacBio HiFi platform. For these samples, 182 highly accurate assembly-based SV callsets were generated by performing haplotype-resolved 183 de novo assembly with phased sequencing reads and subsequent SV discovery from whole-184 genome assembly with the PAV pipeline 26 . Overall, the assembly-based SV approach 185 discovered a slightly higher number of SV events (22,897 -27,187) compared with alignment-186 based methods. By treating these SVs as the "ground truth", we evaluated the SV discovery accuracy of three alignment-based SV callers. DeBreak identified SV with an average F1 score 188 of 80.09% in the six samples, which was higher than pbsv (72.68%) and cuteSV (77.38%) 189 (Table S2). In each sample, DeBreak achieved both higher recall and precision than pbsv and 190 cuteSV (Fig 3a), suggesting higher consistency with the assembly-based SV discovery than the 191 other two alignment-based SV callers. As assembly-based SV calls usually have accurate SV 192 breakpoints inferred from assembled contigs, we also compared the SV breakpoint accuracy of 193 three SV callers. With the breakpoint-refinement module, DeBreak identified 32,813 SVs with 194 exact SV breakpoints on the three PacBio CLR datasets, while pbsv reported 27,345 and 195 cuteSV reported 1,927 precise SV breakpoints, respectively (Fig 3b). pbsv or cuteSV (Fig 3c, Fig S11). In contrast, 71.9% of deletions and 71.5% of insertions 205 reported by DeBreak but not PAV were also reported by either pbsv or cuteSV, suggesting that 206 alignment-based SV callers identified SV calls more consistently. We further characterized the 207 3,385 SVs only reported by PAV and 1,292 SVs only reported by DeBreak in all six samples. 208 Note that DeBreak reported the fewest number of unique SVs. By examining the SV locations 209 on the genome, we found that there was strong enrichment in telomere regions for PAV-unique 210 SVs (43.5% located near telomeres, 5.8% located near centromeres, and 46.4% located in 211 repetitive regions) (Fig 3d, Fig S12). While DeBreak-unique SVs were enriched in the telomere 212 and centromere regions (31.2% located near telomeres, 27.7% located near centromeres, and 38.7% located in repetitive regions). Although DeBreak controls read depth and minimum 214 number of supporting reads, alignment-based SV discovery may have inaccurate read 215 alignment in these regions. However, it is also challenging to assemble reads with abnormal 216 coverage and ascertain phasing status of individual reads without bias. Additional efforts are 217 needed to validate these SVs.  (Fig S13). Among the four SV callers, DeBreak reported relatively fewer singleton SV 229 calls, especially in insertion/duplication detection, suggesting high precision of DeBreak SV 230 callsets. We also compared the SV callset of DeBreak to previously reported SV lists from long-231 read and short-read data 42 (Fig S14). DeBreak reported 1,333 deletions, 3,073 232 insertion/duplications, 51 inversions, and 91 translocations that were not previously discovered. 233 To validate potential cancer-related SVs reported only by DeBreak, we designed primers 234 flanking breakpoints for 15 randomly selected SVs (6 deletions, 4 duplications, 3 inversions, and 235 2 translocations) that spanned more than 10kbp (Fig S15). Polymerase chain reaction (PCR) 236 experiments validated 12 out of 15 DeBreak-novel SVs, with a validation rate of 80% (Supp file 237

2). 238
We further analyzed SVs in the SKBR3 breast cancer cell line by annotating breakpoints 239 and identified 41 putative gene fusions. By cross-validating these gene fusion events with Iso-240 Seq data (Methods), we found 11 gene fusions that can be validated at the transcripts level 241 (Supp file 3). 6 out of 11 gene fusions have been previously reported using transcriptomic 242 data 42-45 . Therefore, SV discovery using DNA-seq data with DeBreak identified 5 novel gene 243 fusions: WDR82-PBRM1, PDE4D-DEPDC1B, CPNE1-PHF20, CSE1L-KCNB1, and CSNK2A1-244 NCOA3. The fusion of WDR82 and PBRM1 was caused by a hemizygous deletion of 392kbp on 245 chromosome 3, with the fusion junction located in the intronic region of both genes (Fig S16).

Runtime and memory usage 254
DeBreak and other SV callers were tested on Intel Xeon E5-2680 v3 CPUs with 12 cores and 255 2.5GHz of frequency. It took 12.4 hours for DeBreak to identify SVs from a human genome 256 (SKBR3 cell line) using the 67x PacBio CLR dataset with peak memory of 63 GB (Table S3). 257 Due to the local assembly module and partial order alignments, DeBreak consumed more 258 runtime and memory than Sniffles (3.0h, 13GB) and cuteSV (1.5h, 3GB). However, DeBreak 259 was much faster and consumed less memory than pbsv (45.1h, 72GB). 260

262
In this work we present DeBreak, a method for efficient and accurate structural variation 263 detection from long-read sequencing data. Based on simulation data, real human genome data, 264 and cancer cell line data, DeBreak has demonstrated excellent performance when compared 265 with several state-of-the-art long-read SV callers. The improved performance is due to several 266 innovative design features: 1) the density-based clustering method can accurately identify 267 candidate SV events with a variety of sizes; 2) the partial order alignments can produce a 268 consensus sequence for accurate breakpoint inference, which is helpful for experimental 269 validation and mechanism inference 20-23 ; 3) local de novo assembly facilitates discovery of long 270 insertion events, which usually cannot be inferred within individual reads; 4) k-means approach 271 can accurately identify multi-allelic SVs, which are functionally important; and 5) multiple 272 functions can be applied to both healthy and unhealthy genomes. 273 Although the benchmark was based on human genomes, DeBreak can be applied to 274 other diploid or haploid non-human long-read resequencing data. The overall workflow may be 275 applied to polyploid genomes as well. Based on our knowledge and experience, SV discovery in 276 polyploid genomes is challenging for any currently available tools. More sophisticated 277 benchmarking work is needed. 278 We observed that SV callers reporting more accurate breakpoint positions (DeBreak and 279 pbsv) required more computational resource than SV callers with less accurate breakpoints 280 (Sniffles and cuteSV). During SV discovery for DeBreak, breakpoint refinement and ultra-large 281 insertion detection were the two most time-consuming steps, accounting for approximately 45% 282 and 32% of total runtime, respectively. When we disabled these two features, DeBreak

Raw SV signal detection and clustering 300
Raw SV signals are detected from read-to-contig alignment. DeBreak scans all read alignments 301 for intra-alignment and inter-alignment SV signals. Smaller indels can be contained within a 302 single alignment. For larger indels, inversions, duplications, and translocations, DeBreak utilizes 303 split-read information and classifies SV type based on orientation and clipping location of two 304 segments from the same read. Once a raw SV signal meets criteria, DeBreak records features 305 of the signal and the read supporting it, including SV type, SV coordinate, SV size, read name, 306 and read mapping quality. As it scans through read alignments, DeBreak also estimates 307 sequencing depth of the input dataset and automatically adjusts parameters used in the 308 following clustering and filtering processes. defined on both side of the summit when density drops to 10% of the summit or lower than the 314 predefined threshold. All raw signals located within one SV region are then merged into one SV 315

candidate. 316
For each SV candidate, DeBreak determines whether it is a multi-allele SV based on the 317 first quartile (Q1) and third quartile (Q3) of SV size from all raw signals. If Q3 is smaller than 318 twice of Q1, all raw signals are merged into a single-allele SV, excluding outliers of extremely 319 large or small size. If Q3 is larger than twice Q1, DeBreak separates raw SV signals for each 320 allele with k-means clustering (k=2 for diploid genomes) and merges signals from each cluster 321 separately as a multi-allele SV candidate. The detection and clustering of SV signals are 322 processed separately for each chromosome, allowing DeBreak to perform multi-thread SV 323 detection, drastically reducing runtime. 324 325

SV breakpoint refinement 326
After SV signal clustering, DeBreak assigns each SV candidate a breakpoint coordinate by 327 computing the mean value of raw signals. Raw signals can be highly imprecise due to the high 328 error rate of long-read sequencing and the presence of low-complexity regions in the genome. 329 DeBreak implants the POA algorithm from wtdbg2 46 to refine breakpoint locations. For each SV 330 candidate, DeBreak collects all reads containing this SV candidate and performs POA to 331 generate accurate consensus sequences. DeBreak then aligns these consensus sequences to 332 the reference genome with minimap2 and detects SVs from consensus sequence alignments. 333 The breakpoint location detected from consensus sequence is used to refine the breakpoint 334 coordinates of SV candidates. If POA fails to generate consensus sequences for an SV 335 candidate, or the consensus sequence cannot be properly aligned back to the genome, 336 DeBreak will keep the mean value of the raw signals as breakpoint coordinates.

Large insertion detection via local assembly 350
DeBreak utilizes a local de novo assembly approach to detect ultra-large insertions that are too 351 long to be spanned within a single read. While scanning read alignments for raw SV signals, 352 DeBreak also records positions of clipped end of read alignments. After scanning through a 353 chromosome, DeBreak identifies candidate insertion breakpoint regions with enriched clipped 354 alignment. It collects all the reads clipped at each candidate breakpoint region and performs 355 local de novo assembly with these reads to reconstruct the inserted sequence. DeBreak aligns 356 assembled contigs to the reference genome with minimap2 and detects insertions from these 357 contigs. Detected insertions are filtered out if 1) multiple contigs are assembled during local de 358 novo assembly, 2) a detected insertion is located in another chromosome or too far away from 359 the candidate insertion breakpoint, or 3) the detected insertion is smaller than 1kbp. 360

Duplication identification 362
DeBreak includes an optional duplication-rescuing module that distinguishes tandem 363 duplications from insertion calls, as smaller tandem duplications are often treated as insertions 364 by aligners. The inserted sequence of tandem duplication shows high similarity with the 365 duplicated region, while insertions usually consistent of novel sequence or sequences from 366 distinct regions of the genome. For each insertion call, DeBreak collects reads supporting the 367 SV event and extracts inserted sequence from each read. It utilizes minimap2 to re-align these 368 inserted sequences back to the local region (1kbp flanking the insertion breakpoint) on the 369 reference genome. If more than 50% of inserted sequences can be aligned back to the local 370 region, DeBreak corrects the SV type to tandem duplication for this insertion call. 371 372 2. Benchmark in simulated dataset 373

Simulated dataset generation 374
Three simulated datasets with ground-truth SVs were generated for benchmarking. For each 375 dataset, a total of 22,200 SVs (10,000 deletions, 10,000 insertions, 1,000 duplications, 1,000 376 inversions and 200 translocations) were randomly simulated on Chr1 to Chr22 and ChrX. The 377 sizes of simulated SVs followed the geometric distribution as observed in real human genomes, 378 including peaks at ~350bp and ~6000bp. These simulated SVs were assigned as heterozygotes 379 and homozygotes with a ratio of 2:1, and heterozygous SVs were randomly assigned to two 380 haplotypes. The human reference genome GRCh38 (autosomes and the X chromosome) were 381 modified according to the type and size of simulated SVs to generate haplotype 1 and haplotype 382 2. Long reads were simulated from the modified genome using pbsim with options "--data-type 383 CLR --model_qc model_qc_clr --depth 25 --accuracy-mean 0.85". The depth was set to 25X for 384 each haplotype, generating a simulated dataset of 50X when merging all reads from both 385 haplotypes. The average read length was set to 10kbp, 15kbp, and 20kbp for three simulated 386 datasets. Figure 1 Workflow of DeBreak. The major steps of DeBreak SV discovery include SV signal detection, signal clustering, breakpoint refinement, and filtering and genotyping. Detailed descriptions of each step can be found in Methods. The unit for recall, precision, and F1 score is %. The highest recall, precision, and F1 score among four tested SV callers are marked in bold. Rec = Recall. Pre = Precision. F1 = F1 score.