Copy-number Analysis by Base-level Normalization (CABANA): An Intuitive Visualization Tool for Con�rming True Copy Number Variations

Next-generation sequencing (NGS) facilitates comprehensive molecular analyses, allowing accurate diagnosis of unsolved disorders. In addition to detecting single-nucleotide variations and small insertions/deletions, bioinformatics tools can identify copy number variations (CNVs) in NGS data, which improves the diagnostic yield. However, due to the possibility of false positives, subsequent con�rmation tests are generally performed. Here, we introduce Copy-number Analysis by BAse-level NormAlization (CABANA), a computational tool that can intuitively visualize true CNVs using the normalized single-base-level read depth calculated from NGS data. To demonstrate how CABANA works, NGS data were obtained from 474 patients with neuromuscular disorders. CNVs were screened using a conventional bioinformatics tool, ExomeDepth, and then we normalized and visualized those data at the single-base level using CABANA, followed by manual inspection by geneticists to exclude false-positives. In this way, we identi�ed 31 true-positive CNVs (7%) in 474 patients and subsequently con�rmed all of them to be true using multiplex ligation-dependent probe ampli�cation. The performance of CABANA was deemed acceptable by comparing its diagnostic yield with previous data about neuromuscular disorders. Despite some limitations, we expect CABANA to help researchers accurately identify CNVs and reduce the need for subsequent con�rmation testing.


Introduction
Next-generation sequencing (NGS), a massively parallel sequencing technology, is one of the most important analytical tools in molecular genetics [1][2][3][4] . NGS technology enables rapid, cost-effective, comprehensive molecular analyses and contributes signi cantly to detection of a broad range of pathogenic variants, especially small variants such as single-nucleotide variants (SNVs) and small insertions/deletions (INDELs) [5][6][7] . Therefore, NGS can help diagnose patients who have phenotypically and genetically diverse disorders, such as neuromuscular disorders (NMDs) 2,8,9 . In addition to the small variants, copy number variations (CNVs), structural variations that usually range from 1 kb to 3 Mb, can be detected using NGS data and can play an important role in diagnosing patients [9][10][11][12] .
Various computational tools have been developed to enhance the sensitivity of CNV detection in NGS data 7,13 . However, most tools produce many false-positive CNV calls, necessitating subsequent con rmatory tests such as multiplex ligation-dependent probe ampli cation (MLPA) and array comparative genomic hybridization 11,14,15 . For SNVs, the need for additional con rmatory tests has been much reduced by powerful visualization tools such as the Integrative Genomics Viewer, but few such tools are available for CNVs 15,16 .
Here, we introduce Copy-number Analysis by BAse-level NormAlization (CABANA), a CNV visualization tool using the normalized single-base-level read depth to identify true CNVs. In a comparison with goldstandard conventional testing, we con rmed that CABANA is both effective and accurate in determining true CNVs.

Methods
Patients and data.
We retrospectively collected NGS data from 474 patients who underwent targeted NGS related to NMDs in our laboratory between July 2016 and December 2020. Using electronic medical records, the disorders suspected when the NGS tests were ordered were largely classi ed into muscular disorders and neurological disorders and then subsequently subdivided into speci c disorders. Three targeted NGS panels were used in this cohort: a muscular panel, neurological panel, and neuromuscular panel consisting of 212, 172, and 599 genes, respectively ( Fig. 1; Supplementary Table S1). This study protocol was approved by the Institutional Review Board/ Ethics Committee of the Severance Hospital, Seoul, Korea (IRB No. 4-2020-0715), and the requirement for informed consent was waived by the Institutional Review Board/ Ethics Committee of the Severance Hospital due to the retrospective study design. All methods were performed in accordance with the relevant guidelines and regulations.
Processing of NGS data. NGS data were generated using a NextSeq 550Dx System (Illumina, San Diego, CA, USA) with 2×151 bp reads. HaplotypeCaller and MuTect2 in the GATK package (3.8-0) and VarScan2 (2.4.0) were used to detect SNVs and small INDELs. Read depth metrics, including allele depth per sample and overall depth of coverage, were obtained through the GATK package. All samples showed a mean depth of at least 500×, and the target region coverage was greater than 99.9% at 30×. ExomeDepth (version 1.1.10), a read depth-based algorithm with high sensitivity, was used to screen exonic CNVs in the target regions 17-19 . Calculation of base-level normalized depths and graphical visualization by CABANA.
Using depths of coverage at the base level from the GATK package, the adjusted read depth (ARD) at base position i of sample j (ARD [i, j]) was calculated by multiplying the read depth at base position i of sample j by an adjustment factor (retrieved from the sum of the read depths of all samples in the same batch at base position i divided by the sum of the read depths at all base positions in sample j). A case sample is de ned as the sample whose copy numbers are to be examined, and control samples are all the other samples in the same batch. Con rmation of CNVs.

Results
Graphical visualization with CABANA.
Raw depths, NRDs, and CNV calls were plotted together. Base-level read depths before normalization were plotted on top, with those of case samples in light green and those of controls in light yellow. In the middle space, NRDs were plotted, with those of case samples in violet and those of controls in light yellow. The CNVs of exons called from ExomeDepth are visualized below the NRD plot, with red and blue boxes representing copy number gains and losses, respectively (Fig. 2). A true-positive CNV typically shows a distinct deviation between the NRD lines of cases and controls, with steady NRD lines of controls around zero with very low variability across all positions in the regions of interest ( Fig. 2A). In contrast, false-positive CNVs show an unclear deviation between the NRD line of cases and the highly variable and widely uctuating NRD lines of the controls (Fig. 2B, Supplementary Fig. S1).
CABANA effectively discriminated a heterozygous deletion in a single or a few exons ( Fig. 3A and B). For hemizygous deletions, CABANA provided a clearer and more intuitive illustration of the absent reads in the deleted region (Fig. 3C). For genes with homologous regions or genes, such as SMN1 and SMN2, detecting CNVs can be very tricky using conventional CNV algorithms that adopt read-depth comparisons. Despite that limitation, CABANA successfully discriminated some CNVs in homologous regions, as shown in a case with a homozygous deletion of exon 7 in SMN1, which has highly homology with SMN2 (Fig. 3D). Examples of CNV visualization with explanation are provided in Supplementary Fig.  S2.
Small deletions and partial exon deletion.
Because it plots CNVs per base, CABANA can distinguish small deletions undetectable by conventional CNV algorithms. As shown in Figure 4A, a deletion of 3 bases in PRX (NM_181882.2:c.1389_1403del) showed a sharp and distinct decrease in the NRD in CABANA, and a deletion of 15 bases in the same gene (NM_181882.2:c.416_418del) showed an apparent decrease in the NRD in the middle of an exon. In addition, CABANA can identify partial exon deletions. Figure 4B shows a distinct deviated line with constant and decreased NRDs in an exon, which indicates the partial exon 4 deletion of FLCN (NM_144997.5:c.-24-1345_35delinsG).
Proportion of CNVs with the true-positive pattern.
Among all CNVs called by ExomeDepth, only about 5% showed true-positive CNV patterns when visualized by CABANA. ExomeDepth called CNVs in TTN, NEB, TNXB, SMN2, PIEZO2, and ASAH1 in more than 40% of the patients tested with the neuromuscular panel, but only one of them in SMN2 showed the true-positive pattern when visualized by CABANA. The proportion of true-positive CNVs identi ed by CABANA was high in DMD, where hemizygous mutations in males are frequent. Interestingly, 11% of the CNVs that ExomeDepth detected in SMN1, which is highly homologous with SMN2, were visualized as true-positive CNVs by CABANA and con rmed as true by MLPA 21,22 .
Diagnostic utility of CABANA.

Discussion
Some patients with genetic disorders have structural abnormalities, of which CNVs are the most common type. NGS can detect both SNVs and CNVs, and many detection algorithms for CNVs have been developed using different principles. One popular method involves comparing the depths of coverage among cases and controls. The performance of those tools has been evaluated in several benchmark studies and was found to depend signi cantly on the dataset 12 15 . Zhao et al. conducted a performance evaluation of four CNV tools (CoNIFER, cn.MOPS, CNVkit, and exomeCopy) with whole exome sequencing data and found that performance differed according to targeted CNV size and type 24 . Of note, CNV algorithms based on the depth of coverage method commonly have problems with false positive calls, which are mainly affected by high GC content and poor mappability. Plus, inaccurate detection of small CNVs remains challenging because most target regions of whole exome sequencing and targeted NGS data are small and noncontiguous 12,23 .
To improve the performance of CNV detection, some CNV visualization tools have been developed (Table  1). Users can recognize true CNVs more intuitively by visually inspecting the depth of coverage in the regions of interest. Most CNV visualization tools use a window of a speci c length to reduce variability in read depths, and they generally visualize CNVs at the chromosome or gene level [25][26][27][28][29][30][31][32] . On the contrary, CABANA visualizes CNVs with high resolution based on normalized single-base-level read depth. To the best of our knowledge, only one previous tool visualizes CNVs using normalized read depths per base 14 .
With that higher resolution, users can e ciently discriminate true CNVs, both small and large, from false CNVs. Unlike other CNV visualization tools, CABANA produces uniform, steady lines plotted using the NRDs, which are an important factor in ltering out false-positive calls and greatly increase speci city. In support of that speci city, 31 pathogenic CNVs determined as true by CABANA were all con rmed to be true by MLPA. Therefore, CABANA visualization can decrease the need for additional con rmatory testing to increase the cost-effectiveness of NGS and reduce the burden on laboratories. In addition, small deletions and partial exon deletions that were not identi ed by the conventional CNV algorithm were detected by CABANA. Because visual inspection with CABANA is very intuitive, even inexperienced users can easily identify true CNVs. Nonetheless, we recommend that con rmation tests be applied in speci c instances, such as a single exon deletion.
CNVs in TTN, NEB, and TNXB were frequently called by ExomeDepth, but none of them showed a truepositive pattern in CABANA. The presence of tandem repeat regions in TTN and NEB and a highly homologous pseudogene in TNXB might have in uenced the performance of CABANA [33][34][35] . Similar to other bioinformatic tools that use the read-depth approach, CABANA seems to have di culty in determining true CNVs in regions with high GC content, where highly variable NRDs tend to appear 7,13,36 . Although the CNVs in TTN, NEB, and TNXB called by ExomeDepth were not con rmed by MLPA, the similar patterns recurrently observed in speci c regions of those genes in normal healthy controls suggest a very low likelihood that they are true pathogenic CNVs.
In this study, we found that about 36% of patients with NMD harbored molecular abnormalities on the targeted NGS panels. Previous studies reported that clinically signi cant variants were detected in 20-49% of NMD patients, but that diagnostic yield varied with the NGS panel and cohort group tested 9,37,38 . A large-scale study on the diagnosis of NMD using multigene panels showed that pathogenic CNVs were identi ed in 7.6% of NMD patients, with the majority being on SMN1, PMP22, DMD, and SPAST 9 . Using our bioinformatics pipelines with CABANA, we found pathogenic CNVs in 7% of patients with NMD; in concordance with the previous large-scale study, most of them were in DMD, PMP22, SPAST, and SMN1.
The most commonly mutated gene in our cohort was DMD, a causative gene for Duchenne muscular dystrophy and a major cause of inherited muscular disorders in Korea 39 . Of 39 pathogenic variants in DMD, 16 (41%) were CNVs. Considering that approximately 70% of Duchenne/Becker muscular dystrophy patients with molecular defects had pathogenic variants of DMD in the form of CNVs, the proportion of CNVs detected by CABANA seems to be low 40,41 . However, there might have been selection bias in our patient cohort because some patients had proven to be negative for CNVs by MLPA or quantitative PCR before NGS testing. SPAST, a major causative gene for autosomal dominant spastic paraplegia 42 , was the third most commonly mutated gene in our patients with NMD, with about 29% being pathogenic CNVs. Previous studies on hereditary spastic paraplegia reported that the proportion of pathogenic CNVs was 2.5-37.5% depending on the characteristics of each cohort 43,44 . PMP22 is related to CMT type 1A and hereditary neuropathy with pressure palsies, with most patients having deletion and duplication CNVs, respectively 45 . Consistent with that, all the pathogenic variants found in PMP22 in this study were CNVs. Collectively, the mutation spectrum and proportion of CNVs in these disorders found using CABANA were concordant with the literature. In most patients, phenotype was consistent with disorders related to the gene with pathogenic CNVs. This evidence supports our CABANA algorithm as robust and accurate.
Our study has some limitations. First, the performance of CABANA could not be thoroughly evaluated due to the limited availability of con rmatory tests and practical considerations, such as an uncertain falsenegative rate. Nonetheless, its performance was deemed to be acceptable compared with previously reported CNV data in patients with NMD and clinical correlations with our patients' results. Second, CNV visualization with CABANA was performed only on CNVs called by ExomeDepth, which might have missed true CNVs 15 . Third, as described above, it can be challenging for CABANA to identify true CNVs in repeat regions, highly homologous regions, and GC content-rich regions 7,13,36 .
In summary, we developed a base-level visualization software, CABANA, as a con rmatory tool for CNVs called by other algorithms. With its high resolution, CABANA showed excellent delity and speci city and could help exclude false CNVs and identify true CNVs without additional con rmation tests. In patients with NMD, CABANA effectively detected pathogenic CNVs, demonstrating its high utility with clinical samples. CABANA work ow using next-generation sequencing data obtained from patients with neuromuscular disorders. The numbers in parentheses indicate the number of samples involved. The formula for normalizing read depth is described in the blue box, where i, j, and n represent a certain base position, a certain sample, and the number of samples in a batch, respectively. RD, ARD, and NRD stand for read depth, adjusted read depth, and normalized read depth, respectively. A case sample is one whose copy numbers are to be tested, and control samples are all the samples in a single batch excluding the case sample. In calculating ARD, the sum of read depths from the start to end positions indicates the sum of read depths in all the targeted regions examined.   Figure 3C).