CNVoyant: A Highly Performant and Explainable Multi-Classifier Machine Learning Approach for Determining the Clinical Significance of Copy Number Variants

The precise classification of copy number variants (CNVs) presents a significant challenge in genomic medicine, primarily due to the complex nature of CNVs and their diverse impact on genetic disorders. This complexity is compounded by the limitations of existing methods in accurately distinguishing between benign, uncertain, and pathogenic CNVs. Addressing this gap, we introduce CNVoyant, a machine learning-based multi-class framework designed to enhance the clinical significance classification of CNVs. Trained on a comprehensive dataset of 52,176 ClinVar entries across pathogenic, uncertain, and benign classifications, CNVoyant incorporates a broad spectrum of genomic features, including genome position, disease-gene annotations, dosage sensitivity, and conservation scores. Models to predict the clinical significance of copy number gains and losses were trained independently. Final models were selected after testing 29 machine learning architectures and 10,000 hyperparameter combinations each for deletions and duplications via 5-fold cross-validation. We validate the performance of the CNVoyant by leveraging a comprehensive set of 21,574 CNVs from the DECIPHER database, a highly regarded resource known for its extensive catalog of chromosomal imbalances linked to clinical outcomes. Compared to alternative approaches, CNVoyant shows marked improvements in precision-recall and ROC AUC metrics for binary pathogenic classifications while going one step further, offering multi-classification of clinical significance and corresponding SHAP explainability plots. This large-scale validation demonstrates CNVoyant’s superior accuracy and underscores its potential to aid genomic researchers and clinical geneticists in interpreting the clinical implications of real CNVs.


Introduction
The establishment of reference genomes, sequencing technologies, and post-processing algorithms has ushered in an era where genetic variation is reliably detectable.Databases are maintained to de  Koch 2020), and provide annotations regarding clinical signi cance (Landrum et al. 2018).However, entries in these resources favor smaller genetic changes, speci cally single nucleotide variations (SNVs) and short insertions and deletions (indels).To date, these short variants have been the focus for clinical germline diagnoses in rare genetic diseases (RGDs); however, this may be a symptom of the limited capacity to discern the clinical signi cance of larger structural variants (SVs).
SVs cover larger segments of DNA and include, but are not limited to, copy number variants (CNVs), translocations, and inversions, all of which span at least 50 base pairs (bp) (1000 Genomes Project et al. 2011;MacDonald et al. 2014;Coutelier et al. 2022).The recent clinical adoption of genome sequencing (GS) has led to more reliable identi cation of SVs, at a much ner resolution than was possible with microarray technology (Sanchis-Juan et al. 2018; Gross et al. 2019;Liu et al. 2022).In contrast to exome sequencing (ES), which focuses on coding sequences, GS extends the breadth of detection to intronic and intergenic regions.This attribute is crucial given that SVs frequently occur in non-coding regions and can encompass multiple genes.Despite the newly available data, understanding the clinical signi cance of detected SVs remains a challenge.Compared to recurrent SVs with well-de ned breakpoints, rare SVs are particularly di cult to interpret.Even when rare SVs are observed in population frequency databases, they are often not annotated for clinical signi cance (NHGRI Centers for Common Disease Genomics et al. 2020).
Despite the advent of next-generation sequencing (NGS), diagnostic genetic variants are typically only identi ed in 25-45% of patients undergoing GS for suspicion of having an RGD (Yang et Kumar et al. 2023).Reanalysis of undiagnosed cases has yielded additional diagnoses after considering CNVs, con rming reports of CNV involvement in RGDs (Hegele 2007; Weischenfeldt et al. 2013; Bergant et al. 2018).Recent efforts have been made to standardize the interpretation of CNVs, culminating in the American College of Medical Genetics and Genomics (ACMG) technical standards for interpreting CNVs (Riggs et al. 2020).These guidelines consider population frequency, the impact of overlapping functional regions, and previous clinical interpretation (Collins et al. 2020).Within these guidelines, these features are evaluated to determine haploinsu ciency (HI) and triplosensitivity (TS), the tolerance to regional losses or gains in the genome, respectively.Both concepts fall under the broader category of dosage sensitivity.
The ACMG guidelines, while highly valued in the clinical setting, tend to classify CNVs as having uncertain pathogenic signi cance (VUS), as observed in the algorithmic implementation of the guidelines, ClassifyCNV (Gurbich and Ilinsky 2020).In the interpretation setting, this limited speci city results in lengthy candidate CNV sets requiring review, with many benign CNVs being classi ed as VUS.To address this problem, several machine learning (ML) approaches have been proposed to enhance the precision of classifying the clinical signi cance of CNVs.These algorithms statistically learn from data elements related to dosage sensitivity, overlapping genes, population frequencies, regulatory elements, topologically associated domains, and genomic position to predict pathogenic potential (Zhang et al. 2021;Gažiová et al. 2022;Sharo et al. 2022;Hertzberg et al. 2022;Lv et al. 2023).None of these algorithms combine all informative features in a single model.Moreover, as is common for ML-based classi ers, they fail to provide prediction explanations for better interpretability as to why the algorithm chose the given classi cation for a CNV.Together, these two limitations motivated the development of an improved ML approach.
Here we introduce CNVoyant, a tree-based, multi-class clinical signi cance classi er that combines previously reported features with novel features to classify CNVs more accurately than previously published methods.CNVoyant provides prediction explanations and enhances the accuracy and granularity of clinical signi cance classi cations, enabling rapid identi cation and interpretation of potentially pathogenic CNVs.

Methods
The capacity of CNVoyant to classify the clinical signi cance of CNVs was tested in 21,574 CNVs curated from DECIPHER after training on 52,176 CNVs published in ClinVar (Fig. 1).This approach is consistent with previously reported comparisons of pathogenicity, where a set of CNVs are examined in a general context rather than focusing on individual patients.This also aligns with the previously cited ACMG technical guidelines, that recommend uncoupling CNV pathogenicity classi cation from the implications for a speci c patient (Riggs et al. 2020).Features were generated to capture information related to genomic position, variant composition, overlapping functional annotation, population frequency, conservation, and dosage sensitivity.

Training Dataset Curation
CNVoyant is trained on CNVs included in the January 2023 XML release of ClinVar (Landrum et al. 2018).This XML le was parsed, and extracted variants were limited to CNVs (variant type of "copy number gain" or "copy number loss") that did not have duplicated variant positions.The Reference ClinVar Accession Number (RCV) entry was chosen to represent each CNV to avoid training on duplicates that can arise in cases of multiple submitters.40,837 of the extracted CNVs were aligned to the GRCh37 reference genome, all of which required a genomic coordinate mapping via the UCSC liftOver command line tool (Hinrichs 2006) to be combined with the 12,641 CNVs that were aligned to GRCh38.Following liftOver, 1,126 variant entries were identi ed as duplicates of entries originally aligned to GRCh38, and they were omitted.850 CNVs were omitted due to ambiguous clinical signi cance labeling or con icting clinical signi cance annotation in entries with matching genomic coordinates.20 variants were removed for having a size of less than 50 bp.The remaining ClinVar CNVs with at least one pathogenic or likely pathogenic designation were labeled as pathogenic for training purposes.Non-pathogenic variants with at least one VUS designation were labeled as VUS, and remaining variants containing only benign or likely benign classi cations were labeled as benign.Altogether, 52,176 CNVs were included in model training (Fig. 2 (a)

Genomic position (2 features)
Centromere distance: The number of bp separating the centromere from the candidate CNV.Distance from the centromere to the CNV is determined by selecting the CNV boundary closest to the centromere: the end coordinate on the P arm or the start coordinate on the Q arm.
Telomere distance: The number of bp separating the telomere from the candidate CNV.Distance from the telomere to the CNV is calculated by selecting the CNV boundary furthest from the centromere: the start coordinate on the P arm or the end coordinate on the Q arm.

CNV composition (2 features)
GC Content: Percentage of nucleotides in the genomic region encompassed by the candidate CNV that are guanine or cytosine.
BP Length: Total bp spanning the candidate CNV.

Functional annotation (5 features)
Count of genes: The gene count is de ned as the total number of genes that overlap the candidate CNV.Genes overlap the candidate CNV if at least one bp is shared between the candidate CNV's Count of diseases: The disease count is de ned as the total number of diseases associated with the gene (s) that overlap the candidate CNV.This allows genes with more disease associations to be distinguished from genes with only a single or no disease association.Disease-gene associations are referenced from the curated annotations provided by Online Mendelian Inheritance in Man (OMIM) (Amberger et al. 2015).
Count of exons: Overlapping exon count is calculated by summing exons, across all genes, that overlap the candidate CNV.The exonic boundaries were padded by 10 base pairs to account for canonical splice regions.
Count of promoter regions: CNVs may overlap the promoter region of a gene rather than the gene itself.To address this, we include a count of promoter regions, de ned as the interval between the transcription start site (TSS) and 1,000 base pairs upstream of the TSS.
Count of ClinVar pathogenic SNVs and indels: A sum of overlapping pathogenic SNV/indels is included to capture potentially relevant pathogenicity interpretation.To obtain a set of overlapping pathogenic SNV/indels, ClinVar (Landrum et al. 2014) is intersected with the candidate CNV and limited to variants interpreted as having "Pathogenic" or "Likely Pathogenic" signi cance.

Population frequency (1 feature)
GnomAD SV popmax: To estimate population frequency, we identify the highest frequency across all gnomAD SV (V4) (Gudmundsson et al. 2022) entries that match the CNV's variant type (deletion or duplication) and exhibit at least 50% reciprocal overlap in genomic coordinates.This means that the candidate CNV and the gnomAD SV entry it overlaps with must share at least half of their span, ensuring a signi cant genomic coverage overlap between the observed CNV and those described in gnomAD SV.

Conservation (2 features)
PhyloP: To estimate the conservation of a candidate CNV, PhyloP scores are referenced.PhyloP scores are available at single nucleotide resolution.Single nucleotide scores are highly variable and are thus correlated with the size of the candidate CNV.To mitigate this correlation, we employ a centered moving average that considers all scores within a speci ed reading frame (Supplemental Fig. 1).This effectively smooths the otherwise volatile conservation score curve.The maximum value of this smoothed curve within the genomic coordinates of the candidate CNV is returned as the PhyloP feature.A maximum was chosen considering that higher PhyloP scores indicate higher estimated conservation.
phastCons: The same procedure used to calculate the PhyloP feature was used to estimate conservation according to the phastCons score.Similar to PhyloP, the maximum was chosen as the aggregate function considering that higher phastCons scores indicate higher estimated conservation.

Dosage sensitivity (5 features)
HI Score: Dosage sensitivity was estimated by overlapping the candidate CNV with a curated set of dosage sensitive regions described by ClinGen (Rehm et al. 2015).Manually curated HI scores are available for all curated regions.HI scores were one-hot encoded to handle each unique value and split into binary features.In the case of multiple overlapping regions, HI score features were summed.
TS Score: As with HI Score, one-hot encoded binary features were generated from the curated TS scores of annotated dosage sensitive regions.The sum is also taken across all binary features when multiple dosage-sensitive regions overlap with the candidate CNV.
HI Index: The minimum HI Index score (Huang et  CNVoyant and all comparator algorithms to test for generalizability and generate accuracy metrics for benchmarking.Given that X-CNV, TADA, dbCNV, and ClassifyCNV take GRCh37-aligned variants as input, UCSC liftOver was again called to lift DECIPHER variants from GRCh38 to GRCh37.X-CNV pathogenic probabilities yielded only three unique values across the entirety of the test set, which was unexpected given the generation of a relatively heterogeneous feature set.Rather than attempting to amend the source code to produce more continuous output, X-CNV was omitted from benchmarking.

Measuring Performance
To generate granular performance data in the test cohort, accuracy metrics are reported for each of the three CNV prediction classes: benign, VUS, and pathogenic.The classi cation probability for each class was utilized to sort the list of CNVs and generate precision-recall (PR) and receiver operating characteristic (ROC) curves.The area under these curves (PR AUC, ROC AUC) is referenced to measure model accuracy, in addition to the average F1 score and overall accuracy for multi-class predictions.F1 and overall accuracy are only reported for CNVoyant, dbCNV, and ClassifyCNV, as these algorithms are the only three that provide multi-class output.dbCNV provides likely pathogenic and likely benign classi cation designations in addition to pathogenic, VUS, and benign designations.Likely pathogenic predictions were mapped to pathogenic classi cation and likely benign predictions were mapped to benign classi cation to ensure a fair comparison to CNVoyant and ClassifyCNV.
TADA, ISV, and StrVCTVRE all output a single score to estimate CNV pathogenic probability.The complement of the pathogenic probability score (1-Pr (pathogenic)) was calculated to estimate benign signi cance scores for these comparator algorithms.The ClassifyCNV output is a score rather than a probability, but the complement was still chosen to represent benign signi cance, as higher scores represent a higher pathogenicity.CNVoyant is the only algorithm that provides benign and VUS probabilities; these values were used in plotting corresponding benign and VUS classi cation curves.dbCNV does not provide probabilities or a continuous con dence score for classi cation, so there is no value to reference in plotting ROC and PR curves.As such, dbCNV was not included in the ROC and PR curve comparisons.
Deciphering Feature In uence on CNV Classi cation with SHAP Analysis SHAP (SHapley Additive exPlanations) values offer a qualitative analysis tool for understanding how each feature in uences the clinical signi cance prediction for distinct CNV classes.For CNVoyant, we generated SHAP beeswarm plots across all classes to visualize the effect of training features on model prediction (Lundberg and Lee 2017).These plots rank features by their importance and use color coding to depict the direction of their in uence on the model's output.Each point on a plot represents a feature's SHAP value for an individual observation, quantifying its contribution to moving the model's prediction from the base value-the dataset's average prediction-toward the actual prediction.
Table 2 Benchmarking Algorithmic Classi cation of CNV Clinical Signi cance.The classi cation score performance of CNVoyant was compared to four algorithms (ISV, StrVCTVRE, TADA, ClassifyCNV) in determining the clinical signi cance of deletions, duplications and combined CNVs.The effectiveness of each algorithm was assessed by calculating the area under the curve (AUC) for both the precision-recall (PR AUC) and the receiver operating characteristic (ROC AUC).These metrics were selected to provide a comprehensive evaluation of each classi er's ability to discriminate between clinically signi cant and non-signi cant CNVs under various threshold settings.CNVoyant demonstrated superior performance to all compared algorithms across most CNV subsets as evaluated by these metrics, except for StrVCTVRE, which exhibited a higher PR AUC in classifying benign deletion variants.dbCNV was excluded from this comparison due to the absence of a continuous variable necessary for plotting PR and ROC curves.The most informative features in the SHAP beeswarm plots for pathogenic classi cation differed between deletion and duplication events (Fig. 4).Pathogenic SNV/indel overlap was the most important feature for deletions, and exon count was the most important feature for duplications.For deletions, the HI index and a curated HI score of "su cient evidence" were the second and third most important features, respectively.For duplications, bp length and promoter region count were the second and third most important features, respectively.The top ve most informative features between deletions and duplications included exon count and disease count.PhyloP was more informative than phastCons in both variant types but was more important in predicting pathogenic deletions (6th most important feature) than pathogenic duplications (9th most important feature).SHAP beeswarm plots for benign and VUS classi cation indicated more similar feature importance between duplication and deletion variants (Supplemental Fig. 3, benign (a-b) and VUS (c-d)).For benign classi cation, the top four features were shared between duplication and deletion variants in the same order of importance.Bp length was the most important feature, followed by pathogenic SNV/indel overlap, PhyloP, and exon count.For VUS classi cation, bp length was the most important feature, followed by exon count for both duplications and deletion variants.PhyloP was the third most important feature for deletion events, followed by gene count.Pathogenic SNV/indel overlap was the third most important feature for VUS classi cation in duplications, followed by PhyloP.Population frequency and GC content were relatively uninformative across benign, VUS, and pathogenic predictions.

Discussion
CNVoyant sets a new standard in the classi cation of clinical signi cance for CNVs.Our novel algorithm outperformed the ve leading algorithms for classifying CNVs (ClassifyCNV, ISV, StrVCTVRE, TADA, dbCNV) across all accuracy metrics and clinical signi cance classes in the DECIPHER test set.This unparalleled accuracy underscores CNVoyant's advanced analytical capabilities, especially when considering the complexity and variability resulting from individual provider submissions within the DECIPHER data set.Furthermore, for the rst time, our comprehensive evaluation of feature performance within the predictive model has uncovered novel insights into the determinants of CNV clinical signi cance, offering a deeper understanding of the underlying drivers of classi cation.
CNVoyant prediction probabilities closely align with the observed class distributions in the DECIPHER test set (Supplemental Fig. 4), supporting the generalizability of these predictions.Regarding explainability, the SHAP values generated from the test set also re ect intuitive reasoning driving predictions.Larger CNVs overlap with more functional and dosage sensitive regions, which are logically more likely to be pathogenic, and this was clearly re ected in the pathogenic SHAP beeswarm plots (Fig. 4 and Supplemental Fig. 3).Conversely, an inverse relationship exists where smaller CNVs that overlap with fewer regions drive benign predictions.The length of a candidate CNV was a simple but highly important feature omitted from ISV, StrVCTVRE, and TADA.Speci cally, bp length was the most informative feature for both deletions and duplications in both VUS and benign variants.We hypothesize that a portion of the overall performance gained over the comparator algorithms was due to the addition of this feature.
In deletions, the count of pathogenic SNVs and indels contained within the CNV boundaries was the most important feature in predicting pathogenic signi cance and the second most important feature in predicting benign signi cance.This is also to be expected, as regions more intolerant to variation have more disease-causing variants.Given that loss of function is the most common variant type of pathogenic or likely pathogenic ClinVar SNV and indel annotations (72.5% of such variants), the emphasis placed on deletion events aligns with expectations.This trend was further observed in the context of conservation, with deletion variants spanning highly conserved regions having more pathogenic potential.Interestingly, HI and conservation metrics showed predictive value in classifying duplication variants.After further investigation, it was recently reported that HI and TS features largely overlap, con rming our observed trend (Collins et al. 2022).
As previously stated, ClassifyCNV is an algorithm that encodes the logic driven by the most recent ACMG technical standards for CNV interpretation.While there is value in minimizing false positive predictions, especially in clinical settings, the consequence is less accuracy in identifying true negatives.We observed this lack of true negative recall in the DECIPHER cohort, where ClassifyCNV predicted that only 4.7% of the 2,651 benign CNVs had benign clinical signi cance.This effectively leaves 98.9% of called CNVs to be interpreted by clinical genomicists, a value that does not signi cantly reduce the burden of variant interpretation.ML methods can address this issue, as they can consider features that are not included in the current clinical guidelines.Comparator models each have certain blind spots that CNVoyant aims to account for.TADA fails to contextualize the genomic position of the CNV itself and instead focuses on overlapping topologically associating domains.ISV and StrVCTVRE address these shortcomings but fail to consider reported pathogenic SNVs and indels in ClinVar.Echoing the principle of Occam's Razor, CNVoyant underscores the power of simplicity by leveraging a concise set of features to outperform more complex models.This approach streamlines the analytical process and enhances the model's explainability, a rming the notion that simplicity often leads to superior outcomes.
The potential for variability in the rigor with which clinical signi cance is assigned within the DECIPHER dataset re ects one potential shortcoming of our study.CNVs in this test set were submitted by individual providers, and submitters likely used varying methods to assess the clinical signi cance of a given CNV.
Given the vast size of this test dataset (21,574 CNVs) and the challenges of reassessing all these CNVs with a standardized set of guidelines, we had to accept the assigned signi cance label in our test data.
In the realm of clinical genomics, experts frequently encounter CNVs that may straddle the line between different clinical signi cance classes.In these cases, understanding the rationale behind ML classi cation can be invaluable.CNVoyant addresses this need by providing a novel feature amongst existing CNV classi cation algorithms, SHAP force plots (Supplemental Fig. 5).This approach enables CNVoyant to effectively highlight the CNV features with the greatest in uence on the model's

Conclusions
The advent of GS technologies and advanced algorithms has revolutionized our ability to detect genetic variants, including segmental duplications and deletions, reliably.Clinical genomics experts must painstakingly interpret these CNVs to determine their relevance to a patient's suspected genetic condition.
To aid this process, we introduce CNVoyant, a highly accurate algorithm for classifying the clinical signi cance of CNVs.CNVoyant's unparalleled accuracy in classifying CNVs' clinical signi cance is driven by a unique ML architecture and a carefully selected set of features that capture the multitude of factors that should be considered when evaluating the impact of a CNV.Importantly, CNVoyant demysti es ML decisions through SHAP force plots, providing the rationale behind the algorithm's classi cation for any given CNV and enhancing transparency for clinicians.With the source code publicly available, CNVoyant invites continuous evolution, allowing for retraining with new data or speci c populations.This adaptability will ensure that CNVoyant remains at the forefront of genomic medicine, simplifying variant prioritization and scaling to meet the demands of expanding GS applications.
CNVoyant not only sets a new standard for accuracy and explainability, but also advances the capability to discern pathogenic signi cance, marking a signi cant leap in genomics.
ne functional regions of the genome (O'Leary et al. 2016; Howe et al. 2021), catalog observed genetic variants (Sherry 2001; Exome Aggregation Consortium et al. 2016), record variant frequencies in different human populations (1000 Genomes Project Consortium et al. 2015; The UK10K Consortium et al. 2015; location and the gene's genomic coordinates drawn from the RefSeq database (O'Leary et al. 2016).All overlap calculations were made via the Bedtools intersect function (Quinlan and Hall 2010).

Figure 2 Training
Figure 2

Table 1
Distribution of Variant Type and Clinical Signi cance in Training and Test Sets.52,176 total CNVs were included in the ClinVar training set, and 21,574 total CNVs were included in the DECIPHER test set.The training set generally favored variants of benign signi cance, with pathogenic signi cance encompassing the fewest number of variants.This trend was reversed in the test set, which heavily favored VUS and pathogenic CNVs over those with benign signi cance.Clinical signi cance class distribution was generally consistent between duplication and deletion events except for more pathogenic variants being present in deletions.
(Pedregosa et al. 2011linVar SNVs/indels, and bp length).These features are log-transformed to re ect more normal distributions.All features are normalized via the sklearn preprocessing.MinMaxScaler Python class prior to training or prediction(Pedregosa et al. 2011).
al. 2010) observed in overlapping dosage-sensitive regions is obtained to represent the HI Index value for the candidate CNV.HI Index estimates the probability of loss intolerance, where lower values predict haploinsu ciency.Training Procedure and Model Selection For model training, features were generated for CNVs included in the ClinVar training set before partitioning into deletion and duplication sets.Separate multi-class models were trained for duplication and deletions, each predicting whether a candidate CNV has benign, uncertain, or pathogenic clinical signi cance.29 common ML architectures were tested via 5-fold cross-validation for each CNV type before selecting the top performers according to multi-class F1 score.Following architecture selection, 5fold cross-validation was again employed to hyperparameter tune and nd the most accurate model.10,000 sets of hyperparameter values were tested for each CNV type via the sklearn.model_selection.RandomizedSearchCV class.The top-performing models were calibrated to class distributions in the training set via the isotonic method implemented in the sklearn.calibration.CalibratedClassi erCV class.
pLI: The maximum pLI score (Exome Aggregation Consortium et al. 2016) observed in overlapping dosage-sensitive regions is obtained from gnomAD to represent the pLI value for the candidate CNV.pLI estimates the probability of loss intolerance, where higher values predict haploinsu ciency.LOEUF: The minimum LOEUF score (Karczewski et al. 2020) observed in overlapping dosagesensitive regions is obtained to represent the LOEUF value for the candidate CNV.LOEUF estimates the probability of loss intolerance, where lower values predict haploinsu ciency.Comparator Algorithm Selection CNVoyant predictions were compared to ve published ML-based CNV pathogenicity classi ers, X-CNV (Zhang et al. 2021), TADA (Hertzberg et al. 2022), dbCNV (Lv et al. 2023), StrVCTVRE (Sharo et al. 2022), and ISV (Gažiová et al. 2022), as well as the algorithmic implementation of the ACMG technical standards for CNV interpretation, ClassifyCNV (Gurbich and Ilinsky 2020).The test set was passed to classi cation decision, providing critical insights to guide clinical interpretation.With this in mind, we engineered CNVoyant to export the plots into portable static image les, which can easily be attached to clinical notes or reports.It should be noted that CNVoyant alone cannot predict the diagnostic signi cance of a patient's CNVs.Other factors must be considered, including variant zygosity, phenotypic overlap with associated diseases, mode of inheritance of associated diseases, presence of additional variants in trans, and differences in CNV frequencies in clinical settings enriched with patients affected by genetic conditions.CNVoyant should instead be included in diagnostic classi cation architectures to limit candidate diagnostic CNVs to only those with requisite probabilities of pathogenicity.CNVoyant outputs probabilities of benign, uncertain, and pathogenic signi cance, which are predicated on label distributions in the training set.In clinical cases, there are far more benign variants than variants with uncertain or pathogenic signi cance.Often, a patient will only have CNVs of benign clinical signi cance.To combat this class imbalance, CNVoyant should be retrained on interpreted CNVs from real patients before implementation in a clinical decision-support setting.While the current catalog of annotated CNVs is limited, it will undoubtedly grow exponentially as more CNVs are detected and interpreted in clinical settings.In anticipation of new data, we have open-sourced CNVoyant's source code to allow users to train models with new data using the same feature set and architecture.Finally, it should be reiterated that CNVs are only one category within the larger domain of structural variation.Additional pathogenicity prediction algorithms are required to predict the clinical signi cance of other types of structural variation, including inversion and translocation events.Comprehensive variant prioritization algorithms must account for all structural variant types and simultaneously consider shorter variants, including SNVs and indels.
Declarationsand objectives were aligned with the ethical guidelines for research and data use, including respecting privacy, intellectual property rights, and the integrity of the data.