Gene-specific artificial intelligence-based variant classification engine: results of a time-capsule experiment

Background: Interpretation of genetic variation remains an impediment to cost-effective application of genomics to medicine. An advanced artificial intelligence (AI)-based Variant Classification Engine (aiVCE), rooted in ACMG/AMP guidelines, employs data-driven methods to expedite gene-specific classification (franklin.genoox.com). In this blinded study, the aiVCE’s overall and rule-level performances were evaluated using ClinVar (v. 2018-10) variants with creation dates after 5/01/2017. By removing any prior knowledge of these variants from the aiVCE training data, they were treated as novel variants. Using a ‘Full’ dataset (75,801 variants with ≥1 star) and an ‘Increased-Certainty’ dataset (3,993 variants with ≥2 stars), the aiVCE classified variants as pathogenic (P), likely-pathogenic (LP), uncertain significance (VUS), likely-benign (LB), or benign (B). VUS with sufficient supporting data were subclassified as VUS-leaning benign or VUS-leaning pathogenic. aiVCE results were evaluated to determine concordance with final ClinVar classification and rule-level determinations. Results: The aiVCE demonstrated >97% concordance among Increased-Certainty variants. Concordance was >95% across variant effects (e.g., missense, null, splice region), and was >93.5% for the Full dataset. When assessing the aiVCE’s application of specific ACMG rules, significant differences were observed between ClinVar P/LP and B/LB variants rule-met proportions (all P<0.00001), thus supporting gene-specific rule selections. Evaluation of discordance between the aiVCE and ClinVar uncovered evidences that might have been unavailable to submitting laboratories, highlighting AI utility in variant classification. Conclusions: The aiVCE exhibited robust performance, despite lacking past evidence, in determining whether variants would be categorized as P/LP. Applying latest computational advances to existing guidelines may assist scientists and clinicians interpret variants with limited clinical information and greatly reduce analytical bottlenecks.


Background
The science of human genomics has greatly benefited from the advent of high-throughput next-generation sequencing (NGS) technology. Information derived from the classification of variants is critical to discovering or confirming disease etiologies and guiding treatment guidelines and patient-specific plans. Variant classification, however, is a complex task, requiring assemblage and assessment of currently available information. 1 As the use of NGS and shared archives has expanded, so has the volume of variant classification data and the need for novel analytical approaches. 2 In 2015, the American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) jointly published standards and guidelines for variant classification to homogenize methods and reduce discordance between clinical laboratories. 3 These guidelines apply weighted rules across multiple categories such as variant frequency, variant type, association to previous reports for pathogenicity, and consistency with inheritance model. Subsequently, the Clinical Genome Resource (ClinGen) Expert Gene/Disease Panels (EPs) were tasked with defining application of ACMG/AMP guidelines in specific genes/diseases. 4 The ACMG/AMP guidelines have been adopted worldwide. While the guidelines provide important direction to clinical and research geneticists, there remains a great deal of evidence data to assimilate into the classification process. Further, implementation of new recommendations for more accurate interpretation of variants involves demanding and complex bioinformatics work. As such, artificial intelligence (AI)-based solutions can be used to facilitate and scale the process of implementing gene-specific recommendations, enhancing current interpretation guidelines.
A novel AI-based Variant Classification Engine (aiVCE) algorithm has been developed to integrate knowledge from available databases and published literature. The aiVCE is data-4 driven, extracting existing evidential data from various sources on an ongoing basis, allowing for accurate, consistent, and rapid variant classification per ACMG/AMP standards and guidelines. The aiVCE algorithm can assess all classification criteria amenable to automation, while case level information, e.g., de novo evidence, segregation data, or allelic data, can be provided as additional input by the user. The aiVCE places particular emphasis on considering gene-specific evidence at the gene level, consistent with the latest efforts by ClinGen EPs. [5][6][7][8][9][10][11] Further, frequency-related rule thresholds for different variants of a specific gene can be customized.
Recent guidelines set forth by the AMP and College of American Pathologists emphasize the importance of validating pipeline tools and algorithms. 12 As such, we sought to validate the novel aiVCE using the ClinVar database 13 and conducted a blinded timecapsule experiment to predict the ability of our algorithm to classify variants that were only uploaded to ClinVar after the time-capsule cutoff date.

Implementation Artificial intelligence-based variant classification engine (aiVCE)
The proprietary aiVCE is based upon ACMG/AMP standards and guidelines. 3 While these guidelines provide a clear and detailed framework for variant classification, implementation can vary. The aiVCE employs a data-driven, machine-learning process to implement variant classification rules for novel, as well as previously reported, variants.
Specifically, the aiVCE continually assimilates information at the variant level from ClinVar, Uniprot, gnomAD, other public data sources, and in-house manually-curated variant databases to establish variant-level knowledge that is extrapolated to the gene level to propose a variant classification of benign (B), likely benign (LB), variant of uncertain significance (VUS),, likely pathogenic (LP), and pathogenic (P). For users desiring additional information about VUS, the aiVCE has the capability to further classify such variants into three categories: VUS-leaning benign (VUS-LB), VUS, and VUS-leaning pathogenic (VUS-LP). Specifically, the VUS-LB and VUS-LP variants would be those considered VUS according to the ACMG/AMP guidelines, but that have additional evidence available for consideration (i.e., the aiVCE would subclassify VUS as VUS-LB for variants with evidence for being B but without enough support for being classified as LB, or as VUS-LP for variants with evidence of being P yet not enough for being classified as LP). The aiVCE automates 17/28 ACMG rules: PVS1, PS1, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BA1, BS1, BS2, BP1, BP3, BP4, BP6, and BP7. The remaining rules cannot be automated, as they require clinical information specific to the patient genotype, e.g., familial data, de novo evidence (PS2, PM6), segregation data (PP1, BS4), and/or allelic data (PM3, BP2). The evidence for rules that cannot be automated, however, can be provided by the user as an input for the aiVCE. In addition, the aiVCE provides the option to override automated criteria. Gene symbols were derived from the HUGO Gene Nomenclature Committee database. 14 Transcript and exon information was based on the Reference Sequence (RefSeq) database. 15 To moderate the propensity for discordance and conflicting interpretation inherent in variant classification, 1 we developed a model for assigning a confidence level to variant classification (Confidence Model). Specifically, an in-house curated dataset of thousands of variants, for which classification was manually curated and thus highly confident, was employed to identify variant features, e.g., number of submissions, submitting organization, conflicting evidence, and date of submission, contributing to variant classification certainty. The Confidence Model assigned a Confidence Score to submitted variants consolidated from ClinVar, UniProt, and other sources to create an internal 6 Variant KnowledgeBase (VarKB), which forms the basis of classifying any new variant in the aiVCE. Further, VarKB data were employed to generate aggregated models at the gene level, i.e., a Gene KnowledgeBase (geneKB), to determine frequency thresholds and known disease mechanisms for the gene (Figure 1)..

Frequency threshold determinations
With a focus on not missing any P variants, the aiVCE emphasizes robust sensitivity in detecting P variants, even at the expense of possibly lowering specificity (i.e., slightly more falsely P determinations). For instance, in some cases where confidence of the evidence is suboptimal, the aiVCE would still consider a P-supporting rule as met, but could assign a lower strength or Confidence Score. Even in cases where statistical modeling or an AI approach determines whether a rule was met, the aiVCE provides comprehensive annotations for the clinician's use when determining the final classification. For example, if the PM1 rule (hotspot region) is met, the aiVCE would provide the numbers and examples of P variants found in that region.
Utilizing geneKB to determine frequency thresholds for rules related to Population Data (PM2, BA1, BS1, BS2), consistent with recent ClinGen EP methods, 4 that for BS1 (Allele frequency is greater than expected for disorder) was defined first. Comparing multiple models for predicting this threshold showed that most frequent pathogenic (MFP) variants for each gene, coupled with the P Confidence Score, provided the most robust results.
Specific to this experiment, strict thresholds were balanced against the high weight of the B-supporting frequency rules, including possible mild phenotypes. The BS1 threshold was predicted using MFP variants and the observed frequencies in each subpopulation, but not <0.1%. The thresholds for PM2 and BA1 were then set as one order of magnitude lower and higher, respectively, than the predicted threshold for BS1, but not <0.5%. To avoid P 7 variants with much higher frequency than others (e.g., GJB2:p.Val37Ile) having undue influence, and consistent with the recent ClinGen EP BA1 recommendations, 5  and PM4 rules, the repetitive nature of the region was determined via RepeatMasker (http://www.repeatmasker.org/). Throughout, the effect of the variant on the protein was determined using RefSeq database transcripts. 15 Specific to hotspot regions (PM1 rule) for each exon/domain, a sliding window within the aiVCE initially extracts candidate regions between each pair of B variants or at the edges of the region to be clear of B variants; candidate regions without P variants are ignored.
Within each resulting candidate region, the aiVCE further detects inner borders of P variants contained within, and determines the number of P variants. Based on the density 8 and overall number of P variants, the aiVCE evaluates each region for the presence of hotspots and then assigns a weight to the PM1 rule as 'supporting,' 'moderate,' or 'strong.' The aiVCE's weighting algorithm differentiates between the inner P region and the region between P and B variants.
For rules related to in silico predictions (PP3, BP4), we developed an in-house training dataset (based on VarKB), rather than relying on external algorithms trained on different data, to enhance prediction capabilities. Specifically, training data comprised P and B variants, excluding those previously used to train constituent tools. Employing a logistic regression algorithm to determine if "multiple lines of computational evidence support a deleterious effect on the gene or gene product," a single aggregated score was generated to determine if the variant was to be considered deleterious.
Application of the BP1 rule was determined based on the number of null P variants and the number of non-P variants in the gene, as well as their ratio. Similarly, calling of the PP2 rule was determined based on the number of missense P variants and missense B variants in the gene and their ratio.
Recent opinions suggest that the use of reputable source data rules (PP5, BP6) are preferable to expert opinion in the absence of primary data. 17 Given that primary data sharing remains a challenge, the PP5 and BP6 rules contribute to the aiVCE's ability to classify variants for which evidence, but not primary data, exist. Note that in the experiments described in this paper, these rules were disabled, as the purpose of the time-capsule experiment was to assess the aiVCE's classification of a novel variant with no previously published data. Reputable sources were considered those used for creating varKB. Rule strength was adjusted according to the Confidence Score, as described previously. 9 Benchmarking experiments We benchmarked the aiVCE classification model using ClinVar 13  Laboratory Improvement Amendments (CLIA)-certified laboratory submitters without conflicts. As well, because the ClinVar database does not represent a gold-standard database, a subset of "Increased-Certainty" variants, i.e., variants with ≥2 stars from ≥2 submitters with no conflicts or that were EP-reviewed (https://www.ncbi.nlm.nih.gov/clinvar/docs/details/), was also interrogated. Further, evidence from clinical databases specific to the 'Test' set was excluded from the algorithm during training so as not to bias the results.
As VUS classification does not represent the actual behavior of a variant, but only that its true classification is unknown, we also assessed concordance when excluding ClinVar's VUS (i.e., P/LP vs. B/LB).. To evaluate the impact of the aiVCE's VUS subclassification system on final classification, we further examined how many of the ClinVar VUS variants were prioritized correctly. A 2-tailed-z-test was used to compute P-values when comparing observed proportions.
Concordance between the aiVCE and ClinVar classifications was also assessed by variant effect (e.g., missense, null, splice region), noting that each region is distinct, with no overlap. As such, canonical splice sites (± 1-2) are not included in splice regions (± 3-10), and intronic regions do not contain either of these regions. We further examined specific variants with conflicting classifications (i.e., discordance), to delineate mechanisms of differentiation between ClinVar and the aiVCE, and summarized the distribution of ClinVar variants according to the aiVCE application of ACMG rules.
As frequency-related rules pose a challenge owing to the different disease mechanisms, we assessed aiVCE performance for such rules across diseases using the following six gene panels from the Genomics England PanelApp

Datasets
Our experiments employed a 'Full' dataset and an 'Increased-Certainty' dataset. rules not applied) synonymous variants or variants located in non-coding regions, but not within the splice region. Given that additional case-specific evidence, based on the ACMG/AMP guidelines, would be required to classify such variants as LB rather than VUS, For the alternate scenario of moderately conflicted variants, i.e., when a variant was classified as VUS by the aiVCE but was P/LP in ClinVar, 48/58 variants were missense 14 variants (n = 41) or variants located in splice region (n = 7) (Additional file 1).. Manual examination of several variants, for which detailed classification information was available in ClinVar, indicated these variants were classified as P/LP based on additional evidence that was not available to the aiVCE, including patient-level data extracted manually from the literature (e.g., clinical information from the affected patient, de novo variant, segregation data) (Additional file 1).. For example, the variant NM_004863.3:c.547C>T in SPTLC2 gene (p.Arg183Trp) was called as P, as it has been reported to segregate with autosomal dominant hereditary sensory and autonomic neuropathy type 1C in two families (https://www.ncbi.nlm.nih.gov/clinvar/variation/487224/). Given that this information is not available to, and associated rules are not applied automatically by, the aiVCE, the classification remained VUS. Of note, the aiVCE subclassified 34/58 variants as VUS-LP, suggesting a greater likelihood the variant is P/LP.
In addition, seven of the 58 moderately conflicted variants classified as VUS by the aiVCE but as P/LP in ClinVar were null variants. Five of these null variants were splice donor/acceptor variants called as PVS1_Moderate based on the recent ClinGen EP recommendations, 16 as the reading frame was not disrupted and the altered region was not known to be critical to protein function. For the remaining two null variants, although they were called as PVS1_Very Strong, their frequency was above their specific gene threshold to meet the PM2 rule. Specifically, for NM_012144.3:c.389-1G>C (p.Gly134Arg), the aiVCE suggested a threshold of 0.00111 for applying PM2 in the DNAI1 gene, yet due to a frequency of 0.00128 in the gnomAD (https://gnomad.broadinstitute.org) 'Other' population, the PM2 rule was not met. For NM_199292.2:c.457C>T (p.Arg153Ter), the suggested frequency for applying PM2 for the TH gene exceeded 0.0005, yet it appeared at a frequency of 0.0007 in the gnomAD East Asian population. The observed frequency for both variants was very close to their gene-specific suggested PM2 threshold. Although these variants did not meet the PM2 rule, they were still below suggested thresholds for applying BS1, owing to the fact that the aiVCE has an uncertain frequency region for determining PM2 or BS1 application in cases where no rule is applied.
Of particular note, five discordant variants, which were classified as VUS by the aiVCE and

Variants and ACMG rules
The distribution of ClinVar variants, according to the aiVCE application of ACMG rules and ClinVar submitter classifications, is shown in Figure 3. A significant difference (p<0.0001) was observed for use of P-supporting rules, as well as for application of B-supporting rules, to P/LP vs. B/LB variants.
When considering gene-specific rules for missense variants (Figure 4a), which are among the most difficult variants to classify, the aiVCE differentially applied P-supporting (PS1, PM1, PM5, PP2, PP3) rules to P/LP variants, and B-supporting (BP1) rules to B/LB variants (P<0.00001 for each rule). Further, application of the PVS1 rule was significantly different between P/LP vs. B/LB LOF variants (P<0.00001) (Figure 4b).. Figure 4a) and variants located in splice regions (Figure 4c).. The PP3 rule was applied for 62.18% of the missense and 76.19% of the splice region P/LP variants but for only 7.0% and 1.6% of the missense and splice region B/LB variants, respectively (P<0.00001). Similarly, the BP4 rule was applied for 60.7% and 66.7% of the B/LB missense and splice region variants, respectively, but for only 9.1% and 4.7% of the 16 corresponding P/LP variants (P<0.00001). As expected, the BP7 rule was applied to all synonymous variants (Figure 4d).. While B-supporting frequency rules were applied to more B/LB than P/LP variants, the difference was not significant.

The aiVCE effectively differentiated between P/LP vs. B/LB variants for both missense variants (
When assessing the aiVCE's application of ACMG frequency-related rules (BA1, BS1, BS2, PM2), significant differences were observed between ClinVar P/LP vs. B/LB missense, LOF and splice region variants (P<0.00001 for each rule), thus supporting the gene-specific thresholds selected for these rules. Given that more rare (and thus difficult to classify) than common variants are submitted to ClinVar, it is not surprising that the aiVCE applied the PM2 rule to >50% of variants within each type of variant effect and for 99.67% and 83.18% of the ClinVar P/LP vs. B/LB variants, respectively (Figure 4)..

aiVCE classification performance across disease categories
To examine the robustness of the aiVCE across different disease groups, the following six gene panels from the Genomics England PanelApp The PM2 rule (extremely low frequency in population databases) was met in >99.7% of the P/LP variants, the BS1 rule was not met or met for only 1-2 (<0.1%) of the P/LP variants, and the BA1 rule was never met for P/LP variants (Figure 5),, indicating that gene-specific and varied gene frequencies can be incorporated into the AI.
Specific to PM2 thresholds used for the genes in the different disease panels (Figure 6)

Discussion
Herein, we benchmarked an aiVCE algorithm, previously shown to be a robust platform for comprehensive downstream analysis and identification of DNA variants responsible for disease. [21][22][23][24] Specifically, we assessed aiVCE concordance with final ClinVar classifications and aiVCE rule-level determinations. Results reported herein suggest that the aiVCE has the potential to streamline variant classification for the variant scientist by automating ACMG rules for which supporting evidence is available. Despite the exclusion of ClinVarderived data from the algorithm to guard against potential bias, and automating only some of the rules, the aiVCE demonstrated robust (>97%) concordance in determining whether future variants would be categorized as P/LP. Further, the aiVCE accurately predicted thresholds for variant/allele frequency-based rules. The aiVCE-determined PM2 (extremely low frequency) thresholds, which averaged <0.0015 across disease categories, were several orders of magnitude lower than the typical default hard threshold (0.005 or 0.010) suggested by other tools, 25,26 while BA1 (common allele) was not met (0%) and BS1 (frequency greater than expected) was met for <0.1% of the P/LP variants.
Observations related to the PP3 and BP4 rules suggest that the aiVCE's AI-based prediction model was sensitive and specific in classifying variants. For example, for splice region variants, the aiVCE called BP4 for 67.0% of the B variants and only 4.7% of the P variants, while the PP3 rule was met for 76.1% of the P variants and only 1.5% of the B variants. Application of other gene-specific rules, including PVS1, PM1, PP2, and BP1, also exhibited significant differentiation between B/LB and P/LP variants. These results show the utility of the aiVCE in applying gene-specific evidence and knowledge that may prove useful to the variant scientist.
Many of the rare B/LB variants in ClinVar were classified as VUS by the aiVCE. Most of these variants are very rare non-coding region variants or synonymous variants not located near a splice region, for which an additional case-specific evidence is required for classification as B/LB. With the advent of deep splice/synonymous prediction tools, 27 the PP3/BP4 rules should be useful in providing evidence required to re-classify these VUS to LB. The ACMG/AMP criteria comprise more P-supporting than B-supporting rules, and inclusion of additional B-supporting evidence may be warranted. For example, similar to the PM5/PS1 rules, the same codon with the same/different amino acid change as a known B variant could be employed as B-supporting evidence, in line with the Evidence-based Network for Interpretation of Germline Mutant Alleles' (https://enigmaconsortium.org) classification criteria for BRCA1/2 genes. In the same way, as the PVS1 rule is applied to genes in which LOF is a common mechanism for disease, no B-supporting rule exists for null variants of genes in which LOF is known not to be a mechanism for a disease. Thus, these B-supporting evidences should be considered as well.
Examination of discordance between the aiVCE and ClinVar highlights the importance of performing re-analysis on a regular basis. Specifically, several variants classified as VUS by ClinVar submitters were found to be LP by the aiVCE, which might suggest evidence was unknown to the submitter at the time, e.g., different amino acid change than in a known P variant (PM5), or lack of clinical domain expert knowledge such as a gene's hotspot region or critical functional domain (PM1), further demonstrating the utility of the aiVCE. On the other hand, development of a more structured approach to incorporate case level information by databases such as ClinVar would provide additional evidence currently missing in the aiVCE, such as segregation or de novo data. In the current era of AI, such evidences can be rapidly assimilated into existing and in-development algorithms.
The aiVCE is a data-driven, AI-based tool that relies on previous evidence derived from various data sources, making data-sharing and community initiatives like ClinVar and gnomAD essential. The availability of more evidence for the aiVCE to train its models only improves its ability to provide accurate variant assessments. The time-capsule experiment illustrates how data sharing is critical to reducing uncertainty in variant classification, not only for the specific variant that was shared, but also for novel variants of the gene.
Without any prior evidence, the aiVCE begins with a default threshold for the PM2 rule of 1%, and as more evidence is gathered, the threshold is further refined. Even evidence from only a few variants can impact thresholds, as evidenced by the frequency-related rule thresholds for the novel P variants NM_012144.3:c.389-1G>C and 20 NM_199292.2:c.457C>T (p.Arg153*), which at the time of the time-capsule experiment were not known to be P, and as such were not considered as part of the model. Similarly, the increase in concordance from 85.74% of variants with ≥1 star (Full dataset) to 99.51% of variants with ≥2 stars (Increased-Certainty dataset) can be attributed to the greater certainty variants occurring in genes for which more evidence existed, yielding more accurate classification. The increased concordance can also be attributed to, among other features of the aiVCE tool, improved application of the PVS1 rule, which requires the aiVCE to correctly determine whether LOF is a known mechanism of disease if P null variants exist in the gene, as well as whether a skipped/truncated region is critical to gene function.
As the aiVCE is data-and AI-driven at the gene level, the increased availability of evidence for a specific gene improves the aiVCE's accuracy. Conversely, the aiVCE's utility for unknown genes is constrained for application of several rules, as well as for determining more global thresholds. Such limitations also exist for clinical laboratories evaluating genes with less available clinical information and, thus, are not unique to the aiVCE.

Conclusions
Our aiVCE, even without input from clinical databases specific to the 'Test' set, could   a Splice region defined as the region within ± 3-10 bases of the exon intron boundary, i.e., does not include splice acceptor/donor regions.
a Splice region defined as the region within ± 3-10 bases of the exon intron boundary, i.e., does not include splice acceptor/donor regions.
B, benign; P, pathogenic; UTR, untranslated region; VUS, variant of uncertain significance  Figure 1 Building the training data for the aiVCE. In-house curated data were employed to    Additional file 1.xlsx