ClinVar identified MCVD genetic variants
ClinVar was used to identify pathogenic, likely pathogenic, likely benign, benign, variants of uncertain significance (VUS) and conflicting variants (treated as VUS) in 47 MCVD genes. These 47 genes were selected for this study based on inclusion of cardiovascular genetic testing panels and literature review for established association with the following MCVD: cardiomyopathies, familial hypercholesterolemia, arrhythmias, connective tissue disorders/aortopathies, congenital heart disease, and skeletal myopathies. ClinVar annotation for variants within these 47 genes was downloaded on February 2, 2022, and subset to the 47 genes using bedtools v2.30.023.
Machine learning model building
Ensemble variant effect predictor (VEP, version 106) and CADD version 1.6 were used for variant annotation. Predictors used in model training included: global gnomAD MAF (scaled by multiplying by 50,000), Maximum Entropy [MaxENT] model score (absolute value) (http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html)24, CADD PHRED score 6, the distance to the nearest ClinVar pathogenic/likely pathogenic (P/LP) variant and median counts per million cardiac exon level expression (from both left ventricle and aorta, as identified from the Genotype Tissue Expression [GTex] database, https://gtexportal.org/home/)25.
A training set consisting of 25,971 ClinVar annotated variants from 47 MCVD genes (n = 2,583 pathogenic, n = 3,414 likely pathogenic, n = 4,167 benign; n = 15,807 likely benign) was used to build the model using Random Forest supervised machine learning. Variants were grouped into two groups: pathogenic and likely pathogenic (P/LP), and benign and likely benign (B/LB) variants. VUS (n = 27,559) and variants with conflicting interpretation (n = 6,223) were together treated as VUS and held out of model building. Six-fold cross validation was performed, evenly distributing P/LPs and B/LBs across folds. For model tuning, five different trees (25, 50, 100, 250, 500, 750) were used with six different seeds on each fold, resulting in 32 models trained per n tree or 180 models in total. Model training was performed in R version 4.1.2 (2021-11-01) using the RandomForest package (v4.7-1)26. Performance was assessed by calculating both the receiver operating characteristic area under the curve (ROC AUC) for each model using ClinVar pathogenicity assignment as truth (pROC package v1.18.0). The top-performing model was chosen as that which had the top median ROC AUC across all trees for a given seed and termed the Cardiovascular Disease Pathogenicity Predictor (CVD-PP).
Validation and Comparison of Top Performing Machine Learning Model
Variants for which pathogenicity assignment changed in ClinVar from years 2014 to 2022 were used to validate CVD-PP performance, with the assumption that the more recent classification is the most correct one using precision recall (PR) AUC (PRROC package v1.3.1)27. These variants were held out from the training set to be used for this validation (total n = 663). Of these, 36 variants changed from P/LP to a B/LB, 146 variants changed from B/LB to a P/LP and 481 variants changed from a VUS to a P/LP. Pathogenicity was predicted as a linear outcome using the CVD-PP.
Performance of CVD-PP was also compared to available in silico methods for predicting pathogenicity using comparison of PR AUC, including CADD6, SIFT7, PolyPhen8, MetaSVM9, MetaLR10, Mutation Assessor11, Mutation Taster12, PROVEAN13 and FATHMM14. For this, we subset to variants for which these model annotations were available (n = 443). Within this set of variants, 6 changed from P/LP to a B/LB, 81 changed from a B/LB to a P/LP and 356 changed from a VUS to a P/LP, Because many models also report a binary outcome derived from the predictive score, comparison to was also made using an outcome (pathogenic vs benign) based on the naïve Bayes optimal cut-off established during model training and performance.
Finally, the predicted pathogenicity of 33,782 VUS within the 47 MCVD genes in ClinVar (n = 27,559 VUS and n = 6,223 with conflicting interpretation) was determined using CVD-PP. CVD-PP MCVD variant classifications are publicly available (https://github.com/meganramaker/CVD-PP) and the full code is available upon request.
CATHeterization GENetics cohort
To further support clinical utility and as proof-of-concept of model accuracy and utility, we then utilized the CATHeterization GENetics (CATHGEN) cohort28 to: (1) compare the burden of model predicted pathogenic VUS within 19 dilated cardiomyopathy (DCM) genes between patients with clinically diagnosed DCM and patients without DCM; (2) perform ACMG/AMP review on the highest and lowest model ranked VUS; and assess for phenotype expression using deep chart review in individuals harboring these VUS in CATHGEN; (3) assess the potential clinical utility of the model by assessing for reclassification of VUS in CATHGEN patients with a disease known to have high genetic penetrance (hypertrophic cardiomyopathy [HCM]);
CATHGEN includes 9,334 individuals referred to Duke University Hospital for cardiac catheterization between 2001 and 2010 who consented for participation in the study. Whole exome sequencing was performed as previously described29. CATHGEN was approved by the Duke University Institutional Review Board (IRB).
Burden of DCM variant pathogenicity in DCM patients
The burden of CVD-PP predicted pathogenic VUS within DCM genes was compared between patients with and without a clinical diagnosis for DCM. DCM cases were defined as having a left ventricular ejection fraction (LVEF) prior to index cardiac catheterization of 35% or less, presence of ICD10 I42.0 or ICD9 425.4 from electronic health record (EHR), and no history or diagnosis of cardiovascular disease (CAD) or hypertrophic cardiomyopathy (ICD10 I42.1, I42.2, ICD9 425.1, 425.11) (n = 257). Ten DCM cases harboring known DCM gene P/LP variants were removed from analysis resulting in n = 247 DCM cases. Non-DCM controls were defined as having an LVEF greater than 50% prior to index cardiac catheterization, nor prior LVEF < = 35% and no prior DCM (ICD10 I42.0, ICD9 425.4), HCM diagnosis (ICD10 I42.1, I42.2, ICD9 425.1, 425.11) or chronic heart failure (n = 5,046). Twenty-seven non-DCM controls harboring DCM gene P/LP variants were removed from analysis, resulting in n = 5,019 non-DCM controls. Variants classified as VUS or conflicting by ClinVar, or not previously annotated in ClinVar but present in the CATHGEN cohort, were identified within 19 DCM genes (ACTC1, ACTN2, BAG3, CRYAB, DES, DSC2, DSG2, DSP, FLNC, LMNA, MYBPC3, MYH6, MYH7, SCN5A, TNNC1, TNNI3, TNNT2, TPM1, TTN). These genes were selected because they had the strongest evidence that supported association with DCM and were found on clinical genetic testing panels. The random forest model predicted pathogenicity score of each alternate observed allele was multiplied by the allele count to obtain a distribution of scores for each gene; the distribution of scores was compared between DCM cases and DCM controls using a one-tailed Wilcoxon rank sum test. The burden of VUS in DCM genes was also compared to the burden of VUS in all non-DCM genes using the same method as further supportive evidence of the model.
ACMG/AMG validation of highest- and lowest-ranked variants in CATHGEN participants
As a second proof-of-concept of model accuracy we performed ACMG/AMP review on the highest 50 and lowest 50 model ranked ClinVar VUS and performed deep clinical chart review to determine expression of disease in CATHGEN individuals harboring one of these VUS. ClinVar VUS or conflicting variants present in CATHGEN were annotated for CVD-PP prediction. CVD-PP was then utilized to predict pathogenicity, and variants were ranked by pathogenicity score from 0 (most likely benign) to 1 (most likely pathogenic).
ACMG review was then performed based on the 2015 updated guidelines4. For this curation, a semi-automated workflow was established. Specifically, we obtained variant annotations from publicly available datasets, namely whether a variant leads to loss of function (PVS1), results in the same amino acid change as a previously established pathogenic variant (PS1), is within a mutational hotspot or critical domain (PM1), is absent from large population databases (PM2), is protein length changing (PM4), results in a different amino acid change in the same position of a previously known pathogenic variant (PM5). Additionally, in silico annotations for the variant (PP3) were also obtained. Evidence for criteria which could not be automated was acquired from literature review and ClinVar variant reports from reputable sources. These criteria included if a variant has well-established in vitro and in vivo functional studies (PS3) or case-control studies (PS4), whether there is cosegregation with disease in multiple family members with the variant (PP1), and if a reputable source recently reported the variant as pathogenic (PP5). Criteria not utilized in our ACMG review (because they are patient/family-specific) included whether the variant was de novo in a patient without a family history (PS2), in trans with a pathogenic variant in recessive disorders (PM3), assumed to be de novo without confirmation of maternity and paternity (PM6), if the variant was a missense variant in a gene with low rate of benign variation (PP2) or whether the phenotype or family history was highly specific for disease (PP4).
VCFtools (v0.1.17) was used to filter for these variants in CATHGEN using a quality of greater than 20 (QUAL > 20) and depth greater than 10 (DP > 10). In total, 2 participants harbored a predicted benign, ACMG/AMG pathogenic variant, 533 individuals harbored a predicted benign, ACMG/AMG confirmed benign variant, 64 individuals harbored a predicted pathogenic ACMG/AMG benign variant and 43 individuals harbored a predicted pathogenic/ACMG confirmed pathogenic variant. For feasibility, we randomly sampled n = 45 individuals harboring ACMG/AMG benign variants for downstream analysis. Of these 90 individuals we filtered to individuals harboring variants causing autosomal dominant form expression of disease (n = 81). Furthermore, chart review was conducted in those with substantial EHR including cardiac imaging data necessary for diagnosis to identify expression of disease (n = 67).
Disease expression in CATHGEN HCM patients harboring predicted pathogenic variants
Lastly, we assessed potential clinical utility of the CVD-PP model by assessing for reclassification of VUS in CATHGEN patients with HCM. Individuals with HCM were initially identified in CATHGEN using ICD-9 (425.1 and 425.11) and ICD-10 codes (I42.1) for HCM. EHR of individuals with associated codes were then expert manually reviewed to ensure patients had definitive diagnoses of HCM. Sixty-seven individuals with a clinical diagnosis of HCM did not harbor a known P/LP HCM genetic variant. Eight of these sixty-seven individuals harbored a ClinVar VUS or a variant not previously annotated in ClinVar predicted to be pathogenic by CVD-PP.