Gene-specific machine learning for pathogenicity prediction of rare BRCA1 and BRCA2 missense variants

doi:10.21203/rs.3.rs-2716687/v1

Download PDF

Article

Gene-specific machine learning for pathogenicity prediction of rare BRCA1 and BRCA2 missense variants

https://doi.org/10.21203/rs.3.rs-2716687/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 28 Jun, 2023

Read the published version in Scientific Reports →

You are reading this latest preprint version

Machine learning-based pathogenicity prediction helps interpret rare missense variants of BRCA1 and BRCA2, which are associated with hereditary cancers. Recent studies have shown that classifiers trained using variants of a specific gene or a set of genes related to a particular disease perform better than those trained using all variants, due to their higher specificity, despite the smaller training dataset size. In this study, we further investigated the advantages of “gene-specific” machine learning compared to “disease-specific” machine learning. We used 1068 rare (gnomAD minor allele frequency (MAF) < 0.005) missense variants of 28 genes associated with hereditary cancers for our investigation. Popular machine learning classifiers were employed: regularized logistic regression, extreme gradient boosting, random forests, support vector machines, and deep neural networks. As features, we used MAFs from multiple populations, functional prediction and conservation scores, and positions of variants. The disease-specific training dataset was more than seven times larger than and included the gene-specific training dataset. However, we observed that gene-specific training variants were sufficient to produce the optimal pathogenicity predictor if a suitable machine learning classifier was employed. Therefore, we recommend gene-specific machine learning as an efficient and effective method for the pathogenicity prediction of rare BRCA1 and BRCA2 missense variants.

Biological sciences/Biological techniques/Bioinformatics

Health sciences/Diseases/Cancer/Cancer genetics

Physical sciences/Mathematics and computing/Computer science

BRCA1 and BRCA2 (BRCA1/2) genes are associated with an elevated risk of developing breast and ovarian cancers^1,2. Small germline variants of BRCA1/2 are one of the primary sources of such risk^3–5. Meanwhile, next-generation sequencing technologies are rapidly being integrated into clinical practice, identifying vast amounts of small germline BRCA1/2 variants^6–8. Accurate interpretation of the identified variants is one of the critical elements of clinical practice. Unlike synonymous and common missense variants, which are usually benign, and nonsense and frameshift variants, which are often pathogenic⁹, rare missense variants of BRCA1/2 are hard to interpret¹⁰. In this regard, computational prediction of the pathogenicity of rare BRCA1/2 missense variants can help the interpretation process^11,12.

Supervised machine learning has been widely adopted to develop computational tools for the pathogenicity prediction of variants, including rare missense ones^13–21. A prediction tool based on supervised machine learning takes a set of features, such as minor allele frequencies (MAFs), predicted functional impacts of a variant, and the degree of conservation across multiple species at its genomic position, as input. A training dataset containing known pathogenic and benign variants is used to build a pathogenicity predictor in supervised machine learning. According to the composition of training variants, supervised machine learning for variant pathogenicity prediction is divided into genome-wide, disease-specific, and gene-specific.

Genome-wide supervised machine learning approaches use variants from across the whole genome to develop pathogenicity predictors. Popular examples include REVEL¹⁵, BayesDel¹⁴, and ClinPred¹³. One advantage of the genome-wide approach is that it involves a larger number of training variants, which can improve the performance of the learned model by reducing variance²². However, this approach does not account for disease-specific patterns in variant pathogenicity. For example, the pathogenicity of a variant could be different between a hereditary cancer syndrome and a hereditary cardiovascular disease. Disease-specific supervised machine learning addresses this issue by using only disease-specific variants, i.e., variants of a set of genes related to a specific disease or a group of similar disorders. Evans et al. developed pathogenicity predictors specific to each of cardiomyopathy, epilepsy, and RASopathies using disease-specific supervised machine learning¹⁶. These disease-specific predictors were found to outperform genome-wide pathogenicity predictors. Lai et al. showed that hereditary cancer-specific and cardiovascular disorder-specific predictors worked better than genome-wide predictors¹⁷. Zhang et al. also showed that the disease-specific approach is better than the genome-wide method for inherited cardiomyopathies and arrhythmias¹⁸.

Compared to the disease-specific approach, gene-specific supervised machine learning is even more specific as it builds pathogenicity predictors using variants from only a particular disease gene, e.g., BRCA1 or BRCA2. This method has the potential to perform best due to its highest specificity; however, its training variants are most limited. In this sense, it is likely to perform poorly due to high variance. Crockett et al.²³, Padilla et al.²⁴, Hart et al.²¹, and Aljarf et al.¹⁹ have developed gene-specific variant pathogenicity predictors for disease-associated genes, including BRCA1 and BRCA2. Their studies have shown that gene-specific predictors perform better than or comparably to genome-wide ones. However, they did not compare their gene-specific approach with the disease-specific one, which is less specific but expected to have lower variance.

In this study, we investigated the efficacy of gene-specific supervised machine learning in predicting the pathogenicity of rare BRCA1/2 missense variants, compared to the disease-specific approach. For the investigation, we used rare missense variants of 28 genes associated with hereditary cancers, including BRCA1/2. We employed a set of widely used linear and non-linear machine learning methods: the lasso, ridge, elastic net, extreme gradient boosting (XGBoost), random forests (RFs), support vector machines (SVMs), and deep neural networks (DNNs) to build the pathogenicity predictor. We evaluated and compared the performance of each machine learning classifier when combined with either the gene-specific or disease-specific approach. These comparisons will provide insight into which of the two methods in which trade-off exists is better suited for the variant pathogenicity prediction.

Variant annotation and filtering

We downloaded a variant file in GRCh37 (clinvar_20200817.vcf.gz) from the ClinVar²⁵ website (https://www.ncbi.nlm.nih.gov/clinvar/). The downloaded VCF file was normalized using vt (version 0.5772)²⁶ and in-house scripts. Then the normalized VCF file was annotated using SnpEff (version 4.3s (build 2017-10-25 10:05))²⁷, SnpSift (version 4.3s (build 2017-10-25 10:05))²⁸, and Ensembl Variant Effect Predictor (VEP) (version 86)²⁹. The databases used for annotation were dbSNP (build 151)³⁰, dbscSNV (version 1.1)³¹, gnomAD (release 2.1.1)³², Korean Variant Archive (KOVA)³³, Korean Reference Genome Database (KRGDB) (phase 2)³⁴, and dbNSFP (version 4.1a)³⁵. In total, 769,966 variants were annotated. The annotated variants were filtered as follows. First, only the variants of which clinical significance in ClinVar is Benign, Benign/Likely_benign, Likely_benign, Pathogenic, Pathogenic/Likely_pathogenic, or Likely_pathogenic were retained. Then, variants were filtered by the ClinVar review status. Only the variants of which review status is practice_guideline, reviewed_by_expert_panel, or criteria_provided_multiple_submitters,_no_conflicts were included in the experiments. Then, variants were filtered by type (single_nucleotide_variant in ClinVar’s annotation), MAF (gnomAD all populations < 0.005), and consequence (VEP consequence is missense_variant or missense_variant&splice_region_variant). Finally, only the variants of 31 reportable transcripts of 30 genes associated with hereditary cancer syndromes compiled by Barrett et al.³⁶ were used. Consequently, we used 1068 rare missense variants of 28 genes, including BRCA1 and BRCA2. Among the 1068 variants, the numbers of BRCA1 and BRCA2 variants were 225 and 179, respectively. Supplementary Table S1 shows the number of variants per gene.

Test Variant Sets

After the filtering, we grouped variants into two categories of clinical significance: P/LP (including Pathogenic, Pathogenic/Likely_pathogenic, and Likely_pathogenic) and B/LB (including Benign, Benign/Likely_benign, and Likely_benign). We determined the ratio of P/LP to B/LB variants in a test variant set in line with previous studies on the pathogenicity prediction of rare BRCA1/2 missense variants since the class distribution of test examples influences the performance of a machine learning classifier³⁷. In previous studies, the ratio of P/LP to B/LB test variants was 0.08¹¹, 0.20¹⁷, and 0.22¹² for BRCA1 and 0.03¹¹, 0.07¹⁷, and 0.07¹² for BRCA2. From the 225 BRCA1 variants, 86 were randomly selected and constituted a test variant set. The 86 variants of the test set included 14 P/LP variants (i.e., P/LP to B/LB ratio 0.19). Among the 79 variants of a test set chosen from the 179 BRCA2 variants, the number of P/LP variants was six, making P/LP to B/LB ratio 0.08. We created ten test variant sets for each of BRCA1 and BRCA2 by repeated random subsampling.

Training Variant Sets

We constructed gene-specific and disease-specific training variant sets for each of the ten test variant sets of BRCA1 and BRCA2. For a test variant set of BRCA1 (BRCA2) with 86 (79) variants, we used the remaining 139 BRCA1 (100 BRCA2) variants as gene-specific training variants. The disease-specific training variants for a test variant set of BRCA1 (BRCA2) consisted of the remaining 982 (989) variants from 28 genes, including BRCA1 and BRCA2, associated with hereditary cancer syndromes. The entire workflow for constructing the test and training variant sets is shown in Fig. 1. Notably, the disease-specific training variant set for a BRCA1 or BRCA2 test variant set included the corresponding gene-specific training variant set.

Features For Variant Pathogenicity Prediction

We used five feature categories to predict the pathogenicity of rare BRCA1/2 missense variants: MAF, site conservation score, predicted functional-impact score, position, and others. The features used in our study are listed in Supplementary Table S2. MAFs for 16 populations obtained from gnomAD, KOVA, KRGDB, and UK10K (from dbNSFP) were used as features. Missing values for the MAF features were replaced by zero. We created an additional feature indicating that MAF values were missing for each of the four MAF databases to discriminate between the missing and zero MAF values.

Furthermore, we used nine site conservation scores from dbNSFP as features. To minimize the risk of overfitting, we did not use functional impact scores, predicted by supervised machine learning models trained using variants labelled as pathogenic or benign, such as PolyPhen-2³⁸ and Combined Annotation Dependent Depletion (CADD)³⁹ scores. Consequently, only five predicted functional-impact scores from dbNSFP were used as features. LRT⁴⁰, MutationAssessor⁴¹, SIFT⁴², and SIFT 4G⁴³ had missing values among the five predicted functional-impact scores. Missing values of these four features were imputed by median over the training variant set. In addition, we created a missing status indicator for each of the LRT, MutationAssessor, and SIFT (including SIFT 4G) scores.

We also used four position features for a variant, i.e., the relative position of the exon in which it exists and its relative position in each of the cDNA, coding, and protein sequences. Finally, we used 12 “others” category features from gnomAD, dbNSFP, dbscSNV, and dbSNP. Among the 12 features, six had missing values. Two of them—dbNSFP_APPRIS and dbNSFP_codon_degeneracy—were categorical features, having “not annotated” as their values. Missing values of the four numerical features—gnomAD2_InbreedingCoeff, dbNSFP_LRT_Omega, dbscSNV_ADA_SCORE, and dbscSNV_RF_SCORE—were imputed by zero or median over the training variant set (see Supplementary Table S2 for details). Furthermore, we created two features respectively representing that the values of dbscSNV_ADA_SCORE and dbscSNV_RF_SCORE were missing. The two missing value indicators created for gnomAD MAF and LRT features were respectively used for indicating that gnomAD2_InbreedingCoeff and dbNSFP_LRT_Omega values were missing. In total, 55 features were used for variant pathogenicity prediction.

Supervised Machine Learning Methods

Before training, we centred and scaled each numeric or integer feature using its mean and standard deviation over the training variant set (the type of each feature is shown in Supplementary Table S2). We evaluated and compared eight supervised machine learning methods: three regularized logistic regression methods (the lasso, ridge, and elastic net), XGBoost, RFs, SVMs with the linear and radial basis function (RBF) kernels (Linear-SVMs and RBF-SVMs, respectively), and DNNs. We used the R caret package (version 6.0.-90) for training and testing the regularized logistic regression (method=‘glmnet’), XGBoost (method=‘xgbTree’), RF (method=‘rf’), Linear-SVM (method=‘svmLinear’), and RBF-SVM (method=‘svmRadial’) models. We used the R keras package (version 2.9.0) for DNNs. We employed fully-connected feedforward DNNs with three hidden layers. The leaky rectified linear unit (ReLU) was used as an activation function for each node of the three hidden layers. We set the slope of leaky ReLU as 0.2. The activation function for the output layer was sigmoid. The hyperparameter values of each method, except for DNNs, were optimized using five-fold cross-validation (CV) over the training variant set. Six hyperparameters of DNNs were optimized using 10-fold CV on the training variant set. The search range for each hyperparameter is shown in Supplementary Tables S3 and S4. The area under the precision-recall curve (AUPRC) was used as the objective function for hyperparameter optimization. The AUPRC values were calculated using the R PRROC package (version 1.3.1).

Prediction performance comparison of gene-specific and disease-specific machine learning

The ratio of pathogenic to benign BRCA1/2 variants in the test variant set of our study is not balanced because it reflects the actual class distribution (see Methods). Therefore, we used AUPRC to evaluate the performance of pathogenicity predictors. The AUPRC is more informative for imbalanced classification datasets than other measures, such as accuracy and the area under the receiver operating characteristics curve (AUROC)^44–46. Eight machine learning methods—ridge, lasso, elastic net, RFs, XGBoost, Linear- and RBF-SVMs, and DNNs—were employed for the performance comparison. Furthermore, we compared four popular genome-wide pathogenicity predictors: REVEL¹⁵, BayesDel¹⁴ with and without maximum allele frequency (MaxAF), and ClinPred¹³. In a recent study, REVEL and BayesDel performed better than other in-silico predictors⁴⁷. ClinPred is a recently developed tool trained using ClinVar variants.

Figure 2 compares the performance of each method on BRCA1. We did not observe a remarkable difference in prediction performance between gene-specific and disease-specific machine learning. Disease-specific learning performed better than gene-specific learning when used with the lasso, XGBoost, Linear- and RBF-SVMs. For the other four methods, gene-specific learning was better than disease-specific learning. However, the performance difference between gene-specific and disease-specific learning was statistically significant (paired t-test P < 0.05) only for one machine learning model: RFs (see Supplementary Table S5). This result is noteworthy because the disease-specific training variant set was more than seven (= 982/139; see Methods) times larger than the gene-specific one. Moreover, the disease-specific training variant set includes all the gene-specific variants. It means that the variants from disease-associated genes other than BRCA1 generally did not improve the pathogenicity prediction performance for BRCA1. Instead, the machine learning model substantially influenced pathogenicity prediction performance more than the training variant type. For BRCA1, the gene-specific RF achieved the highest AUPRC (0.9835 ± 0.0156). Two other models, i.e., XGBoost trained using the disease-specific and the gene-specific variant sets, were the second (AUPRC 0.9783 ± 0.0187) and the third (AUPRC 0.9727 ± 0.0176), respectively, showing comparable performance to the best method (paired t-test P = 0.1062 and 0.0801, respectively). All the others were statistically significantly worse than the gene-specific RF model (see Supplementary Table S6). The popular pathogenicity predictors trained using all genes, i.e., REVEL, BayesDel with and without MaxAF, and ClinPred, demonstrated poorer performance than the gene- and disease-specific machine learning approaches except for the Linear-SVM model, which performed worse than ClinPred and BayesDel with MaxAF.

We show the comparison results for BRCA2 in Fig. 3. Unlike the case of BRCA1, disease-specific learning generally performed better than gene-specific learning for BRCA2. Except for XGBoost and RFs, disease-specifically trained models showed higher AUPRC values than gene-specific ones. Moreover, the performance difference was statistically significant (paired t-test P < 0.05) for all machine learning models but RFs and RBF-SVMs (see Supplementary Table S7). However, gene-specific RFs achieved the best AUPRC (0.9467 ± 0.0483). Four other methods which obtained comparable performance (paired t-test P = 0.1436, 0.2504, 0.1575, 0.1035) to this were disease-specific RFs (AUPRC 0.9398 ± 0.0515), disease-specific DNNs (AUPRC 0.9361 ± 0.0326), disease-specific Linear-SVMs (AUPRC 0.9209 ± 0.0676), and gene-specific XGBoost (AUPRC 0.9167 ± 0.0581). All the other methods were statistically significantly worse than the gene-specific RF model (Supplementary Table S8). This result suggests that gene-specific learning is sufficient to obtain the optimal pathogenicity predictor for BRCA2 if we use an appropriate machine learning algorithm. The popular pathogenicity predictors were not enough to attain high performances. Unlike the case of BRCA1, however, ClinPred and BayesDel with MaxAF showed higher AUPRC values than many gene-specific and disease-specific machine learning approaches (see Fig. 3).

We also compared the variance of ten trials between gene-specific and disease-specific machine learning. Because the gene-specific training variant set is much smaller than that of the disease-specific variant set (see Methods), the variance of gene-specific models is expected to be larger than that of disease-specific models. We show the comparison results for BRCA1 and BRCA2 in Supplementary Tables S5 and S7. For BRCA1, gene-specific learning showed statistically significantly higher variances (Pitman-Morgan test P < 0.05) than disease-specific learning for the lasso, elastic net, Linear- and RBF-SVMs among the eight machine learning models. However, the difference in variance was not statistically significant for the other four models, including RFs, which achieved the highest AUPRC. We observed a different result for BRCA2. The difference in variance was statistically significant (Pitman-Morgan test P < 0.05) for all but one machine learning method, meaning that BRCA2-specific training datasets were likely to produce more inconsistent results than much larger disease-specific training datasets. Interestingly, the variance of RFs, the best-performing predictor on BRCA2, was not statistically significantly different between gene-specific and disease-specific learning (Pitman-Morgan test P = 0.6321). These results suggest that if a suitable machine learning method is adopted, gene-specific variants are enough to obtain the optimal pathogenicity predictor for rare BRCA1 and BRCA2 missense variants.

Comparison Of Important Features Identified By Gene-specific And Disease-specific Machine Learning

As demonstrated in the preceding subsection, the selection of machine learning algorithms plays a more significant role than the type of training variants in achieving optimal pathogenicity predictions for rare BRCA1/2 missense variants. Specifically, among the three predictors exhibiting optimal performance for BRCA1, two were trained on gene-specific variants, while one was trained on disease-specific variants. For BRCA2, two of the top five performing predictors employed gene-specific training variants (see Figs. 2 and 3). It is noteworthy that the top-performing model group for BRCA1 comprised both gene-specific and disease-specific XGBoost models. As for BRCA2, the RF algorithm demonstrated the best performance regardless of the type of training variants used. We compared the significant features identified by these top-performing models obtained using the same machine learning algorithm but different types of training variants.

Figure 4 shows the top ten important features identified by gene-specific and disease-specific learning of XGBoost for BRCA1. The XGBoost feature importance values of all features in the ten trials are shown in Supplementary Tables S9 and S10, respectively, for gene-specific and disease-specific learning. In the gene-specific and disease-specific XGBoost models for BRCA1, the top ten important features had 93.3% and 88.9% of the feature importance values, respectively. In addition, we observed that variance across the ten trials was much higher for gene-specific learning than disease-specific learning, possibly due to the smaller size of the gene-specific training dataset (see Methods). The most important feature learned from BRCA1-specific training variants was dbNSFP_phyloP100way_vertebrate (a site conservation score; feature importance 35.03 ± 19.24%). The second and third were dbNSFP_SIFT4G_score (a predicted functional-impact score; feature importance 20.69 ± 9.66%) and gnomAD2_AF (a MAF; feature importance 13.64 ± 6.80%). Compared to this, the most critical feature learned from disease-specific variants was gnomAD2_AF_male (a MAF; feature importance 29.46 ± 2.18%). The second was gnomAD2_AF (a MAF; feature importance 23.80 ± 3.34%). The third was dbNSFP_LRT_score (a predicted functional-impact score; feature importance 8.41 ± 1.09%). The two most important features learned by gene-specific learning for BRCA1 were site conservation and predicted functional-impact scores. In contrast, the first and second important features in the disease-specific XGBoost models for BRCA1 were MAF features, i.e., gnomAD2_AF_male and gnomAD2_AF.

We observed similar trends when comparing the top ten important feature groups. The top ten important feature groups of the BRCA1-specific and disease-specific XGBoost models shared six features. Differences between the two important feature groups were as follows (see Fig. 4). Ranks of MAF features were higher (first, second, and fourth) in disease-specific learning compared to BRCA1-specific learning (third, fifth, and seventh). It seems that BRCA1-specific training variants were insufficient to learn a reliable pattern of MAFs for discriminating between pathogenic and benign variants compared to disease-specific training variants. Two genomic position features, i.e., EXON and cDNA_position, were among the top ten crucial features in gene-specific learning. However, the feature importance values of these features in disease-specific learning for BRCA1 were much lower (ranked 27th and 24th, respectively; see Supplementary Table S10). The position features exhibit relatively high importance values in gene-specific learning, likely due to the fact that positional information is only meaningful within a specific gene and not applicable across a group of genes, even if they are linked to the same or similar diseases.

Figure 5 shows the most critical twenty features learned from gene-specific and disease-specific RF learning for BRCA2. We offer all features’ normalized variable importance values in Supplementary Tables S11 and S12 for BRCA2-specific and disease-specific learning, respectively. Variable importance values of RF were normalized so that their sum over all features equals 100%. The top twenty features had 76.9% and 82.4% of the variable importance values for gene-specific and disease-specific RF learning for BRCA2, respectively. Similar to the result for BRCA1, we observed that variance across the ten trials was generally higher for gene-specific RF learning than disease-specific RF learning, possibly due to the smaller training dataset size of gene-specific learning. The three most important features for disease-specific RF learning for BRCA2 were gnomAD2_AF (a MAF; normalized variable importance 7.98 ± 0.95%), gnomAD2_AF_male (a MAF; normalized variable importance 7.77 ± 0.58%), and gnomAD2_AF_female (a MAF; normalized variable importance 6.89 ± 0.66%). Among these three MAF features, only gnomAD2_AF was among the top three critical features for BRCA2-specific RF learning (ranked third; normalized variable importance 4.78 ± 0.90%). The first and second essential features for BRCA2-specific RF learning were dbNSFP_phyloP100way_vertebrate (a site conservation score; normalized variable importance 5.38 ± 0.50%) and dbNSFP_LRT_score (a predicted functional-impact score; normalized variable importance 4.93 ± 0.93%), respectively. We note that dbNSFP_phyloP100way_vertebrate was also the most crucial feature learned from BRCA1-specific XGBoost training (see Fig. 4(a)). It suggests the site conservation score is a critical gene-specific information source for discriminating between pathogenic and benign variants.

The comparison results of the top twenty feature groups between BRCA2-specific and gene-specific RF learning are as follows. Fourteen features were common among the top twenty gene-specific and disease-specific feature groups obtained from RF learning for BRCA2. However, each feature’s rank differed, meaning that different optimal RF models were learned from BRCA2-specific and disease-specific variants, respectively. We observed that MAF features were more influential in disease-specific learning (ranked first (gnomAD2_AF), second (gnomAD2_AF_male), third (gnomAD2_AF_female), 11th (gnomAD2_AF_nfe), 15th (gnomAD2_AF_oth), and 17th (gnomAD2_AF_amr)) than in BRCA2-specific learning. Ranks of the same MAF features were lower in BRCA2-specific RF learning: third, sixth, seventh, 19th, 22nd, and 29th, respectively (see Supplementary Table S11). This result is similar to that from XGBoost learning for BRCA1 (see Fig. 4). Another similar result is that position features were more critical in BRCA2-specific learning than disease-specific learning. In BRCA2-specific RF models, ranks of the four position features, i.e., EXON, cDNA_position, CDS_position, and protein_position, were 16th, 17th, 14th, and 15th, respectively. On the contrary, the same features were ranked 26th, 27th, 29th, and 28th in disease-specific RF models for BRCA2 (see Supplementary Table S12).

To summarize, we observed common properties in important features identified by gene-specific and disease-specific learning for the pathogenicity prediction of rare BRCA1/2 missense variants. First, MAF features were more critical in disease-specific learning than gene-specific learning. It means that MAF is a major discriminating factor between pathogenic and benign variants, having similar patterns regardless of genes, at least if they are associated with the same disease. However, gene-specific training variants seem insufficient to capture the discriminating pattern reliably. Instead of MAF features, we can use predicted functional-impact and site conservation scores as significant elements for distinguishing between pathogenic and benign variants, as shown by the optimal performance of gene-specific learning. Additionally, the position of a variant could play an essential role only in gene-specific learning because the meaning of position could be different by genes.

Machine learning has shown promise in tackling the challenge of interpreting rare missense variants in disease-associated genes, such as BRCA1 and BRCA2. Choosing the appropriate set of training variants is crucial for developing an accurate pathogenicity predictor using machine learning. Studies have found that gene-specific and disease-specific approaches are more effective than genome-wide approaches. We conducted a study comparing gene-specific and disease-specific machine learning methods for predicting the pathogenicity of rare missense variants in BRCA1/2. Our findings suggest that gene-specific machine learning can achieve optimal pathogenicity prediction with an appropriate algorithm, without the need to include disease-specific variants in the training set. Some machine learning algorithms produced the best predictor regardless of the type of training variant set. MAF features were more important in disease-specific predictors, while position features played a significant role in gene-specific predictors. These results indicate that gene-specific machine learning, utilizing gene-specific variant characteristics, can produce the optimal pathogenicity predictor for BRCA1 and BRCA2, despite the limited size of the training dataset. Therefore, we recommend using gene-specific machine learning for predicting the pathogenicity of rare missense variants in BRCA1/2 as it is efficient and effective, with the caveat that gene-specific approaches may not be applicable for genes with extremely small numbers of variants, in which case disease-specific approaches may be more suitable.

Data availability

The ClinVar variant file (clinvar_20200817.vcf.gz) was downloaded from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2020/. The dbSNP (build 151) variant file (All_20180423.vcf.gz) was downloaded from https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/. The dbscSNV (version 1.1) variant file (dbscSNV1.1.zip) was downloaded from http://www.liulab.science/dbscsnv.html. The gnomAD (release 2.1.1) variant file (gnomad.exomes.r2.1.1.sites.vcf.bgz) was downloaded from https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz. The KOVA variant file (K1055E_allele_frequency.txt.zip) was downloaded from http://kobic.re.kr/kova/downloads. The KRGDB (phase 2) variant files (KRG1100_rare_variants.zip and KRG1100_common_variants.zip) were downloaded from http://coda.nih.go.kr/coda/KRGDB/index.jsp. The dbNSFP (version 4.1a) data file (dbNSFP4.1a.zip) was downloaded from https://drive.google.com/file/d/17kdX1Fqi_ZW8PXaHm2vQuJLHuoMDwZmB/view.

Acknowledgements

This work was supported by NGeneBio. D.-B.L. and K.-B.H. were supported by the National Research Foundation of Korea (NRF-2022R1F1A1072718).

Author contributions

C.H. and K.-B.H. conceived the research idea. M.K., S.K., D.-B.L., C.H., and K.-B.H. analysed data and interpreted the results. K.-B.H. drafted the manuscript and all authors wrote and approved the final version of manuscript.

Additional information

Competing interests

M.K. and C.H. are currently employed by NGeneBio. S.K. was previously employed by NGeneBio. C.H. owns equity interest in NGeneBio. D.-B.L. and K.-B.H. declare no competing interests.

Miki, Y. et al. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science 266, 66-71, doi:10.1126/science.7545954 (1994).
Wooster, R. et al. Identification of the breast cancer susceptibility gene BRCA2. Nature 378, 789-792, doi:10.1038/378789a0 (1995).
Risch, H. A. et al. Population BRCA1 and BRCA2 mutation frequencies and cancer penetrances: a kin-cohort study in Ontario, Canada. J Natl Cancer Inst 98, 1694-1706, doi:10.1093/jnci/djj465 (2006).
Petrucelli, N., Daly, M. B. & Feldman, G. L. Hereditary breast and ovarian cancer due to mutations in BRCA1 and BRCA2. Genet Med 12, 245-259, doi:10.1097/GIM.0b013e3181d38f2f (2010).
Rebbeck, T. R. et al. Association of type and location of BRCA1 and BRCA2 mutations with risk of breast and ovarian cancer. JAMA 313, 1347-1361, doi:10.1001/jama.2014.5985 (2015).
Feliubadalo, L. et al. Next-generation sequencing meets genetic diagnostics: development of a comprehensive workflow for the analysis of BRCA1 and BRCA2 genes. Eur J Hum Genet 21, 864-870, doi:10.1038/ejhg.2012.270 (2013).
Nicolussi, A. et al. Next-generation sequencing of BRCA1 and BRCA2 genes for rapid detection of germline mutations in hereditary breast/ovarian cancer. PeerJ 7, e6661, doi:10.7717/peerj.6661 (2019).
Toland, A. E. et al. Clinical testing of BRCA1 and BRCA2: a worldwide snapshot of technological practices. NPJ Genom Med 3, 7, doi:10.1038/s41525-018-0046-7 (2018).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405-424, doi:10.1038/gim.2015.30 (2015).
Dines, J. N. et al. Systematic misclassification of missense variants in BRCA1 and BRCA2 "coldspots". Genet Med 22, 825-830, doi:10.1038/s41436-019-0740-6 (2020).
Cline, M. S. et al. Assessment of blind predictions of the clinical significance of BRCA1 and BRCA2 variants. Hum Mutat 40, 1546-1556, doi:10.1002/humu.23861 (2019).
Ernst, C. et al. Performance of in silico prediction tools for the classification of rare BRCA1/2 missense variants in clinical diagnostics. BMC Med Genomics 11, 35, doi:10.1186/s12920-018-0353-y (2018).
Alirezaie, N., Kernohan, K. D., Hartley, T., Majewski, J. & Hocking, T. D. ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. Am J Hum Genet 103, 474-483, doi:10.1016/j.ajhg.2018.08.005 (2018).
Feng, B. J. PERCH: A Unified Framework for Disease Gene Prioritization. Hum Mutat 38, 243-251, doi:10.1002/humu.23158 (2017).
Ioannidis, N. M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet 99, 877-885, doi:10.1016/j.ajhg.2016.08.016 (2016).
Evans, P. et al. Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets. Genome Res 29, 1144-1151, doi:10.1101/gr.240994.118 (2019).
Lai, C. et al. LEAP: Using machine learning to support variant classification in a clinical setting. Hum Mutat 41, 1079-1090, doi:10.1002/humu.24011 (2020).
Zhang, X. et al. Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet Med 23, 69-79, doi:10.1038/s41436-020-00972-3 (2021).
Aljarf, R., Shen, M., Pires, D. E. V. & Ascher, D. B. Understanding and predicting the functional consequences of missense mutations in BRCA1 and BRCA2. Sci Rep 12, 10458, doi:10.1038/s41598-022-13508-3 (2022).
Crockett, D. K. et al. Predicting phenotypic severity of uncertain gene variants in the RET proto-oncogene. PLoS One 6, e18380, doi:10.1371/journal.pone.0018380 (2011).
Hart, S. N., Polley, E. C., Shimelis, H., Yadav, S. & Couch, F. J. Prediction of the functional impact of missense variants in BRCA1 and BRCA2 with BRCA-ML. NPJ Breast Cancer 6, 13, doi:10.1038/s41523-020-0159-x (2020).
Brain, D. & Webb, G. I. in Proceedings of the Fourth Australian Knowledge Acquisition Workshop ( AKAW -99) (eds D. Richards, G. Beydoun, A. Hoffmann, & P. Compton) 117-128 (The University of New South Wales, 1999).
Crockett, D. K. et al. Utility of gene-specific algorithms for predicting pathogenicity of uncertain gene variants. J Am Med Inform Assoc 19, 207-211, doi:10.1136/amiajnl-2011-000309 (2012).
Padilla, N. et al. BRCA1- and BRCA2-specific in silico tools for variant interpretation in the CAGI 5 ENIGMA challenge. Hum Mutat 40, 1593-1611, doi:10.1002/humu.23802 (2019).
Landrum, M. J. et al. ClinVar: improvements to accessing data. Nucleic Acids Res 48, D835-D844, doi:10.1093/nar/gkz972 (2020).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202-2204, doi:10.1093/bioinformatics/btv112 (2015).
Cingolani, P. Variant Annotation and Functional Prediction: SnpEff. Methods Mol Biol 2493, 289-314, doi:10.1007/978-1-0716-2293-3_19 (2022).
Cingolani, P. et al. Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift. Front Genet 3, 35, doi:10.3389/fgene.2012.00035 (2012).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122, doi:10.1186/s13059-016-0974-4 (2016).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-311, doi:10.1093/nar/29.1.308 (2001).
Jian, X., Boerwinkle, E. & Liu, X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res 42, 13534-13544, doi:10.1093/nar/gku1206 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434-443, doi:10.1038/s41586-020-2308-7 (2020).
Lee, S. et al. Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population. Sci Rep 7, 4287, doi:10.1038/s41598-017-04642-4 (2017).
Jung, K. S. et al. KRGDB: the large-scale variant database of 1722 Koreans based on whole genome sequencing. Database (Oxford) 2020, doi:10.1093/database/baaa030 (2020).
Liu, X., Li, C., Mou, C., Dong, Y. & Tu, Y. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med 12, 103, doi:10.1186/s13073-020-00803-9 (2020).
Barrett, R. et al. A scalable, aggregated genotypic-phenotypic database for human disease variation. Database (Oxford) 2019, doi:10.1093/database/baz013 (2019).
Weiss, G. M. & Provost, F. (Rutgers University, 2001).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248-249, doi:10.1038/nmeth0410-248 (2010).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886-D894, doi:10.1093/nar/gky1016 (2019).
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res 19, 1553-1561, doi:10.1101/gr.092619.109 (2009).
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39, e118, doi:10.1093/nar/gkr407 (2011).
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res 11, 863-874, doi:10.1101/gr.176601 (2001).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat Protoc 11, 1-9, doi:10.1038/nprot.2015.123 (2016).
Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, e0118432, doi:10.1371/journal.pone.0118432 (2015).
Liu, Z. & Bondell, H. D. Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data. Statistics in Biosciences 11, 141-161, doi:10.1007/s12561-019-09231-9 (2019).
Movahedi, F. & Antaki, J. F. Limitation of ROC in Evaluation of Classifiers for Imbalanced Data. The Journal of Heart and Lung Transplantation 40, S413, doi:https://doi.org/10.1016/j.healun.2021.01.1160 (2021).
Tian, Y. et al. REVEL and BayesDel outperform other in silico meta-predictors for clinical variant classification. Sci Rep 9, 12752, doi:10.1038/s41598-019-49224-8 (2019).

Competing interest reported. M.K. and C.H. are currently employed by NGeneBio. S.K. was previously employed by NGeneBio. C.H. owns equity interest in NGeneBio. D.-B.L. and K.-B.H. declare no competing interests.

Download PDF

Journal Publication

published 28 Jun, 2023

Read the published version in Scientific Reports →

Editorial decision: Major revision
06 Jun, 2023
Reviews received at journal
01 Jun, 2023
Reviewers agreed at journal
18 May, 2023
Reviewers agreed at journal
25 Apr, 2023
Reviews received at journal
18 Apr, 2023
Reviewers agreed at journal
13 Apr, 2023
Reviewers agreed at journal
11 Apr, 2023
Reviewers invited by journal
10 Apr, 2023
Editor assigned by journal
10 Apr, 2023
Editor invited by journal
30 Mar, 2023
Submission checks completed at journal
30 Mar, 2023
First submitted to journal
21 Mar, 2023

You are reading this latest preprint version

Gene-specific machine learning for pathogenicity prediction of rare BRCA1 and BRCA2 missense variants

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Variant annotation and filtering

Test Variant Sets

Training Variant Sets

Features For Variant Pathogenicity Prediction

Supervised Machine Learning Methods

Results And Discussion

Prediction performance comparison of gene-specific and disease-specific machine learning

Comparison Of Important Features Identified By Gene-specific And Disease-specific Machine Learning

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1