Prediction performance comparison of gene-specific and disease-specific machine learning
The ratio of pathogenic to benign BRCA1/2 variants in the test variant set of our study is not balanced because it reflects the actual class distribution (see Methods). Therefore, we used AUPRC to evaluate the performance of pathogenicity predictors. The AUPRC is more informative for imbalanced classification datasets than other measures, such as accuracy and the area under the receiver operating characteristics curve (AUROC)44–46. Eight machine learning methods—ridge, lasso, elastic net, RFs, XGBoost, Linear- and RBF-SVMs, and DNNs—were employed for the performance comparison. Furthermore, we compared four popular genome-wide pathogenicity predictors: REVEL15, BayesDel14 with and without maximum allele frequency (MaxAF), and ClinPred13. In a recent study, REVEL and BayesDel performed better than other in-silico predictors47. ClinPred is a recently developed tool trained using ClinVar variants.
Figure 2 compares the performance of each method on BRCA1. We did not observe a remarkable difference in prediction performance between gene-specific and disease-specific machine learning. Disease-specific learning performed better than gene-specific learning when used with the lasso, XGBoost, Linear- and RBF-SVMs. For the other four methods, gene-specific learning was better than disease-specific learning. However, the performance difference between gene-specific and disease-specific learning was statistically significant (paired t-test P < 0.05) only for one machine learning model: RFs (see Supplementary Table S5). This result is noteworthy because the disease-specific training variant set was more than seven (= 982/139; see Methods) times larger than the gene-specific one. Moreover, the disease-specific training variant set includes all the gene-specific variants. It means that the variants from disease-associated genes other than BRCA1 generally did not improve the pathogenicity prediction performance for BRCA1. Instead, the machine learning model substantially influenced pathogenicity prediction performance more than the training variant type. For BRCA1, the gene-specific RF achieved the highest AUPRC (0.9835 ± 0.0156). Two other models, i.e., XGBoost trained using the disease-specific and the gene-specific variant sets, were the second (AUPRC 0.9783 ± 0.0187) and the third (AUPRC 0.9727 ± 0.0176), respectively, showing comparable performance to the best method (paired t-test P = 0.1062 and 0.0801, respectively). All the others were statistically significantly worse than the gene-specific RF model (see Supplementary Table S6). The popular pathogenicity predictors trained using all genes, i.e., REVEL, BayesDel with and without MaxAF, and ClinPred, demonstrated poorer performance than the gene- and disease-specific machine learning approaches except for the Linear-SVM model, which performed worse than ClinPred and BayesDel with MaxAF.
We show the comparison results for BRCA2 in Fig. 3. Unlike the case of BRCA1, disease-specific learning generally performed better than gene-specific learning for BRCA2. Except for XGBoost and RFs, disease-specifically trained models showed higher AUPRC values than gene-specific ones. Moreover, the performance difference was statistically significant (paired t-test P < 0.05) for all machine learning models but RFs and RBF-SVMs (see Supplementary Table S7). However, gene-specific RFs achieved the best AUPRC (0.9467 ± 0.0483). Four other methods which obtained comparable performance (paired t-test P = 0.1436, 0.2504, 0.1575, 0.1035) to this were disease-specific RFs (AUPRC 0.9398 ± 0.0515), disease-specific DNNs (AUPRC 0.9361 ± 0.0326), disease-specific Linear-SVMs (AUPRC 0.9209 ± 0.0676), and gene-specific XGBoost (AUPRC 0.9167 ± 0.0581). All the other methods were statistically significantly worse than the gene-specific RF model (Supplementary Table S8). This result suggests that gene-specific learning is sufficient to obtain the optimal pathogenicity predictor for BRCA2 if we use an appropriate machine learning algorithm. The popular pathogenicity predictors were not enough to attain high performances. Unlike the case of BRCA1, however, ClinPred and BayesDel with MaxAF showed higher AUPRC values than many gene-specific and disease-specific machine learning approaches (see Fig. 3).
We also compared the variance of ten trials between gene-specific and disease-specific machine learning. Because the gene-specific training variant set is much smaller than that of the disease-specific variant set (see Methods), the variance of gene-specific models is expected to be larger than that of disease-specific models. We show the comparison results for BRCA1 and BRCA2 in Supplementary Tables S5 and S7. For BRCA1, gene-specific learning showed statistically significantly higher variances (Pitman-Morgan test P < 0.05) than disease-specific learning for the lasso, elastic net, Linear- and RBF-SVMs among the eight machine learning models. However, the difference in variance was not statistically significant for the other four models, including RFs, which achieved the highest AUPRC. We observed a different result for BRCA2. The difference in variance was statistically significant (Pitman-Morgan test P < 0.05) for all but one machine learning method, meaning that BRCA2-specific training datasets were likely to produce more inconsistent results than much larger disease-specific training datasets. Interestingly, the variance of RFs, the best-performing predictor on BRCA2, was not statistically significantly different between gene-specific and disease-specific learning (Pitman-Morgan test P = 0.6321). These results suggest that if a suitable machine learning method is adopted, gene-specific variants are enough to obtain the optimal pathogenicity predictor for rare BRCA1 and BRCA2 missense variants.
Comparison Of Important Features Identified By Gene-specific And Disease-specific Machine Learning
As demonstrated in the preceding subsection, the selection of machine learning algorithms plays a more significant role than the type of training variants in achieving optimal pathogenicity predictions for rare BRCA1/2 missense variants. Specifically, among the three predictors exhibiting optimal performance for BRCA1, two were trained on gene-specific variants, while one was trained on disease-specific variants. For BRCA2, two of the top five performing predictors employed gene-specific training variants (see Figs. 2 and 3). It is noteworthy that the top-performing model group for BRCA1 comprised both gene-specific and disease-specific XGBoost models. As for BRCA2, the RF algorithm demonstrated the best performance regardless of the type of training variants used. We compared the significant features identified by these top-performing models obtained using the same machine learning algorithm but different types of training variants.
Figure 4 shows the top ten important features identified by gene-specific and disease-specific learning of XGBoost for BRCA1. The XGBoost feature importance values of all features in the ten trials are shown in Supplementary Tables S9 and S10, respectively, for gene-specific and disease-specific learning. In the gene-specific and disease-specific XGBoost models for BRCA1, the top ten important features had 93.3% and 88.9% of the feature importance values, respectively. In addition, we observed that variance across the ten trials was much higher for gene-specific learning than disease-specific learning, possibly due to the smaller size of the gene-specific training dataset (see Methods). The most important feature learned from BRCA1-specific training variants was dbNSFP_phyloP100way_vertebrate (a site conservation score; feature importance 35.03 ± 19.24%). The second and third were dbNSFP_SIFT4G_score (a predicted functional-impact score; feature importance 20.69 ± 9.66%) and gnomAD2_AF (a MAF; feature importance 13.64 ± 6.80%). Compared to this, the most critical feature learned from disease-specific variants was gnomAD2_AF_male (a MAF; feature importance 29.46 ± 2.18%). The second was gnomAD2_AF (a MAF; feature importance 23.80 ± 3.34%). The third was dbNSFP_LRT_score (a predicted functional-impact score; feature importance 8.41 ± 1.09%). The two most important features learned by gene-specific learning for BRCA1 were site conservation and predicted functional-impact scores. In contrast, the first and second important features in the disease-specific XGBoost models for BRCA1 were MAF features, i.e., gnomAD2_AF_male and gnomAD2_AF.
We observed similar trends when comparing the top ten important feature groups. The top ten important feature groups of the BRCA1-specific and disease-specific XGBoost models shared six features. Differences between the two important feature groups were as follows (see Fig. 4). Ranks of MAF features were higher (first, second, and fourth) in disease-specific learning compared to BRCA1-specific learning (third, fifth, and seventh). It seems that BRCA1-specific training variants were insufficient to learn a reliable pattern of MAFs for discriminating between pathogenic and benign variants compared to disease-specific training variants. Two genomic position features, i.e., EXON and cDNA_position, were among the top ten crucial features in gene-specific learning. However, the feature importance values of these features in disease-specific learning for BRCA1 were much lower (ranked 27th and 24th, respectively; see Supplementary Table S10). The position features exhibit relatively high importance values in gene-specific learning, likely due to the fact that positional information is only meaningful within a specific gene and not applicable across a group of genes, even if they are linked to the same or similar diseases.
Figure 5 shows the most critical twenty features learned from gene-specific and disease-specific RF learning for BRCA2. We offer all features’ normalized variable importance values in Supplementary Tables S11 and S12 for BRCA2-specific and disease-specific learning, respectively. Variable importance values of RF were normalized so that their sum over all features equals 100%. The top twenty features had 76.9% and 82.4% of the variable importance values for gene-specific and disease-specific RF learning for BRCA2, respectively. Similar to the result for BRCA1, we observed that variance across the ten trials was generally higher for gene-specific RF learning than disease-specific RF learning, possibly due to the smaller training dataset size of gene-specific learning. The three most important features for disease-specific RF learning for BRCA2 were gnomAD2_AF (a MAF; normalized variable importance 7.98 ± 0.95%), gnomAD2_AF_male (a MAF; normalized variable importance 7.77 ± 0.58%), and gnomAD2_AF_female (a MAF; normalized variable importance 6.89 ± 0.66%). Among these three MAF features, only gnomAD2_AF was among the top three critical features for BRCA2-specific RF learning (ranked third; normalized variable importance 4.78 ± 0.90%). The first and second essential features for BRCA2-specific RF learning were dbNSFP_phyloP100way_vertebrate (a site conservation score; normalized variable importance 5.38 ± 0.50%) and dbNSFP_LRT_score (a predicted functional-impact score; normalized variable importance 4.93 ± 0.93%), respectively. We note that dbNSFP_phyloP100way_vertebrate was also the most crucial feature learned from BRCA1-specific XGBoost training (see Fig. 4(a)). It suggests the site conservation score is a critical gene-specific information source for discriminating between pathogenic and benign variants.
The comparison results of the top twenty feature groups between BRCA2-specific and gene-specific RF learning are as follows. Fourteen features were common among the top twenty gene-specific and disease-specific feature groups obtained from RF learning for BRCA2. However, each feature’s rank differed, meaning that different optimal RF models were learned from BRCA2-specific and disease-specific variants, respectively. We observed that MAF features were more influential in disease-specific learning (ranked first (gnomAD2_AF), second (gnomAD2_AF_male), third (gnomAD2_AF_female), 11th (gnomAD2_AF_nfe), 15th (gnomAD2_AF_oth), and 17th (gnomAD2_AF_amr)) than in BRCA2-specific learning. Ranks of the same MAF features were lower in BRCA2-specific RF learning: third, sixth, seventh, 19th, 22nd, and 29th, respectively (see Supplementary Table S11). This result is similar to that from XGBoost learning for BRCA1 (see Fig. 4). Another similar result is that position features were more critical in BRCA2-specific learning than disease-specific learning. In BRCA2-specific RF models, ranks of the four position features, i.e., EXON, cDNA_position, CDS_position, and protein_position, were 16th, 17th, 14th, and 15th, respectively. On the contrary, the same features were ranked 26th, 27th, 29th, and 28th in disease-specific RF models for BRCA2 (see Supplementary Table S12).
To summarize, we observed common properties in important features identified by gene-specific and disease-specific learning for the pathogenicity prediction of rare BRCA1/2 missense variants. First, MAF features were more critical in disease-specific learning than gene-specific learning. It means that MAF is a major discriminating factor between pathogenic and benign variants, having similar patterns regardless of genes, at least if they are associated with the same disease. However, gene-specific training variants seem insufficient to capture the discriminating pattern reliably. Instead of MAF features, we can use predicted functional-impact and site conservation scores as significant elements for distinguishing between pathogenic and benign variants, as shown by the optimal performance of gene-specific learning. Additionally, the position of a variant could play an essential role only in gene-specific learning because the meaning of position could be different by genes.