For cancer-specific methods, TransFIC (applied to PolyPhen-2 predictions due to the fewest missing values) achieved the second-best performance for AUC but the fifth-best performance for AUPR. CHASM prediction resulted in an AUC of 0.61, sensitivity of 0.74, and specificity of 0.15. Similarly, CanDrA only achieved an AUC of 0.51, sensitivity of 0.76, and specificity of 0.07. Thus, CHASM and CanDrA both showed poor performance on the negative samples, with severe imbalance yielding extremely low AUC values (as discussed below).
The CHASM training set consisted of a balanced set of positive and negative samples, however, only 0.6% overlapped at the transcript level [3]. Thus, we speculated that CHASM may be confounded by type 2 circularity, or variant status was largely predicted based on other variants in the same protein [28]. As expected, 53% of false negatives in the CHASM predictions were on transcripts that fully overlapped with positive data in the CHASM training set, while only 0.9% were on transcripts that fully overlapped with negative data in the CHASM training set. Furthermore, the reverse was found for the true negatives of the CHASM predictions, i.e., more samples were located in transcripts that exclusively overlapped with negative data in the CHASM training set. Thus, CHASM was affected by type 2 circularity.
CanDrA proposed that the driver mutations occurred recurrently in proximity (hotspots) in different types of cancer while the passenger mutations were always not in any Cancer Gene Census (CGC) genes [16, 29]. Based on our results, we suspected type 2 circularity in CanDrA as it followed such the training set screening criteria, which inevitably leads to little overlap between positive and negative samples at the transcript level. When negative sample genes in the independent test set overlapped with CGC genes, we found common genes in the two sets, which were impossible the genes for negative samples in CanDrA training set, contained 95% negative samples in the independent test set, of which only 3% were true negatives. In addition, the genes included exclusively in the negative sample genes in the independent test set, which were possible the genes for negative samples in CanDrA training set, contained 5% negative samples in the independent test set, of which > 80% were predicted as true negatives. Thus, CanDrA predicted the status of a variant based on other variants in the same protein, i.e., type 2 circularity. We demonstrated that the low AUC values acquired by both CHASM and CanDrA should be primarily attributed to type 2 circularity. In addition, from the perspective of training data quality, we propose that negative samples used in CHASM and CanDrA cannot model the broad spectrum of passenger mutations.
For the general-disease deleterious mutation predictors, CDMPred achieved the best comprehensive predictive ability, followed by CADD, Polyphen-2, and REVEL. Interestingly, these methods also outperformed the second-best cancer-specific predictor. PolyPhen-2 achieved an AUPR of 0.75 and sensitivity of 0.83, indicating relatively high predictive ability compared with CDMPred on positive samples. However, regardless of the positive or negative samples of the independent test set, many predictions in PolyPhen-2 were "positive", which may correspond to a collection of diseases, not just cancer drivers. [30]. For example, one of the true positives predicted by PolyPhen-2, “GATA2:p.R398W”, is not only related to acute myeloid leukemia, but also to alveolar proteinosis [31, 32]. In addition, one of the false negatives predicted by PolyPhen-2, “HMBS:p.D359N”, is not only related to cancer, but also to acute intermittent porphyria [33, 34]. Therefore, we focused on corresponding genes in the true negative and positive categories and false negative and positive categories of the PolyPhen-2 predictions. Enrichment analysis was performed using the online tool DAVID to verify the above suppositions [35]. We collected the pathways that only related to general diseases, not cancer, and meanwhile that an adjusted P-value < 0.05 as calculated by hypergeometric test followed by the Benjamini-Hochberg. After mapping the enrichment results at the mutation level, 65% relevant to disease of all the true negatives and true positives in PolyPhen-2 predictions, in contrast, 54% relevant to disease of all the false negatives and false positives in PolyPhen-2 predictions. In conclusion, such results supported that PolyPhen-2, even the general-disease predictors, showed a systematic bias on driver mutation prediction.