The use of bootstrap simulation generates 10,000 training and test-set combinations and thus also 10,000 model accuracy statistics and covariate gain statistics29. This method allows for empiric evaluation of the variability in model accuracy to increase the transparency of model efficacy21,25, 29.
Overall Variability in Model Accuracy:
From simulations, we observed that the AUROC ranged from 0.771 to 0.947, a difference of 0.176. These simulations highlight that for smaller datasets (<10,000 patients), that there may be considerable variation in the classification efficacy of the XGBoost model based upon different training-test set combinations. At the higher end, an AUROC of 0.947 implies near perfect fit, while an AUROC of 0.771, while still significantly more predictive than random chance, provides a much decreased level of confidence in the predictions of the model. This highlights a potential issue in replication of machine-learning methods on similar cohorts21,25,30. Two studies may find vastly different results in the predictive accuracy of machine-learning methods even if they use near identical models, covariates, and model summary statistics just due to the choice of the train-test sets (which are determined strictly by random number generation)31. As a result, this study highlights the importance of utilizing multiple different train and test sets when executing machine-learning for prediction of clinical outcomes to accurately represent the variance that is present just in the choice of selection of train and test sets. This will accurately characterize the accuracy of the model and allow for better replications of the study. While the only covariate represented in this discussion session is AUROC, these findings were similar within the other accuracy metrics provided in table 2.
Overall Variability in Covariate Gain Statistics:
In addition to capturing the variability in machine-learning methods in model efficacy, there is also significant variability within the gain statistics for each of the covariates. We observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231. Since the gain statistic is a measure of the percentage contribution of the variable to the model, we find that depending on the train and test set, a covariate can have vastly different contributions to the final predictions in the model. This variability in the contribution of each covariate to the final model highlights potential dangers of training-set bias. Depending on which training set is present, a covariate can be twice as important to the final result of the model. This result highlights the need for multiple different “seeds” to be set prior to model training when splitting the training and test sets in order to avoid potential training-set biases and to have the model at least be representative of the cohort it is being trained and tested on (if not representative of the population the cohort is a sample of). Similar to the model accuracy statistics, this also highlights the difficulty in replication of results in machine-learning models from study to study. Even in our simulation studies with identical cohorts, identical model parameters, and identical covariates, we observed that there was significant variation in which covariates were weighted highly in the final model output. This highlights the need to carefully evaluate the results of the model and not rely on a single seed to set the training and test sets for machine-learning modeling to avoid potential pitfalls that stem from training-test bias. While the only covariate represented in this discussion session is Angina, these findings were similar within the other accuracy metrics provided in table 3.
Utility of SHAP for Model Explanation and Allowing for Augmented Intelligence:
Given the high level of variability in model accuracy metrics as well as covariate importance based upon different combinations of training and test sets, necessity of algorithms to explain the model are necessary to reduce potential for algorithmic bias28. After simulations of model accuracy and covariate gain metrics, a seed can be chosen that accurately represents the center of the distribution for model accuracy metrics and covariate gain statistics. Then SHAP may be executed for Model Explanation to allow for interpretation of model covariates.
In traditional parametric methods such as linear regression, each covariate can be interpreted clearly (e.g., for each 1 increase in x, we observe 2 increase in y)18. However, due to the complexity of the non-parametric algorithms that are common in machine-learning methods, it is impossible for a human to analyze each tree and execute an explanation of how the machine-learning method works. Thus, using SHAP allows for a similar covariate interpretation as linear regression even if the exact effect-sizes of the covariates cannot be interpreted the way it can in linear regression. Figure 1A highlights the relationship between increasing values of a covariate (purple) and increased odds for heart disease. Additionally, Figures 1B, 1C, and 1D allow for observation of the effect sizes of individual covariates. We observe within these plots that patients with Angina lead to significant increase in risk for heart disease, patients who are Male have an increased chance for heart disease, and patients with greater maximum heart rates have a decreased risk for heart disease. In evaluating these three covariates, a researcher/clinician can make judgment calls on if these are concordant with medical literature (prospective clinical trials, retrospective analyses, physiological mechanisms) to validate the results of the model. If the results of the model are not concordant with the medical literature, either a potentially new interpretation of the covariate should be investigated or continued evaluation of if confounders within the model may be done to rectify these observed discrepancies.
Limitations: This study has several strengths and weaknesses. One weakness is that this study utilizes only one cohort that may not have complete electronic health record data (charts, most labs, diagnoses, or procedural codes) to evaluate model variance. However, since the goal was to evaluate methods to increase transparency in machine-learning instead of developing models for heart disease, this is less of a concern. Furthermore, use of a publicly available dataset already built into an R package allows for increased replicability of this study, which is concordant with the general recommendations within this paper32. Another weakness is the need for this methodology to be replicated on other machine-learning methods (neural networks, random-forest) and in other cohorts, both smaller and larger, to get a better understanding of how random chance in selecting training and test sets can significantly impact the perception of model accuracy and the perception of the most important model covariates.