Increasing Transparency in Machine Learning through Bootstrap Simulation and Shapely Additive Explanations

doi:10.21203/rs.3.rs-2075948/v2

Download PDF

Article

Increasing Transparency in Machine Learning through Bootstrap Simulation and Shapely Additive Explanations

https://doi.org/10.21203/rs.3.rs-2075948/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

Importance:

Machine learning methods are widely used within the medical field. However, the reliability and efficacy of these models is difficult to assess. We assessed whether variance calculations of model metrics (e.g., AUROC, Sensitivity, Specificity) through bootstrap simulation and SHapely Additive exPlanations (SHAP) could increase model transparency.

Methods

Data from the England National Health Services Heart Disease Prediction Cohort was used. XGBoost was used as the machine-learning model of choice in this study. Boost-strap simulation (N = 10,000) was used to empirically derive the distribution of model metrics and covariate Gain statistics. SHapely Additive exPlanations (SHAP) to provide explanations to machine-learning output and simulation to evaluate the variance of model accuracy metrics.

Result

Among 10,000 simulations completed, we observed that the AUROC ranged from 0.771 to 0.947, a difference of 0.176, the balanced accuracy ranged from 0.688 to 0.894, a 0.205 difference, the sensitivity ranged from 0.632 to 0.939, a 0.307 difference, and the specificity ranged from 0.595 to 0.944, a 0.394 difference. Among 10,000 simulations completed, we observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231, for Cholesterol ranged from 0.148 to 0.326, a difference of 0.178, the MaxHR ranged from 0.081 to 0.200, a range of 0.119, and for Age ranged from 0.059 to 0.157, difference of 0.098.

Conclusion

Use of simulations to empirically evaluate the variance of model metrics and explanatory algorithms to observe if covariates match the literature are necessary for increased transparency, reliability, and utility of machine learning methods.

Biological sciences/Computational biology and bioinformatics

Health sciences/Biomarkers

Health sciences/Cardiology

Health sciences/Health care

Health sciences/Medical research

Health sciences/Risk factors

Machine learning (ML) algorithms generate predictions from sample data without explicit directions from the user^1–4. Common ML algorithms (e.g., XGBoost, Random Forest, Neural Networks) have been found to be more accurate than traditional parametric methods (linear regression, logistic regression)^1–5. It has been hypothesized that this increase in accuracy can be attributed to potential non-linear relationships between the independent and dependent variables and interactions between multiple covariates^6–8. However, the increase in ML algorithms compared to traditional parametric methods comes at a significant cost: interpretability^9–12. Linear regression and logistic regression have clear interpretable output that have been widely studied^13–15. Machine-learning algorithms are often non-interpretable, leading to their reputation as a “black box” algorithm^3,6,12. As a result, the interpretability, reliability, and efficacy of machine-learning models is often difficult to assess^6–9.

Without methods that explain how machine learning algorithms reach their predictions, clinicians will not be able to identify if models are reliable and generalizable or just replicating the biases within the training datasets^16–18. Provision of explanations about how model predictions are researched and providing accurate summary statistics for model accuracy metrics (e.g., AUROC, Sensitivity, Specificity, F1, Balanced Accuracy) will increase the transparency of machine learning methods and increase confidence when using their predictions^19–20. Potential solutions to these weaknesses in machine learning that have been applied within the field of computer science are SHapely Additive exPlanations (SHAP) for model interpretability and bootstrap simulation for quantifying the statistical distribution of model accuracy metrics^21–23. However, little is known about the efficacy of SHAP and Bootstrap in evaluating machine-learning methods for medical outcomes such as heart disease. Given these limitations in the literature, with data from the England National Health Services Heart Disease Prediction Cohort, we leveraged SHAP to provide explanations to machine-learning output and bootstrap simulation to evaluate the variance of model accuracy metrics.

A retrospective, cohort study using the publicly available Heart Disease Prediction cohort (from the England National Health Services database) was conducted.

All methods in this research were carried out in accordance with guidelines detailed by the Data Alliance Partnership Board (DAPB) approved national information standards and data collections for use in health and adult social care.

Independent Variable:

Demographic covariates of age and sex were collected. Clinical covariates of Resting blood pressure, fasting blood sugar, cholesterol, resting electrocardiogram (ECG), presence of Angina, and maximum heart rate were collected.

Dependent Variable:

The dependent variable of interest was heart disease.

Model Construction And Statistical Analysis:

Descriptive statistics for all patients, patients with heart disease, and patients without heart disease were computed for all covariates and compared using chi-squared tests for categorical variables and t-tests for continuous variables.

XGBoost was the machine-learning model evaluated in this study as it is one of the most widely used machine-learning method in medicine²⁶. The model metrics were the Area under the Receiver Operator Characteristic Curve (AUROC), Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, F1, Accuracy, and Balanced Accuracy. Additionally, the distribution of the Gain statistic, a measure of the percentage contribution of the variable to the model, for each covariate was assessed^26–27.

Boost-strap simulation (N = 10,000 simulations) was carried out by varying the train and test sets (70:30), rerunning the model, and assessing model metrics on the test-set. The model metrics from 10,000 simulations were used to construct the distribution for all model metrics and the gain-statistic for all independent covariates. The distribution of each of statistics was evaluated visually through histograms, and analytically through summary statistics (minimum, 5th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, maximum, mean, standard deviation) and the Anderson-Darling test²⁸.

For model explanation, SHAP visualizations were performed for each independent covariate and visualized in Figs. 2¹. These visualizations were evaluated through clinician judgement to evaluate their concordance with understood relationships in cardiology to validate the predictions from the model.

Of the 918 patients within the cohort, the mean age was 53.51 (SD = 9.43), with 193 females (21%) and 725 males (79%). The mean Resting Blood Pressure was 132.4 (SD = 19.51), cholesterol was 198.8 (SD = 109.38), 214 (23%) of patients had elevated blood sugar, 188 (20%) of patients had Left Ventricular Hypertrophy (LVH), and 178 (19%) had ST-elevation. The mean heart rate was 136.81 (SD = 25.46) and 371 (40%) patients had Angina. Full demographic information listed in Table-1.

Compared to patients without heart disease, patients with heart disease have a greater number of males (90% vs 65%, p<0.01), a higher resting blood pressure (134.2 vs 130.2, p<0.01), increased rates of elevated blood sugar (33% vs 11%, p<0.01), increased rates of ST elevation on ECG (23% vs 15%, p<0.01), and increased Angina (62% vs 13%, p<0.01).

Overall Performance and Variability of the Models:

Full statistics for model metrics in Table 2. We observe that the models have relatively strong performance, with median AUROC = 0.87, Balanced Accuracy = 0.79, sensitivity = 0.786, and specificity = 0.785. Among 10,000 simulations completed, we observed that the AUROC ranged from 0.771 to 0.947, a difference of 0.176, the balanced accuracy ranged from 0.688 to 0.894, a 0.205 difference, the sensitivity ranged from 0.632 to 0.939, a 0.307 difference, and the specificity ranged from 0.595 to 0.944, a 0.394 difference.

Full statistics for model covariate gain statistics in Table 3. We observe that Angina, Cholesterol, Max Heart Rate (MaxHR) and age are the most important predictors within the model by the model gain metric. Among 10,000 simulations completed, we observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231, for Cholesterol ranged from 0.148 to 0.326, a difference of 0.178, the MaxHR ranged from 0.081 to 0.200, a range of 0.119, and for Age ranged from 0.059 to 0.157, difference of 0.098.

SHAP analysis was completed and visualized for Angina, Sex, and Max Heart Rate in Figure 1. We observe from SHAP that patients who have Angina, who are of Male gender, and with lower maximum heart rates have greater incidence of heart disease, which is concordant with the t-test/chi-squared comparisons that were completed in the table-1 analysis. All covariates visualized in supplemental data (supplement figures 1-9).

The distributions for all model statistics and the gain statistics for all covariates are in Figure 2 and 3, respectively. The distributions for all model statistics and gain statistics were not significantly different from a normal distribution as ascertained through by the Anderson-Darling Test, using significance of p<0.05 (Table 4).

The use of bootstrap simulation generates 10,000 training and test-set combinations and thus also 10,000 model accuracy statistics and covariate gain statistics²⁹. This method allows for empiric evaluation of the variability in model accuracy to increase the transparency of model efficacy^{21,25, 29}.

Overall Variability In Model Accuracy:

From simulations, we observed that the AUROC ranged from 0.771 to 0.947, a difference of 0.176. These simulations highlight that for smaller datasets (< 10,000 patients), that there may be considerable variation in the classification efficacy of the XGBoost model based upon different training-test set combinations. At the higher end, an AUROC of 0.947 implies near perfect fit, while an AUROC of 0.771, while still significantly more predictive than random chance, provides a much decreased level of confidence in the predictions of the model. This highlights a potential issue in replication of machine-learning methods on similar cohorts^21,25,30. Two studies may find vastly different results in the predictive accuracy of machine-learning methods even if they use near identical models, covariates, and model summary statistics just due to the choice of the train-test sets (which are determined strictly by random number generation)³¹. As a result, this study highlights the importance of utilizing multiple different train and test sets when executing machine-learning for prediction of clinical outcomes to accurately represent the variance that is present just in the choice of selection of train and test sets. This will accurately characterize the accuracy of the model and allow for better replications of the study. While the only covariate represented in this discussion session is AUROC, these findings were similar within the other accuracy metrics provided in Table 2.

Overall Variability In Covariate Gain Statistics:

In addition to capturing the variability in machine-learning methods in model efficacy, there is also significant variability within the gain statistics for each of the covariates. We observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231. Since the gain statistic is a measure of the percentage contribution of the variable to the model, we find that depending on the train and test set, a covariate can have vastly different contributions to the final predictions in the model. This variability in the contribution of each covariate to the final model highlights potential dangers of training-set bias. Depending on which training set is present, a covariate can be twice as important to the final result of the model. This result highlights the need for multiple different “seeds” to be set prior to model training when splitting the training and test sets in order to avoid potential training-set biases and to have the model at least be representative of the cohort it is being trained and tested on (if not representative of the population the cohort is a sample of). Similar to the model accuracy statistics, this also highlights the difficulty in replication of results in machine-learning models from study to study. Even in our simulation studies with identical cohorts, identical model parameters, and identical covariates, we observed that there was significant variation in which covariates were weighted highly in the final model output. This highlights the need to carefully evaluate the results of the model and not rely on a single seed to set the training and test sets for machine-learning modeling to avoid potential pitfalls that stem from training-test bias. While the only covariate represented in this discussion session is Angina, these findings were similar within the other accuracy metrics provided in Table 3.

Utility Of Shap For Model Explanation And Allowing For Augmented Intelligence:

Given the high level of variability in model accuracy metrics as well as covariate importance based upon different combinations of training and test sets, necessity of algorithms to explain the model are necessary to reduce potential for algorithmic bias²⁸. After simulations of model accuracy and covariate gain metrics, a seed can be chosen that accurately represents the center of the distribution for model accuracy metrics and covariate gain statistics. Then SHAP may be executed for Model Explanation to allow for interpretation of model covariates.

In traditional parametric methods such as linear regression, each covariate can be interpreted clearly (e.g., for each 1 increase in x, we observe 2 increase in y)¹⁸. However, due to the complexity of the non-parametric algorithms that are common in machine-learning methods, it is impossible for a human to analyze each tree and execute an explanation of how the machine-learning method works. Thus, using SHAP allows for a similar covariate interpretation as linear regression even if the exact effect-sizes of the covariates cannot be interpreted the way it can in linear regression. Figure 1A highlights the relationship between increasing values of a covariate (purple) and increased odds for heart disease. Additionally, Figs. 1B, 1C, and 1D allow for observation of the effect sizes of individual covariates. We observe within these plots that patients with Angina lead to significant increase in risk for heart disease, patients who are Male have an increased chance for heart disease, and patients with greater maximum heart rates have a decreased risk for heart disease. In evaluating these three covariates, a researcher/clinician can make judgment calls on if these are concordant with medical literature (prospective clinical trials, retrospective analyses, physiological mechanisms) to validate the results of the model. If the results of the model are not concordant with the medical literature, either a potentially new interpretation of the covariate should be investigated or continued evaluation of if confounders within the model may be done to rectify these observed discrepancies.

Limitations

This study has several strengths and weaknesses. One weakness is that this study utilizes only one cohort that may not have complete electronic health record data (charts, most labs, diagnoses, or procedural codes) to evaluate model variance. However, since the goal was to evaluate methods to increase transparency in machine-learning instead of developing models for heart disease, this is less of a concern. Furthermore, use of a publicly available dataset already built into an R package allows for increased replicability of this study, which is concordant with the general recommendations within this paper³². Another weakness is the need for this methodology to be replicated on other machine-learning methods (neural networks, random-forest) and in other cohorts, both smaller and larger, to get a better understanding of how random chance in selecting training and test sets can significantly impact the perception of model accuracy and the perception of the most important model covariates.

Machine learning algorithms are a powerful tool for medical prediction. Use of simulations to empirically evaluate variance of model metrics and explanatory algorithms to observe if covariates match the literature are necessary for increased transparency of machine learning methods, helping to detect true signal in the data instead of perpetuating biases within the training datasets.

Data Availability Statement: The datasets generated and analyzed within this study are available through the national heath services R community at https://nhsrcommunity.com/ and through the MLDataR package^24-25.

Heo J, Yoon JG, Park H, Kim YD, Nam HS, Heo JH. Machine Learning-Based Model for Prediction of Outcomes in Acute Stroke. Stroke. 2019 May;50(5):1263–1265. doi: 10.1161/STROKEAHA.118.024293. PMID: 30890116.
Kalafi EY, Nor NAM, Taib NA, Ganggayah MD, Town C, Dhillon SK. Machine Learning and Deep Learning Approaches in Breast Cancer Survival Prediction Using Clinical Data. Folia Biol (Praha). 2019;65(5–6):212–220. PMID: 32362304.
Dong J, Feng T, Thapa-Chhetry B, Cho BG, Shum T, Inwald DP, Newth CJL, Vaidya VU. Machine learning model for early prediction of acute kidney injury (AKI) in pediatric critical care. Crit Care. 2021 Aug 10;25(1):288. doi: 10.1186/s13054-021-03724-0. PMID: 34376222; PMCID: PMC8353807.
Wang Z, Li H, Carpenter C, Guan Y. Challenge-Enabled Machine Learning to Drug-Response Prediction. AAPS J. 2020 Aug 10;22(5):106. doi: 10.1208/s12248-020-00494-5. PMID: 32778984.
Sajjadian M, Lam RW, Milev R, Rotzinger S, Frey BN, Soares CN, Parikh SV, Foster JA, Turecki G, Müller DJ, Strother SC, Farzan F, Kennedy SH, Uher R. Machine learning in the prediction of depression treatment outcomes: a systematic review and meta-analysis. Psychol Med. 2021 Dec;51(16):2742–2751. doi: 10.1017/S0033291721003871. Epub 2021 Oct 12. PMID: 35575607.
Kamerzell TJ, Middaugh CR. Prediction Machines: Applied Machine Learning for Therapeutic Protein Design and Development. J Pharm Sci. 2021 Feb;110(2):665–681. doi: 10.1016/j.xphs.2020.11.034. Epub 2020 Dec 2. PMID: 33278409.
Li Y, Pu F, Wang J, Zhou Z, Zhang C, He F, Ma Z, Zhang J. Machine Learning Methods in Prediction of Protein Palmitoylation Sites: A Brief Review. Curr Pharm Des. 2021;27(18):2189–2198. doi: 10.2174/1381612826666201112142826. PMID: 33183190.
Kausch SL, Moorman JR, Lake DE, Keim-Malpass J. Physiological machine learning models for prediction of sepsis in hospitalized adults: An integrative review. Intensive Crit Care Nurs. 2021 Aug;65:103035. doi: 10.1016/j.iccn.2021.103035. Epub 2021 Apr 17. PMID: 33875337.
Patel L, Shukla T, Huang X, Ussery DW, Wang S. Machine Learning Methods in Drug Discovery. Molecules. 2020 Nov 12;25(22):5277. doi: 10.3390/molecules25225277. PMID: 33198233; PMCID: PMC7696134.
Chan HP, Samala RK, Hadjiiski LM, Zhou C. Deep Learning in Medical Image Analysis. Adv Exp Med Biol. 2020;1213:3–21. doi: 10.1007/978-3-030-33128-3_1. PMID: 32030660; PMCID: PMC7442218.
Liu J, Chen Y, Li S, Zhao Z, Wu Z. Machine learning in orthodontics: Challenges and perspectives. Adv Clin Exp Med. 2021 Oct;30(10):1065–1074. doi: 10.17219/acem/138702. PMID: 34610222.
Marill KA. Advanced statistics: linear regression, part II: multiple linear regression. Acad Emerg Med. 2004 Jan;11(1):94–102. doi: 10.1197/j.aem.2003.09.006. PMID: 14709437.
Nick TG, Campbell KM. Logistic regression. Methods Mol Biol. 2007;404:273–301. doi: 10.1007/978-1-59745-530-5_14. PMID: 18450055.
Bender R, Grouven U. Ordinal logistic regression in medical research. J R Coll Physicians Lond. 1997 Sep-Oct;31(5):546–51. PMID: 9429194; PMCID: PMC5420958.
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018 Dec 18;169(12):866–872. doi: 10.7326/M18-1990. Epub 2018 Dec 4. PMID: 30508424; PMCID: PMC6594166.
Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, Collins GS, Bajpai R, Riley RD, Moons KGM, Hooft L. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. 2021 Oct 20;375:n2281. doi: 10.1136/bmj.n2281. PMID: 34670780; PMCID: PMC8527348.
Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet. 2022 Mar;23(3):169–181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26. PMID: 34837041.
Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018 Nov 1;178(11):1544–1547. doi: 10.1001/jamainternmed.2018.3763. PMID: 30128552; PMCID: PMC6347576.
Fleuren LM, Klausch TLT, Zwager CL, Schoonmade LJ, Guo T, Roggeveen LF, Swart EL, Girbes ARJ, Thoral P, Ercole A, Hoogendoorn M, Elbers PWG. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Med. 2020 Mar;46(3):383–400. doi: 10.1007/s00134-019-05872-y. Epub 2020 Jan 21. PMID: 31965266; PMCID: PMC7067741.
Endo H, Uchino S, Hashimoto S, Aoki Y, Hashiba E, Hatakeyama J, Hayakawa K, Ichihara N, Irie H, Kawasaki T, Kumasawa J, Kurosawa H, Nakamura T, Ohbe H, Okamoto H, Shigemitsu H, Tagami T, Takaki S, Takimoto K, Uchida M, Miyata H. Development and validation of the predictive risk of death model for adult patients admitted to intensive care units in Japan: an approach to improve the accuracy of healthcare quality measures. J Intensive Care. 2021 Feb 15;9(1):18. doi: 10.1186/s40560-021-00533-z. PMID: 33588956; PMCID: PMC7885245.
Gramegna A, Giudici P. SHAP and LIME: An Evaluation of Discriminative Power in Credit Risk. Front Artif Intell. 2021 Sep 17;4:752558. doi: 10.3389/frai.2021.752558. PMID: 34604738; PMCID: PMC8484963.
Wojtuch A, Jankowski R, Podlewska S. How can SHAP values help to shape metabolic stability of chemical compounds? J Cheminform. 2021 Sep 27;13(1):74. doi: 10.1186/s13321-021-00542-y. PMID: 34579792; PMCID: PMC8477573.
Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP, Chen KL, Yang CY, Lee OK. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Crit Care. 2020 Jul 31;24(1):478. doi: 10.1186/s13054-020-03179-9. PMID: 32736589; PMCID: PMC7395374.
Bardsley, M., Steventon, A., & Fothergill, G. (2019). Untapped potential: Investing in health and care data analytics. London: Health Foundation.
Masir N, Ghoddoosi M, Mansor S, Abdul-Rahman F, Florence CS, Mohamed-Ismail NA, Tamby MR, Md-Latar NH. RCL2, a potential formalin substitute for tissue fixation in routine pathological specimens. Histopathology. 2012 Apr;60(5):804–15. doi: 10.1111/j.1365-2559.2011.04127.x. Epub 2012 Feb 9. PMID: 22320393.
Hou N, Li M, He L, Xie B, Wang L, Zhang R, Yu Y, Sun X, Pan Z, Wang K. Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost. J Transl Med. 2020 Dec 7;18(1):462. doi: 10.1186/s12967-020-02620-5. PMID: 33287854; PMCID: PMC7720497.
Dinh A, Miertschin S, Young A, Mohanty SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med Inform Decis Mak. 2019 Nov 6;19(1):211. doi: 10.1186/s12911-019-0918-5. PMID: 31694707; PMCID: PMC6836338.\ Nisbet ML, Pendleton IM, Nolis GM, Griffith KJ, Schrier J, Cabana J, Norquist AJ, Poeppelmeier KR. Machine-Learning-Assisted Synthesis of Polar Racemates. J Am Chem Soc. 2020 Apr 22;142(16):7555–7566. doi: 10.1021/jacs.0c01239. Epub 2020 Apr 13. PMID: 32233475.
Li Y, Wei Y, Li B, Alterovitz G. Modified Anderson-Darling test-based target detector in non-homogenous environments. Sensors (Basel). 2014 Aug 29;14(9):16046-61. doi: 10.3390/s140916046. PMID: 25177800; PMCID: PMC4208161.
de la Fuente-Anuncibay R, González-Barbadillo Á, Ortega-Sánchez D, Ordóñez-Camblor N, Pizarro-Ruiz JP. Anger Rumination and Mindfulness: Mediating Effects on Forgiveness. Int J Environ Res Public Health. 2021 Mar 6;18(5):2668. doi: 10.3390/ijerph18052668. PMID: 33800890; PMCID: PMC7967311.
Romero J, Chiang S, Goldenholz DM. Can machine learning improve randomized clinical trial analysis? Seizure. 2021 Oct;91:499–502. doi: 10.1016/j.seizure.2021.07.033. Epub 2021 Aug 2. PMID: 34365104; PMCID: PMC8435025.
Molnár, C., Kaplan, F., Roy, P. et al. Classification of dog barks: a machine learning approach. Anim Cogn 11, 389–400 (2008). https://doi.org/10.1007/s10071-007-0129-9
Bardsley, Martin & Steventon, Adam. (2019). Untapped Potential:Investing n health and care data analytics.

Tables 1-4 are available in the supplementary files section.

No competing interests reported.

Download PDF

Version 2

posted

You are reading this latest preprint version

Increasing Transparency in Machine Learning through Bootstrap Simulation and Shapely Additive Explanations

Status:

Version 2

Abstract

Importance:

Methods

Result

Conclusion

Figures

Introduction

Methods

Independent Variable:

Dependent Variable:

Model Construction And Statistical Analysis:

Results

Discussion

Overall Variability In Model Accuracy:

Overall Variability In Covariate Gain Statistics:

Utility Of Shap For Model Explanation And Allowing For Augmented Intelligence:

Conclusion

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 2