Assessing Calibration and Bias of a Deployed Machine Learning Malnutrition Prediction Model within a Large Healthcare System

doi:10.21203/rs.3.rs-3411582/v1

Download PDF

Article

Assessing Calibration and Bias of a Deployed Machine Learning Malnutrition Prediction Model within a Large Healthcare System

https://doi.org/10.21203/rs.3.rs-3411582/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 Jun, 2024

Read the published version in npj Digital Medicine →

You are reading this latest preprint version

Introduction

Malnutrition is a frequently underdiagnosed condition leading to increased morbidity, mortality and healthcare costs. The Mount Sinai Health System (MSHS) deployed a machine learning model (MUST-Plus) to detect malnutrition upon hospital admission. However, in diverse patient groups a poorly calibrated model may lead to misdiagnosis, exacerbating health care disparities. We explored the model’s calibration across different variables and methods to improve calibration.

Methods

Data from adult (age > 18) patients admitted to 5 MSHS hospitals from September 20, 2020 - December 31, 2021 were analyzed. We compared MUST-Plus prediction to the registered dietitian’s formal assessment. We assessed calibration following the hierarchy of weak, moderate, and strong calibration. We tested statistical differences in intercept and slope by bootstrapping with replacement.

Results

We included 49,282 patients (mean age = 66.0). The overall calibration intercept was − 1.25 (95% CI: -1.28, -1.22), and slope was 1.55 (95% CI: 1.51, 1.59). Calibration was not significantly different between White and Black patients. The calibration intercept was significantly different between male and female patients. Both calibration intercepts and slopes were statistically different between 2021 and 2022. Recalibration improved calibration of the model across race, gender, and year.

Discussion

The calibration of MUST-Plus underestimates malnutrition in females compared to males, but demonstrates similar calibration slope, suggesting similar distributions of risk estimation. Recalibration is effective at reducing miscalibration across all patient subgroups. Continual monitoring and timely recalibration can improve model accuracy.

Widespread adoption of electronic health records (EHR) have facilitated the implementation of machine-learning based models to enable data-driven outcome prediction.^1,2 Historically, assessment of predictive model performance has focused on discrimination rather than calibration.^3–5 Calibration, the “Achilles’ Heel” of predictive analytics, is defined as the agreement between the predicted probability of an outcome for a particular person and the observed frequency of that outcome among all similar patients.⁶ If a model is poorly calibrated, even with high discrimination,it can result in biased individual-predicted probabilities which lead to poorer decision-making by patients and healthcare professionals.^7–10

Identifying cases of malnutrition has important clinical implications because accurate identification and clinical management of malnutrition has been found to reduce risk for hospital-acquired conditions and readmission within 30 days.^12–14 Timsina et. al. developed a malnutrition predictive model (MUST-Plus) derived from EHR data using a Random Forest approach to improve timely identification of high-risk patients to registered dieticians.¹¹ In this study, we assess the overall calibration the MUST-Plus model and whether calibration changed for various patient subgroups as well as over time since it was first deployed. We hypothesized that the calibration of the MUST-Plus model would not be significantly different between male and female patients, between white and Black patients, nor over time.

Study Population

We obtained Institutional Review Board Approval (IRB #18–00573) for this retrospective study. The study cohort consisted of adults (age ≥ 18 years) admitted to the Mount Sinai Health System between September 20, 2020 and December 31, 2022 who had an evaluation performed by a certified registered dietician (RD) as described below. The RD evaluation was considered the gold standard to which the predicted output was compared. Individual hospitals were anonymized in this manuscript.

MUST-Plus model and workflow

We sought to assess calibration, in addition to discrimination, of a real-time deployed malnutrition prediction model, (MUST-Plus) derived using EHR data from a large tertiary care center. The details of the model development have been previously described.¹¹ Briefly, the MUST-Plus model uses a Random Forest approach with 53 predictor variables to predict the likelihood of moderate or severe malnutrition upon hospital admission. The MUST-Plus prediction is used by the RD team to prioritize which inpatients to evaluate during their daily rounds. After evaluation, a malnutrition diagnosis (yes/no) is documented by the RD if a minimum of two of the following diagnostic criteria were met: inadequate energy (kilocalorie) intake compared to estimated requirements; significant percentage of unintentional body weight loss within one year; and findings of muscle wasting, subcutaneous fat wasting, or fluid accumulation (edema) on physical examination.¹⁴ The model has been deployed at 5 Mount Sinai hospitals since 2018.

Calibration Analysis

We systematically evaluated calibration of the MUST-Plus model according to the calibration hierarchy proposed by Van Calster et. al and described briefly below.⁶ We also assessed calibration of the model across various subgroups: race, gender, year, and hospital facility. Discrimination was assessed using Harrel’s concordance index (c-index).¹⁵ Bootstrapping with replacement (n = 5000) was performed to generate confidence intervals for calibration metrics using the boot R package (version 1.3).¹⁶ Empirical p-values for differences in calibration intercepts and calibration slopes were calculated from bootstrapped distributions and a Bonferroni correction was applied adjusting for 18 hypothesis tests.

Weak Calibration

Weak calibration assesses calibration more broadly by measuring the intercept (optimal = 0) and slope (optimal = 1) of a logistic calibration fit. An intercept greater than 0 indicates an overestimation of risk, and a logistic calibration slope greater than 1 indicates underfitting of the model to the data.

Moderate Calibration

Moderate calibration assesses calibration more carefully by measuring the concordance of predicted risks with observed events. using smoothed calibration curves. When moderate calibration is perfectly achieved, the predicted risk loess line falls exactly along the diagonal, meaning that the predicted risk is exactly equivalent to the observed incidence. If the loess line deviates from the diagonal, it indicates an under- or over-prediction of risk. We also calculated other approaches to assessing moderate calibration, such as the rescaled Brier score (optimal = lower score)^7,15, Eavg, and Emax. The integrated calibration index (ICI, denoted Eavg) calculates the average absolute difference between the loess predicted risk line and the ideal diagonal, providing a single number to summarize moderate calibration (optimal = 0).¹⁷ Harrell also proposed using the maximum absolute vertical deviation of the loess predicted risk line and the ideal diagonal (denoted Emax, optimal = 0) as another summary measure for moderate calibration assessment.¹⁵ According to Van Calster et. al., moderate calibration is sufficient to provide clinical decision guidance.⁶

Strong Calibration

Strong calibration assesses moderate calibration across levels of covariates and has been known to be challenging to achieve for many covariates in practice. To visualize some extent of strong calibration, we created calibration curves subset by race, gender, year, and hospital facility and assessed positive predictive values (assuming evaluation indicates prediction of malnutrition) and distribution of predicted risks within each subgroup.

Recalibration

Recalibration-in-the-large and logistic recalibration approaches were followed as outlined in Vergouwe 2017 et. al.¹⁸ Briefly, recalibration-in-the-large was achieved by setting the slope to 1 and estimating the intercept, and logistic recalibration was achieved by allowing the slope and intercept to be freely estimated.

All analysis was performed using R version 4.3.0¹⁹ and R Studio version 2023.03.1 + 446²⁰. The rms package version 6.7²¹ was used to calculate calibration statistics. Code for analysis can be found at https://github.com/latlio/malnutrition_calibration.

The study cohort included 49,282 patients (median [IQR] age = 64.0 [28.0]), 51.7% self-identified as female and 29.5% self-identified as Black or African American. 52.2% were on Medicare and 28.7% on Medicaid. Baseline characteristics are summarized in Table 1.

Table 1

Summary of Baseline Characteristics
	No Malnutrition (N = 37618)	Malnutrition (N = 11664)	Overall (N = 49282)	SMD
Age				0.34
Median [IQR]	64.0 (28.0)	70.0 (23.0)	66.0 (26.0)
Gender				0.16
Female	19465 (51.7%)	5129 (44.0%)	24594 (49.9%)
Male	18153 (48.3%)	6535 (56.0%)	24688 (50.1%)
BMI (kg/m2)				0.03
Median [IQR]	27.1 (8.52)	20.7 (4.93)	25.4 (8.79)
Missing	7164 (19.0%)	2163 (18.5%)	9327 (18.9%)
Race				0.05
Asian	2448 (6.5%)	827 (7.1%)	3275 (6.6%)
Black Or African-American	11100 (29.5%)	3498 (30.0%)	14598 (29.6%)
Other	10808 (28.7%)	3114 (26.7%)	13922 (28.2%)
White	13262 (35.3%)	4225 (36.2%)	17487 (35.5%)
Ethnicity				0.06
Hispanic	9285 (24.7%)	2592 (22.2%)	11877 (24.1%)
Not Hispanic/Latino	28333 (75.3%)	9072 (77.8%)	37405 (75.9%)
Medical/Surgical				0.16
Med	24345 (64.7%)	8407 (72.1%)	32752 (66.5%)
Surg	13273 (35.3%)	3257 (27.9%)	16530 (33.5%)
Payor Type				0.25
Commercial	6981 (18.6%)	1357 (11.6%)	8338 (16.9%)
Medicaid	10801 (28.7%)	2900 (24.9%)	13701 (27.8%)
Medicare	19626 (52.2%)	7368 (63.2%)	26994 (54.8%)
Other	25 (0.1%)	6 (0.1%)	31 (0.1%)
Uninsured	185 (0.5%)	33 (0.3%)	218 (0.4%)
Year				0.02
2021	20833 (55.4%)	6349 (54.4%)	27182 (55.2%)
2022	16785 (44.6%)	5315 (45.6%)	22100 (44.8%)
Facility				0.09
Facility 1	4072 (10.8%)	1069 (9.2%)	5141 (10.4%)
Facility 2	5666 (15.1%)	1559 (13.4%)	7225 (14.7%)
Facility 3	15227 (40.5%)	4796 (41.1%)	20023 (40.6%)
Facility 4	6970 (18.5%)	2208 (18.9%)	9178 (18.6%)
Facility 5	5683 (15.1%)	2032 (17.4%)	7715 (15.7%)
Type of Hospital				0.08
Community Hospital	9738 (25.9%)	2628 (22.5%)	12366 (25.1%)
Quaternary Academic Hospital	15227 (40.5%)	4796 (41.1%)	20023 (40.6%)
Tertiary Acute Care	12653 (33.6%)	4240 (36.4%)	16893 (34.3%)

Discrimination and Calibration

The model overall had a c-index of 0.83 (95% CI: 0.82, 0.84), with a pre-recalibration intercept and slope of -1.25 (95% CI: -1.28, -1.22) and 1.55 (95% CI: 1.51, 1.59) respectively (Table 2). The c-index for Black and White patients were 0.843 and 0.829 respectively. The sensitivity, specificity, PPV, and NPV for Black patients was 0.76, 0.75, 0.49, and 0.91 respectively. The sensitivity, specificity, PPV, and NPV for White patients was 0.75, 0.73, 0.47, and 0.90 respectively. Prior to recalibration, the calibration of the model was similar for both Black and White patients (Table 3–4, Supplementary Fig. 1A-B). Logistic recalibration improved calibration across different statistics: rescaled Brier score, calibration intercept and slope, and Emax and Eavg. Full calibration statistics for Black and White patients are shown in Table 3. Calibration curves for Black and White Patients are displayed in Fig. 1A-C. We also found calibration statistics to be similar in Asian patients (Supplementary Table 1).

Table 2

Overall Calibration Statistics for MUST-Plus Model
	No Recalibration (95% CI)	Recalibration In the Large (95% CI)	Logistic Recalibration (95% CI)
Rescaled Brier Score	0.01 (0.01, 0.03)	0.25 (0.24, 0.26)	0.27 (0.26, 0.28)
Calibration Intercept	-1.25 (-1.27, -1.22)	0.49 (0.46, 0.52)	0
Calibration Slope	1.55 (1.51, 1.59)	1.55 (1.51, 1.59)	1
Emax	0.27 (0.26, 0.28)	0.26 (0.22, 0.28)	0.01 (0.008, 0.03)
Eavg	0.19 (0.19, 0.20)	0.05 (0.04, 0.05)	0.005 (0.003, 0.007)

Table 3

Calibration Statistics for MUST-Plus Model by Race, Gender, and Year
	No Recalibration
	Rescaled Brier Score (95% CI)	Calibration Intercept (95% CI)	Calibration Slope (95% CI)	Emax (95% CI)	Eavg (95% CI)	Rescaled Brier Score (95% CI)	Calibration Intercept (95% CI)	Calibration Slope (95% CI)	Emax (95% CI)	Eavg (95% CI)	Rescaled Brier Score (95% CI)	Calibration Intercept (95% CI)	Calibration Slope (95% CI)	Emax (95% CI)	Eavg (95% CI)
Black	0.07 (0.06, 0.09)	-1.16 (-1.20, -1.11)	1.56 (1.50, 1.62)	0.25 (0.24, 0.26)	0.18 (0.17, 0.19)	0.27 (0.25, 0.29)	0.48 (0.43, 0.53)	1.56 (1.50, 1.63)	0.23 (0.18, 0.27)	0.05 (0.04, 0.06)	0.29 (0.28, 0.31)	0	1	0.007 (0.005, 0.03)	0.002 (0.002, 0.006)
White	0.01 (0, 0.03)	-1.24 (-1.28, -1.20)	1.58 (1.51, 1.64)	0.27 (0.26, 0.28)	0.19 (0.19, 0.20)	0.24 (0.23, 0.26)	0.50 (0.44, 0.55)	1.58 (1.51, 1.64)	0.28 (0.24, 0.33)	0.05 (0.05, 0.06)	0.27 (0.25, 0.29)	0	1	0.04 (0.01, 0.06)	0.009 (0.005, 0.01)
Male	0.14 (0.13, 0.14)	-0.99 (-1.03, -0.96)	1.60 (1.55, 1.65)	0.23 (0.22, 0.24)	0.16 (0.15, 0.16)	0.28 (0.27, 0.29)	0.46 (0.42, 0.50)	1.60 (1.55, 1.65)	0.24 (0.20, 0.27)	0.06 (0.05, 0.06)	0.31 (0.29, 0.32)	0	1	0.008 (0.005, 0.02)	0.003 (0.002, 0.005)
Female	-0.13 (-0.15, -0.12)	-1.53 (-1.57, -1.48)	1.56 (1.51, 1.62)	0.32 (0.31, 0.33)	0.23 (0.22, 0.23)	0.22 (0.21, 0.24)	0.57 (0.51, 0.62)	1.56 (1.51, 1.62)	0.28 (0.22, 0.32)	0.05 (0.04, 0.05)	0.25 (0.24, 0.27)	0	1	0.02 (0.02, 0.04)	0.01 (0.007, 0.01)
2021	0.02 (0.02, 0.03)	-1.29 (-1.33, -1.26)	1.62 (1.57, 1.67)	0.28 (0.27, 0.28)	0.19 (0.19, 0.20)	0.26 (0.25, 0.27)	0.55 (0.51, 0.59)	1.62 (1.58, 1.67)	0.27 (0.23, 0.30)	0.06 (0.05, 0.06)	0.29 (0.28, 0.31)	0	1	0.01 (0.008, 0.03)	0.005 (0.003, 0.008)
2022	0 (-0.01, 0.01)	-1.21 (-1.25, -1.17)	1.47 (1.42, 1.52)	0.26 (0.25, 0.27)	0.19 (0.19, 0.20)	0.23 (0.22, 0.24)	0.41 (0.37, 0.46)	1.47 (1.42, 1.52)	0.24 (0.19, 0.28)	0.04 (0.04, 0.05)	0.25 (0.23, 0.27)	0	1	0.02 (0.006, 0.04)	0.004 (0.002, 0.008)

Table 4

Empirical Bootstrap Differences in Calibration Intercept and Slope before Recalibration
Comparison	Calibration Intercept Mean Difference (P-Value)	Calibration Slope Difference (P-value)
White - Black	-0.08 (0.006)	0.02 (0.35)
Female - Male	-0.53 (0)*	-0.04 (0.14)
2022 − 2021	0.08 (0)*	-0.15 (0)*
*Denotes statistical significance at Bonferroni adjusted threshold of 0.003

The c-indices for male and female patients were 0.84 and 0.83 respectively. The sensitivity, specificity, PPV, and NPV for female patients was 0.78, 0.71, 0.41, and 0.92 respectively. The sensitivity, specificity, PPV, and NPV for male patients was 0.73, 0.77, 0.53, and 0.89 respectively. The calibration for females was statistically different that that for males, with more negative calibration intercepts (P-value = 0), higher Emax, and higher Eavg (Tables 3–4; Supplementary Fig. 1C). Calibration curves for male and female patients are displayed in Fig. 1D-F.

Calibration by year was also examined to see if there was performance drift over time since the model’s inception in 2018. The year 2020 was removed due to the COVID-19 pandemic. The c-indices in 2021 and 2022 were 0.84 and 0.82 respectively. The sensitivity, specificity, PPV, and NPV for the model in 2021 was 0.76, 0.75, 0.48, and 0.91 respectively. The sensitivity, specificity, PPV, and NPV for the model in 2022 was 0.74, 0.72, 0.46, and 0.90 respectively. The calibration of the model changed significantly from 2021 to 2022 (Tables 3–4, Supplementary Fig. 1E-F). The average calibration intercept for assessments made in 2022 was higher than those made in 2021, while the slope was lower, indicating that there was greater underestimation of malnutrition risk. We also assessed calibration by payor type and hospital type as sensitivity analyses. In the payor type analysis, we found that malnutrition risk was overestimated in patients with commercial insurance, but underestimated in patients with Medicaid and Medicare (Supplementary Table 2–3, Supplementary Fig. 2A-B, Supplementary Fig. 3). We did not observe substantial differences in calibration across hospital type (community, tertiary, quaternary) (Supplementary Table 4–5, Supplementary Fig. 2C-D, Supplementary Fig. 4).

In this study, we evaluated the calibration and discrimination of the MUST-Plus model, a real-time deployed malnutrition prediction model, using electronic health record (EHR) data from the Mount Sinai Health System. The results of our study demonstrated that the discriminative capacity of the MUST-Plus model to identify malnutrition did not differ within Black vs. White, male vs. female, or 2021 vs. 2022 subgroups. However, the model demonstrated miscalibration with exaggerated risks at the tails of the distribution across each subgroup. The negative calibration intercepts prior to recalibration suggest that risk is underestimated when the predicted risk is closer to zero, while the positive calibration slopes suggest that risk is overestimated when the predicted risk is closer to one. In females, the miscalibration was higher compared to males. The model was also miscalibrated between the years 2021 and 2022.

MUST-Plus is a model that contains BMI, length of stay, and several laboratory biomarkers such as hemoglobin, serum albumin, serum creatinine, blood urea nitrogen, and serum alanine-aminotransferase as predictors. It has since been deployed as an automated EHR-based screening tool, enabling daily assessments for all hospitalized patients. Higher-risk patients are then referred to registered dieticians for assessment and treatment as necessary. We hypothesized that the MUST-Plus model would remain stable in its predictive capability throughout its deployment. However, the MUST-Plus model was miscalibrated between 2021 to 2022, which could be due to possible target population shift from time of model development to time of model assessment or the variability of selected predictors from the original model development³. This underscores the importance of continual monitoring of predictive model performance such that we don’t induce patient harm.

We found that the model was not differentially miscalibrated between White and Black patients. This surprising finding may be suggestive of shifting health care quality practices to reduce the health disparities in malnutrition management for hospitalized Black patients. Malnutrition is known to be associated with length of stay, with hypothetical mechanisms posited to be nutritional neglect.²² One study found that there were no meaningful differences in patient experience between Black and white hospitalized patients.²³ The calibration for female patients appeared to be worse compared to male patients. This aligns with other single-center studies that find that females are associated with higher risks of malnutrition compared to males, although the mechanisms behind this observation are still poorly understood.^24–26 This discrepancy highlights the need for deeper investigation as to the source of gender miscalibration or consideration or retraining of gender-specific malnutrition risk prediction models. Furthermore, in sensitivity analyses, we discovered that calibration intercepts were significantly different between patients on commercial insurance, Medicaid, or Medicare. However, this is difficult to interpret without understanding how these different health insurance buckets affect care delivery at Mount Sinai. This surprising finding warrants a separate study to investigate these discrepancies further.

Our results indicate that logistic recalibration improved the model's calibration, as evidenced by the improvements in various calibration statistics and the graphical representation of calibration curves. Mishra et. al. 2022 notes that in settings where the miscalibration pattern at the risk threshold is similar to the pattern for the bulk of the data, standard logistic recalibration may adequately improve calibration at the risk threshold.²⁷ While retraining models is also possible and shown to provide improvements in performance,²⁸ recalibration may sometimes be preferred to maintain predictive context with the original model.²⁷ Other work has compared standard logistic recalibration to refitting methods.^18,29,30

It is essential to acknowledge some limitations of our study. First, our analysis was based on data from a single health system which may limit the generalizability of the findings to other healthcare settings. Second, we did not benchmark retraining the MUST-Plus model nor other recalibration methods as that was out-of-scope for our present research study.

In conclusion, our study evaluated the calibration and discrimination of the MUST-Plus model, a real-time malnutrition prediction tool based on electronic health record data from the Mount Sinai Health System. We found the model to be differentially miscalibrated between genders; however, logistic recalibration improved the model's calibration.

ACKNOWLEDGMENTS

We would like to thank the Mount Sinai Clinical Data Science Team for their help with retrieving the data for this study. This study received no funding.

AUTHOR CONTRIBUTIONS

LL performed the analyses and drafted the manuscript. ES, YO, and NE contributed to the statistical analyses and provided edits to the manuscript. PP, PT, and AK collected the data and contributed analytical support. RF, ISH, GNN provided edits to the manuscript. ES and MAL conceptualized the study, provided supervision to the project, and provided edits to the manuscript.

COMPETING INTERESTS

The authors declare no competing interests.

DATA AVAILABILITY

Anonymized source data used to generate the results is available upon reasonable request.

Nevin L. Advancing the beneficial use of machine learning in health care and medicine: Toward a community understanding. PLoS Med. 2018 Nov 30;15(11):e1002708.
Parikh RB, Kakad M, Bates DW. Integrating Predictive Analytics Into High-Value Care: The Dawn of Precision Delivery. JAMA. 2016 Feb 16;315(7):651–2.
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019 Dec 16;17(1):230.
Wessler BS, Paulus J, Lundquist CM, Ajlan M, Natto Z, Janes WA, et al. Tufts PACE Clinical Predictive Model Registry: update 1990 through 2015. Diagn Progn Res. 2017 Dec 21;1(1):20.
Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014 Mar 19;14(1):40.
Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016 Jun;74:167–76.
Steyerberg EW, Uno H, Ioannidis JPA, van Calster B, Collaborators. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018 Jun;98:133–43.
Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration drift in regression and machine learning models for acute kidney injury. J Am Med Inform Assoc JAMIA. 2017 Nov 1;24(6):1052–61.
Davis SE, Greevy RA, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020 Dec;112:103611.
Minne L, Eslami S, de Keizer N, de Jonge E, de Rooij SE, Abu-Hanna A. Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment. Intensive Care Med. 2012 Jan;38(1):40–6.
Schneider CR, Freeman ALJ, Spiegelhalter D, Linden S van der. The effects of quality of evidence communication on perception of public health information about COVID-19: Two randomised controlled trials. PLOS ONE. 2021 Nov 17;16(11):e0259048.
Stratton RJ, Ek AC, Engfer M, Moore Z, Rigby P, Wolfe R, et al. Enteral nutritional support in prevention and treatment of pressure ulcers: a systematic review and meta-analysis. Ageing Res Rev. 2005 Aug;4(3):422–50.
Rosen BS, Maddox PJ, Ray N. A position paper on how cost and quality reforms are changing healthcare in America: focus on nutrition. JPEN J Parenter Enteral Nutr. 2013 Nov;37(6):796–801.
Timsina P, Joshi HN, Cheng FY, Kersch I, Wilson S, Colgan C, et al. MUST-Plus: A Machine Learning Classifier That Improves Malnutrition Screening in Acute Care Facilities. J Am Coll Nutr. 2021 Jan;40(1):3–12.
Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis [Internet]. New York, NY: Springer; 2001 [cited 2023 May 14]. (Springer Series in Statistics). Available from: http://link.springer.com/10.1007/978-1-4757-3462-1
Canty AJ. Resampling methods in R: the boot package. Newsl R Proj Vol. 2002;2(3):2–7.
Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65.
Vergouwe Y, Nieboer D, Oostenbrink R, Debray TPA, Murray GD, Kattan MW, et al. A closed testing procedure to select an appropriate method for updating prediction models. Stat Med. 2017;36(28):4529–39.
R Core Team. R: A language and environment for statistical computing. 2018; Available from: https://www.R-project.org/
Team Rs. RStudio: integrated development for R. RStudio, PBC, Boston, MA. 2020. 2021.
Harrell Jr FE, Harrell Jr MFE, Hmisc D. Package ‘rms.’ Vanderbilt Univ. 2017;229:Q8.
Barrett M, Bailey M, Owens P. Non-maternal and non-neonatal inpatient stays in the United States involving malnutrition, 2016. ONLINE August. 2018;30:2018.
Figueroa JF, Zheng J, Orav EJ, Jha AK. Across US Hospitals, Black Patients Report Comparable Or Better Experiences Than White Patients. Health Aff (Millwood). 2016 Aug;35(8):1391–8.
Castel H, Shahar D, Harman-Boehm I. Gender differences in factors associated with nutritional status of older medical patients. J Am Coll Nutr. 2006 Apr;25(2):128–34.
Larburu N, Artola G, Kerexeta J, Caballero M, Ollo B, Lando CM. Key Factors and AI-Based Risk Prediction of Malnutrition in Hospitalized Older Women. Geriatrics. 2022 Oct;7(5):105.
Gur Arieh N, Adler H, Khanimov I, Giryes S, Ditch M, Felner Burg N, et al. Sex difference in the association between malnutrition and hypoglycemia in hospitalized patients. Minerva Endocrinol. 2021 Sep;46(3):303–8.
Mishra A, McClelland RL, Inoue LYT, Kerr KF. Recalibration Methods for Improved Clinical Utility of Risk Scores. Med Decis Making. 2022 May;42(4):500–12.
de Hond AAH, Kant IMJ, Fornasa M, Cinà G, Elbers PWG, Thoral PJ, et al. Predicting Readmission or Death After Discharge From the ICU: External Validation and Retraining of a Machine Learning Model. Crit Care Med. 2023 Feb 1;51(2):291–300.
Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001 Aug;54(8):774–81.
Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC, Habbema JDF. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004 Aug 30;23(16):2567–86.

(Not answered)

MalnutritionCalibrationManuscriptSupplement.docx

Download PDF

Journal Publication

published 06 Jun, 2024

Read the published version in npj Digital Medicine →

Editorial decision: revise
07 Feb, 2024
Review #2 received at journal
19 Jan, 2024
Reviewer #2 agreed at journal
13 Jan, 2024
Review #1 received at journal
06 Nov, 2023
Reviewer #1 agreed at journal
26 Oct, 2023
Reviewers invited by journal
24 Oct, 2023
Editor assigned by journal
05 Oct, 2023
Submission checks completed at journal
05 Oct, 2023
First submitted to journal
04 Oct, 2023

You are reading this latest preprint version

Assessing Calibration and Bias of a Deployed Machine Learning Malnutrition Prediction Model within a Large Healthcare System

Status:

Journal Publication

Version 1

Abstract

Introduction

Methods

Results

Discussion

Figures

INTRODUCTION

METHODS

Study Population

MUST-Plus model and workflow

Calibration Analysis

Weak Calibration

Moderate Calibration

Strong Calibration

Recalibration

RESULTS

Discrimination and Calibration

DISCUSSION

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1