Development and Optimization of Machine Learning Algorithms for Predicting In-hospital Patient Charges for Congestive Heart Failure Exacerbations, Chronic Obstructive Pulmonary Disease Exacerbations and Diabetic Ketoacidosis

Background Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models. Results We conducted a retrospective cohort study on national discharge records of hospitalized adult patients from January 1st, 2016, to December 31st, 2019. We used numerous ML techniques to predict in-hospital total cost. We found that linear regression (LM), gradient boosting (GBM) and extreme gradient boosting (XGB) models had good predictive performance and were statistically equivalent, with training R-square values ranging from 0.49–0.95 for CHF, 0.56–0.95 for COPD, and 0.32–0.99 for DKA. We identified important key features driving costs, including patient age, length of stay, number of procedures. and elective/nonelective admission. Conclusions ML methods may be used to accurately predict costs and identify drivers of high cost for COPD exacerbations, CHF exacerbations and DKA. Overall, our findings may inform future studies that seek to decrease the underlying high patient costs for these conditions.


BACKGROUND
3][4] High-cost patients often have signi cant unmet critical healthcare needs despite the substantial healthcare costs they incur. 5,6ngestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetes mellitus are life-altering, high-cost, high-volume conditions that affect millions of people and result in many hospitalizations per year. 7According to Medical Expenditure Panel Survey data for 2017 to 2018 published by the American Heart Association (AHA), diabetes mellitus, heart disease, CHF and respiratory conditions, including COPD, were among the top 10 leading diagnoses for direct health expenditures. 8][10] Similarly, COPD is a high-cost disease-as COPD progresses, patients often experience acute exacerbations, characterized by dyspnea, cough, sputum production and worsening lung function; COPD exacerbations cause frequent hospital admissions and readmissions, reportedly accounting for 90.3% of the total medical cost related to COPD and leading to US $32.1 billion in total medical cost. 11,12Finally, diabetic ketoacidosis (DKA) is one of the acute, life-threatening complications of diabetes mellitus, a disease affecting 37.3 million people as of 2019 according to the CDC. 13 DKA is a common cause of hospitalization in patients with diabetes and is characterized by uncontrolled hyperglycemia, metabolic acidosis, and increased serum ketone concentrations. 14,152 Prior Machine Learning Methods Studying Our Outcomes: CHF, COPD, DKA Machine learning (ML) techniques have emerged as a mechanism for analyzing high-dimensional medical data to understand the factors underlying patient-, hospital-and health system-level outcomes.16 Speci cally, for our three cohorts of patients, ML techniques have been utilized to identify at-risk patients, predict the risk of readmission and readmission rates, and predict the length of inpatient stay.11,12,[17][18][19][20][21] Work has been done to develop predictive models to identify major underlying drivers of high healthcare costs for patients in generalized cohorts as well as several other cohorts of patients, such as breast cancer patients and coronary artery bypass graft patients.[22][23][24][25][26] To date, however, robust machine learning algorithms for predicting in-hospital expenditures and the factors that in uence them have not been evaluated in patients experiencing CHF exacerbations, COPD exacerbations or DKA.

METHODS
The purpose of our study was to build and evaluate ML models to predict in-hospital charges associated with hospitalizations for these conditions, as has not been done previously.Furthermore, based on the model output, we provide recommendations for model optimality in modeling in-hospital expenditures in each cohort and identify factors that underlie high-cost in-hospital admissions for each of the three diseases.
An overview of the methodology employed is shown in Fig. 1.All data processing and statistical and machine learning analyses were conducted using R version 4.1.1(version "Kick Things", released August 8, 2021) and RStudio Version 1.4.1717.

Dataset and Study Design
8][29] We used the HCUP-NIS Core, Severity, Hospital and Cost Charge datasets and queried the datasets for all hospitalizations between January 1, 2016, and December 31, 2019.Patients who were discharged from the hospital, aged < 18 years or who died were excluded.
We identi ed a total of 26,190 unique discharges across the three conditions, including 9,552 discharges for COPD, 14,688 for CHF and 1,950 discharges for DKA.The primary outcome for this study was total hospital.This cohort was identi ed after excluding patients who were discharged with missing data for any of the predictor variables of interest (as described below).

Predictor variables
We conducted a preliminary literature review to determine potential factors that may affect in-hospital charges and that could be used as predictors in our analysis.The initial predictors for analysis included 46 variables, including 29 unique ICD-10 diagnosis code groupings extracted from the HCUP-NIS dataset, which included demographic characteristics, hospital-related variables, health care utilization six months before index admission, and discharge-related variables.A brief description of each predictor variable is given in Supplemental Table 2. Further descriptions of the potential values of each variable can be found on the NIS Description of Data Elements page (https://www.hcup-us.ahrq.gov/db/nation/nis/nisdde.jsp).
The ICD10 diagnosis codes were transformed into Agency for Healthcare Research and Quality (AHRQ) comorbidity categories using the icd R package.
If a patient had at least one ICD10 code in one of the AHRQ comorbidity categories, then they were considered positive for that category.A list of AHRQ comorbidity categories is shown in Supplemental Table 3.

Univariate analysis of predictor variables
The relationships between each of the predictor variables and total charges were analyzed using two-sample t tests.Statistical signi cance was determined at the 95% con dence level, with p < 0.05 indicating statistical signi cance.We also calculated the correlations between each predictor variable in the dataset using the Pearson method.To reduce the sheer quantity of variables without having to choose variables a priori, only variables with a Pearson correlation coe cient above 0.2 were visualized.

Preprocessing of variables
Due to the asymmetric distribution of characteristics and predictor variables, cases with missing data for any of the dependent or independent variables were excluded from this analysis, a common, though controversial, approach for dealing with missing values. 31For the ML analysis, "one-hot encoding" was performed, in which each categorical variable was transformed into a numerical dummy variable. 32With one-hot encoding, a total of 79 predictor variables were used.Additionally, all continuous or numerical variables, including total charges, were standardized such that their mean was 0 and standard deviation was 1.This is a common preprocessing method used to decrease the likelihood of bias of the model due to very large or small numeric variables. 33

ML models
We used seven ML algorithms, namely, linear regression (LM), LASSO regression (LASSO), ridge regression (RIDGE), support vector machine (SVM), random forest (RF), gradient boosting (GBM) and extreme gradient boosting (XGB).These have been previously used in healthcare machine learning to build models for healthcare classi cation and prediction.Models were trained and tested using the caret package in R. For training and validation of the model, a vefold cross-validation using 75% of the derivation sample for development with validation at 25% was conducted.The ML models were developed on the training set and then validated on the testing set.
The vefold cross-validation approach was used to obtain reliable results for evaluating prediction models or for obtaining reliable results.Speci cally, the original training set was split into ve folds through strati ed random sampling.For the ith iteration, fold i was treated as the validation set, and the remaining four folds were used to train the model.The procedure was repeated ve times.This process allows for the model performance to be estimated over all the data.

Model Evaluation and Comparison
Models were evaluated based on the root-mean square error (RMSE) and R-squared values of the models, which are common metrics used to measure the accuracy of prediction models. 34,35The RMSE measures the quality of predictions by determining how far predictions fall from measured true values using the Euclidean distance.It is a standard metric for measuring the error of a model, with smaller values indicating less random noise and thus higher accuracy.R-squared is a measure of the goodness of t of a model and has a maximum value of 1. Models with R-squared values closer to 1 are more well tted to the data.We compared models using paired samples t tests to determine if the differences between them were signi cantly different at the 95% con dence level.

Feature Selection
We performed feature selection in two ways.First, the relative importance of predictor variables (i.e., feature importance) was determined from the ML models and reported as variable importance (VI) scores.VI scores demonstrate how much the prediction changes as the feature values vary. 36Higher feature importance indicates greater importance of the feature to the model prediction.Documentation for the caret package indicates that for linear models, the relative importance is determined by the absolute value of the t-statistic.For gradient boosting models, the relative importance is determined from the absolute value of the coe cients corresponding to the tuned model.All importance values were scaled from 0 to 100.Based on this relative feature importance, we visualized the top twenty most in uential features.
Second, using the caret package, we performed recursive feature elimination (RFE), which employs backward selection algorithms to determine the most important features for prediction in each condition using linear functions (Supplementary Fig. 2). 37First, the algorithm ts the model to all predictors, and each predictor is ranked using its importance to the model.When i equals 1 to 50, the model is iterated with i number of features, and at each iteration of feature selection, the i top-ranked predictors are retained.Then, the model is re t, and its performance is assessed.The value of i with the best performance is determined, and the top i predictors are then used to t the nal models.

Sample characteristics
In total, 26,190 unique hospital discharge records with complete data were available for the analysis from January 1, 2016, to December 31, 2019-14,688 patients hospitalized for CHF exacerbation, 9,552 patients hospitalized for COPD exacerbation and 1,950 patients hospitalized for DKA without coma.The characteristics of the sample cohorts are summarized in Table 1.The average costs for hospitalizations were US$18,196 (± $29,248) for CHF exacerbations, US$13,572 (± $17,598) for COPD exacerbations and $13,650 (± $16,778) for DKA episodes.The mean length of stay and number of inpatient procedures were highest in the CHF cohort at 6.36 days and 1.90 procedures, respectively; the mean length of stay was 5.32 days in the COPD exacerbation cohort and 5.08 days in the DKA cohort, and the number of procedures was 1.32 for both COPD patients and DKA patients.As shown in Fig. 2, the mean cost charges for each condition steadily increased for each condition over the four-year period from 2016 to 2019.

Univariate analyses
Tables 2 and 3 show the univariable results for the categorical and continuous variables, respectively.A longer inpatient stay and greater number of procedures were associated with greater in-hospital total charges.Older patients also incurred higher total charges.For several features, such as sex, payment method, hospital bedsize, hospital control, hospital location, All Patients Re ned Diagnosis Related Groups (APRDRG) severity score and APRDRG risk mortality score, the differences in total charges between groups of patients within each cohort were often statistically signi cant (for example, patients in large hospitals incurred greater charges than those in smaller hospitals in each disease cohort, p < 0.05).Notably, black patients incurred more charges than white patients did (p < 0.01).The Pearson correlation coe cients of the most correlated variables are visualized in Fig. 3.The data show that collinearity exists between several variables.For each of the three conditions, the number of procedures and APRDRG risk mortality were the most strongly positively correlated with the nondiagnosis variables (with correlation coe cients of 0.80 for CHF, 0.79 for COPD and 0.77 for DKA), while age and payment method were the most negatively correlated with the nondiagnosis variables (with correlation coe cients of -0.50 for CHF, -0.50 for COPD and − 0.44 for DKA). Figure 4 shows boxplots of the accuracy metrics for the out-of-sample performance within each "sample" for these three models.Pairwise sample ttests showed that the differences between each of the three models for each condition were not statistically signi cant at the 95% con dence level, and as such, within each disease condition, the LM, GBM and XGB models were equivalent.The RMSEs for the training model ranged from 0.21 to 0.60, and the R-squared values ranged from 0.49 to 0.95 for CHF; the RMSEs ranged from 0.20 to 0.51, and the R-squared values ranged from 0.56 to 0.95 for COPD; and the RMSEs ranged from 0.08 to 0.64, and the R-squared values ranged from 0.32 to 0.99 for DKA.The RMSEs for the test model ranged from 0.50 to 0.60, and the R-squared values ranged from 0.56 to 0.60 for CHF; the RMSEs ranged from 0.67 to 0.73, and the R-squared values ranged from 0.17 to 0.37 for COPD; and the RMSEs ranged from 0.51 to 0.60, and the R-squared values ranged from 0.41 to 0.67 for DKA.

Feature Selection
The top 20 features in each model determined from the training LM, GBM and XGB models for each condition were determined (Supplemental Fig. 1).Length of stay was the most important predictor in each of the models, followed by the number of procedures during hospitalization.Age and elective/nonelective admission were also important predictors in at least one model for each disease condition, but with much smaller VI scores than length of stay and age.This nding aligns with our univariable analyses (Tables 2 and 3).

DISCUSSION
Although many studies have employed ML techniques to predict at-risk patients, readmission risks, readmission rates and length of stay for CHF, COPD and DKA patients, the development of a predictive model of in-hospital cost charges in these disease cohorts is a novel contribution of this study.
We constructed 6 ML models for each disease and found that the LM, GBM and XGB models performed the best-they had good predictive performance and were found to be statistically equivalent.Thus, traditional linear regression was not inferior to the tree-based models.The training metrics showed RMSEs ranging from 0.21-0.60 and R-squared values ranging from 0.49-0.95for CHF; RMSEs ranging from 0.20-0.51and R-squared values ranging from 0.56-0.95for COPD; and RMSEs ranging from 0.08-0.64 and R-squared values ranging from 0.32-0.99 for DKA.The corresponding metrics for the test models were all lower than those for the training models, indicating that the models performed worse on the validation datasets.
Unsurprisingly, length of stay was the most important predictor in each of the models, disproportionately affecting hospital charges in each model.This was followed by the number of procedures performed during hospitalization.Age and elective/nonelective admission were also important predictors in at least one model for each disease condition.Feature selection indicates that although these variables are extremely in uential in any model, many other patient-level and hospital-level features also have small but measurable impacts on hospital charges.

Strengths of Our Study
The strengths of our study include the large sample size of the HCUP NIS datasets.Furthermore, the availability of many demographic characteristics, diagnosis-related variables, and hospital characteristics for use as predictors allowed for the building of supervised prediction models.The use of advanced ML techniques represents the robust use of data science to characterize complex clinical issues.The ability to predict expenditures at the patient level with good accuracy can allow for targeted care by anticipating the health care needs of patients.This will provide insights into designing effective and tailored interventions to meet the needs of high-cost patients and reduce costs.

Limitations of Our Study
Despite its strengths, we recognize that this work has several limitations.Missing data are a well-known limitation of utilizing EMR data for research, for which the HCUP-NIS is susceptible.Additionally, we chose to use only complete data without missing values for all predictor variables, thereby eliminating a substantial number of possible discharge events.Future work can involve employing data imputation methods rather than data exclusion.This could help to address the potential selection bias that can result from categorically excluding cases with missing data.
Additionally, the discharge data used may include discharge from readmissions of the same patient.The NIS data contain discharge-level records, which, per the HCUP-NIS documentation, means that "individual patients who are hospitalized multiple times in one year may be present in the NIS multiple times… this will be especially important to remember for certain conditions for which patients may be hospitalized multiple times in a single year." 29,38As discussed, our target patients often experience numerous hospitalizations, and initial versus recurrent hospitalizations might differ in their character.As such, we considered limiting the analysis to initial discharge; however, "…there is no uniform patient identi er available that allows a patient-level analysis with the NIS."Therefore, for the purposes of this study, we included all the discharge data and performed the analysis at the discharge level.

CONCLUSION
We demonstrated the use of ML models to predict in-hospital charges for patients hospitalized for CHF exacerbation, COPD exacerbation and DKA.We found that length of stay, number of procedures during hospitalization, age and elective/nonelective admission were important predictors in these models for these diseases.This research can provide helpful information for medical management, which may decrease health insurance burdens in the future.

Declarations
Ethics approval and consent to participate: We obtained Institutional Review Board approval for this study from the University of Pennsylvania (protocol #851472).
Availability of data and materials: The data that support the ndings of this study are available from Agency for Healthcare Research and Quality but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.Data are however available from the authors upon reasonable request and with permission of Agency for Healthcare Research and Quality.

Consent for publication: Not applicable
Competing interests: The authors declare that they have no con icts of interest.
Funding: The research reported in this manuscript is supported, in part, by the Institutional Clinical and Translational Science Award (CTSA) with Dr. Boland as a coinvestigator (UL1-TR-001878) with Dr. Garret Fitzgerald as the PI.Generous funding was also provided by the University of Pennsylvania.
Authors' contributions: MA conceptualized, analyzed and interpreted the data and drafted the manuscript.LL provided the initial de-identi ed data from the AHRQ.MB edited the manuscript and provided guidance on the project.All authors read and approved the nal manuscript.4.Among the 6 ML algorithms, the LM, GBM and XGB models had the best performances across the three conditions.

Figures Figure 1
Figures

Figure 2 Mean
Figure 2

Table 1
Overall Patient Cohort Demographics and Characteristics for Each Disease: CHF Exacerbations, COPD Exacerbations and DKA Episodes

Table 4
Comparison of the metrics of the ML models