Data Sources
Our retrospective study was composed of electronic medical data from 240,000 patients admitted and hospitalized at Seoul Asan Medical Center in South Korea. The datasets were collected from January 1st, 2018 through October 30, 2020 (Fig. 1). Datasets for demographics, physical, diagnoses, visits, discharges, medications, laboratory, hospitalization, and operations were extracted from the ABLE data warehouse at the Asan Medical Center. The International Classification of Diseases (ICD)-10 was used to identify each patient’s health condition at admission. All patients were de-identified according to the hospital’s privacy rules and were randomly provided a unique patient ID, which acted as a key linkage between the datasets.
Cohort Preparation and Outcome Definition
The patient population was initially defined by filtering the population with the general anesthesia during the operation in index Datetime, defined as ‘indexOPDT’. The measure of the patient’s severity pre-operation was indicated based on whether the patient required general anesthesia. The inclusion criteria for the patient population were as follows: 1) the existence of ‘indexOPDT’ and ‘LOS outcome’ values, 2) alive before the operation date 3) provided general anesthesia, 4) the duration of the length of stay period was greater than three but less than thirty days. Additionally, exclusion criteria included patients whose death date was prior to the operation date and patients who undertook anesthesia other than general. These excluded patients did not provide the LOS outcome variable. The flowchart in Fig. 2 displays formerly described cohort preparation. The prediction outcome of our retrospective study was continuous, in days. The length of stay was defined as the period from the date of the first operation until the date of discharge.
Data Collection
The electronic medical record of each patient’s history was distinguished by variables collected from demographics (age, gender, cancer, death dates, and admission date), physical (BMI, BSA, weight, and height), diagnosis (ICD-10 diagnosis code), visit (ward type, department, path, and visit date), discharge (discharge date), medication (medication name), laboratory (laboratory test date, laboratory result, and laboratory test name), hospitalization (department transfer, and diagnosed department), operation datasets (operation datetime, age at operation, operation department, operation procedure, operation type, ICD-9 surgery code, and anesthesia code). Patient diagnoses used ICD Tenth Revision (ICD-10) codes, obtained from the patients at the hospital admission. A total of 422 variables (excluding the patient id and the outcome variables) were included as preoperative predictors for each patient. The predictive factors were mainly selected with the support of cardiologists, who offered assistance in deciding the importance and suitability of the variable.
Model
This section discusses the overall process of model construction. The prediction model was built in eight steps (Fig. 3).
Data Preprocessing and Feature Engineering
To create increased performance results, the data-preprocessing stage is essential. During the preparation step, raw data are examined with categorical data, outliers, missing values, or redundancy[25, 26].
To handle the categorical values, one-hot encoding, label encoding, and binary encoding were used. We created dummy variables for the nominal categorical variables. Moreover, label encoding was implemented for features containing ordinal categories. To specify the binary encoding, gender was binary encoded with a male by 0 and female by 1, and cancer and death dates were binary encoded by the occurrence: 1 for occurrence and 0 for none. Furthermore, operation occurrence was encoded with 1 if a prior operation exists by indexOPDT and 0 for none; department transfer was encoded with 1 for yes and 0 for no. A large number of observations in categorical features such as diagnosis and medication names were filtered with a threshold of the hundred most frequent observations. Outliers for height and weight variables ranged from 30–200 and 120–220 were removed respectively.
To enhance the performance of the model prediction, missing data were handled individually for each dataset. Firstly, the features missing more than eighty percent were eliminated. Secondly, the null values for BMI and weight were filled with the median, whereas the mean was used for height. Further, the overall missing data were imputed with 0 for demographic, diagnosis, visit, discharge, and medication, whereas − 1 was used for physical and laboratory datasets. The consideration for the missing data imputation was whether the value imputed for the counted factors will affect the learning. To elaborate, a negative value was imputed for a physical dataset instead of 0 because imputing 0 would imply meaning to the physical value. Lastly, before the dataset is learned with models, we used min-max normalization for standardizing the data on a scale from zero to one.
Feature Selection
Feature selection is a substantial process of identifying a subset of features from the input data that consequently reduces the irrelevant features, and training time, and reduces dimensionality, thereby, supporting better prediction performances in machine learning[27]. We identified and selected the most relevant subset of features with the k highest scoring features based on ANOVA F-values. The algorithm selects the features based on the k highest score. The number of top features, which is denoted as ‘k’, was selected from a range of 10 to 200, with the best ‘k’ of 20 finally being selected for the model. The method is faster yet has a great impact on defining the predictors. The model was augmented with modified features resulting from the findings of feature importance.
Model Construction and Experimental Setup
We followed a regression approach for the continuous LOS outcome variable prediction and employed four types of learning algorithms to develop the predictive models: (1) ridge regression[28], (2) extreme gradient boosting (XGBoost)[29], (3) random forest[30], and (4) multilayer perceptron (MLP)[31].
The ridge regression is an extended technique of the linear regression algorithm that uses L2 regularization with a squared magnitude of coefficient and lambda factors added as a penalty to the loss function. Consequently, the ridge estimator reduces the variance and shrinks the coefficients toward zero.
The XGBoost is a highly scalable, gradient tree-based ensemble system that parallelly runs at a much faster rate compared to existing supervised machine learning techniques. XGBoost is effective as it minimizes regularizations (L1 and L2) to loss function and handles missing data using a sparse approach. Additionally, the iterative training process of inserting newly created trees combined with previously built trees ultimately minimizes the loss.
Another tree-based mode, RF is an ensemble algorithm that uses the extended technique of bagging, and feature randomness, to create a subset of separated observations. Hence, it is great for high dimensional data and quickly trains the features to maximize the information gain[32, 33].
The MLP is a fully connected feedforward artificial neural network for supervised learning and as the name refers, one or more hidden layers of activation functions exist in the middle. It uses a backpropagation technique[34] that has a nonlinear function to optimize the weights and reduce the prediction errors. A study found that MLP provided the best performance result among other machine learning models[35].
To perform data sampling, the dataset was randomly split into a training and test set by an 80:20 ratio. The training sets were used to learn the mapping of inputs whereas the test sets were used to evaluate the predictive ability of the model. Following the data stratification, the input features on the training dataset were normalized in the range of zero to one. All four models were built with the same predictor variables through which the model development was performed using the open-source Python package.
Parameter and Hyperparameter Tuning
To control and improve the model’s performances, we jointly used randomized and grid searches to achieve the optimal combination of the hyperparameters for the models. The search iterates over the previously defined sets of hyperparameters to obtain the best-tuned values. Then, a 3-folder cross-validation is used to evaluate each set of hyperparameters into the folds, which the model subsequently fits on to ultimately provide the score on the combination values.
To inform the detailed parameter tuning, for the ridge model, a penalty of 1 is used as a constant that controls the regularization strength.
The random forest model used 100 trees to fit the model that intakes one feature at a time, and a minimum of two internal node samples that will hold before splitting to a further node. Also, a minimum of 1 sample is required to be a node.
To train a multi-layer perceptron layer model, we first provided a specification for an input vector containing 530 features. Following the input layer, a further three hidden layers and the output layer are created with ReLU activation. Moreover, each layer (hidden and output) is created with output neurons of 20, 10, 5, and 1, respectively. Amid the layers, a 25 percent dropout layer is added after the second hidden layer. The MLP model is compiled with the adam optimizer at 0.0001. For evaluation metrics, the mean-squared error loss argument and the root-mean-squared error are defined. The established model and compiled information are trained with 30 epochs, a batch size of 200, and 15% of the training data, which is used for validation. Consequently, six layers with a sum of 8,731 parameters were trained in hidden units during the fitting.
The XGBoost model was created with 150 trees used in the forest with five maximum depths of tree learned at 0.1 rates. Also, 1 feature is randomly subsampled to train a new tree.
All four models are created with the best setting, which was chosen according to the model’s performance. The parameters are tuned with the Grid Search CV package from an open-source Sklearn model selection, while the MLP model implementation was built with the Tensorflow, Keras library.
Evaluation
In this research, the RMSE (root-mean-squared error) metric was used to evaluate the model performance in measuring each of the regression model’s errors. The errors signify, on average, the margin of error between the predictions and the true target values. Therefore, the units are represented by the original units of the predicted value, which in our study are ‘days’. The RMSE was measured for both the training and test sets to verify the closeness of the evaluation results. After evaluating the RMSE results from four different models, we further analyzed and interpreted the XGBoost model which demonstrated the highest prediction performance. For the MLP model, RMSE was estimated through the loss function for the LOS outcome in days.
Model Interpretation
SHAP (SHapley Additive exPlanations) is a Shapley-based novel approach based on game theory, which aids a better understanding of the tree-based model structure for each prediction[36, 37]. The explanation measures the interactions between the local features which then combines to deliver insights into individual predictors and the whole model. SHAP also enables the characterization of the high and low-risk predictors and identifies predictors that degrade over the model’s performances. We employed SHAP values to improve the interpretation of the learning results of the XGBoost model, and to further analyze the associations between the features that identify the critical predictors and impact the LOS outcome. The SHAP tools we used includes dependence plots, summary plots, and force plots.