Study Population and Data Source
A total of 3858 patients with hand injuries who underwent emergency surgery at the Department of Hand Surgery, Wuhan Union Hospital between January 2016 and January 2020 were analysed. Patient data and attributes were extracted from electronic medical records (EMRs) by experienced physicians and trained nurses. The inclusion criteria were: (1) patients older than 18 years and younger than 90 years, (2) patients who underwent emergency surgery for hand and wrist injuries, (3) patients with complete clinical records, and (4) patients discharged after standard treatment procedures.
Variable definition and collection
A prolonged LOS is considered more than or equal to nine days. Hospitalizations that were shorter than nine days are considered short LOS. Factors that may be relevant to LOS were in this study, including age, gender, history of alcohol abuse, history of smoking, diabetes mellitus, hypertension, insurance status, intraoperative discovery, time from injury to surgery (time to surgery), and preoperative blood test results. To avoid pain, patients with hand injuries are generally not evaluated further before surgery after determining the need for surgery. Therefore, intraoperative anatomic findings were selected as predictors, including bone injury, muscle injury, nerve injury, tendon injury, and vascular injury. Insurance status was categorized into three groups: commercial insurance, national medical insurance, self-paying. Finally, white blood cell (WBC), red blood cell (RBC), neutrophil count (NE#), and hemoglobin (HGB) were selected from the preoperative blood test to assess the degree of blood loss and inflammatory response of the body.
Feature selection can reduce the dimension of the feature space and the complexity of modeling16. In this paper, we first excluded significantly collinear variables based on the correlation coefficient between each variable. Next, we propose an embedded feature selection scheme that ranks features based on the feature importance values derived from the Random Forest. Then, the lower-ranked features are removed one by one, with 10-fold cross-validation of the performance of the selected feature group in each round. In addition, we also performed the least absolute shrinkage and selection operator method (LASSO) with 10-fold cross-validation to further test the relationship between the selected features and the performance of the linear model.
Linear Regression Model
Linear Regression (LR) is a standard statistical generalized linear model method used in data mining, automatic disease diagnosis, economic forecasting, and other broad applications 17. The algorithm is essentially a conventional two-category model, with the object's category determined by inputting the object's attribute sequence. To classify the data, the model assumes that the data follows the Bernoulli distribution and employs the method of maximizing the likelihood function to solve the parameters with gradient descent. In our study, a multivariable LR model was built using the function glm of the R package stats. Odds ratios were calculated for each risk factor by exponentiating the LR coefficients. Odds ratios less than 1.0 indicate a decreased risk, while odds ratios greater than 1.0 indicate an increased risk, and the p-value of <0.05 was considered significant.
Machine Learning Models
Four classical machine learning models with five-fold cross-validation were developed to predict LOS, namely Naive Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), and XGboost. All machine learning models were built using the randomforest, caret, and XGboost packages in the R programming language (version 3.3.1). As an extra precaution, the models were trained and tested by two team members (L.Z.Y. and K.W.) to ensure that the models were not inadvertently trained with withheld test data.
The random forest method is a machine learning technique that mixes many decision trees to create a single classification model. The random forest approach generates a forest of multiple decision trees by selecting various dividing characteristics and training samples. When predicting unknown samples, each tree in the forest is trained to make decisions, significantly increasing prediction accuracy compared to a single decision tree. After statistically assessing the decision outcomes, the classification with the most votes is recognized as the official classification result.
NB classifier is a highly scalable supervised learning technique. Bayesian reasoning is based on probability to derive conclusions about the ideal decision's probability distribution. It has been effectively implemented in various scientific areas and consistently performs well even when a few variables are considered.
SVM is a machine learning approach that is based on statistical learning theory. A support vector machine aims to minimize generalization error by creating a hyperplane in a high-dimensional space and utilizing a maximum margin to separate feature vectors belonging to distinct classes. When a support vector machine is used for linear classification, an n-1 dimensional hyperplane is used, where n is the dimension of the data.
XGboost is one of the most extensively used machine learning classifiers in bioinformatics. It is based on a tree model that classifies using a boosting method. Regularization elements are added to the cost function to minimize the model's complexity and prevent overfitting. Additionally, the parallel computing function is enabled by the XGboost algorithm, which significantly accelerates calculation.
Model evaluation and validation
We evaluated the predictive performance of the models in two ways: calibration plots, which represent the agreement between the predicted likelihood and the actual likelihood; AUC and ROC curves, which evaluate the classification performance of each specific model.