We found that using a machine-learning algorithm including all 33 available variables and a parsimonious machine-learning-algorithm encompassing only the 10 most important predictors improved prediction of patients at increased risk of having a length of stay > 4 days or readmissions due to medical complications compared to traditional logistic regression models. Thus, despite similarities in weighting of predictor variables, using the full machine-learning model resulted in approximately 5% increase in correctly identified risk-patients compared to the full logistic regression model. This corresponded to an increase in AUROC of about 1.5, which is about 3 times larger than what was found in a study investigating potential benefits of machine-learning for the NSQIP risk calculator [35].
In contrast, when also including patients having a length of stay > 4 days but without a well-defined complication as an outcome, the parsimonious machine-learning model was slightly worse than a traditional logistic regression model including all variables. Wei et al. used an artificial neural network model to predict same-day discharge after TKA, based on the NSQUIP database from 2018 and found that six of the ten most important variables were the same compared with logistic regression, similar to our findings [36].
However, patients with one-day length of stay were intentionally excluded due to variations in in-patient vs. out-patient registration [36]. A previous systematic review found that machine-learning algorithms may provide better prediction of postoperative outcomes in THA and TKA [37]. However, the authors concluded that such models performed best at predicting postoperative complications, pain and patient reported outcomes and were less accurate at predicting readmissions and reoperations [37]. That machine-learning algorithms may improve prediction of complications after THA and TKA compared to traditional logistic regression was also found by Shah et al. who used an automated machine-learning framework to predict selected major complications after THA [13]. However, theirs was a retrospective study based on diagnostic and administrative coding and the selected complications occurred only in 0.61% of patients, potentially limiting clinical relevance. In contrast, we aimed at identifying a cohort which would comprise 20% of patients in which we found about 60% of all medical complications. This we believe, is within the means of the Danish socialized healthcare system to allocate additional resources for intensified perioperative care and with both patient-related and economic benefits due to potentially avoided complications and costs. In this context, the models using 25% and 35% positive prediction thresholds demonstrated that the gain in sensitivity leading to identification of 14–24 more patients with complications was at the cost of 196 to 391 more patients being “wrongly” classified as risk patients.
Age has traditionally been a major factor when predicting surgical outcomes and remained the single most important predictor in our study. However, although elderly patients had increased risk of postoperative complications, likely related to decline of physical reserves [38], the use of chronological age alone was inferior compared to both machine-learning and logistic regression models incorporating comorbidity and functional status. Thus, using age by itself for identifying the high-risk population resulted in missing 18% of the “true risk-patients” (87 compared to 106 in the full ML model).
We used the SHAP values for estimation of the impact of the included variables. The SHAP values show which variables contribute most to the risk-score, thus providing a better understanding of the otherwise “black-box” machine-learning model. This approach was also used by Bonde and colleagues, who used deep neural networks to predict postoperative complications across several different surgical procedures [10].
In our study, the SHAP analysis on unique Danish registry data on reimbursed prescriptions, unsurprisingly found a considerable increase in risk-score with an increasing number of prescriptions, especially in elderly patients However, this is a complex relationship where some patients benefit from their treatments, while other may suffer from undesirable side-effects. Nevertheless, the information from the SHAP analysis in machine-learning studies may provide inspiration for new hypothesis-generating studies on risk-factors, e.g. on the potential differences in risk-profile between having preoperative prescribed VKA and DOAKs found in our study. Also, the age-related differences in risk from SSRI’s could guide further studies on “deprescription”.
Another important requirement for machine-learning-algorithms to be clinically useful is user friendliness and not depending on excessive additional data collection by the attending clinicians [9]. In this context, it was disappointing that the parsimonious machine-learning algorithm with only the ten most important variables was slightly worse at predicting the secondary outcome than the full logistic regression model. This could be due to a length of stay > 4 days but without described medical complications more often is related to social and logistical factors not contained within the ten most important patient-related preoperative variables, e.g., having a supportive network, availability of homecare etc. Thus, the information gained by the combination of all available information may be of further importance when merely using LOS as outcomes in prediction studies. However, it also highlights the need for as much detailed, and preferably non-binary, data as possible to fulfill the true potential of machine-learning algorithms.
In contrast to several other machine-learning studies, our dataset included only one paraclinical variable, which was preoperative hemoglobin. Although the inclusion of other laboratory tests such as albumin, sodium and alkaline phosphatase has been found to be of importance in some machine-learning algorithms [10, 39] they are not standard in fast-track protocols and not easy to interpret from a pathophysiological point of view. Also, most decisions on intensified postoperative care in elective surgery will likely need to be conducted preoperatively, as there is an increasing need to prioritize limited health-care resources. Thus, although postoperative information such as duration of surgery, perioperative blood length of stays or postoperative hemoglobin have been included in other studies [39], we decided against the use of peri- and postoperative data. The same approach has been used by Ramkumar et al. who used U.S. National Inpatient Sample data including 15 preoperative variables, to predict length of stay, patient charges and disposition after both TKA and THA [17, 40]. However, these studies were not conducted in a socialized health care system, and their main focus was on the need for differentiated payment bundles and without specific information on the reason for increased length of stay or non-home discharge [40].
Our study has some other limitations. First, one of the strengths of machine learning compared to logistic regression is the analysis of multilevel continuous data, whereas we included only a limited number of, often binary, preoperative variables. This could have limited the full realization of our machine-learning model. As previously mentioned, we excluded intraoperative information, including type of anesthesia, surgical approach etc. all of which may influence postoperative outcomes. The observational design of this study means that we cannot exclude unmeasured confounding or confounding by indication. Also, despite that the DNDRP has a near complete registration of dispensed medicine in Denmark, some types or drugs, especially benzodiazepines, are exempt from general reimbursement and thus not sufficiently captured [21]. Furthermore, it is doubtful whether the patients used all types of drugs at the time of surgery (e.g. heparin which is rarely for long-term use). Finally, classification of a complication being “medical” depended on review of the discharge records which can also introduce bias. However, we believe our approach to be superior to depending only on diagnostic codes which often are inaccurate [41] and provide limited details on whether the complication may be attributed to a medical or surgical adverse event. The strengths of our study include the use of national registries with high degree of completion (> 99% of all somatic admissions in case of the DNDRP) [42], prospective recording of comorbidity, extensive information on prescription patterns 6 months prior to surgery and similar established enhanced recovery protocols in all departments.
In summary, our results suggest that machine-learning-algorithms may provide slight, but clinically relevant, improved predictions for defining patients in high-risk of medical complications after fast-track THA and TKA compared to logistic regression models. Future studies could benefit from using such algorithms to find a manageable population of patients who may benefit the most from intensified perioperative care.