Data source and study subjects
A total of 2,756 adult patients (≥ 18 years old) who started CRRT due to acute kidney injury were retrospectively reviewed at Seoul National University Hospital from June 2010 to February 2020. Patients who had underlying end-stage renal disease (n = 344), stopped CRRT within 1 hour after initiation (n = 49), and had no information on comorbidities or laboratory data (n = 14) were excluded. Accordingly, 2,349 patients were analyzed in the present study. The patients were randomly divided into a training set (70%) to develop the models and a testing set (30%) to test and calibrate their performance. The study was approved by the institutional review board of the Seoul National University Hospital (no. H-2003-024-1106) and complied with the Declaration of Helsinki. The requirement of informed consent was waived by the board.
Study variables and outcomes
Using an electronic medical record system, a total of 92 features at the time of starting CRRT were used to develop machine learning models. Clinical features included age, sex, weight, application of the mechanical ventilator, and comorbidities, such as diabetes mellitus, hypertension, ischemic heart disease, chronic heart failure, stroke, peripheral vascular disease, dementia, chronic kidney disease including diabetic nephropathy, chronic obstructive pulmonary disease, connective tissue disease, peptic ulcer disease, cancer, and arrhythmia including atrial fibrillation, atrioventricular block, ventricular tachycardia, tachycardia-bradycardia syndrome, and total left bundle branch block. Vital signs such as systolic blood pressure (SBP), diastolic blood pressure (DBP), mean arterial pressure (MAP), heart rate, respiratory rate, and body temperature were measured. The blood pressure values were continuously collected every 1 hour or less after starting CRRT. The laboratory data included white blood cell counts, hemoglobin, hematocrit, platelet, total bilirubin, blood urea nitrogen, creatinine, total protein, albumin, pH, sodium, potassium, calcium, phosphate, uric acid, prothrombin time-international normalized ratio, activated partial thromboplastin time, partial pressures of arterial carbon dioxide and oxygen, partial pressure to fractional inspired oxygen, alveolar to arterial oxygen gradient, and the presence of bacteremia. As a setting value, target dose, blood flow rate, amount of dialysate and replacement fluids (pre- and post-dilution), target amount of input and output, the number of bicarbonate ampules mixed in dialysate and replacement fluids, and catheter type were collected. The information on the infused medications or fluids and their infusion rates were obtained, as shown in Table S1. The number of bicarbonate ampules mixed in these fluids were calculated. The Glasgow coma scales were calculated. The SOFA, APACHE II, and MOSAIC scores were measured based on the methods presented in the original studies [14–16]. Hypotension was defined as a reduction in MAP ≥ 20 mmHg from the initial value within 6 hours. Additionally, other definitions were used such as a reduction in MAP ≥ 30 mmHg from the initial value or setting the timeframe to within 1 hour. The intensive care unit (ICU) mortality, which was defined as all-cause death during the ICU admission, was estimated.
Statistical analysis and development of machine learning models
Development of machine learning models and statistical analyses were performed using R software (version 4.0.2; The Comprehensive R Archive Network: http://cran.r-project.org). Categorical and continuous features are expressed as proportions and the means ± standard deviation, respectively. The chi-square test was used to compare categorical features (Fisher’s exact test, if not applicable), and the Student’s t test was used to compare continuous features between the training and testing sets. The restricted cubic spline was used to display the odds ratio of ICU mortality according to the change in MAP values during CRRT.
Three machine learning algorithms were used including the support vector machine (SVM), deep neural network (DNN), and light gradient boosting machine (LGBM). The SVM models used four kernels including linear, polynomial, sigmoid, and radial basis functions. For each kernel, ten-fold cross-validation and the best hyperparameter using grid search (cost, gamma, degree, and coefficients) were performed. The kernels corresponding to the highest area under the receiver operating characteristic curve (AUROC) were derived from the final model. In the DNN model, optimal hyperparameters consisting of the size (number of hidden nodes) and decay (parameter for weight decay) with 10-fold cross-validation and grid search were determined. When developing the SVM and DNN models, the continuous features were normalized, and categorical features were processed as a one-hot encoding. In the LGBM model, hyperparameters (max_bin, learning rate, boosting method, and nrounds) were adjusted, and the model with the highest AUROC was selected for comparison.
For performance indices, AUROC, accuracy, F1 score, Matthews correlation coefficient (MCC), and area under the precision-recall curve (AUPRC) were measured in the testing set. The AUROCs were compared between models using the DeLong test. MCC is an informative and truthful score in evaluating binary classification compared to accuracy and F1 score [17]. The MCC values of + 1, 0, and − 1 represent perfect prediction, average random prediction, and inverse prediction, respectively. The threshold was determined when the F1 score was the highest. For calibration, Brier's scores were calculated, with those closer to 0 indicating good calibration. We ranked the importance of features in developing the DNN and LGBM models. The performance of machine learning models with variable numbers of features in order of ranking were also evaluated. P values less than 0.05 were considered significant.