Data were extracted from the Perioperative Data Warehouse (PDW), a custom-built data warehouse containing all patients who have undergone surgery at the University of California Los Angeles (UCLA) Health since the implementation of our EHR (EPIC Systems, Madison, WI) in March 2013. The PDW has been previously described15, 16. Briefly, in the first stage, data are extracted from EPIC’s Clarity database into 29 tables organized to facilitate usage. In the second stage, these data are used to populate a series of measures and metrics such as procedure duration, readmissions, admission International Classification of Diseases (ICD) codes, and postoperative outcomes17-19. All data used for this study were obtained from this data warehouse, and institutional review board approval (18-002053) was obtained from the UCLA Office of the Human Research Protection Program, including exemption from written informed consent, for this retrospective review. All methods were performed in accordance with relevant guidelines and regulations.
Determination of outcome variables – mortality and AKI – from EHR data
Postoperative mortality was defined as death during the same hospitalization as the surgery as identified by either 1) a death date is noted in the EHR between hospital admission and discharge or 2) a postoperative discharge status of expired, presence of a “death note” by a treating provider, and the lack of future admissions or encounters for the patient in the EHR. Postoperative AKI was defined based upon the Acute Kidney Injury Network (AKIN) criteria creatinine criteria20. The baseline creatinine was taken to be the most recent serum creatinine prior to surgery. The postoperative creatinine was the highest creatinine value within 48 hours of surgery. If either the preoperative or postoperative creatinine was missing, the value was set to NULL. A value of 0 was used to denote no AKI while any AKIN stage 1 or above was denoted as a 1 (i.e. this was set to a binary variable).
Inclusion and Exclusion Criteria
The inclusion data set was surgeries with general anesthesia that occurred between April 1, 2013 and July 2021 that were performed at the Ronald Reagan UCLA Medical Center and Santa Monica Medical Center hospitals. Patients were filtered by their class, selecting only inpatient, same day hospitalization, emergency care and overnight recovery patients (i.e. those that spent at least one night in the hospital). Patients aged less than 18 years old or older than 89 years old were excluded from the dataset due to the institutional restrictions on data security. Cases were excluded if they had an ASA physical status score of 6 (indicating organ donors).
Model Input Features
In this manuscript, four sets of input features were defined depending on their characteristics: 1) baseline features including basic patient information and surgery specifications, 2) the most recent laboratory tests obtained before the surgery, 3) procedure description, and 4) medications taken.
Baseline Feature Vector
The features used in the baseline model were based on previous work by our group predicting postoperative mortality both before and after surgery3-5. For the purposes of this analysis, we removed those features that would be redundant with the additional feature vectors (see below). For example, lab results were removed from the feature sets used in those models because we created a separate, more comprehensive feature vector, comprised only of the labs. The list of features included in this group is in Supplementary table 1A.
Laboratory Result Feature Vector
A set of 39 commonly used laboratory results were extracted from the EHR (see Supplementary Table 2A for a complete list). These results were chosen because they are common preoperative tests (i.e. included in complete blood count, comprehensive metabolic panels or coagulation panels). Test all results were then binned in 6 months before surgery, 3 before months, and 1 month prior to surgery. Then, for each laboratory test bin the following descriptive statistics were calculated: total number of tests, cu-sum, median, variance, mean, standard score (z-score). The goal of the cu-sum was to incorporate a measure of temporal change into the descriptive statistics. The standard score (or z-score) is the number of standard deviations at which the mean of a patient's test results for each laboratory is higher or lower than the mean of that test for all patients. Thus, for each surgery we had a vector of 702 features (39 laboratory results, 3 bins, 6 descriptive statistics per bin).
Procedure Name Embedding
As part of our research, we experimented with the inclusion of clinical text data in the form of procedure names as the model inputs. Administrative codes, such as CPT codes are only available after surgery (patient discharge), thus we focused on representing the procedure name using a numerical vector available before surgery. The procedure name, as booked by the surgeon, consists of a string with a variable number of words. The number of unique words contained in all procedures names was 22,003 in the training dataset. In order to include the procedure name in the prediction model, we applied word embedding; a common method for representing words, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning21. We trained word embeddings on the clinical text data to allow the model to understand a clinical context using class Word2Vec from Gensim library22. In order to train the model, each procedure name is broken into words (tokens). The Word2Vec model takes a list of tokens for each sentence as input and returns a set of numeric vectors as output using a two-layer neural network. The size of the numeric vector is set as a parameter of the model; in this paper, it set to 100. The trained model was applied for each procedure name, returning a set of numeric vectors for each word. Since the procedure name contains multiple words, the length of the procedure name is variable. Thus, in order to get a single vector representation of the procedure name, we calculated the average vector of its words, which is the input to the main predictive model.
Medications
The last type of feature we analyzed was the medications taken by a patient before surgery. A set of high use and likely clinically predictive medications was made based on clinical judgement. Medications that were taken 24 hours before surgery were excluded. Medications were broken up into two different categories: given as an inpatient and taken at home. Combined medications (e.g. HYDROCODONE-ACETAMINOPHEN 5-325 MG PO TABS) were separated into single medications, and each prescription dose was calculated with normalized units (e.g. Hydrocodone 5 mg, Acetaminophen 325 mg). For each patient we created a binary vector, where each element of the vector indicated if the specific medication was taken by a patient. Then for each medication the Fisher’s Exact Test was used to determine a significant association between medication and target variable. Medications with a p-value of less than 0.05 were included in the model. The final vector contained a set of 126 medications (see Supplementary Table 3A for a complete list).
Data preprocessing
Data were split training and testing datasets with the split ratio 75:25. To avoid information leakage, all patients that appeared in the test set were removed from the training set.
Categorical variables such as ethnicity, gender, etc in clinical data were converted to a numerical representation by applying the One-Hot-Encoding algorithm that decoded each category in a binary vector. To optimize the memory usage, a memory reduction function was implemented that validated the feasibility of the data type modification. On the training dataset, this technique decreased the dataset used memory by almost 68% from 740 MB to 237MB, which significantly accelerated the model training process.
Model creation, training, and testing
Nine models were trained. Model 1 is the baseline model that includes only the basic features in Supplementary Table 1a. Models 2, 3 and 4 were trained only on laboratory results, procedure name embeddings, and medication features respectively. The main goal of training these models was to measure the individual contributions to the prediction model. Models 5, 6, and 7 were extensions of Models 2,3, and 4 with added features from the baseline model to the features of the respective models (i.e. baseline + laboratory, baseline + medications, etc.). Finally, Model 8 was obtained by combining features from the previous models - 1088 features in total. To simplify Model 8 we created Model 9 that included only features that were selected by the feature selection algorithm Boruta[i].
All models were gradient boosted trees as previous work by our group has shown this technique to consistently perform well3, 5. The main advantage of gradient boosted trees, as opposed to other tree based models, is that trees are created sequentially which reduces the residual error of previous trees.
The gradient boosted tree classifiers were implemented using the XGBoost package. Model hyperparameters were selected using five-fold cross-validation on the training dataset, where patients undergoing multiple surgeries appeared only in the training or testing set, but not simultaneously in both. In five-fold cross-validation, the dataset is divided into five partitions; four-fifths of the data is used to train the models and the remaining one-fifth is used as the testing set. This process is repeated so that each partition is used as a testing set only once and a training set four times. Cross-validation provides a better assessment of model performance by averaging metrics across multiple tests. The models were trained using 2000 estimators and a maximum tree depth of 7 and minimum number of 7 children.
Model performance
Prediction of both mortality and AKI were treated as binary classification problems with highly imbalanced classes. The issue of class imbalance has significant implications for metrics of model performance.
Receiver operating characteristic (ROC) curves are widely used for the estimation of predictive model performance with a binary outcome. ROC curves characterize the trade-off between true positive and false positive rates for the binary classification model by varying the discriminative threshold. However, the false positive rate is affected by the underlying rate of the event and can be deceptive for data with a large skew in the class distribution, thus making ROC curves overly optimistic.
Thus, in addition to ROC curves we considered precision-recall (PR) curves, which summarize the trade-off between true positive rate and positive predictive value by changing the prediction threshold. Positive predictive value (or precision) penalizes a model for a large number of false positives relative to the number of true positives that makes PR curves robust even under imbalanced data. Simultaneously recall, instead of focusing on the number of predicted false positives, penalizes a model for a large number of false negative. The penalties in precision and recall are opposites, making this curve a better metric for model performance with imbalanced data.
Lastly, the F-beta score is a useful metric that calculates the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score[ii] and allows data scientists to pick a threshold that optimizes the implementation tradeoffs between precision and recall. Numbers greater than 1 will give increased weight to the recall, while those less than one will give increased weight to the precision.
To examine the performance of the different models we examined model performance with beta values of 1, 2, and3 were chosen. The higher beta values gave more weight to the recall and penalized the number of false negatives. The F-beta score is a threshold metric; thus, scores were calculated for each model based on the selected discriminative threshold that maximizes the F-score. One of the limitations of these metrics is that they assume the distribution of the classes observed in the training dataset will match the distribution in the test set and in real data when the model is used to make predictions.
[i] https://github.com/scikit-learn-contrib/boruta_py
[ii] https://scikitlearn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html