Predicting postoperative surgical site infection with administrative data: a Machine Learning algorithm

Background: Since primary data collection can be time-consuming and expensive, surgical site infections (SSIs) could ideally be monitored using routinely collected administrative data. We derived and internally validated ecient algorithms to identify SSIs within 30 days after surgery with health administrative data, using Machine Learning algorithms. All patients enrolled in the National Surgical Quality Improvement Program from the Ottawa Hospital were linked to administrative datasets in Ontario, Canada. Machine Learning approaches, including a Random Forests algorithm and the high-performance logistic regression, were used to derive parsimonious models to predict SSI status. Finally, a risk score methodology was used to transform the nal models into the risk score system. The SSI risk models were validated in the validation datasets. Results: Of 14,351 patients, 795 (5.5%) had an SSI. First, separate predictive models were built for three distinct administrative datasets. The nal model, including hospitalization diagnostic, physician diagnostic and procedure codes, demonstrated excellent discrimination (C statistics, 0.91, 95% CI, 0.90-0.92) and calibration (Hosmer-Lemeshow χ 2 statistics, 4.531, p=0.402). Conclusion: We demonstrated that health administrative data can be effectively used to identify SSIs. Machine learning algorithms have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. External validation of this model is required before it can be routinely used to identify SSIs. against the risk from the SSI risk (horizontal


Background
Surgical site infection (SSI) is common and considered one of the most common types of postoperative complications (1). SSIs are associated with substantial morbidity and mortality, prolonged hospital duration of stay, increased hospital readmission rate, and nancial burden to health care systems (1)(2)(3)(4)(5).
Previous research has shown the importance of effective prevention strategies targeting both short-and long-term consequences of SSI, which requires an ability to track SSIs (2). Since the primary data collection can be time-consuming and expensive, routinely collected health administrative data offer ample opportunities to identify and monitor SSIs, and assess the impact of prevention strategies, given a wide population coverage and minimal costs and efforts. Several studies have developed some accurate administrative algorithms to identify SSIs (6-10), while other studies have found that SSI identi cation using administrative data is imprecise (11). However, previous studies were often based on small sample sizes and/or a limited set of pre-selected variables to predict SSIs.
Machine learning approaches have been successfully applied to create predictive models in several elds of study, including automatic medical diagnostics (12,13). With interpretability of model parameters and ease of use, logistic regression can generate excellent models and serve as a commonly accepted statistical tool. Random Forests approach is used in situations where regression assumptions may be violated by situations in which many predictors are associated with a small number of outcomes (16). It can cope with inter-correlation between multiple explanatory variables, since each predictor is selected randomly for each stage of the learning process (17), unlike standard regression approaches. Previous studies have indicated that the Random Forests approach may have better prediction accuracy than other machine learning methods (14,15). We hypothesized that the use of machine learning approaches and a large data set with many features will improve the accuracy of SSI prediction. This study aimed to develop e cient algorithms to identify SSIs within 30 days after surgery using health administrative data.

Material And Methods
This study was divided into three stages. In the rst stage, a Random Forests algorithm was used to perform a preliminary screening of variables and to rank the importance of candidate variables. In the second stage, the 30 most important variables from the rst stage were input into the high-performance logistic regression to build interpretable and parsimonious models for all three administrative datasets used in this study. Finally, we used risk score modeling methodology to transform the nal logistic models form the second stage into the risk score system.

Selection and Description of Participants
This study was performed at The Ottawa hospital (TOH), Canada, a 1200-bed academic health sciences center providing approximately 90% of the major surgical operations in a catchment area of 1.2 million people. We identi ed all patients at TOH aged 18 years and older who underwent surgery and were included in the American College of Surgeons National Surgical Quality Improvement Program (NSQIP) data collection, between April 1, 2010, and March 31, 2015. The NSQIP uses trained Surgical Clinical Reviewers to collect data using a combination of chart review and follow up from the preoperative period through 30 days postoperatively. Patients were excluded if: 1) they were not eligible for the Ontario Health Insurance Program (OHIP) or had an invalid OHIP number, because this was required for linkage to health administrative datasets; or 2) they had missing admission, discharge, or surgery dates.

Population-based Health Administrative Datasets
We linked the NSQIP dataset to three distinct population-based, health administrative datasets housed at the Institute for Clinical and Evaluative Sciences (ICES). ICES is an independent, non-pro t research institute whose legal status under Ontario's health information privacy law allows it to collect and analyze health care and demographic data, without informed consent, for health system evaluation and improvement. The use of data in this project was authorized under Sect. 45 of Ontario's Personal Health Information Protection Act, which does not require review by a Research Ethics Board. The datasets included: 1) the Discharged Abstract Database and Same Day Surgery Database to identify the records of the hospitalization (ICD-10 code), including admission and discharge dates, diagnoses, 2) the Physician Services Database to retrieve all claims for services provided by all eligible health care providers, and 3) the Ontario Health Insurance Plan (OHIP) database that contains physician diagnostic codes (ICD-9 codes) and diagnosis descriptions. All patients were followed for 30 days from the time of their surgery. All databases were linked using anonymized unique identi ers and analyzed at the ICES at the University of Ottawa, Ontario. This study was approved by the Ottawa Health Science Network Research Ethics Board. Study outcome All individuals who had any type of SSIs (i.e. super cial, deep, or organ space) (Additional le 1) within 30 days after surgery, according to the de nition of the NSQIP protocol, were de ned as having experienced an SSI.

Statistical analysis
This study utilized a 3-stage predictive modeling based on the hybrid modeling approaches developed in previous studies (16-18). All stages described below were applied to each administrative dataset used in this study to generate three sub-models that contributed to the omnibus SSI model. Stage 1 -Model development using Random Forests algorithm Details of Random Forests method have been described elsewhere (19)(20)(21). In short, each of the classi cation trees is built using a bootstrap sample of the data, and a random subset of variables was selected at each split, thereby constructing a large collection of decision trees with controlled variation (22,23) (Additional le 2). The Random Forests trees are not pruned, so as to obtain low-bias trees. Every tree in the forest casts a "vote" for the best classi cation for a given observation, and the class receiving most votes results in the prediction for that observation. The study cohort was rst divided randomly into derivation (70%) and validation (30%) samples (Additional le 3). Then, the derivation data was sampled to create an in-bag partition -(2/3) to construct the decision tree, and a smaller out-of-bag partition (1/3) to test the constructed tree to evaluate its performance by computing: 1) misclassi cation error, 2) Cstatistics, and 3) model performance (sensitivity, speci city, etc.). The optimal number of trees and a subset of variables at each node were selected using the "tuneRF" function in R to minimize the misclassi cation error. Random Forests calculates estimates of variable importance for classi cation using permutation variable importance measure (VIM) (19), which is based on the decrease of a classi cation accuracy when values of a variable in a node of a tree are permuted randomly. Finally, Kfold cross validation was used to evaluate the Random Forests model with 10 folds. We identi ed subsets of top 30 important diagnostic or procedure codes to predict SSIs, using a mean decrease accuracy value of 0.02 as a cut-off point. The Random Forests analyses were performed in R statistical software (3.3.2.) using "randomForest" package (21).
Stage 2 -Stepwise model selection using high-performance logistic regression approach The 30 most variables identi ed after Random Forests were input into the high-performance logistic model with stepwise variable selection to nd the best parsimonious model to predict SSIs (24). The Schwarz Bayesian Criterion (SBC) was used as a penalized measure of t for logistic regression model to help avoid the model over-tting.

Stage 3 -Risk score modeling approach
We used the methods suggested by Sullivan et al (25) to summarize each logistic model from stage 2 as a point system. The point scores were developed for hospitalization (ICD-10) and physician (ICD-9) diagnostic codes, and physician procedure claims. All variables in the models were categorical, and the distance between a variable and its base category in regression coe cient units was equal to the size of the coe cient. For each variable, its distance from the base category in regression coe cient units was divided by this constant and rounded to the nearest integer to get its point value.
Then, the obtained point scores were input into logistic regression model and adjusted for other potential confounding factors suggested by the existing literature, including age, sex, surgical procedure, emergency case, concurrent surgical procedures, patient's physical status (ASA-5), and duration of surgery. The full model discrimination (C statistics or AUC) and calibration (Hosmer-Lemeshow (H-L) statistics) were assessed in the validation dataset. All methods were performed in accordance with the guidelines for developing and reporting Machine Learning predictive models in biomedical research (26). The high-performance regression and point score assignment were performed in SAS 9.4 statistical software.

Results
We identi ed 14,351 patients who underwent surgery from April 1, 2010 to March 31, 2015 and were enrolled into NSQIP at our hospital. An SSI was identi ed in 795 (5.5%) of these patients. Of these, 540 (68%) had super cial SSIs and 255 (32%) had deep or organ space SSIs. Descriptive statistics for patients in the study sample are reported in Additional le 4. The derivation and validation datasets were similar in terms of baseline covariates (Additional le 5).
Predictive modeling for hospitalization diagnostic codes (ICD-10) We identi ed 3,085 hospitalization diagnostic (ICD-10) codes recorded within 30 days following the surgery date. These codes then were clustered into 994 three-digit hospitalization diagnostic codes that were used for the further analyses.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random forests approach was used to identify a subset of top important 30 hospitalization diagnostic codes that best predicts classi cation. We used 800 classi cation trees and 46 variables available for splitting at each tree node. The accuracy of the Random Forests model was 95.3%. The resulting SSI prediction model demonstrated positive predictive value (PPV) of 98%, negative predictive value (NPV) of 97%, and AUC (area under the receiver operating characteristic curve) of 0.78 (95% CI 0.77-0.79). The accuracy of the Random Forests model after a 10-fold cross-validation was 94.3%. Figure 1 presents the top 30 hospitalization diagnostic (ICD-10) codes for classi cation of SSIs that have been identi ed using the permutation VIM.    Table 1, Model 1 (25). Among the entire cohort, 80.3% of patients had a score of 0, 11.8% had a score of 1, and 7.9% had a score equal or greater than 2. Predictive modeling for physician diagnostic (ICD-9) codes We identi ed 442 physician diagnostic 3-digit codes (using ICD-9-CA) recorded within 30 days following the surgery date.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random forests approach was used to identify a subset of 30 physician diagnostic codes that best predicts SSIs. The best misclassi cation rate was achieved by using 800 classi cation trees and 31 variables available for splitting at each tree node. Stage 3: Risk scores for the nal model of physician diagnostic codes are presented in Table 1, Model 2 (25). Among the entire cohort, 77.8% of patients had a score of 0, 7.7% had a score of 1, and 14.5% had a score equal or greater than 2.
Predictive modeling for physician procedure claims We identi ed 2,543 physician procedure claims recorded within 30 days following the surgery date. These codes then were clustered into 610 three-digit codes that were used for the further analyses.
Stage 1: Given a large number of physician procedure codes (possible predictors), Random forests approach was used to identify a subset of 30 physician procedure claims that best predicts SSIs. The best misclassi cation rate was achieved by using 1,000 classi cation trees and 37 variables available for splitting at each tree node. The accuracy of the Random Forests model was 94.8%. The resulted SSI prediction model demonstrated PPV of 99%, NPV of 97%, and AUC of 0.82 (95% CI 0.81-0.83). The accuracy of the model after a 10-fold cross-validation was 94.4%. Figure 3 presents the top 30 physician procedure claims that have been identi ed using the permutation VIM. Z59 -Digestive system surgical procedure; C46 -Infectious disease -non-emergency hospital in-patient services: assessment/ consultation; Z10 -Integumentary system surgical procedures: incision of abscess/ haematoma; K07 -Family practice/geriatrics acute and chronic home care supervision; K99 -Emergency department -special visit premium; C03 -General surgery, non-emergency hospital in-patient services-assessment, visits, consultations; A35 -Urology -consultations/ assessment; S16 -Digestive system surgical procedures; H15 -Family practice & practice in general -weekend and holidays: assessment/care; C64 -General thoracic surgery -non-emergency hospital in-patient services: consultation assessment; H12 -Family practice & practice in general -nights assessment and car; C12-Non-emergency hospital in-patient services: Subsequent visits by the MRP; R11-Integumentary system surgical procedures: operations of the breast; E08 -Hospital and institutional consultations/assessments by MRP; C20 -Obstetrics and gynecology -non-emergency hospital in-patient services; Z08 -Debridement of wound(s) and/or ulcer(s) extending into subcutaneous tissue, tendon, ligament, bursa and/or bone; G55-Diagnostic and therapeutic procedures, critical care; S21-Digestive system surgical procedures: rectum; S65 -Male genital surgical procedures; Z74 -Respiratory surgical procedures; R62-Musculoskeletal system surgical procedures -amputation; A20 -Obstetrics and gynecologyassessment or consultation; Z22 -Musculoskeletal system surgical procedures; R06 -Myocutaneous, myogenous or fascia-cutaneous aps, neurovascular island transfer, transplantation of free island skin and subcutaneous ap; A24 -Otolaryngology -assessment/ consultation; C13 -Internal and occupational medicine: non-emergency hospital in-patient services; C01 -Non-emergency hospital inpatient services, subsequent visits by the MRP; H13 -Family practice & practice in general -weekdays, evenings: assessment/care; C21 -Consultations/visits anaesthesia -non-emergency hospital in-patient services Stage 2: The identi ed top 30 physician procedure claims were input into the high-performance logistic regression model to identify the best parsimonious model for prediction of SSIs. We used a stepwise variable selection approach. Table 1, Model 3 presents the nal models of 14 physician procedure claims to identify SSIs (AUC 0.84, 95% CI 0.83-0.85).
Stage 3: Risk scores for the nal model of physician procedure claims are presented in Table 1, Model 3 (25). Among the entire cohort, 55.4% of patients had a score of 0, 11.9% had a score of 1, and 44.6% had a score equal or greater than 2. Full model with total risk score of diagnostic and procedure codes In the derivation cohort, the total scores of hospitalization diagnostic (ICD-10) codes, physician diagnostic (ICD-9) codes and physician procedure claims were included in the logistic regression model and adjusted for potential confounding factors, including surgical specialties, age, sex, duration of surgery, emergency case, ASA class and concurrent surgical procedures (Table 2). The full model had excellent discrimination (AUC 0.91; 95% CI, 0.90-0.92) and calibration (H-L statistics, 4.53, p = 0.402). The predicted probability threshold with the optimal operating characteristics (27) (e.g., the square of distance between the point (0, 1) on the upper left hand corner of ROC space and any point on ROC curve) was a predicted risk of 4% (sensitivity, 83.4%; speci city, 89.2%; PPV, 34.2%; and NPV, 99.1%). In the internal validation cohort, the full model remained strongly discriminative (AUC 0.89, 95% CI 0.88-0.90) and well calibrated (H-L statistics, 6.47, p = 0.487) (Fig. 4).

Discussion
We used a 3-stage predictive modeling approach to derive and internally validate models to predict SSIs within 30 days after surgical procedure. To the best of our knowledge, this is the rst study that used Machine Learning approaches to develop e cient algorithms for identifying SSIs within 30 days after surgery by use of health administrative data. The key nding of our study is that the risk of SSIs can be reliably estimated using routinely collected administrative data, including physician procedure claims, hospital (ICD-10) and physician (ICD-9) diagnostic codes. Our study results demonstrate high performance of the Random Forests algorithm for prediction of SSIs without pre-selection of possible predictors given a small number of cases. We derived a relatively small set of variables to identify postoperative SSIs, including 6 hospital diagnostic codes, 9 physician diagnostic codes, and 14 physician procedure claims.
Several studies have examined the use of administrative data to identify postoperative SSIs (6-10). Our study ndings are consistent with these studies (6, 10). van Walreven et al. (6), for example, found that administrative data, including hospital diagnostic, emergency department visit codes and physician procedure claims, can be effectively used to identify postoperative patients with a low risk of having SSIs within 30 days of their surgical procedure. In particular, the predictive probability threshold with the optimal characteristics was a predicted risk of 5% (sensitivity, 82.1%, speci city, 85.6%, PPV, 27.7%).
Additionally, Sands et al. found that (9) automated medical and claim records together can be used to screen for post discharge SSIs, but the method they used identi ed only 10% of procedures as possible infections.
The approach used in our study added a new contribution to the existing literature by incorporating much larger set of features as compared with the previous studies. It was possible to include all available diagnostic or procedure codes to identify SSIs in this study, because Random Forests approach is generally unaffected by the addition of irrelevant features and is robust to collinearity due to the use of subsets of random variables for tree splits. All the features included in this study were obtained from routinely collected data, and given the complex etiology of SSIs, there might be variables that would be overlooked if we used a narrower search strategy guided by a priori clinical expectations. It would be inappropriate to interpret the identi ed diagnostic or procedure codes as either causes or consequences of SSIs. Random Forests allows us to select variables that are in uencing prediction given a small sample sizes and the extremely small ratio of samples to variable (large "p" and small "n"). If the identi ed important variables are consistent with clinical knowledge, there will be more con dence in the derived model as a decision support tool.

Conclusion
This study shows that health administrative data could be effectively used in identifying SSIs. Machine   Description of the top 30 physician procedure claims to identify SSIs. Z59 -Digestive system surgical procedure; C46 -Infectious disease -non-emergency hospital in-patient services: assessment/ consultation; Z10 -Integumentary system surgical procedures: incision of abscess/ haematoma; K07 -Family practice/geriatrics acute and chronic home care supervision; K99 -Emergency departmentspecial visit premium; C03 -General surgery, non-emergency hospital in-patient services-assessment, visits, consultations; A35 -Urology -consultations/ assessment; S16