What do we Know about Contributing Factors for “Never Events” in Operating Rooms? A Machine Learning Analysis

A Surgical “Never Event” (NE) is a preventable error. Various factors contribute to the occurrence of wrong site surgery and retained foreign item, but little is known about their quantied risk in relation to surgery's characteristics. Our study uses machine learning to reveal factors and quantify their risk to improve patient safety and quality of care. Methods We used data from 9,234 observations on safety standards and 101 Root-Cause Analysis from actual NEs, and utilized three Random Forest supervised machine learning models. Using a standard 10-cross validation technique, we evaluated the model's metrics, and, through Gini impurity we measured the impact of factors thereof to occurrence of the two types of NEs. Results We identied 24 contributing factors in six surgical departments. Two had an impact of >900% in Urology, Orthopedics and General Surgery, six had an impact of 0–900% in Gynecology, Urology and Cardiology, and 17 had an impact of <0%. Factors' combination revealed 15-20 pairs with an increased probability in ve departments: Gynecology:875–1900%; Urology: 1,900:2,600%; Cardiology:833–1,500%; Orthopedics:1,825–4,225%; and General Surgery:2,720–13,600%. Five factors affected the occurrence of wrong site surgery (-60.96–503.92%) and ve of retained foreign body (-74.65–151.43%), three of them overlapping: two nurses (66.26–87.92%), Surgery length<1 hour (85.56–122.91%), Surgery length 1-2 hours (-60.96–85.56%). The use of machine learning has enabled us to quantify the potential impact of risk factors for wrong site surgeries and retained foreign items, in relation to surgery's characteristics, which in turn suggests tailoring the safety standards accordingly. adopted Gini [25] to estimate importance of features their combination our Gini impurity a by measuring of how often chosen Feature importance ranking calculated feature assumed


Abstract
Background A Surgical "Never Event" (NE) is a preventable error. Various factors contribute to the occurrence of wrong site surgery and retained foreign item, but little is known about their quanti ed risk in relation to surgery's characteristics. Our study uses machine learning to reveal factors and quantify their risk to improve patient safety and quality of care.

Methods
We used data from 9,234 observations on safety standards and 101 Root-Cause Analysis from actual NEs, and utilized three Random Forest supervised machine learning models. Using a standard 10-cross validation technique, we evaluated the model's metrics, and, through Gini impurity we measured the impact of factors thereof to occurrence of the two types of NEs.

Results
We identi ed 24 contributing factors in six surgical departments. Two had an impact of >900% in Urology, Orthopedics and General Surgery, six had an impact of 0-900% in Gynecology, Urology and Cardiology, and 17 had an impact of <0%.

Conclusions
The use of machine learning has enabled us to quantify the potential impact of risk factors for wrong site surgeries and retained foreign items, in relation to surgery's characteristics, which in turn suggests tailoring the safety standards accordingly.
Trial registration number: MOH 032-2019 Background Adverse medical events can lead to signi cant morbidity and mortality and increase healthcare expenditures. [1] A Never Event (NE) is an unacceptable adverse event, both preventable and unjusti ed, and should be reduced to zero through quality improvement. [2] Major NEs in perioperative care include incorrect surgery sites and foreign items retained in patients following surgery. [3][4] The human factors approach recognizes that human error is often the result of a combination of both individual surgeon factors and work system factors, [5] which makes human error the main contributing factor to NEs. [6] Human error includes surgeon distraction, [7] lack of situational awareness of the surgical team to possible error, and miscommunication among team members.
[8] Additionally, institutional factors, working conditions, such as increased workload and clinician pressure, create a work climate that is not conducive to meeting the standards required to maintain patient safety [9] and effective teamwork. [10] Currently, there are two essential international standards aiming to reduce NE occurrence: 1) the WHO Surgical Safety Checklist; [11] and 2) surgical counts of all items used during the surgery. [12] Yet, partial compliance, unstandardized implementation of these standards, [13] and other possible unknown factors keep the incidence of NEs unchanged. [14] In Israel, the incidence of retained foreign items during surgery is 3.2 in every 100,000 surgeries. [15] The incidence for wrong site procedure is unclear, but is generally estimated as 1 in every 100,000 surgeries.
This study adopts a machine learning (ML) approach [16] to identify currently unknown contributors to NE occurrence. Previous studies leveraging ML methods in healthcare have demonstrated the bene ts of analyzing and revealing non-trivial insights from diverse data types when compared to traditional methods. [17] To the best of our knowledge, this is the rst study to use ML methods to identify potential contributing factors to the occurrence of NEs in ORs.

Study Design
We utilized a supervised ML method called Random Forest (RF), [18][19] incorporating the popular Extra Tree classi er. [20] RF is an ensemble learning method that trains multiple "simple" decision tree models and merges them to achieve a more accurate and stable prediction.
The use of RF entails several desired elements needed for properly conducting the analysis for this study. First, RFs are used to rank the importance of features in a natural way. Speci cally, the importance of features can be determined by examining to what extent the tree nodes using a feature reduce the impurity (i.e., the uncertainty in classi cation) across all "trees in the forest." Second, RFs are known to cope well with imbalanced datasets (as is the case in this study), and avoid over tting the data. Finally, RFs compared favorably with several other supervised ML algorithms we tested using our data, including popular deep neural networks and support vector machines (SVMs). It is worthwhile mentioning that RFs have been used extensively in the medical eld for clinical risk prediction, [21] among other applications.
Safety Standards used in the OR (surgical safety checklists and surgical counts) were divided into safety veri cation at three distinct time periods -preprocedure, sign in and time out - [11] and addressed incorrect surgery site errors, which we will de ne as type A errors. Surgical counts were divided into three separate counts throughout the surgery to address retained foreign body errors, which we will de ne as type B errors: prior to skin incision; initiation of closure of fascia/cavity; and following skin closure. [22] In addition, we added general features, such as the name of the hospital, length of surgery, patient's gender and age, surgeon's specialty, and number of physicians and nurses present during surgery.

Data Collection and Annotation
Data were collected from 29 Israeli hospitals and consisted of two types of data entries: observations of 9,234 surgeries performed between January 2018 and February 2019 in which no NE occurred in the surgeries observed, and root cause analyses (RCA) of 101 NEs that occurred between January 2016 and February 2020 in the examined hospitals.

Observations
Initiated by the supervisory arm of the Israeli Ministry of Health, passive observations by medical students, physicians, nursing students or RNs are routinely performed in ORs. Observers for this study underwent an eight-hour long training that included simulations. In each OR, at least two observers passively observed randomly selected surgeries, and recorded and annotated the surgery process using a pre-de ned set of features. Observations were then transferred to a central database and routinely assessed for variability and reliability. Overall, 9,234 observations were conducted. Each observation was translated into a 93-feature long vector, representing characteristics of the surgery (Appendix 1). To maintain reliability, entries with greater than 5% discordance among annotators in one OR were discarded (<1%).

Root Cause Analyses (RCA)
RCAs were performed in response to NEs that occurred between January, 2016 and February, 2020. Overall, we reported 101 NEs: 49 of Type A and 52 of Type B. The obtained RCAs were manually annotated by the authors using the same 93-feature-long representation used to characterize the observations. However, unlike the observations, RCAs were performed retrospectively and, thus, a signi cant portion of the features was missing and could not be obtained. Speci cally, up to 40% of all other feature values were missing, a challenge we address further on.

Pre-Processing and Analysis Technique
As some features were non-binary (e.g., patient age, length of surgery), we rst discretized them, resulting in 250 binary features. This and subsequent steps were performed using a designated Python 3 program implemented by the authors that uses the standard scikit-learn ML package (https://scikitlearn.org/stable).
Examination of the 40% missing feature values revealed that most were strongly dependent on the NE type. Namely, for type A NEs, features that were assumed to be more related to NEs of type B were not investigated and vice versa. For example, for an NE on which the wrong hand was operated, there was no indication as to whether the surgeon scanned the surgical cavity for retained surgical items before closure. To mitigate this artifact, we used the popular iterative data imputation approach [23] where we predicted the value of each missing value while relying on the present features and available examples. Speci cally, using the entire dataset, each missing value was estimated using a standard Decision-Tree Regressor.
In addition, balancing steps were taken to cope with the high imbalance of the dataset. Speci cally, with over 9,000 observations and only 101 NEs, we adopted a cost-sensitive training approach [24] whereby our model adjusted for prediction mistakes on the minority class (NEs) by an amount proportional to how underrepresented it was (here, approximately 90 times under-represented).
We implemented three RF models using our data: Model 1 for differing between observations and NEs; Model 2 for differing between observations and NEs Type A; and Model 3 for differing between observations and NE Type B. We used a standard 10-cross validation technique to evaluate the model's metrics and adopted the standard Gini impurity, [25] measure to estimate the importance of features and their combination in our models. Intuitively, Gini impurity captures the "noise" in a set by measuring of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the set. Feature importance ranking was conducted using the trained RF models and we reported the change in NE occurrence probability given the entire data set. We considered each feature separately and calculated the probability of NE occurrence when that feature assumed the value True as compared to the value False.
The study was approved by the University's and Ministry of Health Ethics Committee (MOH 032-2019).

Results
The majority of NEs (62.32%) occurred in six main departments: General Surgery, 19 (18.81%); Gynecology, 17 (16.83%); Orthopedics, 16 (15.84%); Cardiac and Cardiothoracic 15 (14.85%); Ophthalmology 8 (7.92%); and Urology, 7 (6.93%) ( Table 1). Therefore, our analysis focused on the occurrence of NEs in these six departments. In order to evaluate our models, we adopted the Area Under the Curve (AUC) measure which is especially suited for imbalanced data, as in our case in this study, since it does not have any bias toward models that perform well on the minority of majority classes in the expense of the other.
[26] Our three RF models demonstrated good performance, exhibiting an Area Under the Curve (AUC) between 0.81 and 0.85. Generally, AUC scores between 0.8 to 0.9 are considered excellent. [27]. AUC is interpreted as the probability that our model will rank a randomly chosen positive instance higher than a randomly chosen negative one.
[28] As such, our models can be considered relatively strong and accurate despite their limitations.
Feature Importance Figure 1 presents the top contributing features to the occurrence of NEs (of both types combined) in the six departments along with the associated probability change.
The top 14 contributing features varied signi cantly across departments, and there was no single feature set which was consistently more informative across all operations in predicting NEs. For example, feature [C], Discrepancy in second count, varied signi cantly across departments (160% to 1,950%). Feature [B], Surgery is paused because of discrepancy in third count, appeared in four of the six departments, and the associated probability change varied dramatically as well, between 269% and 1,540%. There were 10 features that consistently decreased the chance of an NE, including [F]; Surgeon scans the cavity/fascia before closure during the second count, which affected ve out of six departments, which was consistent in its probability change between

Effects of Feature Combinations
In the following analysis (Figure 2), we examine the effects of paired features, i.e., features that occur together in the data. It is important to note that, when considering feature combinations, their occurrence is expected to be very low especially in the NEs class. As such, the estimated effects are likely to be very high, yet their con dence is signi cantly low.
Interestingly, in General Surgery, there were 14 feature combinations that caused a probability change of 13,600% (Figure 2A). In comparison, the single feature analysis (Figure 1) revealed a probability change of 1,287% and 1,168%, surprisingly by two features that were not part of the 14 feature combinations identi ed here.
In Figure 2A (Gynecology), the effect of every feature combination is associated with a probability change of 1,000-2,000%. In the single feature analysis (Table 2), the effect of two of the features separately was <900%, and the rest lagged behind with <150%. In Urology ( Figure 2B), results show there were dozens of pairs with an effect of 1,900-2,500%, while the effect of a single feature had <1150% effect on error. In General Surgery ( Figure 2E), the accumulated effect of two features together showed a dozen pairs with an effect of 1,900-4,200%, while the effect of a single feature had an <1,950% indication on error, and the rest even lower percentages.

Features Affecting Types A and B
Turning to Models 2 and 3, there is an overlap in three of the top ve contributing features to Types A and B errors (Figures 3 and 4): 1) the presence of two nurses during the surgery predicts a greater occurrence of Type A (66%) and Type B (88%); 2) an operation < 1 hour had a greater occurrence of Type A (122%), and Type B (87%); and 3) when the operation lasted between one to two hours, both Types A and B were less frequent, decreasing by 60% and 74%, respectively. The surgical department that was most affected regarding the occurrence of Type A NEs was Ophthalmology, with a prevalence of 504%, while General Surgery was associated with a decrease of 63% in Type A (Figure 3). For Type B, the two remaining features were staff driven; the feature "more than three physicians" was associated with an increased prevalence of Type B (151%), while "two physicians" was associated with a decreased prevalence of 52% with Type B (Figure 4).

Discussion
Surgical errors are a serious public health problem and uncovering their causes is challenging. [29] In this study, we aimed to uncover contributing factors to NEs by using ML methods to identify heretofore unknown contributors, since ML automatically looks for patterns not seen by classic methods. [18,30] Despite the widespread use of the surgical safety checklist and strict surgical counts, the prevalence of NEs has not decreased signi cantly since their widespread implementation. [31][32] The human factor, and not system error, has been identi ed as the main contributing factor to NEs. 31,33] For example, in one study using an analysis and classi cation system, 628 human factors were divided into four categories that in uenced NEs: preconditions for action, unsafe actions, oversight and supervisory factors, and organization in uences.
[6] Additional studies have identi ed lack of communication and lack of empirical evidence as barriers to the implementation of the universal safety standards. [29,34] Some studies have suggested that counting alone is insu cient, and even when declared correct, there have been items left in the patient, [35][36] mostly in the abdomen and pelvis [ 35,37 This may explain our higher probability of Type B error in General Surgery and Urology, which involve those regions.
We further analyzed paired contributing factors representing the relative risk in the OR's complex work environment, when the graded risk increased compared to single feature analysis. For example, in Orthopedics, discrepancy in the count in combination with a surgery length of 1-2 hours increased the chances for an NE, what can be explained by partial compliance with the standards. In shorter surgeries, the staff rushes and skips some phases of the checklists [38] and the complex sets used challenges the counts. [31,39] We found that the occurrence of wrong site surgery increases in Ophthalmology during short surgeries and when two nurses are present. Its occurrence decreased in general surgery. This increased risk in could be due to the di culty of performing a time out because the surgeons have antiseptic hands and cannot review charts, or perhaps doing so is not made a priority. [40] The decrease in general surgery could be explained by better implementation of the time out process in that specialty. [41][42] One of the main factors contributing to the occurrence of NEs is lack of communication among participating members in the surgery, [33] which may explain our ndings that the number of staff had an increasing/decreasing effect on NE occurrence.
We recognize that the current study is limited by the amount, quality and diversity of the data used. In the context of this work, our samples come from two distinct sources: prospective observations and retrospective investigations of NEs where the latter consists of a small number of NEs compared to the relatively high number of analyzed observations. We believe that these limitations are inherent to the problem at hand as performing prospective analyses of NEs is virtually impossible due to their infrequency and the number of NEs is nominally small. To mitigate some of these concerns, we have used grounded statistical techniques that allowed us to train adequate model and estimate feature importance. Nevertheless, given the above, the feature impact should be considered carefully and validated in future study.
In future study we plan to further expand our data pool with newly obtained observations and NEs as those are accumulated. In another avenue, we explore the use of transfer learning of NEs from other countries which could be used to better inform our model. This avenue could prove valuable in mitigating the imbalanced nature of our data yet may introduce signi cant biases due to the variety of data sources.

Conclusion
Our results suggest that the existing "one size ts all" safety approach currently in place may signi cantly bene t from tailored adjustments that will consider additional factors such as those identi ed in this work. These more speci c guidelines may be used adjust risk management programs to improve patient safety.

Declarations
Ethics approval and consent to participate: Availability of data and materials: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests-
To the best of our knowledge, the named authors have no competing interests, nancial or otherwise to disclose. Effect of two features' combination on prediction by surgical departments Figure 3 Features affecting the wrong site surgery (Type A) Figure 4 Features affecting retained foreign item during surgery (Type B)

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. appendix1.docx