A Machine Learning Study to Improve Surgical Case Duration Prediction

DOI: https://doi.org/10.21203/rs.3.rs-40927/v1

Abstract

Since the emergence of COVID-19, many hospitals have encountered challenges in performing efficient scheduling and good resource management to ensure the quality of healthcare provided to patients is not compromised. Operating room (OR) scheduling is one of the issues that has gained our attention because it is related to workflow efficiency and critical care of hospitals. Automatic scheduling and high predictive accuracy of surgical case duration have a critical role in improving OR utilization. To estimate surgical case duration, many hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from electronic medical record (EMR) scheduling systems. However, the low predictive accuracy with EMR data leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve the prediction of surgical case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 surgical cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, surgeries, specialties and surgical teams. In addition, a more recent data set with 8,672 cases (from Mar to Apr 2020) was available to be used for external evaluation. We computed historic averages from the EMR data for surgeon- or procedure-specific cases, and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-square (R2), mean absolute error (MAE), and percentage overage (actual duration longer than prediction), underage (shorter than prediction) and within (within prediction). The XGB model was superior to the other models, achieving a higher R2 (85 %) and percentage within (48 %) as well as a lower MAE (30.2 min). The total prediction errors computed for all models showed that the XGB model had the lowest inaccurate percentage (23.7 %). Overall, this study applied ML techniques in the field of OR scheduling to reduce the medical and financial burden for healthcare management. The results revealed the importance of surgery and surgeon factors in surgical case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate the performance of ML models.

Introduction

It has become increasingly important for clinics and hospitals to manage resources for critical care during the COVID-19 pandemic period. Statistics show that approximately 60 % of patients admitted to the hospital will need to be treated in the operating room (OR) [1], and the average OR cost is up to 2,190 dollars per hour in the United States [2, 3]. Hence, the OR is considered as one of the highest hospital revenue generators and accounts for as much as 42 % of a hospital’s revenue [4, 3]. Based on these statistics, a good OR schedule and management strategy is not only critical to patients who are in need of elective, urgent and emergent surgeries but is also important for surgical teams to be prepared. Owing to the importance of the OR, improvement of OR efficiency has high priority so that the cost and time spent on the OR is minimized while the utilization of the OR is maximized to increase the surgical case number and patient access [5].

In a healthcare system, numerous factors are involved in affecting OR efficiency, for example, patient expectation and satisfaction, interactions between different professional specialties, unpre- dictability during surgeries, surgical case scheduling, etc. [6]. Although the OR process is complex and involves multiple parties, one way to enhance OR efficiency is by increasing the accuracy of predicted surgical case duration. Over- or underutilization of OR time often leads to undesirable consequences such as idle time, overtime, cancellation or rescheduling of surgeries, which may in- duce a negative impact on the patient, staff and hospital [7]. In contrast, high efficiency in OR scheduling not only contributes to a better arrangement for the usage of the operating room and resources but can also lead to a cost reduction and revenue increase since more surgeries can be performed.

Currently, most hospitals schedule surgical case duration by employing estimations from the surgeon and/or averages of historical case durations, and studies show that both of these methods have limited accuracy [8, 9]. For case lengths estimated by surgeons, factors including patient conditions and anesthetic issues might not be taken into consideration. Moreover, underestimation of case duration often occurs because surgeon estimations are usually made by favoring maximizing block scheduling to account for potential cancellations and cost reduction. Furthermore, operations with higher uncertainty and unexpected findings during surgery add difficulties and challenges to case length estimation [8]. Historic averages of case duration for a specific surgeon or a specific type of surgery obtained from electronic medical record (EMR) scheduling systems have also been used in hospitals. However, these methods have been shown to produce low accuracy due to the large variability and lack of the same combination of factors in the preoperative data available on the case that is being performed [10].

To improve the predictability, researchers have utilized linear statistical models, such as re- gression, or simulation for surgical duration prediction and evaluation of the importance of input variables [11, 12, 13]. However, a common shortcoming of these studies is that relatively fewer input variables or features were used in their models than in alternative approaches due to the lim- itation of statistical techniques in handling too many input variables. Recently, machine learning (ML) has been shown to be powerful and effective in aiding health care management. Master et al. (2017) trained multiple ML models, including decision tree regression, random forest regres- sion, gradient boosted regression trees and hybrid combinations, to predict surgical durations [14]. Ensemble classifiers, implementing least-squares boosting and bagging models with ML, developed by Shahabikargar et al. (2017) were shown to reduce the error by 55 % compared to the original error [7]. With the use of a boosted regression tree, Zhao et al. (2019) increased the percentage of accurately booked cases for robot-assisted surgery from 35 % to 52 %. Bartek et al. (2019) reported that they were able to improve predicted cases within 10 % of the threshold tolerance from 32 % to 39 % using an extreme gradient boosting model [15]. Nonetheless, these ML studies included only 5-12 different types of procedures and specialties to train their ML models, which may limit the generalization of these models.

In this study, we obtained more than 170,000 cases from China Medical University Hospital (CMUH) containing 422 types of procedures across 25 different specialties. From the original data, we further analyzed the working time of primary surgeons and computed their total number of previous surgeries and the total minutes spent on previous surgeries within the same day as well as within the last 7 days. Since surgeons’ working performance might be affected by previous events, surgical cases performed by the same primary surgeon, especially within the same day, should not be considered as totally independent and unrelated. Hence, previous surgical counts and working time obtained from surgeons’ data were included as additional features in our ML model training to account for their influences on surgical case duration. In addition, the number of urgent and emergent operations prior to the case that was being performed by the same surgeon, which has not been previously considered in other studies, was taken into consideration. This factor could affect surgical case duration since urgent and emergent operations happen unexpectedly and delay the start of subsequent planned surgeries. Overall, we hypothesize that these features impose significant influences on surgical case duration and may aid in improving the performance of a trained ML model.

Results

Approximately 17 % of the cases were excluded from the original data from Jan 1, 2017, to Dec 31, 2019, based on the exclusion criteria mentioned in Fig. S2 (Supplementary info). Therefore, 142,448 cases containing more than 420 procedural categories and 25 specialties were included for predictive model development and evaluation. Furthermore, a recent data set collected from Mar 1 to April 30, 2020, (7,231 cases after exclusion) was used in the external evaluation to verify the robustness of the model in making predictions.

The results of all the metrics used to evaluate the performance of all the models on training and internal and external testing sets are shown in Table S1 of the Supplementary info. Based on the results of the model evaluation on the external testing set (Fig. 1), the average model for the surgeon-specific scenario was not a good estimate for surgical case duration. The average model for the procedure-specific scenario had a lower percentage underage (actual duration shorted than prediction) and overage (longer than prediction) than the surgeon-specific average model. These differences were due to an extensive procedure classification in the procedure-specific model. However, the percentage underage was still quite high. Since no other information is taken into consideration in both of the average models except the duration of surgical cases that happened in the past, these models usually exhibit prediction bias and low accuracy.

We first fitted the Reg model by including all the input variables shown in Table 1. The evaluation metrics reported a lower percentage underage and a higher percentage within than the average models on the training and internal and external testing sets (Fig. 1). There was a large improvement in the R2 value, indicating that the predictive performance of the model increases when other information is taken into consideration during model development. Since the results of the percentage underage, overage and within on the training and internal and external testing sets were similar, overfitting was not likely to happen in the Reg model. A model is likely considered to be overfitted when its performance is better on the training set but poorer on the testing set.

 

Patient

Surgical team

Operation

Facility

Primary Surgeon’s Prior Events

Age

Primary surgeon’s ID

Procedure type

Room No.

No. of previous surgeries performed

by the surgeon on the same day

Total surgical minutes performed by the surgeon on the same day

No. of previous surgeries performed by the surgeon within the last 7 days

Total surgical minutes performed by the surgeon within the last 7 days

No. of previous urgent and emergent surgeries performed by the same surgeon on the same day

Gender

Surgeon team size

Subprocedure type

Day of the week

ICD code

Specialty

Anesthesia type

Time of day

In- /outpatient

Primary surgeon’s gender

 

 

ASA status

Primary surgeon’s age

 

 

HypertensionAnemia Diabetes

 

 

 

Table 1: Preoperative data with 24 predictor variables were used for model development. The predictor variables can be categorized by  relationship to patient, surgical team, operation, facility  and surgeon’s prior events. ICD: International Classification of Diseases; ID: identifier: ASA: American Society of Anesthesiologists

 

When we log-transformed surgical case duration and reran a regression model (i.e., logReg), the performance of the logReg model improved, outperforming the Reg model. Since we were predicting surgical case duration, the log transformation of the target prevented us from obtaining zero or negative values from the predicted output of the model. Log transformation has been commonly used in other studies for the same reason [12, 16]. Again in the logReg model, the results of all the evaluation metrics were close for training sets and internal and external testing sets, so the model was not overfitted.

Although the performance of the logReg model was not bad, an assumption of a linear relation- ship between the target and input variables was applied in both the Reg and the logReg models. The relationship between the target and input variables is usually nonlinear in a real-world sit- uation. ML algorithms are helpful in making predictions in a more complicated scenario. The RF model is the first ML model that we built in this study; there was a slight improvement in MAE compared to that of the logReg model. An XGB model was subsequently developed because the training duration of the RF model was time-consuming and caused low computing efficiency. The performance of the XGB model was better than that of the RF model on the training set, but no obvious improvement was observed on the internal and external testing sets. There was a slight improvement in MAE of the XGB model compared to the RF model. Since XGB was more computationally efficient than RF, the XGB model was chosen as the best model and was used in subsequent analysis.

In addition to the three key metrics, we studied the accumulative inaccuracy of all the models by using the external testing set. The total prediction error (in minutes) and the corresponding inaccurate percentage were calculated (Table 2). The actual total minutes represent the sum of surgical case durations for 7,231 cases in the external testing set. The inaccurate percentage was derived from the percentage of total prediction error divided by the actual total minutes. The outcome shows that the inaccurate percentage of the XGB model was the lowest among all the models. The inaccurate percentage of the XGB model was also more than 50 % lower than that of the average model for the surgeon-specific scenario and approximately 25 % lower than that of the procedure-specific average model. This result confirms that the XGB model performed better than the other models. It also implies that prediction made by the XGB model might help to increase the efficiency of OR scheduling.

 

 

Actual

Surgeon-

specific

Procedure-

specific

Reg

logReg

RF

XGB

Total minutes

920,374

899,510

918,061

934,333

885,784

874,528

888,908

Total prediction error in minutes

 

467,548

294,137

238,862

224,700

223,686

218,415

Inaccurate percentage (%)

 

50.8

32

26

24.4

24.3

23.7

Table 2: The extreme gradient boosting (XGB) model produced the lowest percentage of cumulative inaccuracy among all the othermodels. Cumulative differences between actual and predicted case durations for all the models are shown in this table.

 

Subsequently, we plotted scatter plots of actual versus predicted duration on the external testing set for the average models (surgeon- and procedure-specific) and the XGB model (Fig. 2). A straight line indicating the theoretical perfect relationship, i.e., where the predicted and actual procedure duration are identical, was added as a reference in each scatter plot. The data points of the XGB model were aligned closer to the straight line. Therefore, the XGB model showed a higher correlation between predicted and actual durations than the average models. Fig. 3 shows the density plot of differences between actual and predicted case durations for the two average models and the XGB models. The figure clearly demonstrates that the error distribution of the XGB model was narrower and closer to 0. As a result, the XGB model is more accurate than the average models in predicting surgical case duration.

To identify variables that were important in the XGB model, we extracted weighted feature gain (WFG) from the model. WFG was computed based on the reduction in model accuracy when the variable was removed. This value serves as an indication of how important the variable is in improving the purity of a decision tree branch [17, 18]. A higher WFG percentage indicates that the variable is more important. The results of the top 15 important variables in the XGB model are shown in Fig. 4. Notably, 3 of the top 4 important variables were attributed to operative information. Moreover, three of the features that we computed from the surgeon data (i.e., total surgical minutes performed by the surgeon within the last 7 days and on the same day and the number of previous surgeries performed by the surgeon within the last 7 days) were included in this top 15 list.

In addition to looking at the overall performance across all specialties, we compared the perfor- mance of the XGB model by further breaking it down to the specialty level on the external testing set. Fig. 5 shows the number of cases that were predicted as overage, underage and within for each specialty in the external testing set. Most specialties had more cases that were predicted as within than as overage and underage, except some specialties such as thoracic, oral and maxillofacial, anes- thesiology, pediatric, and bariatric and metabolic. These exceptional specialties had fewer cases of within because the total case numbers in these specialties were low. Moreover, the mean case duration was long for bariatric and metabolic, and this causes the case duration of this specialty to be difficult to predict accurately.

Discussion

Accurate prediction of surgical case duration is vital in increasing OR efficiency and reducing costs. This study not only helps to improve the accuracy of OR case prediction but also has novelty in the following aspects. First, the data set used in this study contained more than 140,000 cases (after exclusion) and more than 400 different types of surgical procedures, establishing a new benchmark for a massive quantity of data with high diversity. The maximal number of cases that had been used in other studies was in the range of 40,000 to 60,000 [15, 7]. Second, OR events were modeled as dependent events instead of independent. To this end, we extracted some additional information from surgeon data, e.g., previous working time and number of previous surgeries of the primary surgeons within the last 7 days and on the same day, and this information was taken into consideration during model building. Third, we tested the model on real daily surgical cases from Mar to April 2020 as external testing data for model evaluation. Fourth, though urgent and emergent surgeries were excluded from the data, the number of urgent and emergent operations prior to the case that was being performed by the same surgeon on the same day was included as an input variable to account for its effect on surgical case duration.

Currently, surgical cases at CMUH are scheduled according to estimates made by primary surgeons. However, surgeon estimates rely heavily on prior experiences of the surgeons, and many factors beyond expectation will not be taken into consideration. Since there is no formal record on surgeon estimates, we used averages that were calculated based on a specific surgeon or procedure type on the testing set as our baseline models. The performance of these two average models, as reported in Fig. 1 and Table S1 (Supplementary info), clearly showed that they were poor in predicting surgical case duration. These models also tended to underpredict surgical case duration, according to their scatter plots of actual versus prediction and density plots of differences between actual versus prediction (see Figs. 2 and 3). When 24 feature variables (Table 1) were included in our model development, the R2, MAE, and percentages of underage, overage and within improved substantially compared to the baseline models. We applied 15 minutes as a tolerance threshold for the percentage of underage, overage and within because ± 15 minutes is an acceptable periodic range at CMUH to be considered as accurate booking. To avoid having an excessively stringent standard and to better compare our outcomes with those of other studies [15, 19], a tolerance threshold of 10 % was also applied.

By using regression and ML approaches, we were able to decrease the total prediction error (Table 2) of surgical case duration at CMUH. Among all the models, the performance of the XGB model was considered to be the best because it was more computationally efficient and had the lowest inaccuracy. Moreover, even though the results of the evaluation metrics of the RF model were similar to those of the XGB model, the XGB model was still able to reduce the total prediction error (in minutes) from 223,686 to 218,415. In other words, the XGB model was able to save more than 5,000 minutes of idle or delay times compared with the RF model. Since most ORs usually have multiple cases scheduled per day, the total prediction error represents the cumulative effect of total OR cases in the 2-month period of Mar to April 2020. This cumulative effect may eventually reflect a significant financial advantage in scheduling an additional operation case [20]. This approach would also lead to a significant cost reduction and increase in revenue because ORs are utilized appropriately and efficiently. When we evaluated the case numbers of overage, underage and within predicted by the XGB model at the specialty level, there were more case numbers falling within the acceptable thresholds for most of the specialties in the external testing set (Fig. 1). This finding justifies that the performance of the XGB model can be generalized across specialties.

It has been reported in the past studies that primary surgeons contributed the largest variability in surgical case duration prediction compared to other factors attributed to patients [15, 16, 14]. These studies provide evidence and rationale that more factors relating to primary surgeons should be added as input variables in the training of ML models. Moreover, extensive feature engineering usually improves the quality of ML models and can be independent of the modeling technique itself. As a result, in addition to the primary surgeon’s identifier, gender and age, we computed previous working time and number of previous surgeries performed by the same primary surgeons within the last 7 days and on the same day. We also counted the number of urgent and emergent operations prior to the case that was being performed by the same primary surgeon. These variables extracted from primary surgeon data were significantly (p < 0.05) correlated with surgical case duration (see Table S2 in the Supplementary info). The correlation coefficients of these variables also revealed that a surgical case duration of a primary surgeon may decrease as he or she becomes more familiar with the surgical procedure but may increase if his or her total surgical minutes are too long. Although performing a surgery multiple times on different patients may help a primary surgeon to be more efficient in his or her next operation, a long working time may also lead to lethargy and may affect the primary surgeon’s performance.

In the methodology of data processing, for predictor variables that contained many categories, we grouped categories that had less than 50 cases into a category named ‘Others’. In addition to reducing the data dimensionality for categorical features, this grouping may aid in the generalization of our model, which implies that our model will still be able to predict the case duration even for operations that are rare. Moreover, our model can be applied to new primary surgeons who are not included in the training set during model development by setting their ID as ‘Others’ for case duration prediction. However, there is still a need to update our model after a while, for example, when the surgical cases performed by a new primary surgeon have increased beyond a certain number. In terms of timing, we recommend updating the model annually by using surgical cases performed in the most recent 3 years as training data.

One limitation in this study is that we selected predictor variables that could only be extracted from preoperative data. Our ML model still needs to be improved in order to be able to predict surgical case duration dynamically. For example, blood loss during surgery may affect case duration since an unexpected increase in blood loss may cause surgeons to take a longer time to complete the surgery. Therefore, it would be better if intraoperative data are incorporated during ML model development, and the prediction made by the ML model can be updated during surgery. One common issue in all ML studies in terms of predicting surgical case duration, including our study, is that ML models were developed using data from a single site. These ML models have difficulties in generalization since the surgical team, facilities and patient populations are different across entities. A custom-made model has to be built for a given organization using training data containing its patients, procedures, surgeons, medical staff, and the facility itself. As a result, the exact same ML model is not meant to and will not perform well when applied to another organization or hospital. The other interesting issue of applying ML or artificial intelligence in surgical duration estimation is that medical technologies quickly evolve. Hence, how frequently an ML or artificial intelligence model need to be updated still remains to be answered.

Conclusion

The XGB model was superior in predictive performance when compared to the average, Reg and logReg models. The total inaccuracy of predicted outcomes of the XGB model was the lowest among the other models developed in this study. Although the performance of the RF model was close to that of the XGB model, the XGB model was more computationally efficient in that it took a shorter time to complete the training process. For the XGB model built in this study, the coefficient of determination (R2) was higher than that in other ML studies, while the percentages of under- and overprediction were lower [15, 19, 7]. Moreover, this model improves the current OR scheduling method at CMUH, which is based on estimates made by surgeons.

We propose extracting additional information from surgery and surgeon data to be used as predictor variables for ML algorithm training since their importance was high in the XGB model. Moreover, we validated the model types using an external testing set in addition to the internal testing set split from the original data used in model training. This approach helped us to validate and test the models in a more stringent and rigorous way. Therefore, we suggest that external evaluation should be used as a tool to better validate the predictive power of ML models in the future.

Methods

Data sources.

Data for this study were collected from the EMR scheduling system of CMUH located in Taichung, Taiwan. The data set covered a broad variety of details about patients, surgeries, specialties and surgical teams. A total of 170,748 cases performed between Jan 1, 2017, and Dec 31, 2019, were used for model development. Additionally, 8,672 cases performed between Mar 1 and April 30, 2020, were used as data for external model evaluation in this study. Over 400 different types of procedures across 25 surgical specialties were included in the data set. Institutional review board approval (CMUH109-REC1-091) was obtained from CMUH before carrying out this study.

Exclusion criteria, data processing and feature selection.

Emergent and urgent surgical cases were removed since these two types of surgeries cannot be scheduled until they happen. A surgeon’s age younger than 28 years and surgical case duration more than 10 hours or less than 10 minutes were also removed. Surgical records with missing values were excluded. Patients who were pregnant or underwent two or more surgical procedures at the same time or with age under 20 years were removed. The exclusion criteria are shown in Fig. S2. This approach resulted in a data set of 142,448 cases that were used for model training and testing. The same criteria were also applied to the data of Mar 1 to April 30, 2020, and 7,231 cases remained after exclusion.

Features were selected from available data sources based on literature review and discussion with surgeons and administrators of CMUH. Although the model performance could be enhanced by some postoperative information (e.g., total blood loss), these parameters cannot be used as features for model training because they were either missing or simply estimated by surgeons before surgery. Therefore, only variables that were available before surgery were selected for model development.

When visualizing all the categories of procedure types and the International Classification of Diseases (ICD) code, there were hundreds to thousands of categories in these two variables. To reduce the problem of having too many dimensions during one-hot encoding of categorical features, we combined categories that had less than 50 cases in the training set into a category and named it as ‘Others’. Similarly, we combined categories for primary surgeon’s ID, specialty, anesthesia type and room number that had less than 50 cases into the category of ‘Others’.

In addition, since surgical case duration can be related to the performance of surgeons and surgeons’ performance is affected by their working time, we analyzed primary surgeons’ previous surgical events. The number of previous surgeries and total surgical minutes performed by the same primary surgeons on the same day as well as within the last 7 days and the number of urgent and emergent operations prior to the case that was being performed by the same surgeon were included in the analysis. Together, 24 predictor variables were included for predictive model building in this study. These predictors can be categorized into 5 groups: patient, surgical team, operation, facility and primary surgeon’s prior events (see Table 1).

Model development and training.

We applied multiple ML methods for surgical case duration prediction. Surgical case duration (in minutes) is the total period starting from the time the patient enters the OR to the time of exiting the OR. Historic averages of case durations based on surgeon-specific or procedure-specific data from EMR systems were used as baseline models for comparison in case duration prediction. At the beginning, we performed multivariate linear regression (Reg) to predict surgical case duration. However, when we evaluated the distribution of surgical case duration, it was observed to be skewed to the right (Fig. S1 in the Supplementary info). We performed a logarithmic transformation on the surgical case duration to reduce the skewness. The model built from log-transformed multivariate linear regression (logReg) outperformed Reg in all evaluation indexes. Subsequent ML algorithms were also trained by using the log-transformed case duration as the target.

The first ML algorithm that we tested was random forest (RF), a tree-based supervised learning algorithm. RF uses bootstrap aggregation or a bagging technique for regression by constructing a multitude of decision trees based on training data and outputting the mean predicted value from the individual trees [21]. The bagging technique is unlikely to result in overfitting; in other words, it reduces the variation without increasing the bias. Tree-based techniques were suitable for our data since they include a large number of categorical variables, e.g., ICD code and procedure type, of which most were sparse. The number of trees set in this study was 50. The extreme gradient boosting (XGB) algorithm is the other supervised ML algorithm that was tested for comparison to RF. Recently, the XGB algorithm has gained popularity within the data science community due to its ability in overcoming the curse of dimensionality as well as capturing the interaction of variables [22].

XGB is also a decision tree-based algorithm but is more computationally efficient for real-time implementation than RF. The XGB and RF algorithms are different in the way that the trees are built. It has been shown that XGB performs better than RF if parameters are tuned carefully; otherwise, it would be more likely to overfit if the data are noisy [23, 24]. We adopted a 5-fold cross-validation strategy to tune the best number of iterations by using η = 0.5 (step size shrinkage to prevent overfitting), maximum of 3 depths of the tree, γ = 0.3 (minimum loss reduction, where a larger γ represents a more conservative algorithm) and α = 1 (L1 regularization weighting term, where a larger value indicates a more conservative model).

A data-splitting strategy was used in the training for all the models to prevent overfitting consequences. We randomly separated the data into training and testing subsets at a ratio of 4:1. The training data were used to build different predictive models as well as to extract important predictor variables. The testing data were used for internal evaluation of the models. In addition to interval evaluation, external evaluation on all the models was performed using data from Mar 1 to Apr 30, 2020. These data were not included in the original data set for ML model training. The results obtained from external evaluation are thus better in verifying the robustness of the trained model in making an accurate prediction. Historic averages of case duration for surgeon- or procedure-specific data calculated from EMR data were also evaluated on the same internal and external testing sets to ensure fair and uniform comparison across all models. Data processing and cleaning as well as model development in this study were performed using R software. The packages “xgboost and “randomforest were used to implement the XGB and RF algorithms in R, respectively [25, 17].

Model evaluation.

Multiple predictive models were built to predict surgical case duration. Different standards are usually applied to evaluate the predictive performance of the built models. The three key metrics used to evaluate model performance in this study included (1) R-square (R2), (2) mean absolute error (MAE), and (3) the percentage overage, underage and within.

R2 is the coefficient of determination; it represents the proportion of the variance for the actual case duration that is explained by predictor variables in our models.

MAE measures the average of errors between the actual case durations and the predictions.

Percentage overage indicates the percentage of cases with an actual case duration > prediction + 10 % tolerance threshold (i.e., 1.1 ∗ prediction) and prediction + 15 minutes. Meanwhile, percentage underage is the percentage of actual case duration < prediction - 10 % tolerance threshold (0.9 ∗ prediction) and prediction - 15 minutes. Therefore, the percentage within equals 100 %-(percentage overage + percentage underage).

Data availability

The minimum dataset (March to April 2020) used in external evaluation for this study is available from our web site: https://cmuhopai.azurewebsites.net/. The dataset required to replicate model training and internal evaluation contains personal data and is not publicly available, in keeping with the Data Protection Policy of CMUH.

Code availability

The code used in this study is currently unavailable but may become available in the future from the corresponding author on reasonable request.

Declarations

Acknowledgments

The authors would like to thank Shu-Cheng Liu and Jhao-Yu Huang in assisting to build the web site for data access.

Author contributions

+ C.H. and J.L. contributed equally to this work. C.H. and J.L. performed the literature search, conducted the modelling and evaluation, contributed to the modelling design, generated the figures, and co-wrote the paper. D.C. conceived the study and data interpretation. J.Y. contributed to the study design, data interpretation, and paper editing. 

Competing interests

The authors declare that they have no competing interests.

Materials & correspondence

Correspondence and requests for materials should be addressed to Jiaxin Yu (email: [email protected]).

References

  1. Gordon, T., Paul, S., Lyles, A. & Fountain, Surgical unit time utilization review: Resource utilization and management implications. Journal of Medical Systems 12, 169–179 (1988).
  2. Barbagallo, S. et al. Optimization and planning of operating theatre activities: An original definition of pathways and process BMC Medical Informatics and Decision Making 15 (2015).
  3. Childers, P. & Maggard-Gibbons, M. Understanding costs of care in the operating room. JAMA Surgery 153 (2018).
  4. Gillespie, B. M., Chaboyer, W. & Fairweather, N. Factors  that influence the expected length    of operation: Results of a prospective study. BMJ Quality and Safety 21, 3–12 (2012).
  5. Levine, C. & Dunn, P. F. Optimizing Operating Room Scheduling (2015).
  6. Rothstein, D. H. & Raval, V. Operating room efficiency. Seminars in Pediatric Surgery 27, 79–85 (2018).
  7. Shahabikargar, Z., Khanna, S., Sattar,    &  Lind,  J.  Improved  Prediction  of  Procedure  Duration for Elective Surgery. Studies in health technology and informatics 239, 133–138 (2017). URL http://www.ncbi.nlm.nih.gov/pubmed/28756448.
  8. Laskin, D. M., Abubaker, A. O. & Strauss, R. A. Accuracy of  predicting  the  duration  of  a  surgical operation. Journal of Oral and Maxillofacial Surgery 71, 446–447 (2013).
  9. May, H., Spangler, W. E., Strum, D. P. & Vargas, L. G. The Surgical Scheduling Problem: Current Research and Future Opportunities. Production and Operations Management 20, 392–405 (2011). URL http://doi.wiley.com/10.1111/j.1937-5956.2011.01221.x.
  10. Zhou, J., Dexter, F., MacArio, A. & Lubarsky, D. A. Relying solely on historical surgical times to estimate accurately future surgical times is unlikely to reduce the average length of time cases finish late. Journal of Clinical Anesthesia 11, 601–605 (1999).
  11. Kougias, , Tiwari, V. & Berger, D. H. Use of simulation to assess a statistically driven surgical scheduling system. Journal of Surgical Research 201, 306–312 (2016).
  12. Hosseini, N., Sir, M. Y., Jankowski,  J. & Pasupathy,  K. S.  Surgical Duration Estimation  via Data Mining and Predictive Modeling: A Case Study. Tech. Rep., Mayo Clinic.
  13. Eijkemans, M. J. et al. Predicting the unpredictable: A new prediction model for operating room times using individual characteristics and the surgeon’s estimate. Anesthesiology 112, 41–49 (2010).
  14. Master, N. et al. Improving predictions of pediatric surgical durations with supervised International Journal of Data Science and Analytics 4, 35–52 (2017).
  15. Bartek, M. A. et al. Improving Operating Room Efficiency: Machine  Learning  Approach  to Predict Case-Time In Journal of the American College of Surgeons (2019).
  16. Strum, D. , Sampson, A. R., May, J. H. & Vargas, L. G. Surgeon and type of anesthesia  predict variability in surgical procedure times. Anesthesiology 92, 1454–1466 (2000).
  17. Chen, T., He, T., Benesty, & Khotilovich, V. Package ’xgboost’ Type Package Title Extreme Gradient Boosting (2020). URL https://github.com/dmlc/xgboost/issues.
  18. Song, Y. & Lu, Y. Decision tree methods: applications for classification and prediction. Shanghai Archives of Psychiatry 27, 130–135 (2015).
  19. Zhao, B., Waterman,    S.,  Urman,  R.  D.  &  Gabriel,  R.  A.  A  Machine  Learning  Approach  to Predicting Case Duration for Robot-Assisted Surgery. Journal of medical systems 43, 32 (2019). URL http://www.ncbi.nlm.nih.gov/pubmed/30612192.
  20. Dexter, & Macario,  A.  Decrease in Case Duration Required to Complete an Additional  Case During Regularly Scheduled Hours in an Operating Room Suite. Anesthesia & Analgesia 88, 72–76 (1999). URL http://journals.lww.com/00000539-199901000-00014.
  21. Prasad, A. M., Iverson, L. R. & Liaw, A. Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems 9, 181–199 (2006).
  22. Nielsen, D. Tree Boosting With XGBoost - Why Does  XGBoost Win  ”Every”  Machine  Learn-  ing Competition? Tree Boosting With XGBoost - Why Does XGBoost Win ”Every” Machine Learning Competition? (2016).
  23. Bent´ejac, , Cs¨org˝o, A. & Mart´ınez-Mun˜oz, G. A Comparative Analysis of XGBoost (2019). URL http://arxiv.org/abs/1911.01914.
  24. Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals of Statistics (2001).
  25. Breiman, ,  Cutler,  A.,  Liaw,  A.   &   Wiener,   M.   Package   ’randomForest’   Title Breiman and Cutler’s Random Forests for Classification and Regression (2018). URL https://www.stat.berkeley.edu/ breiman/RandomForests/.