OR Vision: Objective, explainable assessment of surgical skill with deep learning

doi:10.21203/rs.3.rs-1978829/v1

Download PDF

Research Article

OR Vision: Objective, explainable assessment of surgical skill with deep learning

https://doi.org/10.21203/rs.3.rs-1978829/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Currently, evaluating surgical technical performance is inefficient and subjective [1,2,3,4] and the established rubrics for assessing surgical ability are open to interpretation. To power programs for surgical training and Maintenance of Certification (MOC), a reliable and validated solution is required. To this end, we draw upon recent advances in machine learning and propose a framework for objective and scalable assessment of technical proficiency.

Methods

Different machine learning models were trained to predict surgical performance on the public EndoVis19 and JIGSAWS datasets. The most important features were extracted by probing each machine learning model, and these features form the basis of the proposed algorithm. We internally tested the performance of this model on proprietary datasets from Surgical Safety Technologies (SST) and the University of Texas Southwestern (UTSW). The performance of these models was assessed according to various statistical techniques such as precision, recall, f1-scores and the area under the receiver operating characteristic curve (AUC).

Results

OR Vision is a statistically-driven multi-stage machine learning tool that quantifies surgical skill objectively and explainably. Instrument motion, control, and coordination are quantified in terms of 150 objective metrics, extracted from tool motion tracked by the deep learning model. The N most highly correlated of these metrics (p<0.05) model surgical performance with quantifiable objective metrics (fine-motor precision, fluidity, tremor, disorder, etc.). These metrics are combined into clinically-weighted composite scores that represent the category-wise technical performance of surgeons. The OR Vision score discriminates between expert and novice surgeons with high precision (0.82-0.84) and provides constructive feedback in the form of a concise report for every participating member of the cohort. Each report provides a breakdown of user performance on statistically relevant categories.

Conclusion

A machine learning-based approach for identifying surgical skill is effective and meaningful and provides the groundwork for objective, precise, repeatable, cost-effective, clinically-meaningful assessments.

Deep Learning

Computer Vision

Surgical Assessment

Explainable Artificial Intelligence

Surgical Training

Surgical Feedback

Currently, evaluating surgical technical performance is inefficient and subjective [1, 2, 3, 4] and the established rubrics for assessing surgical ability are open to interpretation [5, 6, 7, 8, 9, 10, 11, 12]. However, various studies have empirically demonstrated the correlation between surgical skill and patient outcomes [13, 14, 15, 16]. As negative outcomes for various types of surgeries across the world increase, improving surgical skill has the potential to improve patient outcomes [17, 18, 19, 20, 21, 22, 23]. Assessing surgeons using a reliable and objective framework requires significant human effort, which is not scalable. To power programs for surgical training and Maintenance of Certification (MOC), a reliable and validated solution is required. To this end, we draw upon recent advances in machine learning and propose a framework for objective and scalable assessment of technical proficiency.

Machine learning has gained widespread popularity in augmentative tools for clinicians [24, 25, 26, 27, 28]. Some models have been shown to surpass human diagnostic ability [29, 30]. Automated assessment of surgical skill is promising but also has limitations [31, 32, 33, 34, 35]. The strict requirement of sufficient computational power, the requirement of publicly available datasets for reproducibility, the lack of a consistent rubric, and non-transparent results have stagnated research in this field [36, 37]. We present an approach for automatically assessing surgical skill that is clinically relevant, efficient, and unbiased. This approach creates a feedback loop for surgeons to improve their skills by visualizing their performance using intuitive, explainable AI-generated reports. This is in contrast to machine learning models that boast excellent metrics but are inherently black-boxes which clinicians struggle to interpret and trust. Our model is thus an important step towards widespread clinical adoption of assistive machine learning tools.

Datasets

The Johns Hopkins University–Intuitive Surgical Gesture and Skill Assessment Working Set (JIGSAWS) consists of 103 videos showing curated table-top surgical setups and includes kinematic measurements (i.e., articulation and velocities of joints) from 8 surgeons performing 4 to 5 trials of 3 surgical actions such as knot tying, needle passing, and suturing. All participants, both patients and surgeons, provided written informed consent. The data were captured using the DaVinci Robotic System (Intuitive Surgical) and came with manually annotated labels that correspond to performance scores defined by a modified version of Objective Structured Assessment of Technical Skill, specifically the GRS. The GRS excludes certain categories, such as use of assistants, because each clip depicts a surgeon completing a short procedure in a controlled environment where assistance is not available. The GRS uses a Likert scale with values ranging from 1 to 5 for respect for tissue, suturing and needle handling, time and motion, flow of operation, overall performance, and quality of product. This data set was collected as a collaboration between Johns Hopkins University and Intuitive Surgical, within an institutional review board–approved study and has been released for public use [39].

The EndoVis19 dataset was released as a part of the Endoscopic Vision challenge 2019 [40]. The dataset consists of 22 full length videos of Cholecystectomy procedures with annotated steps and scores for each significant step of surgery. The triangle of Calot and Dissection phases were considered as part of this analysis. The GOALS rubric was used for the surgical skill annotations [46].

Feature selection and data preprocessing

A deep learning instance segmentation model (called Mask-RCNN) identifies the visible bounds, type, and quantity of surgical instruments in each processed frame [46, 47] The resulting features are then tracked over time and recorded as temporal features to be used in downstream processes. For the purposes of the project, we track the instrument tip, giving us the ability to capture high-level characteristics that could potentially be correlated to the level of surgical skill of the operator.

In order to track surgical instruments, the Mask-RCNN model has to be explicitly trained on various types of surgical instruments. For this, annotated data that identifies not just the physical bounds of the instruments but also the type of instrument is required. These data were annotated by 6 trained raters and we then ‘fine-tune’ the Mask-RCNN model with these data. During fine-tuning, the model learns to distinguish between different instruments while simultaneously delineating their boundaries.

A tracking algorithm is used to track identified instruments through time. This algorithm takes as input the frame-level results of the preceding models and conditions the results on the characteristics of segments identified in previous frames. Since instance segmentation models, in general, can produce some incorrect detections due to the complexity of the task, the tracking algorithm serves as a post-processing step for filtering over mis-classified instruments. The algorithm is tuneable depending on the desired precision-recall trade-off. As we are interested in tracking the instruments through time, we adjust the algorithm with the goal of maximizing precision.

Depending on the type of surgery, various instruments are visible throughout the course of the procedure. Some commonly used instruments such as the bowel grasper appear numerous times. To deal with such scenarios, we have chosen to average the contributions of multiple detections of the same instrument. In each procedure, only one operator completes the entirety of the task (Calot, dissection. needle-passing, suturing, or knot-tying). The EndoVis19 dataset has a special consideration where an assistant might be used to hold a certain instrument or even keep the camera in focus. We choose to consider the contributions of both the primary and (potentially) secondary operators in unison for two reasons:

The GRS score does not differentiate between multiple operators and is a collective score

The datasets don’t include hand-over points and without these points, it is not possible to delineate between the contributions of a potential collaborator.

Table 1

*Prediction results on the JIGSAWS dataset represented as mean absolute error (MAE)*. The random forest model consistently performs better than the linear models.
Category	Linear Regression	Naive-Bayes	Lasso	Random Forest
Time and motion	0.750 (0.050)	0.802 (0.119)	0.88 (0.017)	0.670 (0.055)
Respect for tissue	0.800 (0.064)	0.922 (0.024)	0.779 (0.000)	0.760 (0.073)
Flow of operation	0.736 (0.067)	0.856 (0.297)	0.793 (0.081)	0.719 (0.066)
Suture needle handling	0.738 (0.056)	0.904 (0.053)	0.838 (0.025)	0.680 (0.074)

Table 2

*Prediction results on the EndoVis19 dataset represented as mean absolute error (MAE)*. Both the random forest model and Naive-Bayes models perform well on the EndoVis19 dataset.
Category	Linear	Naive-Bayes	Lasso	Random Forest
Tissue handling	0.800 (0.099)	0.686 (0.064)	0.820 (0.040)	0.766 (0.084)
Efficiency	0.647 (0.033)	0.627 (0.066)	0.753 (0.013)	0.641 (0.040)
Bi-manual dexterity	0.453 (0.102)	0.519 (0.063)	0.653 (0.007)	0.421 (0.094)
Depth perception	0.240 (0.038)	0.240 (0.000)	0.273 (0.033)	0.207 (0.072)

Table 3

*Consensus based system to determine feature polarities*. A ‘+’ indicates a positive correlation between the feature and performance. Whereas, a ‘-’ indicates a negative correlation.
Feature	JIGSAWS	EndoVis19	Literature	Final
Fine-motor reactivity	-	+	+	+
Control of pace	+	+	+	+
Consistency of placement	-	+	+	+
Fine-motor precision	+	+	+	+
Economy of motion	-	-	-	-
Fluidity	+	+	+	+
Tremor	-	-	-	-
Disorder	-	-	-	-
Predictability	+	-	+	+
Inertia	-	-	-	-
Bi-manual dexterity	+	+	+	+

Model development and validation

Each video is run through the framework presented in Fig. 1, the output of which is a set of carefully created metrics (Table 3). These metrics are a distilled representation of the input video and contain everything needed for downstream processing. This distilled representation or feature vector is used for subsequent training of machine learning models.

We trained a random forest classifier model on the provided data. This helps to better understand the contribution of each calculated metric to the performance of an individual. We also hypothesized that the positive or negative correlation of each metric would only be meaningful if the model results were acceptable. Tables 1 and 2 contain the performance metrics of the model on two different datasets. We run 10-fold cross-validation on both datasets and use the mean of the metrics across all folds. The Gini importances were extracted from the random forest model and the polarities are presented in Table 3 [48].

We validate our approach by using video data from Surgical Safety Technologies (SST) and University of Texas Southwestern (UTSW). SST data consists of 102 laparoscopic procedures whereas UTSW consists of 133 robotic procedures. Each video is annotated with performance data (OSATS), which we use to divide the top 10% and bottom 10% of surgeons based on two distinct evaluation criteria. This step illustrates the model's ability to generalize to different types of surgery, different annotation frameworks, and across different cohorts.

Statistical analysis

Tables 1 and 2, and their corresponding polarities (Table 3) were created using this approach. The Kolmogorov-Smirnov test was used to determine the distribution to which a sample or a set of samples might belong. Metrics extracted from the time-series data for each unique surgical procedure were mapped to distributions such as the Gaussian, Gumbel, or Cauchy. The radar plots in this report were constructed by presenting the percentiles for each metric in each case. The values are normalized across the population and are thus bounded between 0 and 1.

Random forest regressor models are trained to minimize the difference between the predicted and true scores [38]. This difference is represented by the mean absolute error (MAE). The results are summarized in Tables 2 and 3 for the JIGSAWS and EndoVis19 datasets, respectively [39, 40]. Depending on the user's error threshold, this analysis serves as an acceptable endpoint. Each model is able to assign scores for the set of categories on which it is trained, assuming it is provided with an acceptable input. We judge these results not on the basis of the provided MAE values; instead, we extract the relative ‘Gini importances’ assigned to each input feature [41]. The feature-specific polarities serve as an important indicator of performance. Due to the small size of the datasets the models are unlikely to generalize to unseen data. However, the polarities are an indicator of feature-performance correlation. The resulting feature-wise polarities are shown in Table 3.

We repeat this process for the EndoVis19 dataset and present the resulting polarities in Table 3. These two datasets are very different, where one is of ex-vivo table top surgeries and the other of in-vivo surgeries. Despite the completely different setups, there is agreement for the majority of metrics. To deal with disagreement, we incorporate a consensus-based system to determine our final polarities. These polarities are used in conjunction with a feature-wise distribution to place surgeons within percentiles of their peer population. Surgeons should strive to be in the upper echelons of their respective population groups. For example, a surgeon in the 90th percentile for the tremor metric would have less detected tremor during their surgical procedure, despite its negative connotation.

Figure 2 depicts the performance of an individual on a subset of categories. This particular surgeon is in the 80th percentile in the fine-motor precision category and has an opportunity to improve their fluidity of motion. If the surgeon were to improve, these categories are much more actionable than receiving a score of 4 out of 5 in the Tissue handling category on a rubric. The difference is that fine-motor precision has a very specific statistical interpretation and is thus quantifiable and objective. In contrast, quantifying tissue handling is challenging and forces analysts to assign a number, on a likert scale, by drawing on their experiences, thus making it a subjective measure with the potential for inconsistencies across multiple annotators for the same video. Another attribute of the analysis is the associated heatmaps, which illustrate regions where the surgeon is most active. The heatmaps (Fig. 2) provide another level of transparency to the assigned scores. This statistically-driven approach incorporates quantitative metrics as opposed to the traditionally used qualitative metrics for performance evaluation. The performance evaluation is thus actionable, procedure-agnostic, explainable, and consistent.

Finally, the validation results are presented in Table 5 and show the generalizability of our approach to both laparoscopic and robotic data. The proposed approach benefits from the selection of extracted features as the model shows the ability to discriminate between the upper and lower echelon of surgeons with precision scores of 0.84 and 0.82 for the robotic and laparoscopic cohorts, respectively.

Table 4

*Comparison of existing performance evaluation methodologies and the proposed OR Vision methodology*
Metric	GRS	ORVision
Objectivity	N/A	● Clinically meaningful objective statistical metrics
Explainability	N/A	● Heatmaps ● Instrument profiles ● Interpretable sub-categories
Consistency	● Low inter-annotator agreeability ● Low intra-annotator agreeability	● Video always assigned the same score
Speed	● Turn around time for rating a video is very long and this results in less videos being rated	● Near real-time
Cost	● Expensive	● Inexpensive
Scalability	N/A	● Yes

Table 5

*Prediction results on a variety of datasets using our proposed technique.* The results are validated by running inference on Surgical Safety Technologies (SST), University of Texas Southwestern (UTSW). Macro metrics place equal weight on classification categories. Weighted metrics weigh categories according to their frequency in the sample population.
Category	N	Model Accuracy	Precision (macro)	Recall (macro)	F1-score (macro)	Precision (weighted)	Recall (weighted)	F1-score (weighted)	Data type
JIGSAWS	130	0.75 (0.15)	0.73 (0.12)	0.65 (0.12)	0.66 (0.12)	0.74 (0.14)	0.75 (0.14)	0.72 (0.13)	Table-top
EndoVis19	42	0.86 (0.12)	0.79 (0.10)	0.79 (0.10)	0.79 (0.10)	0.86 (0.11)	0.86 (0.11)	0.86 (0.11)	Laparoscopic
SST	102	0.83 (0.06)	0.82 (0.06)	0.81 (0.06)	0.81 (0.05)	0.83 (0.04)	0.83 (0.04)	0.82 (0.03)	Laparoscopic
UTSW	133	0.92 (0.03)	0.84 (0.07)	0.78 (0.05)	0.81 (0.06)	0.91 (0.05)	0.92 (0.05)	0.92 (0.05)	Robotic

The results presented in Table 5 show the efficacy of a consensus-based modeling approach for distinguishing between the upper and lower echelon of performers in a surgical cohort. By simplifying the modeling task to distinguish between surgeons in a cohort, we are able to show that our selected set of features yields a satisfactory signal. The most salient of these features, as determined by probing the individual models, are presented in Table 3.

The consensus-based mechanism we propose uses features that are easily extractable directly from the provided videos. The extracted low-dimensional representations serve as a unique fingerprint for the specific procedure and the collective set of features shows a good level of discrimination for all of types of surgery explored herein such as table-top simulated experiments, laparoscopic, and robotic procedures. It is interesting to note that the model has higher precision and recall scores (0.82 and 0.81) when modeling laparoscopic procedures compared to table-top procedures (0.73 and 0.65), this may be attributed to the length of the videos, as laparoscopic videos are significantly longer than table-top videos and thus provide more context. Another important consideration is the mechanism by which we extract tool information from the input videos. The employed deep learning modeling technique can also be a source of noise as the model is sensitive to the detected position of the instruments [47]. The nature of laparoscopic surgery makes evaluating surgical skill extremely challenging. Unlike robotic surgeries, in which the camera is fixed, the laparoscopic camera is allowed to move unconstrained during the course of surgery. A moving camera can disturb the signal being captured by tracking the position of the surgical instruments. To minimize the error associated with this sort of movement, we applied temporal smoothing.

As artificial intelligence algorithms are developed for improving clinical and healthcare processes, clinicians have raised concerns about the explainability of such models [42, 43, 44]. In this work, we present models that produce precise outputs along with supporting evidence. This method of relatively assessing the performance of individuals creates a contrast which may be used in conjunction with the explainable outputs to create an invaluable feedback loop for surgeons to improve their surgical technique. For example, the results presented in Fig. 2 of a randomly sampled individual in the cohort shows an illustrative breakdown of performance as determined by our predetermined set of interpretable features. We present a comparison of existing coaching techniques against our methodology in Table 4. Our approach generates clinically validated scores that are inexpensive to generate. The only costs incurred are due to the compute power and can be run on a Graphics Processing Unit (GPU) with at least 10Gb of memory (RAM). The results are repeatable as the same input data will yield the same results each time. Instead of using a traditional Likert scale, we use percentiles to rank individuals within their cohort. The solution is also generalizable, as it can be used for different types of surgeries, as well as for practicing surgeons or surgical trainees. This form of individual specific feedback thus serves as an invaluable tool for surgeons looking to improve their skills without necessitating oversight from a more experienced surgeon.

The Global Rating Scale (GRS) and similar technical assessment rubrics such as the Global Operative Assessment of Laparoscopic Skills (GOALS) and the Objective Structured Assessment of Technical Skill (OSATS) use a Likert-like scale with subjective criteria [45, 46]. These scales are subjective, which results in a high likelihood of annotation inconsistencies over the course of a surgical procedure. To train models effectively, the annotations required for this type of analysis must have high inter-rater reliability.

Here, we demonstrate a functional framework for assessing surgical performance in an unbiased and repeatable manner for challenging laparoscopic surgeries. The presented approach produces intermediate outputs that provide insights regarding the models decision making, thus making the model explainable. The framework serves as a deployed implementation that yields clinically significant results. Future work continues to extend this model to different surgery types.

Funding Information:

The study was supported by Surgical Safety Technologies.

Disclosures

Mr Shuja Khalid reported receiving a salary from Surgical Safety Technologies during the conduct of the study and outside the submitted work.

Dr Vanessa Palter reported receiving a salary from Surgical Safety Technologies during the conduct of the study.

Dr Teodor Grantcharov reported receiving a salary from Surgical Safety Technologies.

Dr Frank Rudzicz reported receiving salary from Surgical Safety Technologies during the conduct of the study and outside the submitted work and is supported by a CIFAR Chair in Artificial Intelligence.

Rosser Jr, J.C., Rosser, L.E., Savalgi, R.S.: Objective evaluation of a laparoscopic surgical skill program for residents and senior surgeons. Archives of surgery 133(6), 657–661 (1998)
Reiley, C.E., Lin, H.C., Yuh, D.D., Hager, G.D.: Review of methods for objective surgical skill evaluation. Surgical endoscopy 25(2), 356–366 (2011)
Kramp, K.H., van Det, M.J., Hoff, C., Lamme, B., Veeger, N.J., Pierie, J.- P.E.: Validity and reliability of global operative assessment of laparoscopic skills (goals) in novice trainees performing a laparoscopic cholecystectomy. Journal of surgical education 72(2), 351–358 (2015)
Moorthy, K., Munz, Y., Undre, S., Darzi, A.: Objective evaluation of the effect of noise on the performance of a complex laparoscopic task. Surgery 136(1), 25–30 (2004)
Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., Brown, M.: Objective structured assessment of technical skill (osats) for surgical residents. Journal of British Surgery 84(2), 273–278(1997)
Aggarwal, R., Moorthy, K., Darzi, A.: Laparoscopic skills training and assessment. Journal of British Surgery 91(12), 1549–1558 (2004)
Hogle, N.J., Liu, Y., Ogden, R.T., Fowler, D.L.: Evaluation of surgical fellows’ laparoscopic performance using global operative assessment of laparoscopic skills (goals). Surgical endoscopy 28(4), 1284–1290 (2014)
Larsen, C., Grantcharov, T., Schouenborg, L., Ottosen, C., Soerensen, J., Ottesen, B.: Objective assessment of surgical competence in gynaecological laparoscopy: development and validation of a procedure-specific rating scale. BJOG: An International Journal of Obstetrics & Gynaecology 115(7), 908–916 (2008)
Moorthy, K., Munz, Y., Sarker, S.K., Darzi, A.: Objective assessment of technical skills in surgery. Bmj 327(7422), 1032–1037 (2003)
Zevin, B., Bonrath, E.M., Aggarwal, R., Dedy, N.J., Ahmed, N., Grantcharov, T.P., et al.: Development, feasibility, validity, and reliability of a scale for objective assessment of operative performance in laparoscopic gastric bypass surgery. Journal of the American College of Surgeons 216(5), 955–965 (2013)
Gofton, W.T., Dudek, N.L., Wood, T.J., Balaa, F., Hamstra, S.J.: The ottawa surgical competency operating room evaluation (o-score): a tool to assess surgical competence. Academic Medicine 87(10), 1401–1407 (2012)
MacEwan, M.J., Dudek, N.L., Wood, T.J., Gofton, W.T.: Continued validation of the o-score (Ottawa surgical competency operating room evaluation): use in the simulated environment. Teaching and learning in medicine 28(1), 72–79 (2016)
Curtis, N.J., Foster, J.D., Miskovic, D., Brown, C.S., Hewett, P.J., Abbott, S., Hanna, G.B., Stevenson, A.R., Francis, N.K.: Association of surgical skill assessment with clinical outcomes in cancer surgery. JAMA surgery 155(7), 590–598 (2020)
Scally, C.P., Varban, O.A., Carlin, A.M., Birkmeyer, J.D., Dimick, J.B., Collaborative, M.B.S., et al.: Video ratings of surgical skill and late outcomes of bariatric surgery. JAMA surgery 151(6), 160428–160428 (2016)
Varban, O.A., Thumma, J.R., Finks, J.F., Carlin, A.M., Ghaferi, A.A., Dimick, J.B.: Evaluating the effect of surgical skill on outcomes for laparoscopic sleeve gastrectomy: a video-based study. Annals of surgery 273(4), 766–771 (2021)
Dlouhy, B.J., Rao, R.C., Page, P., Julia, D., Gomez, N., Codina-Cazador, A.: Surgical skill and complication rates after bariatric surgery. The New England journal of medicine 370(3), 285–285 (2014)
Pohl, H.G., Rana, S., Sprague, B.M., Beamer, M., Rushton, H.G.: Discrepant rates of hypospadias surgical complications: a comparison of us news & world report and pediatric health information system®data and published literature. The Journal of urology 203(3), 616–623 (2020)
Brown, E.D., Chen, M.Y., Wolfman, N.T., Ott, D.J., Watson Jr, N.E.: Complications of renal transplantation: evaluation with us and radionuclide imaging. Radiographics 20(3), 607–622 (2000)
Floyd, S.B., Chapman, C.G., Thigpen, C.A., Brooks, J.M., Hawkins, R.J., Tokish, J.M.: Shoulder arthroplasty in the us medicare population: a 1-year evaluation of surgical complications, hospital admissions, and revision surgery. JSES open access 2(1), 40–47 (2018)
Asemota, A.O., Ishii, M., Brem, H., Gallia, G.L.: Comparison of complications, trends, and costs in endoscopic vs microscopic pituitary surgery: analysis from a us health claims database. Neurosurgery 81(3), 458–472 (2017)
Haberal, M., Boyvat, F., Akdur, A., Kırnap, M., Ozcelik, U., et al.: Surgical complications after kidney transplantation. Experimental and clinical transplantation: official journal of the Middle East Society for Organ Transplantation 14(6), 587–595 (2016)
Pang, D.: Surgical complications of open spinal dysraphism. Neurosurgery Clinics of North America 6(2), 243–257 (1995)
Canet, J., Hardman, J., Sabate, S., Langeron, O., de Abreu, M.G., Gallart, L., Belda, J., Markstaller, K., Pelosi, P., Mazo, V.: Periscope study: predicting post-operative pulmonary complications in europe. European Journal of Anaesthesiology EJA 28(6), 459–461 (2011)
Panahiazar, M., Taslimitehrani, V., Pereira, N., Pathak, J.: Using ehrs and machine learning for heart failure survival analysis. Studies in health technology and informatics 216, 40 (2015)
Kim, S.J., Cho, K.J., Oh, S.: Development of machine learning models for diagnosis of glaucoma. PloS one 12(5), 0177726 (2017)
Wang, W., Kiik, M., Peek, N., Curcin, V., Marshall, I.J., Rudd, A.G., Wang, Y., Douiri, A., Wolfe, C.D., Bray, B.: A systematic review of machine learning models for predicting outcomes of stroke with structured data. PloS one 15(6), 0234722 (2020)
Chen, P.-H.C., Liu, Y., Peng, L.: How to develop machine learning models for healthcare. Nature materials 18(5), 410–414 (2019)
Ebrahimi, M., Mohammadi-Dehcheshmeh, M., Ebrahimie, E., Petrovski, K.R.: Comprehensive analysis of machine learning models for prediction of sub-clinical mastitis: Deep learning and gradient-boosted trees outperform other models. Computers in biology and medicine 114, 103456 (2019)
Caballe-Cervigon, N., Castillo-Sequera, J.L., Gomez-Pulido, J.A., Gomez-Pulido, J.M., Polo-Luque, M.L.: Machine learning applied to diagnosis of human diseases: A systematic review. Applied Sciences 10(15), 5135 (2020)
Dreiseitl, S., Binder, M., Hable, K., Kittler, H.: Computer versus human diagnosis of melanoma: evaluation of the feasibility of an automated diagnostic system in a prospective clinical trial. Melanoma research 19(3), 180–184 (2009)
Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B., Rudzicz, F.: Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA network open 3(3), 201664–201664 (2020)
Ahmidi, N., Ishii, M., Fichtinger, G., Gallia, G.L., Hager, G.D.: An objective and automated method for assessing surgical skill in endoscopic sinussurgery using eye-tracking and tool-motion data. In: International Forum of Allergy & Rhinology, vol. 2, pp. 507–515 (2012). Wiley Online Library
Fard, M.J., Ameri, S., Darin Ellis, R., Chinnam, R.B., Pandya, A.K., Klein, M.D.: Automated robot-assisted surgical skill evaluation: Predictive analytics approach. The International Journal of Medical Robotics and Computer Assisted Surgery 14(1), 1850 (2018)
Zia, A., Essa, I.: Automated surgical skill assessment in rmis training. International journal of computer assisted radiology and surgery 13(5), 731–739 (2018)
Levin, M., McKechnie, T., Khalid, S., Grantcharov, T.P., Goldenberg, M.: Automated methods of technical skill assessment in surgery: a systematic review. Journal of surgical education 76(6), 1629–1639 (2019)
Hwang, T.: Computational power and the social impact of artificial intelligence. Available at SSRN 3147971 (2018)
Lavanchy, J.L., Zindel, J., Kirtac, K., Twick, I., Hosgor, E., Candinas, D., Beldi, G.: Automation of surgical skill assessment using a three-stage machine learning algorithm. Scientific reports 11(1), 1–9 (2021)
Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape model fitting using random forest regression voting. In: European Conference on Computer Vision, pp. 278–291 (2012). Springer
Gao, Y., Vedula, S.S., Reiley, C.E., Ahmidi, N., Varadarajan, B., Lin, H.C., Tao, L., Zappella, L., Bejar, B., Yuh, D.D., et al.: Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2cai, vol. 3, p. 3 (2014)
2019, M.: Sub-challenge: Surgical Workflow and Skill Analysis. https://endovissub-workflowandskill.grand-challenge.org/ Accessed 2019
Nembrini, S., Konig, I.R., Wright, M.N.: The revival of the gini importance? Bioinformatics 34(21), 3711–3718 (2018)
Galitsky, B.: Customers’ retention requires an explainability feature in machine learning systems they use. In: 2018 AAAI Spring Symposium Series (2018)
Tonekaboni, S., Joshi, S., McCradden, M.D., Goldenberg, A.: What clinicians want: contextualizing explainable machine learning for clinical end use. In: Machine Learning for Healthcare Conference, pp. 359–380 (2019). PMLR
Samek, W., Muller, K.-R.: Towards explainable artificial intelligence. In: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 5–22. Springer (2019)
Doyle, J.D., Webber, E.M., Sidhu, R.S.: A universal global rating scale for the evaluation of technical skills in the operating room. The American journal of surgery 193(5), 551–555 (2007)
Vassiliou, M.C., Feldman, L.S., Andrew, C.G., Bergman, S., Leffondre, K., Stanbridge, D., Fried, G.M.: A global assessment tool for evaluation of intraoperative laparoscopic skills. The American journal of surgery 190(1), 107–113 (2005)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Menze, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.A.: A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC bioinformatics 10(1), 1–16 (2009)

Competing interest reported. Mr Shuja Khalid reported receiving a salary from Surgical Safety Technologies during the conduct of the study and outside the submitted work. Dr Vanessa Palter reported receiving a salary from Surgical Safety Technologies during the conduct of the study. Dr Teodor Grantcharov reported receiving a salary from Surgical Safety Technologies. Dr Frank Rudzicz reported receiving salary from Surgical Safety Technologies during the conduct of the study and outside the submitted work and is supported by a CIFAR Chair in Artificial Intelligence.

Download PDF

Version 1

posted

You are reading this latest preprint version

OR Vision: Objective, explainable assessment of surgical skill with deep learning

Status:

Version 1

Abstract

Figures

Introduction

Materials And Methods

Statistical analysis

Results

Discussion

Declarations

References

Additional Declarations

Status:

Version 1