Datasets
The Johns Hopkins University–Intuitive Surgical Gesture and Skill Assessment Working Set (JIGSAWS) consists of 103 videos showing curated table-top surgical setups and includes kinematic measurements (i.e., articulation and velocities of joints) from 8 surgeons performing 4 to 5 trials of 3 surgical actions such as knot tying, needle passing, and suturing. All participants, both patients and surgeons, provided written informed consent. The data were captured using the DaVinci Robotic System (Intuitive Surgical) and came with manually annotated labels that correspond to performance scores defined by a modified version of Objective Structured Assessment of Technical Skill, specifically the GRS. The GRS excludes certain categories, such as use of assistants, because each clip depicts a surgeon completing a short procedure in a controlled environment where assistance is not available. The GRS uses a Likert scale with values ranging from 1 to 5 for respect for tissue, suturing and needle handling, time and motion, flow of operation, overall performance, and quality of product. This data set was collected as a collaboration between Johns Hopkins University and Intuitive Surgical, within an institutional review board–approved study and has been released for public use [39].
The EndoVis19 dataset was released as a part of the Endoscopic Vision challenge 2019 [40]. The dataset consists of 22 full length videos of Cholecystectomy procedures with annotated steps and scores for each significant step of surgery. The triangle of Calot and Dissection phases were considered as part of this analysis. The GOALS rubric was used for the surgical skill annotations [46].
Feature selection and data preprocessing
A deep learning instance segmentation model (called Mask-RCNN) identifies the visible bounds, type, and quantity of surgical instruments in each processed frame [46, 47] The resulting features are then tracked over time and recorded as temporal features to be used in downstream processes. For the purposes of the project, we track the instrument tip, giving us the ability to capture high-level characteristics that could potentially be correlated to the level of surgical skill of the operator.
In order to track surgical instruments, the Mask-RCNN model has to be explicitly trained on various types of surgical instruments. For this, annotated data that identifies not just the physical bounds of the instruments but also the type of instrument is required. These data were annotated by 6 trained raters and we then ‘fine-tune’ the Mask-RCNN model with these data. During fine-tuning, the model learns to distinguish between different instruments while simultaneously delineating their boundaries.
A tracking algorithm is used to track identified instruments through time. This algorithm takes as input the frame-level results of the preceding models and conditions the results on the characteristics of segments identified in previous frames. Since instance segmentation models, in general, can produce some incorrect detections due to the complexity of the task, the tracking algorithm serves as a post-processing step for filtering over mis-classified instruments. The algorithm is tuneable depending on the desired precision-recall trade-off. As we are interested in tracking the instruments through time, we adjust the algorithm with the goal of maximizing precision.
Depending on the type of surgery, various instruments are visible throughout the course of the procedure. Some commonly used instruments such as the bowel grasper appear numerous times. To deal with such scenarios, we have chosen to average the contributions of multiple detections of the same instrument. In each procedure, only one operator completes the entirety of the task (Calot, dissection. needle-passing, suturing, or knot-tying). The EndoVis19 dataset has a special consideration where an assistant might be used to hold a certain instrument or even keep the camera in focus. We choose to consider the contributions of both the primary and (potentially) secondary operators in unison for two reasons:
-
The GRS score does not differentiate between multiple operators and is a collective score
-
The datasets don’t include hand-over points and without these points, it is not possible to delineate between the contributions of a potential collaborator.
Table 1
Prediction results on the JIGSAWS dataset represented as mean absolute error (MAE). The random forest model consistently performs better than the linear models.
Category | Linear Regression | Naive-Bayes | Lasso | Random Forest |
Time and motion | 0.750 (0.050) | 0.802 (0.119) | 0.88 (0.017) | 0.670 (0.055) |
Respect for tissue | 0.800 (0.064) | 0.922 (0.024) | 0.779 (0.000) | 0.760 (0.073) |
Flow of operation | 0.736 (0.067) | 0.856 (0.297) | 0.793 (0.081) | 0.719 (0.066) |
Suture needle handling | 0.738 (0.056) | 0.904 (0.053) | 0.838 (0.025) | 0.680 (0.074) |
Table 2
Prediction results on the EndoVis19 dataset represented as mean absolute error (MAE). Both the random forest model and Naive-Bayes models perform well on the EndoVis19 dataset.
Category | Linear | Naive-Bayes | Lasso | Random Forest |
Tissue handling | 0.800 (0.099) | 0.686 (0.064) | 0.820 (0.040) | 0.766 (0.084) |
Efficiency | 0.647 (0.033) | 0.627 (0.066) | 0.753 (0.013) | 0.641 (0.040) |
Bi-manual dexterity | 0.453 (0.102) | 0.519 (0.063) | 0.653 (0.007) | 0.421 (0.094) |
Depth perception | 0.240 (0.038) | 0.240 (0.000) | 0.273 (0.033) | 0.207 (0.072) |
Table 3
Consensus based system to determine feature polarities. A ‘+’ indicates a positive correlation between the feature and performance. Whereas, a ‘-’ indicates a negative correlation.
Feature | JIGSAWS | EndoVis19 | Literature | Final |
Fine-motor reactivity | - | + | + | + |
Control of pace | + | + | + | + |
Consistency of placement | - | + | + | + |
Fine-motor precision | + | + | + | + |
Economy of motion | - | - | - | - |
Fluidity | + | + | + | + |
Tremor | - | - | - | - |
Disorder | - | - | - | - |
Predictability | + | - | + | + |
Inertia | - | - | - | - |
Bi-manual dexterity | + | + | + | + |
Model development and validation
Each video is run through the framework presented in Fig. 1, the output of which is a set of carefully created metrics (Table 3). These metrics are a distilled representation of the input video and contain everything needed for downstream processing. This distilled representation or feature vector is used for subsequent training of machine learning models.
We trained a random forest classifier model on the provided data. This helps to better understand the contribution of each calculated metric to the performance of an individual. We also hypothesized that the positive or negative correlation of each metric would only be meaningful if the model results were acceptable. Tables 1 and 2 contain the performance metrics of the model on two different datasets. We run 10-fold cross-validation on both datasets and use the mean of the metrics across all folds. The Gini importances were extracted from the random forest model and the polarities are presented in Table 3 [48].
We validate our approach by using video data from Surgical Safety Technologies (SST) and University of Texas Southwestern (UTSW). SST data consists of 102 laparoscopic procedures whereas UTSW consists of 133 robotic procedures. Each video is annotated with performance data (OSATS), which we use to divide the top 10% and bottom 10% of surgeons based on two distinct evaluation criteria. This step illustrates the model's ability to generalize to different types of surgery, different annotation frameworks, and across different cohorts.
Statistical analysis
Tables 1 and 2, and their corresponding polarities (Table 3) were created using this approach. The Kolmogorov-Smirnov test was used to determine the distribution to which a sample or a set of samples might belong. Metrics extracted from the time-series data for each unique surgical procedure were mapped to distributions such as the Gaussian, Gumbel, or Cauchy. The radar plots in this report were constructed by presenting the percentiles for each metric in each case. The values are normalized across the population and are thus bounded between 0 and 1.