An overall architecture of our approach to detect hand hygiene action in the OR is shown in Fig. 3. Given a video clip, each member of the anesthesia personnel appearing in the video clip was detected using the per-frame 2D person detector, Faster R-CNN.[34] The regions of interest (ROIs) on the upper body of each person were automatically cropped and spatial extent fixing was conducted on them. Following this, the ROIs were transmitted to a 3D CNN, I3D,[27] and classified into three categories: hand hygiene, touching equipment, and other actions. In the evaluation in terms of a binary classification of hand-hygiene actions, the action of touching equipment was considered to belong to the “other actions” class.
Data Acquisition
The videos were collected over a period of four months (November, 2018 to February, 2019) at a single OR in Asan Medical Center, South Korea. Video recording was conducted throughout multiple consecutive surgeries at the aforementioned OR everyday. The Intel® RealSense™ D415 camera (used RGB images only) was used during the recordings. The videos were recorded at 15 frames per second (fps) at a resolution of 640 × 480. Due to the lack of the collected dataset, additional data was simulated and added to it. These deficiencies can be attributed to missing annotations and the confined field-of-view of the camera. We recorded an hour-long video under a simulated situation that was a close approximation of actual surgical conditions, in which eight non-medical personnel, accompanied by an anesthesiologist, repeatedly rubbed their hands with hand sanitizer from dispensers installed within the OR. The camera was installed on one side of the walls and was verified to be capable of capturing most of the activities performed around the OR table. Figure 4 depicts the location of camera and alcohol-based hand sanitizers in the OR. The Institutional review board for human investigations at Asan Medical Center approved this retrospective study based on waivers of informed consent.
Bounding Box Annotations on the Anesthesia Personnel
We used a bounding box to represent the locations of the anesthesia personnel as ground truth. We followed a hybrid approach similar to the process of person annotation in the AVA dataset.[30] To simplify the process, a baseline set of bounding boxes was generated using a deep learning-based object segmentation tool, instead of drawing them from scratch. Mask R-CNN[35] was used to generate the initial set of bounding box predictions with a confidence score exceeding 0.9 to reduce the number of false-positives. Subsequently, incorrectly defined bounding boxes were manually corrected by two human annotators using Labelme,[36] which is a graphical image annotation tool.
Action Annotation on Anesthesiologists
The OR scenes before and after surgery are selected in untrimmed videos depicting intense involvement of anesthesiologists. The average duration of these scenes is observed to be 5.7 min. Then, they are divided into video clips of shorter duration—between 1 and 22 s (16 to 323 frames)—depicting actions related to hand hygiene protocols (rubbing hands, wearing/removing gloves, and touching equipment). The actions not included in the aforementioned categories were annotated as other actions. Each video clip consisted of one to six actions with a mean and standard deviation of 2.82 and 1.13, respectively. Detailed statistics of these video clips is stated in Statistics of the Data used.
Detection of the Anesthesia Personnel
For this task, we utilized a 2D person detector trained on a large dataset, Microsoft COCO,[37] instead of training it on the OR dataset. A well-known object detection network, Faster R-CNN,[34] which is known to exhibit high accuracies and low error rates on several benchmark datasets was used in this study. Given a video clip comprising a sequence of L frames, the person detector outputted the bounding boxes of the individuals detected in each frame. The detection accuracy of the 2D person detector was estimated by comparing its outputs with the ground truth bounding boxes. The performance of Faster R-CNN at the frame-level was also evaluated in terms of average precision (AP) and average recall (AR). The ground truth bounding boxes of the anesthesia personnel were annotated by human annotators. AP and AR were measured by following the PASCAL VOC[38] and the COCO evaluation protocols.[37] In PASCAL VOC evaluation, a detection is deemed to be correct if its intersection over union (IoU) with a ground truth is greater than the threshold, 0.5. In COCO evaluation, the average value over 10 instances, corresponding to 10 IoU thresholds between 0.5 and 0.95 at intervals of 0.05, is used. The maximum detection number was set to be 10 during the calculation of AR.
Action Linking
Given the bounding box of L frames, action tubelets were created by linking the detected person instances on the frame-level throughout each video clip. Links were drawn between each detected instance of the same person throughout the L frames corresponding to a single constant action. IoUs of the bounding boxes of all adjacent pairs of frames were evaluated and the detected person instance was assigned to the frame exhibiting the highest IoU score with its subsequent neighbor. False assignments were manually corrected after completion of the linking process. During the experiment using the 2D person detector, false assignments were ignored and missing instances of detection in the L frames were filled via linear interpolation.
Upper Body Cropping and Temporal Smoothing
Two techniques were used to reduce the spatial discrepancy across the L frames in the action tubelets. First, upper body cropping was performed on the bounding boxes. To this end, the bottom portion of each bounding box depicting a full body region with a height-to-width ratio greater than 1.5 was cropped out. In addition, spatial extent fixing was conducted on each action tubelet by selecting the largest 3D bounding box enclosing it completely. This guaranteed that each of the L frames within a video clip exhibited the same spatial extent. This was accomplished by considering the pixels corresponding to the extrema of the range of the bounding box results along the X and Y axes. The effect of each technique was evaluated and compared in terms of video mean AP (v-mAP)—an action detection is deemed to be correct if its label is true and the mean frame AP of the video clip is greater than 0.2.
Statistics of the Data used
A total of 45 untrimmed videos, with an average duration of eight hours, depicting the target action—the rubbing of hands— were acquired. Each untrimmed video included one to eight occurrences of the target action. In aggregate, the videos comprised 309 clips with 873 action tubelets, of which 252, 109, and 512 action tubelets were labeled to depict hand-hygiene actions, touching equipment, and other actions, respectively. The means (median) of the numbers of image frames in clips depicting actions from the three aforementioned categories were observed to be 30.16 (22.0), 51.15 (34.0), and 36.69 (25.0), respectively as shown in Fig. 5. We divided the video clips in the ratio 6:2:2 to form the training, validation, and testing datasets, respectively. All dataset did not share video clips recorded on the same date. All the action tubelets retained the ratio of each class by temporal augmentation.
Action Detection
We trained 3D CNN on the action tubelets in each video clip for hand hygiene action classification. To this end, I3D,[27] an Inception-v1 architecture[39] that inflates 2D convolutions into 3D ones, was selected. We took advantage of the network pretrained on a large-scale video dataset, Kinetics-400, comprising 400 categories, including 9 hand-related human actions: air drumming, applauding, clapping, nail trimming, doing nails, finger drumming, finger snapping, fist pumping, and hand washing. We adapted a two-stream design as suggested in [27]. The two networks for RGB and optical flow inputs were jointly trained on the training dataset. Then, we conducted an ablation study to explore the effect of the optical flow stream. To this end, we compared the performances of the three models: RGB, optical flow (Flow), and the joint model (RGB + Flow). A training experiment using the RGB model was assumed to be the baseline and the performances of the optical flow-based model and the combined model were compared with it. All methods were evaluated on each video clip in the testing dataset. The optical flow was computed using a TV-L1 algorithm implemented in the OpenCV library. The pretrained I3D was fine-tuned using an additional 1 × 1 × 1 convolution following the final convolutional network and a softmax output was used to account for the multi-classification of the three action categories: hand hygiene, touching equipment, and other actions. All the layers preceding the average pooling layer were frozen during training. The number of input frames was selected to be 16.
Implementation details
Training was performed using standard SGD with momentum set to 0.9, with synchronous parallelization across 2 GPUs of NVIDIA® Tesla® P40. We trained models for 100 epochs, with a learning rate of 1e− 3 and a weight decay of 1e− 5. We set a class weight of 0.75 for "other actions" class while 1 for "hand hygiene" and "touching equipment" class.
Data augmentation was performed spatially and temporally. In spatial augmentation, a real-time multi-scale cropping was applied around the bounding box. The random scale was chosen ranging from 0.75 to 1.10 times its original size to include the original bounding boxes. In addition, a random horizontal flipping with a probability of 0.5 and brightness jittering with an extent from − 0.1 to 0.1 was applied. All of the input was scaled to 224 pixel size and center-cropped. In temporal augmentation, we adapted a sliding window fashion with different size of stride per class. A number of 1, 1, and 3 which are inversely proportional to the average number of image frames for hand hygiene, touching equipment, and other actions, respectively.