Temporal-based Swin Transformer network for workflow recognition of surgical video

Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic information interactions well due to the inductive bias inherent in convolution. In this paper, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical video workflow recognition task. TSTNet contains two main parts: the Swin Transformer and the LSTM. The Swin Transformer incorporates the attention mechanism to encode remote dependencies and learn highly expressive representations. The LSTM is capable of learning long-range dependencies and is used to extract temporal information. The TSTNet organically combines the two components to extract spatiotemporal features that contain more contextual information. In particular, based on a full understanding of the natural features of the surgical video, we propose a priori revision algorithm (PRA) using a priori information about the sequence of the surgical phase. This strategy optimizes the output of TSTNet and further improves the recognition performance. We conduct extensive experiments using the Cholec80 dataset to validate the effectiveness of the TSTNet-PRA method. Our method achieves excellent performance on the Cholec80 dataset, which accuracy is up to 92.8% and greatly exceeds the state-of-the-art methods. By modelling remote temporal information and multi-scale visual information, we propose the TSTNet-PRA method. It was evaluated on a large public dataset, showing a high recognition capability superior to other spatiotemporal networks.


Introduction
Laparoscopic surgery (LS), a minimally invasive surgery, improves the quality of patient treatment and provides the opportunity to record surgical videos [1]. These videos can be further used for documentation of surgical reports, skills assessment of surgeons and for training. LS usually takes several hours to complete, and it is very difficult and timeconsuming for the surgeon to analyse and manually index the surgical procedures [2]. Computer-aided intervention (CAI) systems provide automatically retrievable and realtime objective information to help solve these problems [3]. Automatic recognition of surgical phases is an important task in the field of CAI, and the commitment to objectively recognize surgical phases from recorded surgical videos is essential to improve surgeon efficiency and patient safety. However, purely visual-based recognition is quite difficult due to the similar interclass appearance and the blurring of the recorded video scenes [4].
Temporal information has proven to be an important cue for various surgical video analysis tasks [5,6]. Early methods of surgical workflow recognition used statistical models, such as hidden Markov models [7] and conditional random field [8]. However, these methods rely on pre-defined linear models. And the limited representation capability makes it difficult to handle the complex temporal relationships between surgical frames [9]. Jin et al. [10] proposed an endto-end recurrent convolutional model called SV-RCNet to optimize the spatiotemporal feature representation process. And SV-RCNet uses LSTM [11] network to extract temporal features. On this basis, Jin et al. [12] proposed a multi-task recurrent convolutional network (MTRCNet) that employs correlation loss to enhance the synergy between the tool and the phase prediction. Yi and Jiang [13] proposed an online hard frame mapper (OHFM) based on ResNet [14] and LSTM to handle hard frames. Lea et al. [15] designed a piecewise spatiotemporal CNN that used the change in VGG [16] as the spatial component and a one-dimensional convolution filter as the time component. Twinanda et al. [17] proposed the EndoNet model for extracting visual features based on AlexNet [18]. After that, Twinanda [19] replaced the HMM of EndoNet with an LSTM to enhance the network's ability to model temporal information. Jin et al. [20] proposed a time memory relation network (TMRNet), integrated multiscale LSTM outputs via non-local operations. Although these surgical phase recognition methods have achieved excellent performance, most existing methods can only extract temporal and spatial features between successive frames and are unable to capture cross-frame dependencies. Moreover, due to the inherent inductive biases in the convolutional structure, they lack understanding of global information. These lead to losses of critical visual information.
Transformer [21] is a deep neural network based on a self-attentive mechanism. Vision transformer (ViT) [22] is the first model to apply the transformer to image classification. Transformer allows concurrently relating entries inside a sequence at different positions in the sequence, which helps to capture cross-frame dependencies and preserve essential features in extra-long sequences. Transformer has also shown outstanding capabilities in visual feature representation [23,24]. Recently, transformer has demonstrated excellent results when used to fuse multi-view elements in point clouds [25], implying that it has the potential to facilitate the synergy of spatial and temporal features in surgical videos. Swin Transformer [26] is a new vision transformer model with sliding window operation and layered design. It combines the advantages of transformer and the hierarchical merit of CNN. This allows the model to be flexible in handling images of different scales. It also reduces the computational complexity significantly compared to ViT. Therefore, we propose a novel temporal-based Swin Transformer network (TSTNet). TSTNet is an end-to-end network, which eliminates the limitation of positional distance and further takes into account multi-scale image features as well as long-term temporal information. Our main contributions are summarized as follows: (1) We develop a novel temporal-based Swin Transformer network (TSTNet) for surgical video analysis. It further takes into consideration multi-scale image features as well as long-term temporal information and exhibits a strong feature representation capability. (2) TSTNet uses Swin Transformer as a benchmark network, introducing an attention mechanism to encode remote dependencies and extract multi-scale visual features. The LSTM is used to learn long-range dependencies. The whole network is trained in an end-to-end manner to extract spatiotemporal features that contain more contextual information. (3) We design a simple and effective prior revision algorithm (PRA) to take the standardization of surgery and good surgical video structure into account. PRA uses prior knowledge to optimize the output of TSTNet, thus greatly improving the accuracy of surgical workflow recognition.

Method
In surgical videos, many image frames are difficult to identify accurately by visual features. Considering that surgical video is a sequential data, extracting multi-scale image features and modelling long sequence information are crucial for accurate recognition of the surgical phase. Previous recognition methods use CNNs as the backbone network, but they lack an understanding of global information due to the induction bias inherent in convolutional architectures. And they are unable to capture cross-frame dependencies. Therefore, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical workflow recognition task. In particular, most of the surgical videos are well structured and ordered, and the consistency of prediction can be improved if a priori information on the order of occurrence of surgical phases can be introduced. Therefore, we propose PRA to further improve recognition performance. The overall framework of this paper is shown in Fig. 1. First, the successive image frames are pre-processed (dividing the sequence by a sliding window and then randomly disrupting the sequence order) and then sent to TSTNet to extract spatiotemporal features. Finally, the output of TST-Net is further optimized by PRA to obtain the final surgical workflow identification results. The frames in a video clip are represented by x = {x 1 · · · x t−1 , x t }. Andp t ∈ R C is the prediction vector with C denoting the number of classes (the number of phases in our task).

Temporal-based Swin Transformer network
The TSTNet network structure is shown in Fig. 2. First, the Swin Transformer architecture is trained based on the pre-training model weight of the Imagenet-22k [14] dataset. Then, the Swin Transformer is used to capture multi-scale visual features in image frames. LSTM is used to model the temporal information of sequential frames. TSTNet integrates the Swin Transformer and LSTM, so we train them jointly in an end-to-end way to obtain high-level semantic features. We denote the output features of Swin Transformer by z = {z 1 · · · z t−1 , z t }, and h = {h 1 , . . . , h t−1 , h t } denotes the feature of the LSTM output.

Swin Transformer network
Considering the complex surgical environment, it is not easy for us to obtain spatiotemporal features with high recognition performance. Different from previous approaches that adopted CNNs as the benchmark network for surgical video flow recognition tasks, we use the Swin Transformer with the attention mechanism to tackle this complex but meaningful task.
The Swin Transformer network structure is shown in Fig.  3, which consists of four repetitive stages. It has the same feature map resolution as the typical convolution network (such as VGG and ResNet). Swin Transformer combines the advantages of transformer and the hierarchical merit of CNN. This allows the model to be flexible in handling images of different scales. The stage consists of the linear embedding layer and two consecutive Swin Transformer blocks. As illus-trated in Fig. 3b, Swin Transformer blocks are calculated as follows: whereẑ l represents the output features of the (S)W-MSA module, and z l represents the output features of MLP for block l. W-MSA denotes a window-based multi-head selfattention using a regular window partition configuration, while SW-MSA is a shifted window partition configuration. Swin Transformer established based on the shift window module can establish a global connection between sequences and capture long-range dependent feature. Its same hierarchical construction method as the CNN also enables adequate extraction of multi-scale image features. These are helpful in providing a more robust feature representation for surgical phase recognition tasks.
We initialized the weights of the Swin Transformer with the pre-trained model on ImageNet-22k [14]. Specifically, the last layer of the pre-trained model (the softmax layer) was removed. For the classification network, we connected a LayerNorm (LN) layer and a global average pooling (GAP) layer after the Swin Transformer to extract image features. Then, the Swin Transformer was trained again based on the Cholec80 dataset. Finally, a 1024-dimensional feature vector was output.

LSTM network
For complex surgical procedures, it is difficult to accurately distinguish between surgical stages by relying purely on visual information. Considering that the surgical video is a type of serial data, it would be of great help for phase recognition if long-time serial features could be extracted efficiently. Therefore, we use LSTM to solve the long-time dependency problem and extract more features containing contextual information.
The LSTM introduces a new internal state c t ∈ R D for linear cyclic transfer, while nonlinearly outputting the external state h t ∈ R D of the hidden layer. The current moment unit receives the previous moment state h t−1 and the feature information output by the Swin Transformer. The first forgetting gate f t controls the information about the internal state of the previous moment that needs to be forgotten. The input gate i t determines how much information needs to be stored in the candidate state at the current moment. The result of combining the forgetting gate and the input gate generates a These experiments are conducted in online mode. The surgical sequences are input into TSTNet to train in an end-to-end manner. Specifically, we take the Swin Transformer output 1024-dimensional features as input and connect a one-way LSTM, i.e. before the fully connected layer. The LSTM network has 512 neurons with a tenfold step length. The parameters of the Swin Transformer and LSTM are jointly optimized during the backpropagation process. As a result, spatiotemporal features with high recognition capability can be obtained. Finally, we output the predicted image frame classes through a fully connected layer. The number of neurons is 7 to correspond to the 7 surgical categories.

Revision method based on prior knowledge
We found that the need for surgeons to follow prescribed workflows and instructions led to more regularity in most surgical videos [17]. Figure 4b summarizes the variation in surgical workflow in the Cholec80 dataset. Specifically, phases P1 to P4 are defined in turn. From P4 to P7, there is no certain order, but P5 must certainly occur before P7. Based on the above considerations, this paper obtains valuable prior information by tracking the surgical workflow, and instantly inferring the phase of the current frame based on the predictions of previous frames. This will help to calibrate those image frames that were incorrectly predicted. Therefore, this paper proposes a simple and effective prior revision algorithm (PRA) after TSTNet, which uses prior knowledge to improve the consistency of prediction.
In the PRA, we use λ t ∈ (0 · · · L, L = 6) to represent the phase prediction of the current frame x t by the network, where L is the number of phases. To store prior knowledge, we set up a priori prediction collection box (i.e. S) to retain all phase predictions of past frames.
The prior knowledge preserved by S is used to derive the most matching surgical phase for the current frame x t . Also, we set an accumulator A for each surgical phase, recording the number of predictions to that phase. In addition, to ensure the accuracy of the a priori P t , accumulator A will accumulate only when successive frames of the sequence are predicted to the current phase. Otherwise, A will be cleared. Finally, when the phase count reaches the set threshold δ , the prior P t can be determined. Then, the acquired a priori P t is used to rectify the prediction result λ t of the current frame x t . The three treatments for the current frame are as follows: (1) when the prediction resultλ t of the current frame x t is consistent with P t , it is considered that the prediction is correct and the prediction is maintained. (2) When the prediction result λ t of the current frame x t is categorized as one of the potential next phases, inference may have entered the next phase. To ensure the accuracy of the prior phase P t , the accumulator is started. When the accumulator reaches the threshold δ, it is determined that P t enters the next phase and keeps the prediction result λ t unchanged. If the threshold is not reached, consider that the prior phase P t is still in the current phase, modify λ t to be consistent with P t and clear the accumulator. (3) If the prediction result λ t of the current frame is not consistent with P t and does not belong to the potential next phase of P t , then the prediction of the current frame is directly corrected to be consistent with P t .
Finally, it is important to note that the hyper-parameters in the PRA are determined using the grid search method on a validation subset of the dataset. We apply the optimal threshold values obtained to the test subset. As shown in Fig.  4, we take the example of going from P1 to P2 to illustrate how PRA works when the threshold value is 5. The same principle is followed for other transition points.

Training details of TSTNet
The experimental environment is based on Python 3.6.7. Our framework is implemented based on PyTorch1.7.1 using a 24 GB Nvidia Quadro RTX 6000 GPU for training. We initialize the parameters of Swin Transformer from the weights trained on the ImageNet-22k dataset. The experimental parameters are as follows: the learning rate is adjusted to 0.00003, the decay is fixed at 0.000001, the batch size is 80, the epoch is 5, and the sequence length is 10. The whole training took about 20 h. During the inference process, our TSTNet model processes one frame within 0.01 s. The optimizer is adaptive moment estimation (Adam). The image size is 224 * 224, and the centre crop is chosen for the image cropping method. Slice the image into multiple patches of size 4, and select 7 as the window size.

Dataset and evaluation metrics
We extensively validated our TSTNet-PRA approach on the publicly available dataset Cholec80 [17]. It contains 80 surgical videos of cholecystectomies performed by 13 surgeons. The surgical video frame rate in this dataset is 25 frames per second. The resolution of these frames is 1920*1080 or 854*480. We followed the split of [8][9][10]12] exactly and set the first 40 videos of the dataset as the training set and the remaining 40 videos as the test set. Eight videos were divided from the training set as the validation set. For data preprocessing strategies, we create the length of the sequence in the form of a sliding window with a window size of 10, moving backwards one frame each time. That is, there is an overlap of 9 frames between adjacent sequences, and the last frame is updated. After obtaining several sets of sequences, the sets are randomly disrupted.
This paper focuses on the quantitative evaluation of the model using precision, recall, and accuracy. The calculation formulas are as follows: where GT and P denote the ground truth set and prediction set of a phase, respectively. Since precision and recall are evaluated at the phase level, the values obtained at all phases are averaged to obtain the precision and recall results of the entire surgical video. Accuracy is evaluated on the entire surgical video, which is defined as the percentage of correct detection in the entire surgical video.

Experimental analysis of ablation
To illustrate the effectiveness of each module of the method in this paper, ablation experiments were carried out. As shown in Table 1, the accuracy is 81.9%, the recall is 71.3%, and the precision is 73.3% on the Swin Transformer. When combined with the LSTM network, an improvement in all three evaluation metrics can be seen when compared to the baseline network. The most significant improvement in precision was 9.1%. This suggests that transformer combined with LSTM(TSTNet) can show strong feature representation ability, thus improving recognition performance. Based on the TSTNet network, the PRA proposed in this paper is used to further optimize the results. The accuracy, recall, and precision are 92.8%, 90.7%, and 90.5%, respectively. The three indicators are increased to more than 90%, which greatly improves the recognition performance and proves the effectiveness of PRA. To more comprehensively analyse the effectiveness of our proposed framework, the confusion matrix is calculated to further show detailed results at the phase level shown in Fig.  5. By optimizing the output of TSTNet, it is observed that PRA alleviates the erroneous predictions of P1 to P2, P4 to P2, and P7 to P1. However, after processing by the PRA, some correctly identified frames were incorrectly revised to other stages. For example, part of P3 was incorrectly classified as P4, which may be due to the inaccurate identification of the transition sequence between P3 and P4 by the TSTNet network.
The 7 phases of the surgical video are also evaluated, as shown in Table 2, and the support column is the sample size for that category. It can be observed that sample imbalance exists in 7 surgery phases. Importantly, even though the sample sizes of the three phases of preparation, cleaning and coagulation, and gallbladder retraction were small, using our proposed TSTNet-PRA, the precision can still reach high

Comparative experimental analysis
We compare our proposed method with some state-of-theart methods, namely PhaseNet and EndoNet proposed by Twinanda et al. in [17], OHFM [13], MTRCNet-CL [12], and TMRNet [20]. The experimental results are shown in Table  3. It can be seen that our TSTNet-PRA method has a significant advantage in all three aspects of accuracy, recall, and  precision. Although EndoNet uses additional tool annotation information for the stage identification task, our TSTNet-PRA still improved the accuracy from 81.7 to 92.8%, the recall from 79.6 to 90.7%, and the precision from 73.7 to 90.5%. The method proposed in this paper outperforms the OHFM network by 7.6% in terms of accuracy metrics. Although OHFM takes into account long-range information, its network architecture is not an end-to-end training but encodes visual and temporal features separately. Furthermore, despite the fact that TMRNet can capture multi-scale temporal information, our method still outperforms TMRNet by 2.7%, 1.2%, and 0.2% in three metrics, respectively. In this paper, only phase labelling information is used to train the network.

Conclusion
We propose a temporal-based Swin Transformer network (TSTNet) for the automatic recognition of surgical workflows from surgical videos. We exploit the Swin Transformer and LSTM networks to extract multi-scale visual features and temporal dependencies. By combining the advantage of transformer and the merit of CNN, Swin Transformer shows strong feature representation ability. TSTNet integrates the Swin Transformer and LSTM networks and is trained in an end-to-end manner so that visual and temporal features are jointly and efficiently optimized during training. In particular, based on the full understanding of the natural features of the surgical videos, PRA is further proposed in this paper to optimize the output of TSTNet using a priori knowledge. Our proposed TSTNet-PRA method can achieve an accuracy of 92.8% on the Cholec80 dataset, which validates the validity of the TSTNet-PRA method. Further, our TSTNet-PRA method can also provide fundamental support for the assessment of surgical quality.