Research on laparoscopic surgical instrument detection technology based on multi-attention-enhanced feature pyramid network

Laparoscopic surgery is a very active area of research in clinical medicine. Detection of tools in surgical videos can help physicians operate surgical instruments, reduce complications, and ensure patient safety. However, the size of laparoscopic surgical instruments is highly variable, leading to poor detection. Feature pyramid networks (FPNs) can effectively solve the problem of multi-scale target detection, but FPNs still have some problems that limit the full utilization of multi-scale features. By analyzing the FPN design problem, we propose the Multi-Attention Augmented Feature Pyramid Network (MAFPN), which can fully utilize the multi-scale features. First, we replace the convolutional block with feature selection module (FSM) that combines channel attention and global attention, which selectively maintains important information and enhances the expressiveness of features at each scale. Second, the global contextual information is captured by the self-attentive augmented fusion module (AAFM), which enriches the high-level feature information in the FPN and enhances the feature fusion effect. Finally, we use Dynamic Convolution Decomposition (DCD) to alleviate the impact of upsampling while enhancing the feature expression ability. The experimental results on the laparoscopic surgical instrument detection dataset m2cai16-tool-locations show that the average precision of MAFPN is 96.5 when the IOU is 0.5, which is 1.8% better than the baseline method RetinaNet, and the average precision is more than 1.6% better than the comparison network. Compared with the state-of-the-art method, the performance of laparoscopic surgical instrument detection is superior.


Introduction
The laparoscopic technique is the surgical operation performed by placing minimally invasive instruments in the abdominal cavity. Therefore, laparoscopic surgery has the advantages of short operation time, less injury, less pain, less visceral interference, low probability of abdominal infection, and less incidence of complications such as postoperative intestinal adhesions [1]. Today, laparoscopic surgery is widely used in surgical procedures. For most patients with gallbladder stones, laparoscopic cholecystectomy is a well-established procedure. However, there are still many complications after laparoscopic surgery, which are often caused by technical mistakes such as improper operation of Fig. 1 Overall pipeline of MAFPN. MAFPN filters features through the FSM module, then the AAFM module enhances and fuses the pyramid features, and finally the DCD module enhances the feature expression ability average precision of the experiment reached 63.1AP. In 2020, Zhang B et al. [6] used an improved Faster R-CNN [4] network on this dataset for detection, with an average precision of 69.6%.
Object detection has achieved remarkable achievements in the medical field [7][8][9]. So far, CNN-based object detectors mainly include anchor-based one-stage detectors [10][11][12][13], anchor-based two-stage detectors [17] and anchor-free detectors [14][15][16]. Two-stage detector generates a framework of area proposals for the possible presence of objects through Region Proposal Network (RPN) [4], and then performs classification and localization. One-stage detectors directly transform the object frame localization problem into a regression problem. Anchor-based detector can directly obtain the final object boundary frame through the pre-defined anchor frame. Anchor-free detector solves the object detection problem with a pixel-by-pixel prediction. In object detection tasks, the backbone network extracts features and forms feature maps with different resolutions, where the highresolution feature maps have a smaller perceptual field and a lot of detailed information but lack semantic information, which facilitates the detection of small targets. The lowresolution feature maps have larger perceptual fields and richer semantic information but lack detailed information that facilitates the detection of large targets. The Feature Pyramid Network (FPN) [18] solves the shortcomings of object detection in dealing with multi-scale changes by fusing the features of each scale and predicting the feature map of each level. However, FPN [18] still has some shortcomings in the fusion of different scale features. Before feature fusion, the channel information of multi-scale features is lost, and upsampling also leads to the loss of semantic information [19]. This greatly affects the performance of object detection. In this paper, we propose the MAFPN.
1. Use feature selection module (FSM) to replace the 1 × 1 convolution before feature fusion, emphasize important feature information and suppress redundant feature information to alleviate the impact of channel information loss. 2. The self-attentive augmented fusion module (AAFM) is used to capture the long-distance dependencies of features, enhance the feature space information, and improve the fusion effect. 3. After feature fusion, Dynamic Convolution Decomposition (DCD) is used to mitigate the effects of aliasing due to upsampling.

Feature fusion
The information contained in features is different at different scales. The bottom feature contains a lot of detailed information for easy localization of small objects. While the top-level features have more semantic information to facilitate the classification of large objects. Therefore, the fusion of multi-scale feature maps can effectively enhance network performance. FPN [18] fuses multi-scale features through a top-down path, but it still does not fully utilize features of different resolutions. PANet [20] adds a bottom-up path based on FPN [18], which enhances the feature context information, but greatly increases the computational complexity. ASFF [21] fuses feature maps at different levels together by adaptively adjusting the spatial weight of features at each scale. BFP [19] first unifies the scale of features by upsampling or downsampling. Then, non-local [22] is used to capture global features to obtain balanced semantic features. Finally, the enhanced multi-scale feature maps are acquired through upsampling and downsampling. TPNet [23] effectively increases the detection precision of objects of different sizes by bidirectional conversion between features of different scales. AugFPN [24] improves feature fusion by compensating for various missing information generated during the feature fusion process. NASFPN [25] seeks the optimal feature fusion method to balance speed and accuracy. FPG [26] integrates multiple connection methods to form a deep multi-path feature pyramid. The above methods improve the feature pyramid from different perspectives. And we improve the feature pyramid from the perspective of compensating channel information and enhancing feature fusion.

Attention mechanism
Since the attention mechanism can pick out important information from many irrelevant backgrounds, it has been used extensively in deep learning. SENet [27] learn the correlation between input feature channels, then enhance effective features and suppress invalid features according to the importance of each feature channel. SKNet [28] improves network performance by fusing the channel weighting concept in SENet with the multi-branch network structure concept in Inception [29]. CBAM [30] enhances information by combining channel attention with spatial attention. CA [31] incorporates location information into channel attention, thereby extending the attention range and reducing the computational cost. Non-local [22] directly obtains remote information by calculating the interaction between two locations, no longer limited by the neighboring points. CCNet [32] obtains contextual information on cross paths through a cross-attention module and further captures long-range dependencies through loops.

Dynamic convolution decomposition
Static convolution means applying the same convolution kernel on the data input to the network. While Dynamic Convolution [33] combines the attention mechanism with multiple convolution kernels, which can adjust the convolution parameters adaptively. Compared with static convolution, dynamic convolution has stronger feature expression capabilities. DCD [34] proposes standard dynamic convolution to use attention on high-dimensional channel groups, resulting in smaller attention values and inhibiting the learning of the corresponding channels. To solve this problem, DCD [34] adopts dynamic channel fusion instead of dynamic attention on channel groups to increase the feature expression capability of dynamic convolution. In this paper, the standard convolution after feature fusion is replaced by dynamic convolution to mitigate the impact of upsampling and enhance the feature expression ability.

Approach
The structure of MAFPN is shown in Fig. 1, which includes the backbone network ResNet50 [35], FSM, AAFM, and sub-networks. First, using ResNet50 to extract image features, and then the feature maps C3, C4, and C5 are input to FSM, which can select the valid channel information. Then, higher-level features C6, C7 are obtained using DCD [34] and standard convolutional downsampling. The AAFM captures the contextual information of feature map C7, improving the semantic information of the bottom features by top-down feature fusion. Finally, the candidate boxes are classified and regressed by the sub-network.

Feature selection module
Since the number of channels needs to be unified before feature fusion, FPN [18] directly reduces the channels of the feature map and then fuses them with the upper-layer features that are also channel-reduced. But channel reduction will cause a lot of channel information loss. This problem is solved by FSM. Figure 2 shows the structure of the FSM. Inspired by CBAM [30], FSM extracts feature information by combining spatial information with channel information. First, the average pooling operation preserves the overall characteristics of the object and the main object features are extracted by the maximum pooling operation. Then, the channel information is captured using a local cross-channel interaction strategy that does not reduce the channel dimensionality, resulting in vectors m ∈ R 1×C and a ∈ R 1×C . After adding m and a, the attention map K ∈ R C×1×1 is obtained after sigmoid activation. Recalibrate the features by a matrix multiplication operation between K and the input feature X. Finally, important channel features are selectively maintained by 1 × 1 convolutional.
S a ∈ R C×1×1 , S m ∈ R C×1×1 represent the result of max pooling and average pooling of input feature X, respectively. f 1×1 denotes 1 × 1 convolution, f 3×3 expresses 3 × 3 convolution, s stands for squeeze dimension and transpose dimension. This operation learns the channel attention without dimensionality reduction.
u represents the opposite operation to S, σ represents the sigmoid function, and f 1×1 c represents 256 convolution operations with filter size 1 × 1. The output channel Co is set to 256. FPN [18] fuses low-resolution features with weak semantic information with high-resolution features via a top-down approach. However, the upsampling operation before fusion leads to the loss of semantic information of the high-level features [19]. Inspired by CCNet [32], we propose to exploit self-attention to capture the contextual information of highlevel features C7 to improve the detection ability of large objects. Then, the semantic information is passed down through the top-down path, thereby improving the detection of small objects.

Self-attention augmented fusion module
As shown in Fig. 3. First, feature maps Q and K are generated through 1 × 1 convolutional, {Q, K}∈ R C/8×H ×W . The vector Q h u ∈ R C/8 is obtained by changing the dimension sequence and reducing the number of channels, where u represents each position in the vertical direction of Q, and the feature vector Q w u ∈ R C/8 of the same row as the position u is extracted from Q. Similarly, the vectors K h u ∈ R C/8 and K w u ∈ R C/8 are obtained from K. Then the similarity of Q and K in the vertical and horizontal directions is calculated by matrix multiplication, and the attention maps E h ∈ R H ×W×H and E w ∈ R H ×W×W are obtained, the feature map A ∈ R H ×W×(W +H ) is further generated. The operation is as follows: Finally, a matrix addition is conducted between O and X to model the global context information. The operation is as follows: where ρ represents trainable parameter with an initial value of 0.

Dataset and evaluation metrics
At present, the dataset of presence-annotated boxes for minimally invasive surgical instrument detection is limited.
Except for the Cholec80 dataset [2] and the m2cai16-tool dataset [1], all other data sets need to be manually labeled.
The m2cai16-tools-locations dataset [3] is extended from the laparoscopic surgery video dataset m2cai16-tool [1]. We validate the effectiveness of the network on m2cai16-toolslocations dataset. Due to the large difference in the number  Figure 4 shows the number of pictures in all categories before and after data augmentation. We use average precision (AP), average precision under different Intersection over Union (IOU) (AP50, AP75), and average precision (APs, APm, APl) metrics for small, medium, and large objects to evaluate performance. Furthermore, AP is calculated according to the object scale. The area of small objects is less than 100 × 100 pixels, the area of medium-sized objects varies from 100 × 100 pixels to 200 × 200 pixels; the area of large objects is larger than 200 × 200 pixels. Table 1 shows the number of objects at different scales and the proportion of small objects in the augmented dataset. AP calculation method is as follows: where TP denotes correctly classified positive samples, FP denotes misclassified negative samples, and FN denotes misclassified positive samples.

Experimental details
All experiments in this paper are run on NVIDIA RTX 3060 GPU with a batch size of 2. The baseline method is RetinaNet [10] with ResNet50 [35]. Using the learning rate warm-up method, a smaller learning rate is used first, and the initial learning rate is restored to 0.0005 after 500 iterations, and the learning rate is decreased by 0.1 times after 16 and 22 epochs, respectively. All experiments were trained for 24 epochs using SGD optimization function with momentum 0.9 and weight decay 0.0001. All other hyperparameters follow MMDetection [36].

Experimental results
We evaluate MAFPN on the m2cai16-tool-locations dataset [3]. Table 2 indicates all the results of this paper. By replacing the FPN [18] in the baseline network RetinaNet with MAFPN, our method achieves an average precision of 96.5% when the IOU is 0.5. The overall average precision is 5.6% higher than the baseline method, achieving 59%. In comparison with the two-stage detectors Faster R-CNN [4], Grid R-CNN [16], Sparse R-CNN [17], anchor-based one-stage detectors SSD300 [12], Guided Anchoring [11], YOLOF [13], and anchor-free one-stage detectors FCOS [14], ATSS [15], our method shows better performance. In addition, we also compare MAFPN with other FPN [18] improvement networks, such as BFP [19], NASFPN [25], and FPG [26]. Among them, when the IOU is 0.5, both NASFPN and MAFPN improve the baseline network by 1.8%, and the average precision reaches 96.5%. And MAFPN improved the detection of objects at different scales, where the average precision of detection of large objects was 12.3% higher than the baseline method.
For object detection tasks, the precision-recall curve provides a visual representation of the detector's performance. Figure 5 shows the precision-recall curves of some methods. The method in this paper obtains the largest area, indicating that this method is superior to other models. Figure 6 displays the detection results of different models on the m2cai16-tool-locations dataset. In some cases, FCOS, Sparse RCNN, and SSD300 detectors miss one or two examples, while MAFPN can accurately detect these surgical instruments with higher average precision.

Ablation study on the importance of the three components of FSM, AAFM, and DCD
As shown in Table 3 the FSM improves the overall average precision of the baseline method by 4.7%, in which the detection results of objects of different scales have been greatly improved. This means that reducing channel information before feature fusion in FPN [18] greatly affects the detection effect, and FSM can effectively alleviate this problem. In this paper, applying FSM to each layer of features effectively enhances the representation ability of feature maps of different scales. AAFM improves the average precision of large objects by 1.1%, indicating that AAFM effectively captures spatial information and improves the feature expression ability of high-level features. However, the detection effect of small objects decreases, which may be due to the reduction of semantic information passed from high-level features to underlying features after AAFM captures spatial information. DCD improves the average precision of the baseline network by 3.8%, especially for small object detection by 4%. When any two of the three modules are combined, their improvement relative to the baseline approach is greater. For instance, FSM combined with DCD can bring a 5.1 AP improvement. When all three modules are combined into the baseline method, the average precision reaches 59%, an improvement of 5.6%. These results illustrate that the three modules are complementary to each other.

Ablation study in terms of the number of feature pyramid levels
We studied the effect of different numbers of feature pyramid levels on detection. The results are shown in Table 4.  When the number of feature pyramids is 3, the AP50 reaches its highest value. With the increase of shallow features, the detection rate of small objects is higher. When the number of feature pyramids is 5, the AP is the best, and the detection rate of medium and large objects is the highest. For comprehensive consideration, this paper sets the number of feature pyramids to 5.  our method hardly increases the computational cost but slightly degrades the inference speed.

Conclusion
To tackle the category imbalance in the laparoscopic surgical instrument dataset, we adopt a data augmentation strategy to reduce the difference in sample size of different categories. Aiming at the design flaws of FPN, we propose the Multi-Attention-Enhanced Feature Pyramid Network. The network uses the feature selection module instead of 1 × 1 convolution to focus on the effective information, suppresses redundant information, and alleviates the loss of channel information caused by reducing the number of channels of scale features at each layer in FPN. The self-attention enhancement fusion module captures the context information of the high-level feature pyramid, relieves the semantic information lost after the upsampling of the FPN, and improves the feature fusion effect. Finally, the dynamic convolution decomposition is sampled to alleviate the aliasing effect caused by upsampling and enhance the feature expression ability. When the IOU is 0.5, our proposed method achieved 96.5 average precision on the laparoscopic surgical instrument detection dataset m2cai16-tool-locations, demonstrating the effectiveness of the method.
Author contributions YL analyzes and interprets the laparoscopic surgical instrument detection dataset m2cai16-tools-locations. YZ performed laparoscopic surgical instrument detection experiments and completed contrast and ablation experiments. XW is a major contributor to writing the manuscript. All authors read and approved the final manuscript. Data availability All data generated or analyzed during this study are included in the published article: Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. Dataset URL: https://ai.stanford.edu/~syyeung/ tooldetection.html. Then, we uses five data enhancement methods: vertical flip, horizontal flip, brightness adjustment, affine transformation, and adding Gaussian noise to increase the quantity of small sample categories and randomly remove the quantity of multi-sample categories.