Detection of surgical instruments based on Gaussian kernel

In minimally invasive laparoscopic surgery, it is of practical significance to quickly locate the location and category information of the surgical instrument. It can remind medical personnel of irreversible injury caused to patients due to leaving surgical instruments after the operation. In this paper, the Gaussian kernel is introduced into each ground truth, which is conducive to making full use of label information to allocate positive and negative samples and improve the accuracy of location and classification. Then, we introduce SIoU Loss and Harmonic Loss function into a total loss. The former uses relative coordinates to make the network converge more quickly, and the latter solves the problem of asynchronous optimization of the two branches of classification and regression. Our experiment proves that the strategy based on Gaussian kernel sample allocation is very effective on a pubic data set m2cai16-tool-locations, displaying our method possesses conspicuous accuracy of classification and regression than other work.


Introduction
With the continuous progress of science and technology, minimally invasive surgery has become very popular, and minimally invasive laparoscopic surgery is gradually being accepted by more and more people. Subsequently, medical accidents that could lead to complications caused by leaving surgical instruments in the abdominal cavity due to negligence are also on the rise. Medical staff should focus on an operation or make important decisions, rather than looking for surgical instruments. Examining the remaining problems of surgical instruments is repetitive and time-consuming work, which requires staff to quickly complete the examination in a short time. This inspection process is affected by the psychological state of the inspectors, so it is wise to use advanced object detection technology to avoid medical accidents.
B Shengsheng Wang wss@jlu.edu.cn Hongren Zhang hrzhang21@mails.jlu.edu.cn 1 In this paper, to address using Anchor-free object detection algorithms with lower accuracy in laparoscopic surgical tools, we propose to apply the Gaussian kernel to ground truth. This will enable the network to combine the width and height information about each object in the training process, making it easier for the network to find the pixels that need attention. The value of two-dimensional Gaussian distribution is used to restrain some fuzzy samples and samples with low confidence so that the network can pay more attention to samples with high confidence. The main contributions of our work are as follows: 1. We propose to apply a Gaussian kernel to each groundtruth box, and make full use of the width and height information of the ground-truth box to achieve positive and negative sample allocation. This is conducive to the detection of objects with different scales, to improve the accuracy of the network. 2. We analyze different sampling strategies, and prove that the strategy that selects positive and negative samples by Gaussian kernel can be widely used in object detection, especially in multi-scale project detection. 3. SIoU is introduced to FCOS by using relative coordinates. Due to the allocation strategy of the Gaussian kernel, SIoU becomes simpler in anchor box optimization. We introduce Harmonic loss into total loss to address the problem that classification and regression are optimized independently, which causes inconsistent predictions.

Related work
Real-time object detection has recently been widely used in artificial intelligence surgery, and surgical instrument detection is a promising AI sub-project [1]. Its value lies in helping surgeons to monitor medical conditions and tools to assist surgeons to make correct decisions, by providing test results with high accuracy. Moreover, it can avoid medical accidents caused by leftover surgical instruments. Early work responds to different data sets by using different artificial characteristics for detection like [2,3]. However, these methods are limited and cannot extract more efficient advanced features easily influenced by people's subjectivity, thus ignoring the more important fine-grained characteristics [4]. In addition, the application of convolution neural network algorithm has played a revolutionary role in the development of computer science and deep learning [5][6][7], making it possible to design an efficient positive and negative sample allocation strategy with higher accuracy than before. Following the M2CAI 2016 Tool Presence Detection Challenge benchmark [8], most tool detection approaches are framework-level detection. Many object detection algorithms applied to the medical field appear, including a convolution neural network to extract characteristics of spatial structure but the accuracy of the early algorithms cannot be applied to real life. Any suggestion by automated detection without excellent accuracy will mislead or even cause fatal harm to patients.
The significant advantages of the anchor-based [9,10] allocation strategy are that it has higher accuracy and more abundant extracted features. The advantage of anchor-free algorithms is detection speed. They do not need to preset many anchor proportions [11][12][13], which greatly reduces the time-consuming hyper-parameters setting for different data sets with lower accuracy [14]. FCOS makes good use of all points in the ground truth bounding box to predict. Moreover, FCOS introduces a new branch, called the 'center-ness' branch, to surpass those low-quality anchors. Therefore, it has a high recall with anchor-based detectors. Corner-based object detection algorithms include CornerNet [15] and Cen-terNet [16]. CornerNet turns the traditional prediction target box into two corners for prediction and then uses Corner-Pooling to integrate the feature map information and deliver it to the two corners. Although it uses two key points to complete object detection, the process of corner matching is time-consuming. CenterNet pays more attention to the information of the center point of one object and also uses the Gaussian kernel to select the center point more flexibly and converge faster. CornerNet and CenterNet both use the Gaussian kernel distribution, but the former mainly punishes negative samples at corners to varying degrees, while the positive center of the Gaussian circle is a positive sample. The latter applies the Gauss kernel to the heatmap and penalizes negative samples near the maximum value to varying degrees to calculate the loss of the heatmap. Although ATSS [17] uses statistics to dynamically allocate positive and negative samples, it is a pseudo-dynamic. But they show a critical message that the allocation of positive and negative samples is a key issue for object detection. How to design an allocation strategy of positive and negative samples to maintain the advantage of speed and improve accuracy has become the main consideration.

The proposed method
The allocation of positive and negative samples is the core of the difference between Anchor-based and Anchor-free detection. The above works do not consider the allocation strategy of the width and height information of an object. For objects with different width and height information [18], simply regard all points in the target area as positive sample points, which is a burden to the network [12,19]. For targets with different width/height ratios, it is unreasonable to select only the samples in the fixed central area as positive samples. Therefore, we should make full use of the width and height information and introduce it into the Gaussian distribution to help us better select positive and negative samples. The ground-truth bounding boxes for an input image are defined as indicate the coordinates of the left-top and right-bottom corners of the bounding box and c (i) is the classification information that the object belongs to, and c is the number of classes.
As shown in Fig. 1, C3, C4, and C5 denote the feature maps of the backbone network, and P3 to P7 are the feature levels used for the final prediction by detector head for classification loss, center-ness loss, and regression loss. Firstly, we deliver a frame originating from a laparoscopic surgery video to a feature map and feed it into the ResNet-50 backbone. Then, we convert it into a feature pyramid for integrating feature information at different levels. For each feature map, we introduce the Gaussian kernel into the ground-truth box. We select the center of the ground-truth box as the center of the Gaussian circle to generate the Gaussian matrix. Then we select pixels (x, y) in the matrix that are greater than a certain threshold as positive samples to participate in the subsequent loss calculation. Finally, the network outputs four distances (l, t, r , b) to the left, top, right, and bottom predicted by positive sample pixels and category information. We apply the Gaussian kernel to each ground truth without considering its category information, and only allocate samples according to their width and height information.

Gaussian Kernel for allocation strategy
For feature, F i , every pixel is linearly mapped to the original picture. We make full use of the width and height information of all ground-truth boxes in the one feature layer. Then 2D is used to generate Gaussian Matrix. If there are m objects in an image, it will produce all zero matrix, whose channel number is m. The size of the Gaussian matrix is the same as that of the ground truth, so the generated Gaussian matrix can replace all zero matrix at the position of the ground truth. Also, let the network pay more attention to higher values of Gaussian distribution for faster convergence and a high recall. We use a central prior knowledge that the center point of a ground-truth box is more important than the points around the ground-truth box, where m is the number of objects in an image, α and β are variable parameters that are different for diverse objects. Then, we give different Gaussian matrix weights to the center point and its surrounding pixels. Finally, we choose some points whose values of the Gaussian matrix are greater than a certain threshold as positive samples.

Loss function
We introduce SIoU into our network for faster convergence. We use relative position coordinates to replace absolute position coordinate information. Because we use the Gaussian kernel sampling strategy, our positive and negative samples have higher confidence. Our positive samples surround the center of the ground-truth boxes, we can selectively reduce the value of to pay more attention to shape optimization instead of optimizing the movement of anchor boxes. For the predicted anchor boxes, their location information has advantages, so we should focus on IoU information. This is the unique advantage of our sampling strategy. Classification and regression are optimized independently, which causes inconsistent predictions. For example, the network predicts that a target has a high classification score, but the regression effect is not good, which is not a good prediction result for the network. We introduce a Harmonic loss to harmonize the training of classification and regression branches and strengthen the correlation between classification score and localization accuracy for addressing the problem [20]. Because only positive samples participate in the loss calculation, given a positive sample, the Harmonic loss can be defined as follows: where, β r and β c are mutually restrictive. And p i and y i denotes the predicted classification score and the corresponding groundtruth class. Also d i andd i denote the output regression distance and target distance. CE(·) is classification loss and L r (·) is regression loss. If the classification branch effect of a positive sample is better, that the loss of classification is relatively small, the network will pay more attention to the regression of the anchor boxes; similarly, if the regression effect of a positive sample point is better, the network will pay more attention to classification prediction. They coordinate with each other to promote the accuracy of network classification and regression. The total loss L consist of CE(·),L r (·) and L ctr (·). We define L ctr as follows: Specifically, where l, t, r , b are the distances from the location to the four sides of the bounding and N is the number of positive and negative samples.

Data set
We combine loss function mentioned above and Gaussian kernel sampling strategy into popular one-stage detector FCOS, and conduct experiments on the public m2cax16-tool data set. This surgical instrument data set includes 2532 pictures in total, with seven different label information: grasper, bipolar, hook, scissors, clip applier, irrigator and specimen bag. We divide this data set according to the ratio of 5:3:2 for training, validation and testing. The bold indicates the best performance in each column

Sample allocation strategy
We consider the following three allocation strategies, including Rectangular Area, Rectangular Sub-Area, and our Gaussian Area, all of which take pixels in a specific region (gray area) as positive samples as shown in Fig. 2. All pixels in the rectangular area of a ground-truth box are regarded as positive samples, it will cause many low-quality anchors. The situation is a burden to a network. If we only select some points in the center area as positive samples, the fixed area will damage network performance. Because the network cannot make good use of the information of ground truth, it treats all pixels as equally important. But our method has considered that not all points have the same importance, and we can choose more positive samples with high confidence for objects with a special width/height ratio. The sample allocation strategy is reasonable, FCOS adopts the central sampling strategy, mAP, which reaches 0.946, but we can reach 0.951 only by changing the allocation strategy in Table 3. As shown in Table 1, this effect is more obvious in the PASCAL VOC data set, because the data set is rich in different kinds and different proportions of targets. However, the threshold should be fine-tuned for different data sets.

Implementation details
Our experiments were executed on the NVIDIA RTX 3090 GPU with a total batch size at 4 during the training and testing stages, and only the evaluation stage was executed on the CPU. As shown in Table 2, the training time of one-stage object detection algorithms is shorter than two-stage object detection algorithms. Unless specified, ResNet-50 is used as our backbone networks and the same hyper-parameters with FCOS are used. Moreover, the optimizer we use is stochastic gradient descent (SGD), whose initial learning rate is 0.002 reduced by a factor of 10 at iteration 70 K and 100 K. Weight decay and momentum are set to be 0.0001 and 0.9, respectively. We initialize our backbone networks with the weights pretrained on ImageNet. α is equal to one third of the width of Gaussian matrix and β is equal to one third of the height of Gaussian matrix. The threshold of Gaussian matrix is 0.75, and input images are resized to have their shorter side being 800 and their longer side equal to 1333. The bold indicates the best performance in each column

Comparison with object detection methods
To further prove our method, we compare some popular object detection methods, including Anchor-based and Anchor-free algorithms. Two-stage detection algorithms [9,21] usually have high detection accuracy, but they are slow.
One-stage detection algorithms [22][23][24] have a speed advantage, as shown in Table 2, we improve the accuracy of the Anchor-free algorithms while retaining the advantage of speed. The FCOS has higher accuracy in identifying surgical equipment. However, we introduce the Gaussian kernel to FCOS, and the accuracy still improved. By applying our Gaussian kernel to the above Anchor-free algorithms, we can achieve different improvements. Through a series of experiments, we found that although our FPS cannot reach some one-stage algorithms with the advantage of speed. But our accuracy is significantly higher than theirs. Compared with FCOS, our FPS is slightly higher. Our method still has room for improvement in reasoning speed. As shown in Fig. 3a, the image on the top shows the FCOS algorithm detection results. We can find that under the same training round, FCOS has had a false detection that the detector falsely treats the background as a Specimen Bag, but our method will avoid this situation. As shown in Fig. 3b, our method can accurately identify the surgical instrument category and mark the target box accurately, which can meet the expectations in practical life.

Ablation study
To prove the effectiveness of our sample allocation, we conduct several ablation experiments. As shown in Table 3, to  The bold indicates the best performance in each column The bold indicates the best performance in each column clearly show exactly how each part of the module affects the accuracy of the detection, we have carried out the experimental process of continuous progress and used the mAP evaluation metric. Our network structure is ResNet-50 without any attention mechanism. And the positive and negative sample allocation strategy of basic work is central sampling, that only a small part of the fixed center of the groundtruth boxes is a positive sample, ignoring the width and height information of ground-truth boxes. GK means the Gaussian kernel is used by selecting positive and negative samples, and the threshold of the Gaussian matrix is 0.75. SIoU means the regression loss is utilizing SIoU by relative coordinate instead of IoU. HL means the total loss applies Harmonic loss to avoid the problem that classification and regression are not synchronized. Basic+GK result shows that using the Gaussian kernel is effective for sample allocation. Basic+SIoU is improved, and the network converges faster. Basic+SIoU+HL shows remarkable classification and regression effect on the data set. As shown in Table 4, we first continue to approach the number of positive samples of the basic method through experiments, select an appropriate threshold of 0.56 and avoid the problem of extremely unbalanced positive and negative samples. Moreover, we found that the detection effect was not as expected. After analysis, we found that too many positive samples led to too many low-quality anchors. In subsequent experiments, we increased the threshold value and found that the detection effect was significantly improved. This is because the sample points with low confidence in the Gaussian matrix are filtered out. As the threshold value increases continuously, it remains unchanged when it reaches about 0.8. Then increasing the threshold value, the detection effect will decline for the imbalance of positive and negative samples. For the network, the number of positive samples is too small, which makes the learning effect poor and the network becomes unstable.

Conclusion
In this paper, it is pointed out that the sub-region for selecting positive samples in an object should be flexible rather than fixed [19]. We analyze that different sampling strategies will lead to different network training accuracy. In order to solve this problem, we propose to integrate the Gaussian kernel into ground truth, which can make full use of the information of ground truth, facilitate the rapid convergence of the network and improve the accuracy. At the same time, SIoU is used to accelerate the convergence rate of the model and improve its accuracy of the model. We introduce a Harmonic loss to harmonize the training of classification and regression branches and strengthen the correlation between classification score and localization accuracy. In a word, we have demonstrated by a sequence of contrast experiments that our method gets high accuracy on surgical tool inspection work and meets the expectations for improvement. We hope our work can help medical staff detect devices more efficiently and accurately, to help them avoid medical accidents. available in the public domain: -m2cai16-tool-locations, http://ai. stanford.edu/~syyeung/resources/m2cai16-tool-locations.zip.