A Small Target Detection Research Based on Dynamic Convolution Neural Network

Because the detection effect of EfficientNet - YOLOv3 target detection algorithm is not very good, this paper proposes a small target detection research based on dynamic convolution neural network. Firstly, the dynamic convolutional neural network is introduced to replace the traditional convolutional neural network, which makes the algorithm model more robust; Secondly, in the training process, the optimization parameters are continuously adjusted to further strengthen the model structure; Finally, in order to prevent over fitting, the Learning Rate and Batch Size parameters are modified during the training process. remote sensing image The results of the proposed algorithm on RSOD remote sensing image data sets show that compared with the original EfficientNet - YOLOv3 algorithm, the (Average Precision, AP) value is increased by 1.93% and the (Log Average Miss Rate ,LAMR) value is reduced by 0.0500; The results of the proposed algorithm on TGRS - HRRSD remote sensing image data set show that compared with the original EfficientNet - YOLOv3 algorithm, the mAP value is increased by 0.07% and the mLAMR value is reduced by 0.0007.


1.Introduction
The emergence of deep learning technology promotes the development of computer vision. Target detection based on remote sensing image is one of the important directions in the field of computer vision. Target detection based on remote sensing image can be applied to urban planning, traffic safety and environmental monitoring [1][2][3][4][5] . The target detection algorithm of remote sensing image comes from the target detection algorithm in the general field. The target detection algorithm of remote sensing image is a series of algorithms combined with the characteristics of multi-scale, strong density and shape difference of remote sensing image. The traditional remote sensing image target detection algorithm mainly adopts multi-step strategy, such as preprocessing, image segmentation, extracting regions of interest and target detection [6][7] . Each step is independent of each other, and each is the corresponding algorithm designed for specific problems, which has the problems of insufficient process solidification, automation and intelligence [8] . Therefore, due to the strong feature representation ability and end-to-end learning ability, the performance of target detection algorithm in remote sensing image has been greatly improved.
Nowadays, remote sensing image is a data set with small target and complex background. There are huge problems in the research of remote sensing image target detection. Although many target detection algorithms have relatively good results on ordinary data sets, the detection accuracy of target detection in remote sensing images is not very good. Chen et al. [9] proposed a detection algorithm based on multi classification learning, which realizes the accurate detection of ships by finding the minimum circumscribed rectangle of the target; Ding et al. [10] proposed the region of interest deformer (ROI trans) to solve the remote sensing target detection with significant directivity; Xu et al. [11] proposed the gliding vertex to characterize the directed bounding box by first detecting the horizontal bounding box and then learning the offset of the four corners of the horizontal bounding box; Yang et al. [12] proposed dense rotating small target detector. By designing sampling fusion network, multi-layer features are fused into effective anchor sampling to improve the sensitivity to small targets; At the same time, by suppressing noise and highlighting object characteristics, a supervised pixel attention network and channel attention network are designed; Finally, the (Intersection Over Union ,IOU) factor is added to smooth L1 loss to solve the problem of inaccurate boundary of rotating bounding box; Aiming at the problems of large aspect ratio and category imbalance of target instances in remote sensing images; Lin et al. [13] proposed a fine tuned single-stage detector, which realizes feature reconstruction and alignment by designing feature refinement module, and realizes target detection by stepwise regression from coarse-grained to fine-grained; Yao et al. [14] proposed a target detection algorithm of remote 2. EfficientNet-YOLOv3 algorithm model

EfficientNet backbone feature extraction network
EfficientNet backbone feature extraction network is a research under the same computing power constraints. The influence of network depth, width and resolution on the same operation type network is explored to find the optimal configuration proportion parameters. EfficientNet algorithm process of backbone feature extraction network: firstly, the image needs to be transformed into the input dimension required by the (Mobile inverted Bottleneck Convolution , MBConv) module through the first Conv 3 * 3 convolution layer, and the input remote sensing image needs to be extracted from the feature map through a series of MBConv modules; Secondly, the parameters of each MBConv module are finely adjusted to adapt to the current use environment. The combined scale optimization method can make the network obtain a better receptive field; Finally, using the feature map adaptive connection method based on (Full Convolutional Neural Network, FCNN), the Conv 1 * 1 convolution network will be able to adapt to the feature maps of different sizes and unify them into the dimensions required by the algorithm. Finally, the image classification, recognition and detection are completed through the output feature map. The structure of EfficientNet backbone characteristic network model is shown in Figure 1:

YOLOv3 model structure
The model structure of YOLOv3 has no pooling layer and full connection layer. The size transformation of the tensor in the forward propagation process of YOLOv3 model is realized by changing the step size of the convolution kernel. The backbone feature extraction network reduces the output feature map to 1/32 of the input. YOLOv3 network structure is a full convolution network, which uses a large number of layer hopping connections of residuals, does not use the pooling layer to reduce the negative gradient effect caused by pooling, uses the convolution step size to realize downsampling, and uses the convolution step size of 2 to realize downsampling. In order to enhance the accuracy of the algorithm for small target detection, YOLOv3 adopts the up sampling and fusion method similar to (Feature Pyramid Networks, FPN). The three prediction branches in the YOLOv3 model structure adopt the full convolution structure, with a total of 9 anchor boxes and 3 outputs, and each output uses 3 anchor, so 3 boxes are predicted for each position. The model structure of YOLOv3 is shown in Figure 2:

Dynamic convolutional neural network
As shown in Figure 3, The goal of dynamic convolutional neural network is to provide a better trade-off between network performance and computational burden in the range of efficient neural network. Dynamic convolution neural network, which does not increase the depth or width of the network, but improves the performance of the model by paying attention to aggregating multiple convolution cores. It has convolution cores sharing the same kernel size, input and output dimensions by using the attention weight is aggregated, and after the aggregation convolution, the batch normalization and activation function are used to construct the dynamic convolution layer. Because the convolution kernel is very small, the increase in the output dimension of each layer is within a controllable range. Traditional convolutional neural network is easily replaced by dynamic convolutional neural network. The traditional convolution neural network is improved. In the traditional convolution neural network, the parameters of any input are fixed, and the dynamic convolution makes the convolution parameters of each layer change with the input during reasoning. For a feature map generated in the convolution process, first operate the feature map to generate K parameters π with a sum of 1. Then the K convolution kernel parameters are linearly summed. In this way, the convolution changes with the change of input.

Improved YOLOv3 model structure
As shown in Figure 4, firstly, the input image size of the backbone feature extraction network EfficientNet is modified from 416*416*3 to 800*800*3, the residual network is convoluted once with a convolution kernel size of 3*3 and a step size of 2, and the width and height of the input feature layer are compressed to obtain a feature layer; Secondly, carry out a 1*1 convolution and a 3*3 convolution on the feature layer, and then add this result to the feature layer to form a residual structure; Finally, the network is greatly deepened through the superposition of 1*1 convolution, 3*3 convolution and residual edge. The two-dimensional convolution neural network Conv2D uses the Dynamic convolution neural network Dynamic_ Conv2D。 Figure4. Improved YOLOv3 algorithm model

Experimental results and analysis
In order to verify the effectiveness of the improved EfficientNet-YOLOv3 algorithm, the RSOD remote sensing image data set labeled by Wuhan University and the TGRS-HRRSD remote sensing image data set labeled by Xi'an Institute of Optics and precision machinery, Chinese Academy of sciences are used to verify the effect. The experiment uses five comparison algorithms: target detection algorithm based on Faster-RCNN, target detection algorithm based on RetinaNet, single-stage target detection algorithm based on SSD, target detection algorithm based on YOLOv4 and target detection algorithm based on the original EfficientNet-YOLOv3. Objectively analyze the performance of the target detection algorithm proposed in this paper, and use the (Average Precision , AP) and (Log Average Missed detection Rate , LAMR) as the evaluation indexes. The larger the mAP value, the better, and the smaller the mLAMR value of the better.

Experimental environment configuration
The target detection algorithm proposed in this paper uses the built experimental environment platform model: the computer is configured with i5-8250CPU, 8GRAM, 64 bit Windows10 operating system, the server is configured with BSCC-N22, and the GPU queue is configured as follows: each machine is configured with 8 GPUs with NVIDIA Tesla V100-sxm2 32GB video memory, each GPU card is allocated 8CPU core and 36GB memory by default, and the memory ratio is 1GPU card, 8CPU core 36GB memory.

4.2.Experimental results of RSOD remote sensing image data set
4.2.1.RSOD remote sensing image data set As shown in Figure 5, the RSOD remote sensing image data set is an open data set, which is applied to target detection of remote sensing images. The sample data set diagrams include: Figure 5(a) aircraft, Figure 5 Table 1, the experimental results AP values of numerical indexes of each class obtained by all algorithms in this paper are shown, and the last row is the mAP values of all classes in all methods. It can be seen from the mAP value in the last line that the mAP value obtained by the proposed algorithm is the highest compared with the five compared algorithms. Among them, the mAP value of the proposed algorithm is 1.93% higher than that of the original algorithm EfficientNet-YOLOv3, the mAP value of the proposed algorithm is 21.18% higher than that of the Faster-RCNN algorithm, and the proposed algorithm is 21.18% higher than that of the RetinaNet algorithm, The mAP value is increased by 3.50%. Compared with SSD algorithm, the mAP value of this method is increased by 5.32%. Compared with YOLOv4 algorithm, the mAP value of this algorithm is increased by 4.21%, which shows that the algorithm proposed in this paper has achieved good results in the target detection process of remote sensing image.

mLAMR value is used as evaluation index
The target detection algorithm for remote sensing images generally uses (Miss Rate , MR). The logarithmic mean is used as the evaluation standard of data, which is called LAMR for short.  Table 2. Experimental results of mLAMR value of RSOD remote sensing image data set Table 2 shows the experimental results of the numerical index LAMR of each class obtained by all algorithms in this paper. The last line is the mLAMR value obtained by adding and averaging the LAMR values of all classes in all algorithms. It can be seen from the mLAMR value that the mLAMR value obtained by the algorithm proposed in this paper is the lowest compared with the other five algorithms. Among them, the mLAMR value of the algorithm proposed in this paper is 0.0500 lower than that of the original algorithm EfficientNet-YOLOv3, 0.2550 lower than that of Faster-RCNN target detection algorithm, and 0.0225 lower than that of RetinaNet target detection algorithm, Compared with SSD target detection algorithm, mLAMR value is reduced by 0.1150, and compared with YOLOv4 target detection algorithm, mLAMR value is reduced by 0.0300. Again, it shows that the target detection algorithm proposed in this paper has achieved good results in the process of target detection.

TGRS-HRRSD remote sensing image data set
As shown in Figure 6, TGRS-HRRSD remote sensing image data set is a remote sensing image data set produced by the optical image analysis and learning center of Xi'an Institute of Optics and precision machinery, Chinese Academy of Sciences for studying high-resolution remote sensing image target detection. The format of the data set is made into PASCAL_VOC format application, the sample images of remote sensing image data include: Figure 6(a) airplane, Figure  6(b) bridge, Figure 6(c) crossroad, Figure 6(d) ship, Figure 6(e) vehicle, Figure 6(f) harbor, Figure 6(g) ground track field, Figure 6(h) storage tank, Figure 6(i) basketball court, Figure 6(j) parking lot, Figure 6 Table 3. Experimental results of map value of TGRS-HRRSD remote sensing image data set As shown in Table 3, the numerical indexes of all classes obtained by all algorithms in this paper are the AP values obtained from the experimental results. The last line is the mAP value obtained by adding and averaging the AP values of all classes in all methods. It can be seen from the mAP value in the last line that the mAP value obtained by the algorithm in this paper is the highest compared with the five algorithms. Among them, the mAP value of the algorithm proposed in this paper is 0.07% higher than that of the original EfficientNet-YOLOv3 target detection algorithm, 31.95% higher than that of Faster-RCNN target detection algorithm, and 2.36% higher than that of RetinaNet target detection algorithm, Compared with SSD target detection algorithm, the mAP value is increased by 8.54%, and compared with YOLOv4 target detection algorithm, the mAP value is increased by 10.05%, which further shows that the algorithm proposed in this paper has a good effect in the target detection process of remote sensing images.

mLAMR value is used as evaluation index
The target detection algorithm of remote sensing image generally uses the (Logarithmic Average of Miss Rate , LAMR) as the evaluation standard of data, which is called LAMR for short.  Table 4. Experimental results of mLAMR value of TGRS-HRRSD remote sensing image data set Table 4 shows the experimental results of the numerical index LAMR of all classes obtained by all algorithms in this paper. The last line is to add and average the LAMR values of all classes in all algorithms to obtain the mLAMR value. It can be seen from the mLAMR value that the mLAMR value obtained by the target detection algorithm proposed in this paper is the lowest compared with the five compared algorithms. Among them, the mLAMR value of the target detection algorithm proposed in this paper is reduced by 0.0007 compared with the original EfficientNet-YOLOv3 algorithm, and the mLAMR value is reduced by 0.4108 compared with the Faster-RCNN target detection algorithm, The mLAMR value is reduced by 0.0300. Compared with SSD target detection algorithm, the mLAMR value is reduced by 0.1608. Compared with YOLOv4 target detection algorithm, the mLAMR value is reduced by 0.1300. Again, it shows that the algorithm proposed in this paper obtains good results in the process of target detection.

Conclusion and Prospect
This paper presents the research on small target detection based on dynamic convolution neural network, and compares the target detection algorithm with EfficientNet-YOLOv3 algorithm. Without adding any model parameters, it not only improves the mAP value of target detection in remote sensing image, but also reduces the mLAMR value of target detection in remote sensing image, It improves the accuracy of the detection process, reduces the missed detection rate of the algorithm, and enhances the stability of the algorithm. The target detection algorithm proposed in this paper has achieved good results in RSOD and TGRS-HRRSD remote sensing image data sets, but the algorithm has two disadvantages in the face of target detection in remote sensing images. Firstly, the algorithm in this paper is modified on the basis of EfficientNet-YOLOv3 target detection algorithm, and uses dynamic convolutional neural network to replace the traditional convolutional neural network, which improves the complexity of the network model; Secondly, the algorithm proposed in this paper is only for target detection in remote sensing images, which is not necessarily very good when using other data sets and has weak universality. It is hoped that in further work, the use of lightweight backbone feature extraction network will be considered to reduce the complexity of network model, minimize the parameters of model and improve the detection performance of target detection algorithm in remote sensing image.