A Small Object Detection Research Based on Dynamic Convolution Neural Network

: In view of the fact that the detection effect of EfficientNet-YOLOv3 object detection algorithm is not very good, this paper proposes a small object detection research based on dynamic convolution neural network. Firstly, the dynamic convolutional neural network is introduced to replace the traditional, which makes the algorithm model more robust; secondly, the optimization parameters are continuously adjusted in the training process to further strengthen the model structure; finally, the Learning Rate and Batch Size parameters are modified during the training process in order to prevent overfitting. In order to verify the effectiveness of the proposed algorithm, RSOD and TGRS-HRRSD remote sensing image data sets are used to test the effect. The results of the proposed algorithm on RSOD remote sensing image data sets show that compared with the original EfficientNet-YOLOv3 algorithm, the mean Average Precision (mAP) value is increased by 1.93% and the mean Log Average Miss Rate (mLAMR) value is reduced by 0.0500; The results of the proposed algorithm on TGRS-HRRSD remote sensing image data set show that compared with the original EfficientNet-YOLOv3 algorithm, the mAP value is increased by 0.07% and the mLAMR value is reduced by 0.0007. Conv2D convolution network Dynamic_ Conv2D.


Introduction
The emergence of deep learning technology promotes the development of computer vision. And object detection based on remote sensing image is one of the important directions in the field of computer vision, which can be applied to urban planning, traffic safety and environmental monitoring [1][2][3][4][5] . The object detection algorithm of remote sensing image comes from the object detection algorithm in the general field, which is a series of algorithms combined with the characteristics of multi-scale, strong density and shape difference of remote sensing image. The traditional remote sensing image object detection algorithm mainly adopts multi-step strategy, such as preprocessing, image segmentation, extracting regions of interest and object detection [6][7] . Each step is independent and is the corresponding algorithm designed for specific problems, which has the problems of process solidification, insufficient automation and intelligence [8] . Therefore, the performance of object detection algorithm in remote sensing image is greatly improved by deep learning technology due to its strong feature representation ability and end-to-end learning ability.
Nowadays, remote sensing image is a data set with small object and complex background, and there are huge problems in the research of remote sensing image object detection. Although many object detection algorithms have relatively good performance on ordinary data sets, the detection accuracy of object detection in remote sensing images is not very good. Chen et al. [9] proposed a detection algorithm based on multi-classification learning, which realizes the accurate detection of ships by finding the minimum circumscribed rectangle of the object; Ding et al. [10] proposed the region of interest transformer (ROI trans) to solve the remote sensing object detection with significant directivity; Xu et al. [11] proposed the gliding vertex to characterize the directed bounding box by first detecting the horizontal bounding box and then learning the offset of the four corners of the horizontal bounding box; Yang et al. [12] proposed dense rotating small object detector: by designing sampling fusion network, multi-layer features are fused into effective anchor sampling to improve the sensitivity to small objects; at the same time, by suppressing noise and highlighting object characteristics, a supervised pixel attention network and channel attention network are designed; finally, the Intersection Over Union (IOU) factor is added to smooth L1 loss to solve the problem of inaccurate boundary of rotating bounding box; aiming at the problems of large aspect ratio and category imbalance of object instances in remote sensing images, Lin et al. [13] proposed a fine tuned single-stage detector, which realizes feature reconstruction and alignment by designing feature refinement module, and realizes object detection by stepwise regression from coarse-grained to fine-grained; Yao et al. [14] proposed an object detection algorithm of remote sensing image based on multi-resolution feature fusion; Wang et al. [15] proposed the method of combining multi-level features for regional proposal; Yao et al. [16] proposed a multi-scale convolutional neural network remote sensing object detection framework; Li et al. [17] proposed an object detection algorithm for visual perception of high-resolution remote sensing images; Chen et al. [18] proposed a fast ship object detection method in large scene remote sensing images based on cascaded convolutional neural network; in view of the problems such as poor object performance and slow inference speed of remote sensing images with dense permutation and significant direction detected by RoI Trans, Zhao et al. [19] adopted the method of progressively enhancing the positioning accuracy of rotating candidate frame and non-local enhancement of features to improve the object detection performance of remote sensing images; Cai et al. [20] proposed a new progressive shrinkage algorithm, a generalized pruning method, which reduces the model size in more dimensions (depth, width, core size and resolution) than pruning; Chen et al. [21] proposed an algorithm of dynamic convolution neural network; Liu et al. [22] proposed a new strategy to assemble multiple identical backbones through the composite connection between adjacent backbones to form a more powerful backbone, which is called composite backbone network; Hua et al. [23] proposed a cascade convolutional neural network real-time object detection framework combining visual perception and convolutional memory network reasoning; Li et al. [24] proposed an end-to-end depth network based on the shape of network structure, which detects salient objects in optical images in a pure data-driven manner; Zhang et al. [25] proposed a scale self-adaptive network scheme to improve the accuracy of multi-object detection; Zhu et al. [26] proposed a new object detection method based on multi-scale SELU-DenseNet and Dynamic Anchor Assignment (DAA) strategy; Yao et al. [27] proposed an online latent semantic scatter method to solve this problem; Pang [28] proposed a unified self-enhancement network 2 -CNN: a convolutional neural network based on remote sensing region, which is composed of backbone microgrid, intermediate global attention block, final classifier and detector; Chen et al. [29] proposed a new object heat map network to solve these problems; Yao et al. [30] proposed a new calculation model for airport detection using optical Remote Sensing Image (RSI); Yu et al. [31] proposed a method Chan-Vese (CV) based on image decomposition and distance regularization model; Li et al. [32] proposed the remaining problems and future development trend of ship detection and classification methods based on optical remote sensing images; Youme et al. [33] proposed to divide the image into four regions, mark the image through the region of interest, and extract image features by single-stage detection algorithm SSD to improve the accuracy of small object detection.
This paper improves the EfficientNet-YOLOv3 algorithm [34] whose backbone feature extraction network is EfficientNet [35] . Firstly, based on the idea of dynamic convolution neural network proposed by Chen et al. [21] , replace the two-dimensional convolution neural network of the backbone feature extraction network by the dynamic convolution neural network, in which dynamic convolution kernel dynamically aggregates multiple parallel convolution kernels related to input, improves the backbone feature extraction network and strengthens the feature extraction ability of the backbone network for remote sensing images; secondly, adjust and optimize the parameters to improve the feature extraction ability of small and medium-sized objects in remote sensing images; finally, modify the learning rate and batch size to prevent overfitting during the training. The validity of the proposed algorithm is verified by using RSOD and TGRS-HRRSD remote sensing image data sets.

EfficientNet backbone feature extraction network
EfficientNet backbone feature extraction network is a research under the same computing power constraints. The influence of network depth, width and resolution on the same operation type network is explored to find the optimal configuration proportion parameters. The algorithm process of backbone feature extraction network EfficientNet is as follows: firstly, the image needs to be transformed into the input dimension required by the Mobile Bottleneck Convolution (MBConv) module through the first Conv 3 * 3 convolution layer, and the input remote sensing image needs to be extracted from the feature map through a series of MBConv modules; secondly, the parameters of each MBConv module are finely adjusted to adapt to the current use environment, thus, the combined scale optimization method can make the network obtain a better receptive field; finally, using the feature map self-adaptive connection method based on Full Convolutional Neural Network (FCNN), the Conv 1 * 1 convolution network will be able to adapt to the feature maps of different sizes and unify them into the dimensions required by the algorithm. Then, the image classification, recognition and detection are completed through the output feature map. The structure of EfficientNet backbone feature network model is shown in Fig.1:

YOLOv3 model structure
The model structure of YOLOv3 has no pooling layer and full connection layer. The size transformation of the tensor in the forward propagation process of YOLOv3 model is realized by changing the step size of the convolution kernel: the backbone feature extraction network reduces the output feature map to 1/32 of the input. YOLOv3 network structure is a full convolution network, which uses a large number of layer hopping connections of residuals rather than the pooling layer to reduce the negative gradient effect caused by pooling. And it uses the convolution step size 2 to realize downsampling. In order to enhance the accuracy of the algorithm for small object detection, YOLOv3 adopts the up sampling and fusion method similar to Feature Pyramid Networks (FPN): the three prediction branches in the YOLOv3 model structure adopt the full convolution structure, with a total of 9 anchor boxes and 3 outputs, and each output uses 3 anchors, so 3 boxes are predicted for each output position. The model structure of YOLOv3 is shown in Fig.2:

research objective
Because there are some differences between remote sensing image and natural scene image, and their imaging methods are different, remote sensing image has its own unique characteristics, such as high resolution, small and densely arranged targets, complex background and so on. Therefore, it is necessary to further improve the migration method, so as to solve the difficult problems in remote sensing image detection, and overcoming these key detection difficulties has become the key research direction of many scholars in this field. How to apply deep learning to the field of remote sensing technology and improve the performance of remote sensing detection has important research value and practical significance.

Optimization
Firstly, the dynamic convolutional neural network is introduced to replace the traditional, which makes the algorithm model more robust; secondly, the optimization parameters are continuously adjusted in the training process to further strengthen the model structure; finally, the Learning Rate and Batch Size parameters are modified during the training process in order to prevent overfitting.

Dynamic convolutional neural network
As shown in Fig.3, the goal of dynamic convolutional neural network is to provide a better balance between network performance and computational burden in the range of efficient neural network. Dynamic convolution neural network, which does not increase the depth or width of the network, but improves the performance of the model by paying attention to aggregating multiple convolution cores. It has convolution cores sharing the same kernel size, input and output dimensions and aggregates by using the attention weight , after which the batch normalization and activation function are used to construct the dynamic convolution layer. Because the convolution kernel is very small, so the increase in the output dimension of each layer is within a controllable range. Therefore, traditional convolutional neural network can be easily replaced by dynamic convolutional neural network. In the traditional convolution neural network, the parameters of any input are fixed, while the dynamic convolution makes the convolution parameters of each layer change with the input during reasoning. For a feature map generated in the convolution process, first operate the feature map to generate K parameters with a sum of 1. Then the K convolution kernel parameters are linearly summed. In this way, the convolution changes with the change of input while reasoning.

Improved YOLOv3 model structure
As shown in Fig.4, firstly, the input image size of the backbone feature extraction network EfficientNet is modified from 416*416*3 to 800*800*3, the residual network is convoluted once with a convolution kernel size of 3*3 and a step size of 2, and the width and height of the input feature layer are compressed to obtain a feature layer; secondly, carry out a 1*1 convolution and a 3*3 convolution on the feature layer, and then add this result to the feature layer to form a residual structure; finally, the network is greatly deepened through the superposition of 1*1 convolution, 3*3 convolution and residual edge. The two-dimensional convolution neural network Conv2D uses the Dynamic convolution neural network Dynamic_ Conv2D.

Experimental results and discussion
In order to verify the effectiveness of the improved EfficientNet-YOLOv3 algorithm, the RSOD remote sensing image data set labeled by Wuhan University and the TGRS-HRRSD remote sensing image data set labeled by Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences are used. The experiment uses five comparison algorithms: object detection algorithm based on Faster-RCNN, object detection algorithm based on RetinaNet, object detection algorithm based on SSD single stage, object detection algorithm based on YOLOv4 and object detection algorithm based on the original EfficientNet-YOLOv3 to objectively analyze the performance of the object detection algorithm proposed in this paper. Besides, use the mAP and mLAMR as the evaluation indexes: the larger the mAP value is, the better the performance is; the smaller the mLAMR value is, the better the performance is.

Experimental environment configuration
The object detection algorithm proposed in this paper uses the built experimental environment platform model: the computer is configured with i5-8250CPU, 8GRAM, 64 bit Windows10 operating system; the server is configured with BSCC-N22, and the GPU queue is configured as follows: each machine is configured with 8 GPUs with NVIDIA Tesla V100-SXM 232GB video memory, each GPU card is allocated 8CPU core and 36GB memory by default, and the memory ratio is 1GPU card,8CPU core and 36GB memory [36] .

RSOD remote sensing image data set
As shown in Fig. 5, the RSOD remote sensing image data set is an open data set, which is applied to object detection of remote sensing images. The remote sensing image data set samples include: Fig. 5 (a) [36] .

TGRS-HRRSD remote sensing image data set
As shown in Fig. 6, TGRS-HRRSD remote sensing image data set is produced by the optical image analysis and learning center of Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences for studying high-resolution remote sensing image object detection. The format of the data set is made into PASCAL_VOC format for application, the remote sensing image data samples include: Fig. 6 (a) airplane, Fig. 6 (b) bridge, Fig. 6 (c) crossroad, Fig. 6 (d) ship, Fig. 6 (e) vehicle, Fig. 6 (f) harbor, Fig. 6 (g) ground track field, Fig. 6 (h) storage tank, Fig. 6 (i) basketball court, Fig. 6 (j) parking lot, Fig. 6 (k) tennis court, Fig. 6 (l) baseball diamond and Fig. 6 (m) T junction, which are 13 categories, 21000 diagrams and 40000 target objects in total [36] .

Discussion
As shown in Table 1, the experimental results AP values of numerical indexes of each class obtained by all algorithms in this paper are shown, and the last row is the mAP values of all classes in all methods. It can be seen from the mAP value in the last line that the mAP value obtained by the proposed algorithm is the highest compared with those of the other five compared algorithms. To be specific, the mAP value of the proposed algorithm is 1.93% higher than that of the original algorithm EfficientNet-YOLOv3, 21.18% higher than that of the Faster-RCNN algorithm, 3.50% higher than that of the RetinaNet algorithm, 5.32% higher than that of the SSD algorithm and 4.21% higher than that of the YOLOv4 algorithm, which shows that the algorithm proposed in this paper has achieved good results in the object detection process of remote sensing image. Table 2 shows the experimental results of the numerical index LAMR of each class obtained by all algorithms in this paper. The last row is the mLAMR value obtained by adding and averaging the LAMR values of all classes in all algorithms. It can be seen from the mLAMR value that the mLAMR value obtained by the algorithm proposed in this paper is the lowest compared with those of the other five compared algorithms. To be specific, the mLAMR value of the algorithm proposed in this paper is 0.0500 lower than that of the original algorithm EfficientNet-YOLOv3, 0.2550 lower than that of Faster-RCNN algorithm, 0.0225 lower than that of RetinaNet algorithm, 0.1150 lower than that of SSD algorithm, 0.0300 lower than that of YOLOv4 algorithm, which shows that the algorithm proposed in this paper has achieved good results in the process of object detection.
As shown in Table 3, the numerical index AP values are obtained from the experimental results of all classes obtained by all algorithms in this paper. The last row shows that the mAP value is obtained by adding and averaging the AP values of all classes in all methods. It can be seen that the mAP value obtained by the algorithm in this paper is the highest compared with the other five compared algorithms. To be specific, the mAP value of the algorithm proposed in this paper is 0.07% higher than that of the original EfficientNet-YOLOv3 algorithm, 31.95% higher than that of Faster-RCNN algorithm, 2.36% higher than that of RetinaNet algorithm, 8.54% higher than that of SSD algorithm, and 10.05% higher than that of YOLOv4 algorithm, which further shows that the algorithm proposed in this paper has a good performance in the object detection process of remote sensing images. Table 4 shows the experimental results of the numerical index LAMR of all classes obtained by all algorithms in this paper. The last row obtains the mLAMR value by adding and averaging the LAMR values of all classes in all algorithms. It can be seen from the mLAMR value that the mLAMR value obtained by the object detection algorithm proposed in this paper is the lowest compared with the other five compared algorithms. To be specific, the mLAMR value of the object detection algorithm proposed in this paper is 0.0007 lower than that of the original EfficientNet-YOLOv3 algorithm, 0.4108 lower than that of the Faster-RCNN algorithm, 0.0300 lower than that of the RetinaNet algorithm, 0.1608 lower than that of the SSD algorithm, and 0.1300 lower than that of the YOLOv4 algorithm, which shows again that the algorithm proposed in this paper obtains good results in the process of object detection.

Conclusion and Prospect
This paper puts forward the research on small object detection based on dynamic convolution neural network, and compares the object detection algorithm with EfficientNet-YOLOv3 algorithm. Without adding any model parameters, it not only improves the mAP value of object detection in remote sensing image, but also reduces the mLAMR value of object detection in remote sensing image. In addition, it improves the accuracy of the detection process, reduces the missed detection rate of the algorithm, and enhances the stability of the algorithm. Although the object detection algorithm proposed in this paper has achieved good results in RSOD and TGRS-HRRSD remote sensing image data sets, the algorithm has two disadvantages in terms of object detection in remote sensing images. Firstly, the algorithm in this paper is modified on the basis of EfficientNet-YOLOv3 object detection algorithm, and uses dynamic convolutional neural network to replace the traditional convolutional neural network, which improves the complexity of the network model; secondly, the algorithm proposed in this paper is only for object detection in remote sensing images, which is not necessarily very good when using other data sets, that is, its universality is weak. It is hoped that in further research, the use of lightweight backbone feature extraction network will be considered to reduce the complexity of network model, minimize the model parameters, and improve the detection performance of object detection algorithm in remote sensing image.