3.1 Shallow Feature Network:
In order to solve the problem that small objects such as vehicles, ships, and airplanes lose most of semantic information in deep convolutional network [31], we design a shallow feature network (SFN) to supplement semantic features in backbone .
As shown in Fig. 1, the shallow feature network (SFN) is composed of four shallow feature extraction block(SFBLOCK), which are used to adjust the size of featuremap and produce shallow features. Shallow feature network(SFN) fuses low-level semantic information with backbone to improve network performance. It can be expressed as:
$$\begin{array}{c}{P}_{i}=\left({S}_{i} ⨁ {C}_{i}\right) ⨁ \left({C}_{i+1}{ f}_{i}\right)\#\left(1\right)\end{array}$$
i represents the different levels of backbone, S and C represent the features of the output of SFBlock and CSPDarknet, respectively. ⊕ is the element-by-element addition operation, and \(f\) is the feature extraction operation of each layer in CSPDarknet.
The shallow feature extraction block (SFBLOCK) is shown in Fig. 2, it is composed of down-sampling and a residual block. First, image is downsampled by convolution operation, and then it is passed through residual block including convolution, batch-norm and ReLU layers.
3.2 Multi-scale context feature pyramid
Large-scale changes in the detected object may cause the problem of inconsistency between the deep-level feature map and the real object [32]. In the process of a continuous deepening of the network, feature map will only focus on a small part of the image, while the complex background of remote sensing images brings great challenges to object detection.
Based on this, we adopt the multi-scale context feature pyramid (MSC-FPN) (shown in Fig. 3), including a top-down, bottom-up bidirectional branch similar to PANET and a channel splicing that can integrate contextual information[33], and the spatial attention module for feature maps of different scales.
3.2.1 Context Fusion Module:
In the fusion between low-resolution features and high-resolution features of the feature pyramid, design a contextual information fusion module (CFM) with three parallel branches (shown in Fig. 4), after feature fuses with local and global contexts, it can enrich its expressive ability[34].
Specifically, high-level semantic information increases the receptive field through three mapconvolutions with different dilation rate [35], and adds residual connections on each parallel expansion branch to supplement semantic information, feature is then stitched with the low-level feature on the number of channels. It can be expressed as:
$$\begin{array}{c}{F}_{i}=\phi \left\{ Concat \left( {F}_{i+1} \left({D}_{1},{D}_{2},{D}_{3}\right) ⨁ {F}_{i+1},{F}_{i} \right) \right\}\#\left(2\right)\end{array}$$
\({F}_{i+1},i\in \left(\text{1,2}\right)\) rep, esents a feature in the deep layer of the pyramid network, \({F}_{i}\) is the feature of the current layer, \({D}_{j},j\in \left(\text{1,2},3\right)\) represents the dilated convolution of three different dilation rate the in parallel branch. The deep feature is convoluted with different dilation rates respectively and then added with the low-level feature. Finally, a 1 * 1 convolution merging channel is represented by \(\phi \left\{\right\}\), and \(⨁\) denotes element-wise addition.
3.2.2 Attention mechanism:
The attention mechanism can guide the network to focus on more prominent information in remote sensing images[36], so as to achieve the effect of enhancing features and suppressing the background. Among them, the spatial attention module [37] (as shown in Fig. 5) helps to enhance the object features with sparse texture and mixed background. The corresponding semantic information can help the network deal with different proportions of objects.
3.3 SF-YOLOV4 algorithm steps
SF-YOLOV4 algorithm steps are shown in Fig. 6. The first is the data preprocessing stage. Images need to be preprocessed such as cropping and flipping. Then enter the model training stage, load the model, training data, and save the weight in turn. The testing phase is similar to training. Network architecture is analyzed according to training weights, and then images are read in to make predictions. Finally, prediction boxes are drawn to the pictures for display, and related indicators such as MAP are calculated.