FFR-SSD: feature fusion and reconstruction single shot detector for multi-scale object detection

Object detection is one of the most fundamental tasks toward image content understanding. Although numerous algorithms have been proposed, implementing effective and efficient object detection is still very challenging for now. To solve the challenges of small and multi-scale objects, we propose a hierarchical feature fusion and reconstruction method called feature fusion and reconstruction single shot detector (FFR-SSD). We first present a multi-scale visual attention model, which incorporates the channel-wise and space-wise information into the multi-branch feature enhancement module to improve the feature representation capacities. Second, a hierarchical feature map weighing mechanism is developed to fuse multi-layer feature maps, which contributes to describing the intact objects for the subsequent module. Third, we present an effective feature map reconstruction module to encourage the model to focus on pivotal information. Therefore, the overall contour information is preserved for the shallow enhanced response map and the semantic feature information is enriched for the deep enhanced feature map. Numerous experiments on two public benchmark datasets show that the proposed method achieves significant improvements over the state of the arts.


Introduction
Object detection is one of the most fundamental tasks in the computer vision field, which encompasses object classification and localization in images. With the recent rapid development of deep learning, it is now widely used in intelligent video surveillance, robot vision, automated driving, etc. [1].
Recently, object detection algorithms based on deep learning are mainly manifested in two paradigms that region-based detectors and single-stage detectors. While the former such as Xu Cheng and Zhixiang Wang contributed equally to this work as the first author. Fast R-CNN [2], Faster R-CNN [3], R-FCN [4] and ION [5] exhibits higher accuracy, they cannot meet the requirement of real-time object detection in the real world. In contrast, the latter such as YOLO [6], SSD [7], FCOS [8], CenterNet [9] and YOLOv4 [10] merged region proposal and feature classification stages, showing faster inference speed.
The SSD model [7] is one of the single-stage baseline detectors in terms of speed and precision with VGG-16 as the backbone. However, the model is limited and fails to completely represent the multi-scale features of the object. Moreover, the model utilizes a single convolution-pooling down-sampling paradigm to obtain feature maps, making it difficult to describe the overall object information during inference.
Although possible solutions have been suggested in some of the literature like DSSD [11], FSSD [12] and EDF-SSD [13], these inappropriate methods not only make the network more complicated, which seriously affects the speed of detection, but also result in loss of the semantic information that is extremely significant to the multi-scale objects detection task. Therefore, it is prerequisite to adopting the appropriate feature enhancement and fusion method to enhance semantic information of the object.
To address the above-mentioned issues, we present a hierarchical feature fusion and reconstruction single shot detection method for multi-scale object detection, which constructs the local fine-grained information for the shallow feature maps and contextual semantic feature of the object for the deep feature maps. The main contributions of this work are summarized as follows: • We propose a hierarchical feature fusion and reconstruction single shot detection method for multi-scale object detection. The multi-branch feature enhancement module constructs the multi-scale visual attention feature map, which combines the channel-wise and space-wise feature information to enhance the power of feature representation. Furthermore, to solve the limitation of the semantic feature information from the shallow layer, we develop an adaptive multi-branch feature map weighing method to fuse hierarchical features, which is beneficial to detect multi-scale objects. • In order to remove redundant feature information of the fused feature map, we present an effective feature map reconstruction module to refocus on pivotal information of the resulting feature map. This module considers the correlation between the semantic information and the fine-grained feature of objects in the response map, which contributes to depicting the intact objects for the subsequent module. • We carry out the extensive experiments against the stateof-the-art detection methods on the PASCAL VOC and COCO benchmark datasets. Compared with other detection models, our method has a significant improvement over the baseline with slight speed drop under the same experimental setup.
The remainder of this article is organized as follows. Section 2 reviews related work on object detection. We detail the proposed hierarchical feature fusion and reconstruction method in Sect. 3. The extensive experiments and analyses are presented in Sect. 4. Section 5 concludes the article.

Related work
Current single-stage object detection pipelines like Yolo [6], SSD [7], CenterNet [9] and Yolov4 [10] have shown great potential for robust and fast object detection. Among these approaches, the SSD proposed by Liu et al. [7] is a classic baseline that achieves a balance between accuracy and speed. However, it exhibits poor performance for small and clustered objects due to the absence of semantic information for multi-scale objects.
To solve these problems, existing models adopted feature pyramid, feature map fusion and receptive field augmentation methods to enhance the semantic feature information of shallow layers. Earlier models [14] adopted image pyramid approaches to strengthen information by cropping images of different resolutions. After that, feature pyramid methods such as FPN [15], STDN [16] and PFPNet [17] chose to utilize the shallow-medium-depth feature maps obtained from CNN to make predictions. However, the drawback of these feature pyramid methods is evident, namely that the construction of the feature pyramid is time-consuming.
The feature map fusion methods employed advantages of the intrinsic relationship between feature maps of adjacent layers to enhance the semantic feature information. Lu et al. [18] designed a multi-layer feature fusion module (FFM) consisting of 1 × 1 convolution and bilinear interpolation upsampling to enrich the semantic information of the shallow feature. Recent methods like DF-SSD [19] and Mask-SSD [20] utilized atrous convolution or deconvolution to improve fusion of shallow and deep features and take full advantage of the relationship between different feature layers. However, due to the different importance of features, those simple fusion operators play a trivial role in object location correction and cannot comprehend the collective knowledge of the previous layers.
With the capability of real-time detection under complex circumstance, single-stage detectors have been widely used in various practical scenarios. Yavariabdi et al. [22] employed a detector modified from YOLOv3 to detect unmanned aerial vehicles (UAVs) on edge platforms. Sun et al. [23] proposed a mobile-end-based detection model MEAN-SSD for real-time detection of apple leaf diseases. Chen et al. [24] developed an improved SSD algorithm with MobileNetv2 backbone to achieve fast detection of vehicles in traffic scenes. Nevertheless, improving accuracy while maintaining efficient computation is still a concern, especially for small targets and multi-scale problems.

Motivation
In the original SSD model [7], it behaves poor performance due to lack of semantic feature information in shallow layers and fails to fully represent multi-scale object feature. Later, some improved versions of SSD have been proposed to enhance the feature representation of the object, such as DSSD [11], FSSD [12], M2Det [25], etc. However, these existing feature enhancement and fusion models focus on the intrinsic information of current layer, and they do not consider the influence of the prior knowledge of previous layers on the subsequent feature maps. Additionally, some unnecessarily redundant object semantic information exists in the resulting feature maps, which results in inaccurate object position prediction in the subsequent regression modules.

Multi-scale attention module
In this paper, we propose a multi-scale attention module to highlight the intrinsic connection between feature maps by utilizing an adaptive weighing mechanism. The multi-scale attention architecture is shown in Fig. 2.
For channel-wise attention, given a feature map x ∈ R H ×W ×C , two different convolution transformations are conducted to obtain the split feature maps x s 1 ∈ R H ×W ×C and x s 2 ∈ R H ×W ×C . Then, an element-wise summation operator is employed to obtain the fused feature map x c ∈ R H ×W ×C . Global information is embedded by utilizing global average pooling to generate channel-wise statistics, which is defined as follows: where F gp denotes the global average pooling operator; H and W are the height and width of the feature map, respec-tively; x c denotes the fused feature map; and x s is the enhanced feature map with global information. Further, the response maps x t 1 , x t 2 and x t 3 are obtained based on different receptive fields. The enhanced resulting feature map x r 1 can be got through element-wise summation.
For space-wise attention, convolution with 3 × 3 and 1 × 1 kernels is exploited to represent the feature map. The space-wise attention feature map is defined as Eq. (2).
where x s 4 and x r 2 denote the feature map of intermediate layer and the resulting response map, respectively; C 1×1 and C 3×3 represent convolution with 1 × 1 and 3 × 3 kernels, respectively; δ is the ReLU function [26]; and B denotes the Batch Normalization [27]. For simplicity and effectiveness, the feature maps with different receptive fields are combined by element-wise summation to get the resulting feature map.
where α 1 and α 2 denote the predicted weights of channelwise and space-wise attention feature maps, respectively, and α 1 + α 2 =1. x s 1 , x s 2 and x s 3 are the obtained feature maps with different receptive fields, respectively; x r 2 is the space-wise feature map.

Multi-branch feature enhancement module
In order to better enhance the semantic feature and local fine-grained information of the object, we furthermore propose an efficient multi-branch feature enhancement module to fuse the characteristic information flow, which consists of the original feature map and the enhanced response map generated by multi-scale attention model. In addition to the original feature map, another feature map generated by the multi-scale attention model is designed to extract the representative object features of channel and spatial dimensions, where the atrous convolution is adopted to capture the feature map of large receptive fields.

Adaptive hierarchical feature weighing mechanism
The original feature map and the enhanced feature map have a gap between semantic information and fine-grained features, and a simple summing of pixel values fails to effectively fuse the hierarchical features. Therefore, we present an adaptive hierarchical feature weighing mechanism to incorporate multi-layer features into the resulting feature map. First, the feature map x 11 ∈ R H ×W ×C from the input image is obtained through backbone network [28], and the enhanced feature map arose from multi-scale attention model can be described as x 12 ∈ R H ×W ×C . Then, they are fused to obtain the enhanced output feature map using an adaptive multibranch feature map weighing scheme that is defined as Eq. : where γ is a predefined weighing threshold that set to be 0.5 empirically; β 1 and β 2 denote the predicted weight parameters through convolution operators; and x 1 denotes the enhanced output response map of layer-1.
Next, x 1 is put into the feature map reconstruction module for object position prediction. Furthermore, in order to enhance feature information flow to form collective knowledge, x 21 2 ×C are obtained from the enhanced feature map x 1 by utilizing convolution operator with stride 2, 4 and 8, respectively. Similarly, the enhanced output response map from the ith of layer (i = 2, 3, 4, 5) can be generated from Eq. (5): where γ j and γ a are the predefined weighing thresholds of current layer and previous layers, respectively; x i j are the obtained feature map by utilizing the jth kernels ( j k ∈ {1 × 1, 3 × 3, 5 × 5}, k = 1, 2, 3) from the layer-i; x i−1 and x i−2 denote the collective knowledge from the enhanced output feature maps of previous layers, respectively; β j and β b are the predicted weight parameters and x i denotes the output enhanced feature map of the ith layer. The output response map x i is fed into the feature map reconstruction module for the final object position prediction.

Feature map reconstruction module
The global context and local fine-grained feature information can be enriched by adopting the feature weighing mechanism. The feature map obtained by down-sampling scheme has the problem of overlapping feature information with the underlying feature map. To solve this problem, we further develop a feature map reconstruction strategy to refocus on the prominent information of the resulting feature map.
The enhanced output response map from the hierarchical feature weighing mechanism goes through feature map reconstruction scheme to get the corrected feature map, which can successfully focus on the intrinsic feature relation by the given feature map.

Loss function
In order to solve the imbalance problem between positive and negative anchors, focal loss [29] is utilized for the classification task, which is written as follows: where p t is the probability of model evaluation, which is utilized to adjust the learning weights of the hard samples. γ is the attention parameter. In this paper, we set γ to 2 by default.
To describe the object location regression loss accurately, we adopt DIOU loss to measure the distance loss between the adjusted anchor and the ground truth, by Eq. (7).
where IoU is the area intersection of union between the predicted box and the ground truth of the object; Distance_2 is the Euclidean distance between the two centroids of the predicted box and the ground truth of the object; and Distance_c denotes the diagonal distance between the minimum external rectangle of the predicted box and the ground truth.

Experimental results
We carry out the experiments based on backbone network VGG-16 [39] on PASCAL VOC2007/VOC2012 [40] and MS COCO [41] datasets. In the datasets, mean average precision (mAP) and frame per second (FPS) are used to evaluate detection performance.

Experimental setup
We implement our method on the PyTorch framework. During training, we follow the SSD [7] strategies. Additionally, we slightly change the learning rate scheduling for better accommodation of our method.

PASCAL VOC2007
We train our detector on PASCAL VOC2007 and VOC2012 trainval (16551 images in total) and test on VOC2007 test (4952 images). We set the learning rate to 10 −3 for the first 80k iterations and then decrease it to 10 −4 for the next 20k iterations and 10 −5 for the rest 20k iterations. Additionally, we adopt a "warm-up" strategy that gradually ramps up the learning rate in the early iterations, which contributes to stabilizing the training process. The momentum and weight decay are set to 0.9 and 0.0005, respectively. Tables 1 and 2 show the experimental results on the VOC2007 test. Without bells and whistles, our model with 300 × 300 input image has achieved 79.8% mAP, which exceeds the latest SSD300* by 2.7% points and SSD512* by 0.3% points. It is worth noting that our method equipped with VGG-16 [39] backbone also exceeds DSSD321 with ResNet-101 [28] network by 1.2% points. By increasing the input image size to 512 × 512, our method achieves the better performance, improving mAP from 79.8% to 81.9%. Moreover, our model utilizing the DIoU loss can further improve accuracy to 81.2% and 82.8% mAP with 300 × 300 and 512 × 512 input image, respectively.
For some specific categories occupying a large area in images such as aero, car and dog, the AP of our method gains an improvement by a remarkable 3-7 points than SSD300*, outperforming most of the other object detectors. Figure 3 shows the results of our model on the VOC2007 test set. It can be observed that objects of various classes with different scales can all be accurately detected, validating the robustness of the model to class and background variations.

MS COCO
We further train our model on MS COCO dataset. We utilize the trainval35k (118287 images) for training and evaluate the results on the minival. The batch size is set to 32 for 300 × 300 input and 16 for 512 × 512 input, respectively. We train the model with 10 −3 for the first 280k iterations, then 10 −4 and 10 −5 for another 120k and 40k iterations. The total number of training iterations is 320k. SSD300* and SSD512* indicate the latest version updated by the authors. Bold fonts and italics fonts denote the best and second best performance, respectively   Bold fonts denote the best performance S denotes small object; M is medium object; L denotes large object As shown in Table 3, we observe that our method with 300 × 300 input images achieve 28.2% AP@[0.5:0.95], 48.2% AP@0.5 and 29.7% AP@0.75, which improves the baseline model SSD300 by 3.1%, 5.1% and 3.9% points, respectively. Our model with 512 × 512 input images also outperforms the baseline SSD512 by 4.1%, 4.0% and 5.3% points, respectively. It is noticeable that our model with 300 × 300 and 512 × 512 input images achieves 9.2% and 15.6% AP for small objects (s < 32 2 ), respectively, which improves SSD (2.6% and 4.7% point) with a large margin and proves that our proposed method is more powerful on detection of small objects. For the medium (32 2 < s < 96 2 ) and large objects (s > 96 2 ), our method with 512 × 512 input images achieves 37.8% and 51.2%AP, which improves SSD512 (6.0% and 6.2% point) with a large margin and also validates the effectiveness of the proposed method for object detection task.
Although FFR-SSD300 performs slightly worse in accuracy than YOLOv4-tiny-3l [32] and YOLOv7-tiny [42], this gap is acceptable in that our model has smaller image input and utilizes weaker backbone VGG and does not employ a lot of design tricks.

Conclusion
In the paper, we propose a hierarchical feature map fusion and reconstruction model called feature fusion and reconstruction SSD (FFR-SSD) for multi-scale object detection, which utilizes multi-scale attention and hierarchical feature weighing mechanisms to enhance semantic feature information of shallow layers. To wipe off the redundant information of the fused feature map, a feature map reconstruction module is designed to refocus on pivotal information of the fused feature map. Experimental results demonstrate the effectiveness of the proposed approach.
Author Contributions X.C. was involved in conceptualization; X.C. and C.S. were involved in methodology and writing and original draft preparation; X.C., C.S. and Z.W. were involved in validation, software and writing-review and editing; X.C. and Z.W. performed the formal analysis. All authors have read and agreed to the published version of the manuscript. Data availability Data is publicly available.