A magnifier can help the observer quickly find the camouflaged object from the image. This is because the magnifying effect of the magnifier makes it easier for the observer to spot the center, key points and tiny details of the camouflaged object. Inspired by the magnifier, we apply the observation effect to the COD problem, design the Ergodic Magnify module and the Attention Focus module. The ergodic magnification module is designed to mimic the magnifying process of a magnifier ergodicing an image, and the attention focusing module is used to perform the observation process which human attention is highly focused for focusing on a region.
3.1 Network Overview
The network structure of MAGNet is shown in Figure 2. Input a camouflage object image into this network, MAGNet first extracts multi-scale feature maps through Res2net-50 backbone [42], and then inputs the latter three feature maps to the Ergodic Magnify module and the Attention Focus module respectively. Finally, the output feature maps of the two modules are fused to simulate the observation effect of magnifier on the object.
3.2 Ergodic Magnify Module (EMM)
As shown in Figure 2, the Ergodic Magnify module consists of two parts, i.e., the Central Excitation Module (CEM) and the Multi-scale Feature Fusion Module (MFFM).
The Central Excitation Module is used to traverse the feature maps of different scales output from the back three layers of the backbone to expand the receptive field and stimulate the central point and key points.
The Multi-scale Feature Fusion Module is designed to fully integrate the multi-scale feature maps after the Central Excitation Module to realize the efficient utilization of high-level and low-level features.
3.2.1 Central Excitation Module (CEM)
We find that when we use a magnifier to observe an object, we will observe the central area of the magnifier more carefully than the edge area. This is because the human visual receptive field mechanism determines that the observer will be more attracted to the center of the object [43]. Then, we will use the magnifier to traverse the whole picture until the center of the magnifier coincides with the center of the object.
In order to simulate the visual magnification and traversal function of the magnifier. We design a simple and efficient Central Excitation Module, as shown in Figure 3. The realization of the above functions mainly depends on the dilated convolution (DConv) with different sizes convolution kernels [44].
Specifically, the Central Excitation Module includes four branches. Input the feature map into four branches at the same time. The four branches first use 1×1 convolution to change the number of output channels, and then, three of them use 3×3, 5×5, and 7×7 dilated convolution with expansion coefficients of 2. After connecting the three sets of output feature maps, a 3×3 convolutional layer is used for fusion between channels. Finally, the residual connection is made with the fourth branch to obtain the feature map after the center excitation.
The connection of the three sets of dilated convolution can increase the importance of the central feature while increasing the receptive field. As shown in Figure 4, the function of central excitation can be realized. The multi-scale feature maps after central excitation have the same number of 128 channels to ensure the balanced utilization of information of each scale.
3.2.2 Multi-scale Feature Fusion Module(MFFM)
The function of the Multi-scale Feature Fusion Module is to fully integrate the excitation feature maps of different scales, thereby outputting a camouflage object map that fully contains high and low-level features. The MFFM structure diagram is shown in Figure 5. The small-scale excitation feature map transmits the feature information to the large-scale feature map through continuous upsample and fusion, and then generates an output feature map with a size of 44×44×1.
The front-end fusion method of the module adopts Hadamard product (). The Hadamard product calculation method is pixel-by-pixel multiplication, which can better achieve feature crossover, so as to eliminate the difference between the two groups of features and improve the ability of feature fusion.
The back-end of the module is fused by adding the number of channels, which can fuse the features of each layer to increase the feature dimension, but does not increase the internal information of the features, so as to make full use of the semantic information of the high-level and low-level features.
The module output map is denoted as Fout, the large-scale feature map in the module is denoted as Fi, and the small-scale feature map is denoted as Fi−1. In the Figure 5, the feature map output by the Hadamard convolution module in blue is Fh, and the feature map output by the green Concat module is Fc. Then there is the following formula:
$${F_h}={\text{ }}{F_i} \times CBR(UP({F_{i - 1}})){(1)}$$
$${F_c}=Concat({F_i},CBR(UP({F_{i - 1}}))){(2)}$$
$${F_{out}}=CBR({F_c}){(3)}$$
3.3 Attention Focus module
The attention focus module has two steps. First, through upsample and convolution operations, the three sets of feature maps output by Backbone are processed into feature maps of the same size and the same number of channels. Then input it into the Channel-Spatial Attention Module to simulate the effect of human attention being focused on observing objects in the field of view of the magnifier.
3.3.1 Channel-Spatial Attention Module (CSAM)
The attention mechanism in deep learning simulates the human visual attention mechanism, and the goal is to obtain more important information [43]. It is mainly divided into two types: spatial attention mechanism and channel attention mechanism. The spatial attention mechanism can find the most important area in the space, and retain the important local information through spatial transformation. The channel attention mechanism can assign different weights according to the importance of each channel, so that the model pays more attention to channels with more important information [45]. The two methods have their own advantages and disadvantages, and the Channel-Spatial Attention Module we proposed is a parallel fusion mechanism of spatial attention and channel attention, as shown in Figure 6.
As illustrated by Figure 6, the Channel-Spatial Attention Module is mainly implemented in four steps. The pseudocode of the Channel-Spatial Attention Module is as follows:
Algorithm 1: CSAM Algorithm
Input: L2, L3, L4.
# 1. Feature maps Concat
X-original = Concat(L2, L3, L4)
for i = 2, 3, 4:
# 2. Spatial Attention
xsa_i = GN(Li)
xsa_i = Weight * xsa_i + bias
xsa_i = Li * Sigmoid(xsa_i)
# 3. Channel Attention
xca_i = CAmodule(Li)
Xsa = Concat(xsa_3, xsa_4, xsa_5)
Xsa = Softmax(Xsa)
Xca = Concat(xca_3, xca_4, xca_5)
# 4. Fusion attention maps
Xout = X-original * Xca * Xsa
Output: Xout.
Feature maps concat: Superimpose the three groups of input feature maps with the same size and number of channels on the channel dimension, so as to make average use of the feature maps of each scale and fully integrate the semantic information of high-level and low-level features. Then, input the feature maps of three different layers into the channel attention mechanism branch and spatial attention mechanism branch respectively to generate a channel attention map and a spatial attention map.
Channel Attention: Squeeze-and-Excitation (SE) module is the most commonly used method of channel attention [46]. It can extract important features by assigning weights to each channel, but does not learn the importance of location information. Therefore, we embed the Coordinate Attention (CA) module [47] that can fully perceive position information into SAFM. The CA module first performs coordinate information embedding, uses 2D-average-pooling operations to aggregate the input features into a pair of direction-aware feature maps. Then, the CA module performs coordinated attention generation. The first step is to process the direction-aware feature maps by the convolution layer. The second step is to segment and encode it into two attention maps that store position information. Finally, the feature maps are multiplied through Hadamard product to generate a channel attention map embedded with position and direction information.
Spatial Attention: The spatial attention mechanism is particularly important for finding special targets, and it can retain important local information. We first use GroupNorm (GN) for group normalization to eliminate the hardware platform's restrictions on BatchNorm. The second step is to use a set of trainable parameters, namely weight (w) and bias (b), to assign spatial weights to enhance the representation ability of the feature map. The third step uses sigmoid to activate, and then multiplies the original feature map pixel by pixel to obtain the spatial attention map. Finally, we use Softmax to normalize again.
Fusion channel and spatial attention maps: We use Hadamard product for the fusion of attention maps, that is, the method of pixel-by-pixel multiplication, which can obtain a more accurate feature map.
3.4 Output Prediction
Finally, the feature maps output by EMM and AFM are transformed into single-channel camouflaged object map through upsample operation. The two feature maps are fused by adding pixel by pixel. We select weighted BCE loss and weighted IOU loss [48] as the loss function. The overall loss function is:
$${L_{overall}}={\text{ }}L\left( {{P_{EMM}},{\text{ }}GT} \right)+L\left( {{P_{AFM}},{\text{ }}GT} \right){(4)}$$
$$L\left( {P,GT} \right)={L_{wbce}}\left( {P,GT} \right)+{L_{wiou}}\left( {P,GT} \right){(5)}$$
Where PEMM and PAFM are camouflaged object maps after upsample operation by the Ergodic Magnify module and Attention Focus module, and GT is the truth map.