3.1. Feature Pyramid based on MobileNetV3
MobileNetV3[20] follows the design principles of reduced width, increased input resolution, and stronger non-linearity. By reducing the width of the network, which refers to reducing the number of channels in each convolution operation, the computational complexity can be decreased. Additionally, increasing the input resolution improves the model's ability to perceive details and small objects. MobileNetV3 also introduces stronger non-linear activation functions such as Hard-Swish[21] and linear bottlenecks[22] to enhance feature representation. RELU6 is an extension of the ReLU function that limits the output to zero or above for negative input values and restricts the output to six or below for positive input values, resulting in an output range of [0, 6] as shown in Eq. 1.
$$ReLU6\left( x \right)=min\left( {max\left( {0,{\text{ }}x} \right),6} \right)$$
1
This makes RELU6 more suitable for certain application scenarios, such as object detection, where it can be used to limit the coordinate range of bounding boxes, ensuring that the bounding box positions remain within a reasonable range. In comparison to RELU6, the SWISH function can provide smoother activation characteristics in some cases, which may help reduce the gradient vanishing problem during training and potentially improve model performance, as shown in Eq. 2.
$$SWISH\left( x \right)=x*sigmoid\left( x \right)$$
2
However, SWISH function has a high computational complexity. It involves the calculation of the sigmoid function, which is relatively complex and involves exponential operations. This leads to a higher computational burden for SWISH, especially in large-scale neural networks, increasing the computational load.
$$sigmoid\left( x \right)=1/\left( {1+exp\left( { - x} \right)} \right)$$
3
The "h-sigmoid" activation function is obtained by applying the ReLU6 function to the input value x + 3 and then dividing the result by 6. It has a similar shape to the sigmoid function, as shown in Fig. 2(a), but with simpler calculations and derivative computations.
$$h\_sigmoid\left( x \right)=relu6\left( {x+3} \right)/6$$
4
Replacing the sigmoid function in the swish activation function with the h-sigmoid function not only results in a similar shape to the swish function, as shown in Fig. 2(b), but also improves inference speed and facilitates the quantization process.
$$h\_swish\left( x \right)=x*h\_sigmoid\left( x \right)$$
5
In terms of network architecture, MobileNetV3 utilizes grouped convolution [23] and depth-wise separable convolution [24]. Grouped convolution divides the input feature map into multiple groups and performs convolution operations on each group, then concatenates the results. By employing grouped convolution, the parameter and computational complexity can be effectively reduced. This is particularly useful when using larger convolutional neural network models with limited computational resources, such as real-time image processing on mobile devices.
Furthermore, grouped convolution can also improve the parallelism of the model to some extent, allowing for parallel computation on multiple GPUs or processors, thereby accelerating the speed of training and inference. Depth-wise separable convolution decomposes the standard convolution into two steps: depth-wise convolution and point-wise convolution, further reducing the computational complexity.
The core idea of the inverted residuals structure is to map low-dimensional feature maps to a high-dimensional space and then compress them back to a low dimension through linear projection, as shown in Fig. 3. Its design goal is to address the issue of excessive parameters and computational complexity in traditional residual structures (Residuals) in lightweight networks. It consists of an expansion layer and a linear bottleneck layer.
The expansion layer is used to increase the number of channels, while the linear bottleneck layer is used to reduce the dimensionality and introduce non-linear activation functions. This structure enables more effective utilization of the model's parameters and enhances the representation capability of features, as shown in the figure. Through the inverted residuals structure, the network can perform feature extraction and non-linear transformations at a lower dimension, and then increase the feature representation capability through high-dimensional mapping and linear projection. This design reduces the number of parameters and computational complexity of the network while improving its representation capability, making lightweight networks more efficient while maintaining high performance.
In addition, the Squeeze-and-Excitation (SE) module is used for channel-wise attention weighting of feature maps. The SE module learns the importance weights of each channel in the feature map through global average pooling and two fully connected layers, and applies them to each spatial position in the feature map. This enhances the network's sensitivity to different channels and further improves the representation capability of features, as shown in Fig. 4.
Neural Architecture Search (NAS) methods utilize an automated search process to discover the optimal network structure and parameter configuration. By employing reinforcement learning algorithms and search strategies, MobileNetV3 explores different combinations of network structures, including the number of layers, width diversity, and branching structures, among others, to optimize performance. The automated search process of NAS alleviates the burden of manual network design, enhances search efficiency, and improves performance. This enables MobileNetV3 to automatically search for the best network structure and parameter configuration, leading to better performance and results.
3.2. Combining with the adaptive neck of SA-net
SA-NET[26] (Shuffle Attention for Deep Convolutional Neural Networks) is a deep convolutional neural network based on attention mechanisms. It aims to improve the network's representational power and feature selection ability to better capture important features in images. The core idea of SA-NET includes grouped convolution and channel shuffling to enhance interactions between features, and it utilizes attention modules to adaptively adjust the weights of feature maps.
In traditional deep convolutional neural networks, convolution operations are usually performed simultaneously across all channels, resulting in independence between different channels in the feature maps. To enhance interactions between features, SA-NET introduces the concept of grouped convolution. It divides the input feature maps into multiple groups and performs independent convolution operations on each group. Specifically, when using grouped convolution, the input feature maps are divided into g groups, each containing c/g channels. Assuming the input feature map is X, the parameters of grouped convolution are denoted as W, and the output feature map is Y. The calculation formula for grouped convolution can be represented as follows:
$$Y{\text{ }}={\text{ }}Concatenate\left( {\left[ {Conv\left( {{X_1},{\text{ }}{W_1}} \right),{\text{ }}Conv\left( {{X_2},{\text{ }}{W_2}} \right),{\text{ }}...,{\text{ }}Conv\left( {{X_G},{\text{ }}{W_G}} \right)} \right]} \right)$$
6
In this context, \(Conv\) represents the standard convolution operation, \({X_i}\)represents the input feature map of the group, and \({W_i}\) represents the convolutional kernel associated with the group . The purpose of this approach is to gradually capture specific semantic responses in each sub-feature map during the training process. Subsequently, individual convolution operations are performed on each subset, and their outputs are concatenated to form the final output feature map, enabling better fusion and transmission of feature information among different groups. Additionally, grouped convolution offers advantages in significantly reducing the computational complexity and parameter count of the model. Compared to traditional convolution operations, grouped convolution decomposes the convolution operation into smaller operations, with each operation handling only a subset of the channels. This approach reduces the computational load of each convolution operation and fully leverages the capabilities of parallel computing. The grouped processing of convolution operations also decreases the level of parameter sharing, thereby reducing the parameter count of the model.
Furthermore, SA-NET introduces channel shuffling to enhance the interaction between features. Channel shuffling rearranges the results of grouped convolution, allowing feature maps from different groups to interleave with each other. Through channel shuffling, information between different channels can interact and fuse more effectively. Adjacent channels originate from different groups, enriching the correlation between features and aiding in capturing more feature information and patterns. This operation increases the expressive power of the model while reducing computational and parameter requirements to a certain extent, resulting in a more lightweight network. Assuming the input feature map is, and the output feature map after grouped convolution is, the channel shuffling operation can be represented as follows:
$$Y{\text{ }}={\text{ }}Concatenate\left( {\left[ {Shuffle\left( {{X_1}} \right),{\text{ }}Shuffle\left( {{X_2}} \right),{\text{ }}...,{\text{ }}Shuffle\left( {{X_G}} \right)} \right]} \right)$$
7
In this context, \(shuffle\) represents the channel shuffling operation, \({X_i}\) represents the input feature map of the group. Channel shuffling has been widely applied in lightweight network architectures, particularly in mobile devices and embedded systems, to provide high-performance computation and recognition capabilities. It is an effective design strategy that reduces computational and storage requirements while maintaining model accuracy, making it suitable for various computer vision tasks.
Channel attention aims to capture the dependencies between different channels to better control the representational capacity of feature maps. In the SA network, for each sub-feature map, the global average pooling (GAP) operation is applied to compute the average along the spatial dimensions, obtaining the global statistics of the channels:
$$s=\frac{1}{{H \times W}}\sum\limits_{{i=1}}^{H} {\sum\limits_{{j=1}}^{W} {Xk1} } (i,j)$$
8
Next, a simple gating mechanism is used to generate channel weights. Specifically, the parameters \({W_1}\) and \({b_1}\) are used for linear transformation, followed by a sigmoid activation function to obtain the weight parameter:
$$weights=\sigma ({W_1} \cdot s+b1)$$
9
Finally, the weight parameter is applied to \(Xk1\) linearly transform the sub-feature map, resulting in the final output \({X_0}k1\) of channel attention. This step can be represented as:
$${X_0}k1=weights \cdot Xk1$$
10
Through this computation process, channel attention can model the importance of different channels and adjust the representation of sub-feature maps based on the weight parameters. This helps to enhance the model's representational capacity and better capture the correlations between features.
Spatial attention is used to determine which positions in the feature map contain informative content. In the SA network, the computation process of spatial attention is as follows: Firstly, for each sub-feature map \(Xk2\), its spatial statistics are obtained by applying group normalization (GN) operation, which enhances the representation capability of the feature map.
$$GN(Xk2)=\frac{{(Xk2 - \mu )}}{{\sqrt {{\sigma ^2} - \varepsilon } }}$$
11
Where \(Xk2\) represents the sub-feature map, \(\mu\) represents the mean within each group \(Xk2\), \(\sigma\) represents the standard deviation within each group \(Xk2\), and is a small constant for numerical stability. The adjusted feature map \({}^{\^}Xk2\) is further enhanced by applying function \(Fc(\cdot )\) :
$${}^{\^}Xk2=Fc(GN(Xk2))$$
12
Where \(Fc(\cdot )\) is a non-linear transformation function, such as the ReLU activation function. The aggregation of the adjusted sub-feature maps\(X0k1\) and \(X0k2\) is achieved by concatenating them:
$$X0k{\text{ }}={\text{ }}\left[ {X0k1;{\text{ }}X0k2} \right]$$
13
where \(X0k1\) is the sub-feature map adjusted by channel attention, and \(X0k2\) is the sub-feature map adjusted by spatial attention. Finally, the output of the SA module has the same size as the original feature map, which allows easy integration of the SA module with modern architectures, as shown in Fig. 5.
3.3. C3_DS_Conv
Deep neural networks have achieved tremendous success in computer vision, natural language processing, and other fields. However, their complexity and large number of parameters lead to high computational and storage requirements. The main goal of quantization techniques is to reduce the precision of parameters and activation values, thereby reducing the computational and storage demands. Traditional deep neural networks use 32-bit floating-point numbers to represent parameters and activation values, while quantization techniques can represent them as lower-precision integers or binary forms. For example, parameters and activation values can be represented using 8-bit integers or binary, significantly reducing the computational and storage overhead. Quantization serves multiple purposes. Firstly, it can reduce computational costs. Lower-precision representations can decrease the complexity of multiplication and addition operations, speeding up inference, which is crucial for real-time applications and resource-constrained devices such as mobile devices and embedded systems. Secondly, quantization can reduce storage costs. Deep neural networks typically have a large number of parameters, requiring significant memory space for storage. Through quantization, parameters can be represented in compact integer or binary form, significantly reducing storage requirements, which is important for deploying and running deep neural networks in resource-constrained environments. Additionally, quantization can improve energy efficiency. By reducing computational demands, quantization can reduce energy consumption, prolong battery life, and enhance device efficiency, which is beneficial for power-sensitive applications like mobile devices and wireless sensor networks.
During the inference process of neural networks, adopting quantization methods with low-precision representations can significantly reduce computational and storage overhead. However, how to quantize weights and activations without sacrificing accuracy remains a challenge. DS_Conv [27] employs a strategy called block-wise quantization, where weights and activations are divided into different blocks and quantized separately. By converting floating-point values into fixed-bit integer values and using floating-point scaling factors to preserve the quantized precision, DS_Conv achieves low-precision representation without compromising the accuracy of the network. The core principle of DS_Conv is based on the relative distribution invariance of quantized weights and activations. It utilizes the block-wise strategy in quantization operations, representing weights and activations using integer values and employing floating-point scaling factors to maintain the quantized precision. By minimizing the Kullback-Leibler (KL) divergence or L2 norm, DS_Conv computes the scaling factor for each block to re-map the quantized values back to the original range.
The specific structure of DS_Conv consists of two key components: Variable Quantization Kernel (VQK) and Kernel Distribution Shifting (KDS), as shown in Fig. 6. VQK is an integer tensor of the same size as the original weight tensor, represented in 2's complement format, with its range determined by a preselected number of bits. KDS is a floating-point tensor used to store the scaling factors for each block. By multiplying the integer values of weights and activations with their respective scaling factors, the distribution of each block is readjusted to the correct range. DS_Conv also employs an activation quantization method called Block Floating-Point (BFP). It partitions the activation tensor into blocks and performs clipping and shifting on other activations based on the maximum exponent within each block. This allows the use of fewer bits in the activation tensor and enables low-precision integer operations between weights and activations.By appropriately selecting the values of block size B and number of bits b, DS_Conv achieves a trade-off between computational and storage efficiency and inference accuracy. A larger block size B can reduce storage overhead and the depth of KDS but may increase clipping errors. A smaller number of bits b can lower storage costs but may result in greater computational complexity.
DS_Conv is a quantization method for neural network inference that achieves low-precision representation without sacrificing network accuracy through block-wise quantization and floating-point scaling factors. It employs the two key components, VQK and KDS, to store quantized weights and activations, and utilizes the BFP method for activation quantization. By appropriately selecting the block size and number of bits, DS_Conv strikes a balance between computational and storage efficiency and inference accuracy. This approach demonstrates good performance even without training data and has broad prospects for applications.