The proposed approach for indoor objects detection is based on a combination between two powerful neural networks. The cross-stage partial network (CSPNet) [11] used as a detection framework. In this work, we proposed to change the CSPNet backbone by a lighter one which is EfficientNet v2 [12] in order to perform a reliable implementation on FPGA device.
The main contribution of introducing CSPNet is to enable the architecture to achieve a richer gradient combination and to reduce the computation complexity. All this is obtained by dividing the features map into two parts. After that, it merges them through the cross-stage hierarchy. CSPNet architecture is robust enough to enable the gradient flow to propagate through different paths within the network. By this way, the CSPNet architecture succeeded on reducing the amount of computation and improves the network results.
Generally, CSPNet architecture addresses the following problems:
- Increasing the learning ability of the CNN model: The performance of DCNN models is increased after light weighting the network. CSPNet enable to maintain a good learning ability while using a lightweight architecture.
- Reduce the computation complexity: CSPNet architecture reduce the computation complexity by distributing the amount of computation of each layer. By this, it upgrades the use rate of computation unit and by this way it reduces the computation complexity.
- Reduce the memory storage: CSPNet architecture adopts Cross-channels pooling technique [21] in order to compress the features map during the feature pyramid generating process. So, it down the memory cost of the neural network.
The main aim of using the partial transition layer in the CSP architecture is to maximize the difference of the gradient combination. Partial transition layer is used to truncate the gradient flow and to avoid different layers from learning duplicate gradient information. Two different strategies can be used: fusion first and fusion last. In the fusion first strategy, the generated features maps are concatenated by two parts and then it perform the transition operation. In this strategy, a huge amount of gradient information will be reused. While in the fusion last strategy, the gradient low will be truncated and no gradient information will be reused. As in the proposed work we aim to build a lightweight indoor objects detection system, we used the fusion last method as it significantly dropped the network computation cost. Figure 1 provides the fusion last method adopted in the proposed work.
In order to build the proposed indoor objects detection system, we used CSPNet as a detection model. It consists of the exact fusion model (EFM) which consists of three main parts:
- Perfect prediction: EFM architecture captures the appropriate parts of objects (field of view FoV) for every applied anchor. This fact enables to enhance the network accuracy.
- Aggregate the features pyramid: EFM architecture is based on the YOLO v3 [22] architecture, it enables aggregating the initial pyramid network by assigning one prior bounding box to every object ground truth. Every ground truth corresponds to one anchor that surpass IoU threshold. If the size of the anchor is equivalent to the FoV of the grid cell, then for the Sth scale of the grid cell then its corresponding bounding box will be lower bounded by the (S-1)th scale and upper bounded by the (S+1) th scale. By this, the EFM will concatenate features from the three scales.
Balancing the computation: as EFM module accumulate features map from different scales, it require a huge amount of computational resources and to address this problem, CSPNet architecture introduce the use of maxout technique to resize the features map. Figure 2 provides a detailed architecture of the EFM architecture.
EfficientNet [23] present a family of networks that are especially optimized for FLOPs and parameters reduction. It presents a rethink version of convolutional neural networks. It provides a model scaling balancing between depth, width and resolution. This technique contributes for better network performances. EfficientNet network propose a new scaling technique named “compound coefficient scaling”.
Generally, deep learning-based CNN models are scaled up by adding more layers but in EfficientNet architecture a scaling up across multiple dimensions is applied. After various and extensive experiments, the compound scaling is presented following these coefficients:
- Depth = 1.20
- Width = 1.10
- Resolution = 1.15
EfficientNet family architectures present another powerful component which is the depthwise convolution. This type of convolution presents fewer parameters and calculation complexity than the regular convolution. But this type of convolution cannot be used in the modern accelerators. To address this problem, fused-MBconv is proposed. It replace the depthwise 3x3 and expansion conv 1x1 by a single regular 3x3 convolution as presented in the following figure 3.
When applying the fused MBconv in the early stages of the network (1-3), it improves the training speed process with a small overhead in FLOPs and parameters. While when applying fused MBconv to the 7th network stage, it increase significantly the FLOPs and parameters number. So, it is very important to find the best combination of the two blocks (MBconv and fused MBconv).
To address this problem, searching for the best combination has been performed in the NAS search space [24] by using a “training aware” NAS framework. This type of training aims to optimize the network accuracy and efficiency. EfficientNet search in the NAS search space for the best choice for the convolution blocks (MBconv and fused MBconv), number of layers and kernel size (3x3 or 5x5) and the expansion layer (1, 4, or 6). The training aware in the NAS search space can be reduced by removing the unnecessary search options qs pooling and skip connections. EfficientNet v2 architecture is presented in table1.
Table 1: EfficientNet v2 architecture
Stage
|
Operator
|
Stride
|
Number of layer
|
0
|
Conv 3x3
|
2
|
1
|
1
|
Fused MBconv1,
K 3x3
|
1
|
2
|
2
|
Fused MBconv4,
K 3x3
|
2
|
4
|
3
|
Fused MBconv4,
K 3x3
|
2
|
4
|
4
|
MBconv4, K 3x3
SE 0.25
|
2
|
6
|
5
|
MBconv6, K 3x3
SE 0.25
|
1
|
9
|
6
|
MBconv6, K 3x3
SE 0.25
|
2
|
15
|
7
|
Conv 1x1 & pooling FC
|
_
|
1
|
EfficientNet v2 architecture provides several distinctions compared to EfficientNet architecture as: the use of both MBconv and fused MBconv in the early layers. It also uses smaller expansion ratio for the MBconv layers as they depend on less memory access. 3x3 kernel sizes are used to reduce the complexity of computation. By using the training aware technique in NAS search, EfficientNet v2 achieve much faster training process. Image size plays an important role in the training efficiency speed. When training a neural network, it is important to take into account the regularization parameters during training process with different image sizes.
During the EfficientNet v2 training, we used the progressive learning with adaptive regularization. In the early training epochs, it trains the neural network with smaller image sizes and weak regularization. After that, it gradually increases the image size and makes the learning process more difficult by adding stronger regularization.
In order to ensure a lightweight implementation of the proposed indoor objects detection system using FPGA device with high detection rate, we adopted a weights pruning technique. This technique was applied during the training process to allow the network to adopt the changes. Its main aim is to remove the unnecessary weights. Also weights pruning means to increase the neural network sparsity. Weights pruning require criteria of choosing which parameter will be maintained and which one will be removed. In the proposed work, the weights pruning criteria was the absolute value (L1 norm). We fixed a defined threshold value, if the L1 norm is below the fixed value, the weight will be set to zero and removed if not it will be maintained. Figure 4 provides the workflow adopted to perform the weights pruning technique.
In order to reduce and compress the model size, we used the post training quantization which aims to reduce the weights and activations representation for 32 bits floating points to 2 bits fixed points’ representation. It also contributes to reduce the computation complexity of the neural network as well as the memory storage.