Earf-YOLO:A Model for Recognizing Zhuang Minority Patterns Based on YOLOv3

With reference to the limitations of YOLOv3 in recognizing symbols on Zhuang pattern, such as slow detection speed, unable to detect small object, and inaccurate positioning of bounding box, we propose a new model: Earf-YOLO (Efﬁcient Attention Receptive Field You only look once) in this paper. In EarF-YOLO, we present an attention module: CBEAM (Convolution Block Efﬁcient Attention Module) at ﬁrst, which provides feature maps from channel and spatial dimensions. In CBEAM module, a local cross-channel interaction strategy without reducing dimensionality is used to improve the performance of the convolutional neural network. Besides, we put forward the SRFB (Strength Receptive Field Block) structure. During its training, more branch structures will be generated to enrich the feature space of the convolutional block. During its prediction, the multi-branched structures will be reparametrized and fused into one main branch to improve the performance of the model. Finally, we adopt some advanced training techniques to improve the detection performance. Experiments on the dataset of Zhuang patterns and the COCO dataset show that the Earf-YOLO model can effectively reduce the error of the prediction box and the ground-truth box, and decrease the calculation time. The mAP value of this model on the dataset of Zhuang patterns and on the COCO dataset reaches 82.1 (IoU=0.5) and 62.14 (IoU=0.5) respectively.

model skips detecting generating detection proposals and adopts an end-to-end approach for training and recognition. 15 Although the YOLOv3 and the SSD illustrate good real-time processing capability on multiple Graphic Processing 16 Units (GPUs), they lose accuracy and cannot be run on a single GPU in real time. In the recognition stage, they need 17 many GPUs to ensure its real-time performance. In the training stage, they also need many GPUs or expensive GPUs 18 to increase the number of the batch size, so as to better fit the data of the model. 19 With regards to the above-mentioned problems, it is necessary to build a fast and efficient model for object detec-20 tion. A sensible solution is not to increase the depth of the neural network but to increase the feature extraction ability 21 of lightweight networks by introducing some new structures and modules into the models. 22 For recognizing symbols on the Zhuang patterns, we suggest the efficient attention receptive field Earf-YOLO 23 model based on YOLOv3 in this paper. Earf-YOLO evolves YOLOv3 into a quick, accurate, and easily-trained 24 model. 25 The main contributions of our work are summarized as follows: 26 (1) We propose a new attention module CBEAM, which can provide feature maps from channel and spatial 27 dimensions. It contains a few parameters but can improve recognition performance. 28 (2) We suggest the SRFB structure based on the RFB structure [7] to replace the redundant convolutional layer  regions by the CNN model to achieve good results. However, the two-stage object recognition model requires large 42 computational overhead to ensure its recognition effect. One-stage object recognition model accelerates the detection 43 speed, discards the box of generating detection proposals, and predicts confidence and multiple objects according to 44 the features of the whole image. Although its prediction time has been shortened, its accuracy is significantly lower 45 than that of the two-stage object recognition model. Recently, some advanced one-stage object recognition models 46 (such as Retinanet [12], EfficacientDet[13], YOLOv4[14], etc.) update their original network with some techniques, 47 making their accuracy similar to that of the two-stage model. In this paper, instructed by the one-stage object recog-48 nition model, we optimize its network structure to achieve higher efficiency and accuracy. In recent years, the attention module has been introduced into convolutional neural networks, which has signif-51 icantly improved the feature extraction ability of convolutional neural networks with little computational overhead.

52
Meanwhile, the attention module has shown great potential for further improvement. In some studies, for example, the FC layer will also capture some unimportant pseudo-attention feature maps, which will pose negative impacts on important channels worthy of attention. In addition, using a large number of two-dimensional convolutions, it will 64 increase the cost of memory access, and will lose the dependencies of different groups. Considering the imperfection of current attention modules, we are committed to developing a more efficient one. Some expanded network receptive field modules can improve the recognition accuracy, though they will increase a 68 little bit computational cost. He et al. [18] proposed the Spatial Pyramid Pooling (SPP) structure and used max pooling 69 with multiple parallel k × k convolution cores to increase the receptive field of the model and get feature information.

70
Although the SPP structure can increase the receptive field of the model and obtain multi-scale information, it fails to 71 consider the influence of the eccentricity of the receptive field. In its receptive field, the influence of every pixel is the 72 same, and it does not emphasize the important information in the receptive field, so some important information may 73 be omitted. Chen et al. [19] proposed the ASPP structure in DeepLabv3+ in order to capture contextual information 74 at multiple scales. The main difference between the ASPP module and the SPP module is that the SPP extends the field. However, during its training process, the RFB structure fails to expand the receptive field to maximum extent 82 to give full play to its performance of multi-scale prediction. It takes longer reasoning time to carry out prediction.

83
In the study of receptive field, we endeavor to re-parameterize the RFB structure, to develop a new module to replace 84 redundant convolutions, and to improve the recognition performance of the model. We creatively combine the multi-85 branch, multi-scale and over-parameterization of RFB to enrich the feature space during training. SRFB can be 86 equivalently converted into a single convolution when it is deducing. Usually, researchers adopt advanced training strategies, modules, and post-processing methods to make the model 89 more accurate. The strategy that only increases the training cost not the deduction cost is called "Bag of freebies".

90
The insertion module and post-processing method, which only increase a small amount of prediction overhead but can 91 significantly improve the prediction accuracy of the model are called "Bag of specials". Some methods of "Bag of 92 freebies" are to optimize the loss function to make the model better fit the data. Aiming at the problem that negative high scores to low scores, selects the prediction box with the highest scores, sets a threshold, deletes the prediction 103 box whose overlap rate with the highest-scored prediction box exceeds the threshold, and repeats the above-mentioned 104 steps with the left prediction boxes until the last one. When the overlap rate of two objects in the image is larger than 105 the fixed threshold, Hard-NMS will set the score of the prediction box as 0, then it will be deleted. This may lead to the 106 low-scored objects not being detected and loss of accuracy. Soft-NMS algorithm considers that the object occlusion is taken into account. Therefore, Soft-NMS addresses the problem that Hard-NMS mistakenly deletes the prediction 109 box when two objects overlap.   The channel attention in CBEAM is shown in Figure 2 w k has k × C parameters. Every group of vectors in will consider the interactive feature information of k neighbors, which avoids complete independence of feature information between groups.The channel features of channel attention in CBEAM only consider the information exchange between and k neighbor channel features, and they share learning parameters. The weight of channel attention can be calculated by formula 2.
The calculation strategy of ω channel attention can be achieved by 1D convolution with the kernel size as k. The 138 weight of the channel attention can be calculated by formula 3.
In the formula, C1 represents 1D convolution. The kernel size k of 1D convolution determines the correlation range between feature information.If the channel attention of CBEAM is divided into fixed groups, the performance of CNN can be improved, but the same feature information will be communicated between the high-dimensional channels and between the low-dimensional channels for a long time. Therefore, in CBEAM channel attention, we define a nonlinear proportional relationship between the communication information range of features and channel dimension C, so as to solve the problem of long-time information communication between features. There's a mapping φ between k and C, as is shown in equation 4.
Given channel C in channel attention model, the convolution kernel size of k can automatically adapt to C. The channel attention module is expressed as formula (5): The spatial attention module in the CBEAM is shown in Figure 2 final spatial attention map. The spatial attention module is expressed as formula (6): In the formula, F refers to feature map and σ indicates the Sigmoid operation. f 1 3×3 represents the one-dimensional 145 convolution layer with a 3 × 3 convolution kernel, and f 2 7×7 represents the two-dimensional convolution layer with a 146 7 × 7 convolution kernel.  148 We attempt to improve the detection performance of YOLOv3 with less computational overhead. In neural networks, the depth of the network plays an important role in improving the recognition rate of the neural network. However, under the influence of gradient divergence, only increasing the depth of the network may decrease the recognition rate of the network. Therefore, we do not increase the depth of the RFB network (the RFB structure is shown in Figure 3 (a)), but use some architecture-independent structures to enhance the performance of the RFB structure.SRFB structure based on the RFB structure, is presented to optimize the network structure of YOLOv3. Compared with the RFB structure, SRFB has more informative "microstructures", and increases the receptive field of feature extraction of the model, so that each feature extracted by convolution contains a large range of feature information. when SRFB conducts prediction, it can transform a variety of matrixes into a single convolution and reduce the loss of deduction. SRFB structure is shown in Figure 3 (b). We use parallel layers with kernel sizes of 3×3, 1×3, 1×1, and 3×1. Each layer is batch normalized. During training, SRFB uses the parallel 3×3, 1×3, 1×1

Strength Receptive Field Block
In formula 7, M :,k represents the input feature map of the k-th channel, F convolution kernel to enrich the convolution feature information, as shown in Figure 4.
Where I signifies the feature matrix, K (1) and K (2) represent two-dimensional convolution kernels with compatible 157 sizes, ⊕ represents the sum of the corresponding positions, * represents the two-dimensional convolution operator, and 158 compatibility represents that the smaller kernel can be patched to the larger kernel.

159
The homogeneity of convolution proves that batch normalization of the feature space of neutral network can be equivalently integrated into convolution during the prediction. According to the homogeneity of the convolution, a new kernel γ j σ j F ( j) plus bias µ j γ j σ j + β j can be constructed for each branch, as shown in formula (9) and (10).
By adding the parallel convolution kernel to the asymmetric convolution kernel, three 3×3, 1×1, 1×3 and 3×1 convolution branch is normalized and merged into a standard convolution layer. Compared with RFB, SRFB can obtain rich feature information without increasing computational overhead after the merging. The result after the merging is shown in equation 11.
Where O :,i, j ,O :,i, j ,Ô :,i, j ,Õ :,i, j represents the output results of 3×3, 1×3, 3×1 and 1×1 convolutional layer, 160 respectively. It is worth noting that the SRFB structure can be equivalently converted only when it conducts deduction, 161 as shown in Figure 3 (C). Because the kernel weights are randomly initialized during training, they use different 162 calculations to obtain gradients, so they cannot be converted equivalently during training.

L(p, y)
Where p ∈ [0, 1] represents category probability of predicted samples; is the category label. Set its value to 0 and 176 1.

177
At present, many object detection algorithms generally use L1 and L2 norm to calculate the loss value. L1 188 In the formula, A and B refer to the prediction box and the ground-truth box respectively, and C is the smallest 189 closed box that contain both boxes.

190
When GIoU becomes larger, the GIoU Loss will become smaller and the network will be optimized to make the .
S is the grid size, S 2 is the 13x13,26x26, 52x52 grid, B is the prediction box, I obj ij is the prediction box at i,j has a 194 target, its value is 1, otherwise is 0,ŵ j i andĥ j i are the width and height of the prediction box at i,j.
s i is the score of the current prediction box; N t is the threshold value; M is the prediction box with highest score.  are often used as evaluation indicators. Among them, the larger the mAP value is, the better the detection effect 222 will be; the larger the FPS value is, the higher the detection efficiency will be; and the smaller the Param value is, the 223 lower the network memory consumption will be. In this section, we first conduct experiments on the dataset of Zhuang 224 patterns, which proves that Earf-YOLO can recognize symbols on Zhuang patterns successfully and efficiently. Then   frames. We selected 10,000 images as the validation set from the COCO 2014 dataset to verify our model.      275 We studied the impact of different prediction methods on its accuracy on the dataset of Zhuang patterns. In this 276 paper, we used K-means to cluster the anchor box of the dataset of Zhuang patterns to obtain 9 anchor boxes with 277 different but their best sizes. Earf-YOLO with and without the clustering scheme were compared, whose results were 278 listed in Table 4. From Table 4, the mAP value of Earf-YOLO with clustering scheme was 2.8% higher than that of 279 Earf-YOLO without clustering. Besides, we still explored the impact of image input size on the model's ability to 280 extract features. As shown in Figure 8, as the size of input image increases, its mAP value will continue to increase. We studied the impact of different prediction methods on its accuracy on the dataset of Zhuang patterns. In this 283 paper, we used K-means to cluster the anchor box of the dataset of Zhuang patterns to obtain 9 anchor boxes with 284 different but their best sizes. Earf-YOLO with and without the clustering scheme were compared, whose results were 285 listed in Table 4. From Table 4, the mAP value of Earf-YOLO with clustering scheme was 2.8% higher than that of 286 Earf-YOLO without clustering. Besides, we still explored the impact of image input size on the model's ability to 287 extract features. As shown in Figure 7, as the size of input image increases, its mAP value will continue to increase.   The COCO dataset is much more difficult to recognize than that of Zhuang patterns, and the evaluation criteria are 297 stricter, so it can better evaluate the performance of different methods. We Compared Earf-YOLO with YOLOv4 and 298 EfficientDet-D0. As shown in Table 6, the mAP value of our model was 9.82 % higher than that of EfficientDet-D0 299 and slightly lower than that of YOLOv4. The FPS value of our model was the highest of all, which increased by 10 300 compared with EfficientDet-D0 and by 9 compared with YOLOv4. All of these illustrated that Earf-YOLO could 301 ensure the detection accuracy in a relatively fast speed.   there is still room for further improvement in computational cost and recognition rate. We will continue to optimize 316 this model to make it available on mobile phone, and to detect more decorative patterns of nationalities such as Yao