Safety helmet detection method based on semantic guidance and feature selection fusion

Safety helmet detection is a hot topic of research in the field of industrial safety for object detection technology. Existing object detection methods still face great challenges for the detection of small-scale safety helmet object. In this paper, we propose a safety helmet detection method based on the fusion of semantic guidance and feature selection. The method is able to consider the balance between detection performance and efficiency. First, a multi-scale non-local module is proposed to establish internal correlations between different scales of deep image features as well as to aggregate semantic context information to guide the information recovery of decoder network features. Then the feature selection fusion structure is proposed to adaptively select deep features and underlying key features for fusion to make up for the missing semantic and spatial detail information of the decoding network and improve the spatial location expression capability of the decoding network. Experimental analysis shows that the method in this paper has good detection performance on the expanded safety helmet wearing dataset with 5.12% improvement in mAP compared to the baseline method CenterNet, and 6.11% improvement in AP for the safety helmet object.


1
In the production process, due to the worker's lack of safety 2 awareness and other reasons caused by the failure to wear 3 safety helmets caused by the casualty accident is common. 4 Considering the problem of inefficient manual supervision, 5 active research on automatic worker safety helmet detection 6 methods in the operating environment based on object detec-7 tion technology has important theoretical significance and 8 practical application value for ensuring workers' personal 9 safety [1] as well as achieving safe production [2, 3]. 10 Traditional safety helmet detection methods are based on 11 manual collection of features. Wu et al. [4]  then used hierarchical support vector machines for safety hel-14 met classification. Yue et al. [5] used HOG to extract object 15 features and constructed random queer classes in the fea- 16 ture domain space for safety helmet detection using random 17 binary test. With the rapid development of deep learning, tra-18 ditional methods no longer meet the development needs of 19 today's technology. Numerous scholars currently use deep 20 learning-based target detection techniques to achieve safety 21 helmet object detection. Li et al. [6] proposed a method to 22 extract safety helmet object features using lightweight net-23 works in SSD. Zhou et al. [7] proposed to add a channel 24 attention module to the backbone network to enhance the 25 feature extraction capability. Cheng et al. [8] proposed SAS-26 YOLOv3-Tiny safety helmet detection method for embedded 27 devices and for practical application scenarios. Gu et al. 28 [9] proposed a helmet wearing detection method based on 29 posture estimation, which combines the human posture to 30 detect the helmet wearing condition. Sun et al. [10] embed 31 the attentional mechanism in the YOLOv5 backbone, while 32 compressing the model to accommodate real-time detec-33 tion of safety helmets in mobile devices. Zhang et al. [11] 34 designed the anchor using K-means based on the YOLOv5s 35 method, while adding a prediction layer to improve the net-the features obtained from the backbone network and select 85 key feature information to be fused, further helping to recover 86 image spatial detail information. 88 The deep features have only a fixed receptive field, result-89 ing in poor long-range dependence. This leads to the loss 90 of important contexts information. Inspired by SPP [14] and 91 non-local networks [15,16], this paper proposes a multi-92 scale non-local module, as shown in Fig. 2. The module 93 uses pooling layers of different kernel sizes to extract seman-94 tic and spatial detail features from the input image features 95 at different scales, while using non-local modules to estab-96 lish correlations of internal features between different scales, 97 and further uses the correlation information of multi-scale 98 features to generate rich semantic contexts information for 99 guiding the information recovery of subsequent image fea-100 tures in the decoding process.

101
Specifically, for the input features of x ∈ R C×H ×W , fea-102 ture mapping is first performed using pooling kernels k ∈ 103 {5, 9, 13} of different sizes, and due to the padding opera-104 tion, features m i ∈ R C×H ×W that do not change the feature 105 size and contain semantic and spatial information of different 106 scales can be obtained separately. Then, convolutional trans-107 formations are performed using 1 × 1 convolutional layers 108 W θ ,W φ and W g to obtain θ (m) = W θ m,φ(m) = W φ m and 109 g(m) = W g m, respectively. Then θ and φ are matrix multi-110 plied to obtain similarity attention matrices A ∈ R N ×N and 111 N = H ×W , normalized using softmax to obtainÃ ∈ R N ×N , 112 andÃ and g are matrix multiplied again to obtain V ∈ R N ×C , 113 and transformed using a 1 × 1 convolution W z to sum the 114 elements with the initial input features x ∈ R C×H ×W to 115 establish the correlation of internal features between differ-116 ent scales.

121
To enhance the semantic and spatial detail information of 122 image features in the decoding process. Inspired by the 123 attention mechanism [17][18][19], this paper proposes a feature 124 selection fusion structure, which consists of two modules:  dimensions, as shown in Fig. 3 ∈ R 1×H ×C are obtained using maximum pool-144 ing and average pooling processes. The output feature X 1 ∈ 145 R C×H ×W is then obtained by convolving the 7 × 7 layers 146 as well as the BN layer, while using Sigmoid to obtain the 147 attention weights and weighting them toF 1 and rotating them 148 clockwise by 90 • along the H-axis, as expressed in Eq: For the second branch across C and W . The input feature 151 F se ∈ R C×H ×W is rotated 90 • counterclockwise along the 152 W-axis to obtainF 2 ∈ R H ×C×W . After that, the same oper-153 ation as the above process is done and the weighted feature 154 obtained is rotated 90 • clockwise along the W-axis to obtain 155 the output feature X 2 ∈ R C×H ×W , as expressed in Eq:  × 1 convolution layer, as shown in Fig. 3(b). The formula is 169 expressed as:  [21]. Some images of the dataset are 185 shown in Fig. 4. In this paper, it is divided into training and 186 test sets according to 8:2, with 6464 images in the training 187 set and 1617 images in the test set.

Ablation experiments 209
To verify the effectiveness of the modules proposed in this 210 paper, the input image is set 512 × 512, and the pre-trained 211 ResNet-50 is used as the backbone for ablation experiments, 212 and the results are shown in Table 1. Where FSF denotes fea-213 ture Selection Fusion Structure, NLM denotes Non-Local 214 module, and MSNM denotes Multi-Scale Non-local Mod-215 ule proposed in this paper. mAP is calculated using an IoU 216 threshold of 0.5.

217
As seen in Table 1, the addition of the feature selection 218 fusion structure to the baseline method was able to improve 219 Fig. 6 Comparison of the effect of the method and baseline detection in this paper The comparison between the heat map effect of this paper 240 and the baseline CenterNet is shown in Fig. 5, where the 241   Table 2. 256 As can be seen from Table 2, the method in this paper 257 achieves the best performance when the input image is 258 512 × 512, with mAP able to reach 87.21% and for the 259 safety helmet AP value of 85.55%. The method in this paper 260 improves 16.89%, 6.16% and 6.24% compared to SSD, RFB-Net and RefineDet methods for the same input size, and 261 11.75%, 8.31% and 4.52% for the safety helmet AP values, 262 respectively. And compared with the YOLOv3 and YOLOv4 263 methods with an input size of 416 × 416, the mAP of the 264 method in this paper is improved by 1.55% and 0.66%, 265 respectively, and the AP values for the safety helmet object 266 are improved by 2.7% and 2.46%, respectively. In addition, 267 compared with the FCOS, YOLOv5-m, and YOLOX-s meth-268 ods with an input size of 640 × 640, the mAP values were 269 improved by 3.63%, 1.22%, and 0.04%, respectively, while 270 the AP values for the safety helmet objects were improved 271 by 7.82%, 1.99%, and 0.97%, respectively, further verify-272 ing the effectiveness of the method in this paper. In addition, 273 the FPS of the method in this paper is 49.2 when the input 274 image is 512 × 512, which is slightly lower compared with 275 other methods, but has obvious advantages for the detection 276 accuracy of safety helmet objects.

277
The detection effect of this paper's method in the test set 278 is shown in Fig. 7, and it can be seen that this paper's method 279 can have good detection effect for small-scale safety helmet 280 objects at the far end of the image.