Safety helmet detection method based on semantic guidance and feature selection fusion

doi:10.21203/rs.3.rs-2480908/v1

Download PDF

Research Article

Safety helmet detection method based on semantic guidance and feature selection fusion

https://doi.org/10.21203/rs.3.rs-2480908/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 20 May, 2023

Read the published version in Signal, Image and Video Processing →

You are reading this latest preprint version

Safety helmet detection is a hot topic of research in the field of industrial safety for object detection technology. Existing object detection methods still face great challenges for the detection of small-scale safety helmet object. In this paper, we propose a safety helmet detection method based on the fusion of semantic guidance and feature selection. The method is able to consider the balance between detection performance and efficiency. First, a multi scale non-local module is proposed to establish internal correlations between different scales of deep image features as well as to aggregate semantic context information to guide the information recovery of decoder network features. Then the feature selection fusion structure is proposed to adaptively select deep features and underlying key features for fusion to make up for the missing semantic and spatial detail information of the decoding network and improve the spatial location expression capability of the decoding network. Experimental analysis shows that the method in this paper has good detection performance on the expanded safety helmet wearing dataset with 5.12% improvement in mAP compared to the baseline method CenterNet, and 6.11% improvement in AP for the safety helmet object.

safety helmet detection

CenterNet

multiscale non-local module

feature selection fusion

In the production process, due to the worker's lack of safety awareness and other reasons caused by the failure to wear safety helmets caused by the casualty accident is common. Considering the problem of inefficient manual supervision, active research on automatic worker safety helmet detection methods in the operating environment based on object detection technology has important theoretical significance and practical application value for ensuring workers' personal safety[1] as well as achieving safe production[2, 3].

Deep learning-based detection methods are currently a hot research direction for safety helmet object detection. Li et al[4] proposed a method to extract safety helmet object features using lightweight networks in SSD[5] detection methods. Han et al[6] proposed a helmet wearing detection method based on cross-layer attention mechanism and multi-scale perception to enhance detection by refining the target region with spatial attention and channel attention. Zhou et al[7] proposed an attention-based mechanism for safety helmet detection algorithm to enhance the feature extraction capability by adding channel attention modules to the backbone network. Cheng et al[8] proposed SAS-YOLOv3-Tiny safety helmet detection method for embedded devices and for practical application scenarios. And Gu et al[9] proposed a safety helmet wearing detection method based on posture estimation, which combined with human posture to detect the safety helmet wearing condition. These methods mentioned above improve the detection of conventional safety helmets to a certain extent by improving the existing target detection methods. However, under the surveillance image, the safety helmet is affected by factors such as light intensity and background environment, and the difficulty of detection is further deepened by the low resolution of the safety helmet and the random distribution. The existing object detection methods still face great challenges for the detection of small-scale helmet object.

Therefore, in this paper, we propose a safety helmet detection method based on the fusion of semantic guidance and feature selection. The method is able to integrate the proposed multi-scale non-local module and feature selection fusion structure into the CenterNet[10] network framework to achieve a balance between performance and efficiency for small-scale safety helmet object detection. The multi-scale non-local module is able to capture the multi-scale features of the deep features and aggregate the global semantics to compensate for the missing global semantic context in the decoding stage. Meanwhile, the proposed feature selection fusion structure is capable of adaptive selection fusion of deep features and underlying features, filtering out noise and redundant information to provide refined semantic and spatially detailed features for decoding features. Experimental analysis shows that the method in this paper achieves 85.55% AP and 87.21% mAP for safety helmet object detection on the expanded safety helmet wearing dataset, which is 6.11% and 5.12% improvement respectively compared to the baseline CenterNet method, and has good detection performance for safety helmet object.

2.1 Network architecture

The network framework of safety helmet detection method based on the fusion of semantic guidance and feature selection proposed in this paper is shown in Fig. 1. It mainly consists of a backbone network, the decoding network, a multi scale non-local module, and a feature selection fusion structure. The multi scale non-local module is used to downsample 16x and 32x images into multi scale feature extraction, establish internal correlation between features at different scales and then generate semantic context to guide the recovery of image features in the decoding stage. The feature selection fusion structure, on the other hand, can adaptively filter out noise and redundant information from the features obtained from the backbone network and select key feature information to be fused, further helping to recover image spatial detail information.

2.2 Multi scale non-local module

The deep features have only a fixed receptive field, resulting in poor long-range dependence. This leads to the loss of important contexts information. Inspired by SPP [14] and nonlocal networks [15, 16], this paper proposes a multi scale non-local module, as shown in Fig. 2. The module uses pooling layers of different kernel sizes to extract semantic and spatial detail features from the input image features at different scales, while using non-local modules to establish correlations of internal features between different scales, and further uses the correlation information of multi-scale features to generate rich semantic contexts information for guiding the information recovery of subsequent image features in the decoding process.

Specifically, for the input features of $x \in {{\mathbb{R}}^{C \times H \times W}}$, feature mapping is first performed using pooling kernels $k \in \left\{ {5 \;9\;13} \right\}$ of different sizes, and due to the padding operation, features ${m_i} \in {R^{C \times H \times W}}$ that do not change the feature size and contain semantic and spatial information of different scales can be obtained separately. Then convolutional transformations are performed using 1×1 convolutional layers${W_\theta }$,${W_\phi }$and${W_g}$to obtain $\theta \left( m \right)={W_\theta }m$,$\phi \left( m \right)={W_\phi }m$ and $g\left( m \right)={W_g}m$, respectively. Then $\theta$ and $\phi$ are matrix multiplied to obtain similarity attention matrices $A \in {{\mathbb{R}}^{N \times N}}$ and $N=H \times W$, normalized using softmax to obtain $\tilde {A} \in {{\mathbb{R}}^{N \times N}}$, and $\tilde {A}$ and are matrix multiplied again to obtain $V \in {{\mathbb{R}}^{N \times C}}$, and transformed using a 1 × 1 convolution ${W_z}$ to sum the elements with the initial input features $x \in {{\mathbb{R}}^{C \times H \times W}}$ to establish the correlation of internal features between different scales.

$$\tilde {A}=softmax\left( {\theta {{\left( {{m_1}} \right)}^T} \times \phi \left( {{m_2}} \right)} \right)$$

$$V=\tilde {A} \times g{\left( {{m_3}} \right)^T}$$

$$y={W_z}\left( {{V^T}} \right)+x$$

2.3 Feature selection fusion

To enhance the semantic and spatial detail information of image features in the decoding process. Inspired by the attention mechanism[15–17], this paper proposes a feature selection fusion structure, which consists of two modules: feature selection and feature fusion. The feature selection module can suppress the noise and redundant information brought by different features when fusing. The feature fusion module, on the other hand, can fuse deep features with underlying features to provide refined semantic and spatial details for feature information recovery.

2.3.1 Feature selection module

The feature selection module consists of channel attention and attention that interacts across channels and spatial dimensions, as shown in Fig. 3. First, the input features are adaptively adjusted by the SE[15] module to learn the global association features. The adjusted features are then passed through two branches that interact across channels and spatial dimensions, respectively, to enhance spatial detail features.

Specifically, for the first branch across the channel dimension C and the spatial dimension H. Firstly, input feature ${F_{se}} \in {{\mathbb{R}}^{C \times H \times W}}$ is rotated $90^\circ$ counterclockwise along the H-axis to obtain ${\tilde {F}_1} \in {{\mathbb{R}}^{W \times H \times C}}$, and then $\tilde {F}_{1}^{{max}} \in {{\mathbb{R}}^{1 \times H \times C}}$ and $\tilde {F}_{1}^{{avg}} \in {{\mathbb{R}}^{1 \times H \times C}}$ are obtained using maximum pooling and average pooling processes. The output feature ${X_1} \in {{\mathbb{R}}^{C \times H \times W}}$ is then obtained by convolving the 7 × 7 layers as well as the BN layer, while using Sigmoid to obtain the attention weights and weighting them to ${\tilde {F}_1}$ and rotating them clockwise by $90^\circ$ along the H-axis, as expressed in Eq:

$${X_1}=C{W_H}\left( {{{\tilde {F}}_1} \odot \sigma \left( {BN\left( {{f^{7 \times 7}}\left( {\tilde {F}_{1}^{{max}};\tilde {F}_{1}^{{avg}}} \right)} \right)} \right)} \right)$$

For the second branch across C and W. The input feature ${F_{se}} \in {{\mathbb{R}}^{C \times H \times W}}$ is rotated $90^\circ$ counterclockwise along the W-axis to obtain ${\tilde {F}_2} \in {{\mathbb{R}}^{H \times C \times W}}$. After that, the same operation as the above process is done and the weighted feature obtained is rotated $90^\circ$ clockwise along the W-axis to obtain the output feature ${X_2} \in {{\mathbb{R}}^{C \times H \times W}}$, as expressed in Eq:

$${X_2}=C{W_W}\left( {{{\tilde {F}}_2} \odot \sigma \left( {BN\left( {{f^{7 \times 7}}\left( {\tilde {F}_{2}^{{max}};\tilde {F}_{2}^{{avg}}} \right)} \right)} \right)} \right)$$

In Eqs. (5) and (6), $\sigma$ is a Sigmoid function, while ${f^{7 \times 7}}$ denotes a convolution operation with a convolution kernel size of 7 × 7, BN is a batch normalization layer, and $C{W_H}$ and $C{W_W}$ denote a clockwise rotation of $90^\circ$ along the H and W axes. Finally, the features are output by summing the elements of features ${X_1}$ and ${X_2}$

2.3.2 Feature fusion module

The deep features ${U_i}$ that have gone through the feature selection module are upsample using bilinear interpolation and then fused with the adjacent underlying features ${F_i}$ for Concat and adjusted for the number of channels by a standard 1×1 convolution layer, as shown in Fig. 3(b). The formula is expressed as..

$${X_i}=Conv\left( {Concat\left( {{F_i} Upsample\left( {{U_i}} \right)} \right)} \right)$$

Where Conv means 1×1 convolution operation and Upsample means bilinear interpolation to achieve upsampling. Finally, the features that have gone through the feature fusion module and have deep semantic and underlying spatial details are output layer by layer and fused with the decoded features for Concat.

3.1 Dataset

The dataset was mainly derived from the publicly available safety helmet wearing dataset and expanded with 500 surveillance images. The expanded dataset contains 8081 images containing objects with labels of type hat and person, respectively, where the person data are derived from the SCUT-HEAD dataset[18]. Some images of the dataset are shown in Fig. 4. In this paper, it is divided into training and test sets according to 8:2, with 6464 images in the training set and 1617 images in the test set.

3.2 Implementation details

The Pytorch version used for the experiments is 1.10.0, the Python version is 3.6, and the GPU is Nvidia RTX 3090. The Adam optimizer is used, the initial learning rate is 5e-4, the batch size is 8, and the epoch is set to 200.

3.3 Evaluation indicators

This paper uses the AP(Average Precision), mAP(mean Average Precision) and FPS(Frames Per Second) to evaluate our method. The formula is expressed as:

$$\left\{ \begin{gathered} AP=\int_{0}^{1} {P\left( R \right)dR;} \hfill \\ mAP=\frac{1}{n}\sum\limits_{{i=1}}^{n} {AP;} \hfill \\ P=\frac{{TP}}{{TP+FP}}; \hfill \\ R=\frac{{TP}}{{TP+FN}}; \hfill \\ \end{gathered} \right.$$

where P denotes the ratio of the number of positive samples correctly predicted to the number of samples predicted to be positive. R is the ratio of the number of positive samples correctly predicted to the number of all positive samples. And TP denotes the number of samples predicted to be positive by the model and actually positive, FP denotes the number of samples classified as positive by the model but actually negative, and FN denotes the number of samples classified as negative by the model but actually positive. n indicates the number of types of object in the dataset.

3.4 Ablation experiments

To verify the effectiveness of the proposed method in this paper, the input image of 512×512 is set and the pre-trained ResNet-50 is used as the backbone for ablation experiments, and the results are shown in Table 1. where SE and FS denote the feature selection fusion structure constructed by the SE module and the feature selection module, respectively, while MSNM denotes the multiscale nonlocal module. mAP is calculated using an IoU threshold of 0.5.

Table 1

Comparison of test results of ablation experiments
Method	SE	FS	MSNM	hat	person	mAP(%)
CenterNet				79.44	84.74	82.09
	√			83.12	88.92	86.02
		√		83.33	89.22	86.27
			√	80.99	86.95	83.97
	√		√	82.74	88.72	85.73
		√	√	85.55	88.87	87.21

As seen in Table 1, the addition of the fusion structure driven by the SE module was able to increase the mAP by 3.93% compared to the baseline method CenterNet.And adding the feature selection structure proposed in this paper can improve the mAP by 4.18% from the baseline. And adding MSNM can improve the mAP of the method in this paper by 1.88%, which indicates that the global semantic context generated using MSNM can guide the recovery of the underlying image features. Adding MSNM to the feature selection fusion structure was able to improve the mAP by 5.12% and for the safety helmet object AP by 6.11%. This suggests that the use of feature selection fusion structures to provide refined semantic and spatially detailed features for decoding features, guided by the semantic context generated by multi scale non-local modules, can enhance the localization and recognition of small scale safety helmet objects. The darker red part in the heat map shown in Fig. 5 indicates the higher attention to the region, thus it can be seen that the method in this paper enhances the attention to the target region compared with the baseline method.

Figure 6 shows the detection effect of this paper and the baseline method under the surveillance image. It can be seen that the baseline method has a certain degree of missed detection and serious deviation of the prediction frame for the small-scale safety helmet object at the far end of the image, while this paper can better identify and locate the safety helmet object at the far end of the image accurately.

3.5 Comparison experiments

To further verify the detection performance of the method in this paper, the method in this paper was compared with RefineDet[19], YOLOv3[20] and the FCOS[21], method for experiments, and the experimental results are shown in Table 2.

As can be seen from Table 2, the detection accuracy of the method in this paper is the highest when the input image is 512×512, with an mAP of 87.21, which is 6.24% and 1.55% higher than the mAP values of RefineDet and YOLOv3 detection methods, respectively. As for the safety helmet object detection, the AP of the method in this paper can reach 85.55%, which is 4.52% and 2.7% higher compared with RefineDet and YOLOv3, respectively. For the FCOS method, the mAP value of this paper is improved by 3.63% compared with the FCOS method with 640×640 input image, and the AP for safety helmet object detection is improved by 7.82%, which further verifies the effectiveness of this paper's method. In addition, the FPS of the method in this paper is 49.2 when the input image is 512×512, which is slightly lower compared with YOLOv3 and FCOS methods, but the detection accuracy for the safety helmet object has been significantly improved.

Table 2

Comparison of test results of different methods
Method	Backbone	Input	hat	person	mAP(%)	FPS
RefineDet	VGG-16	512×512	81.03	80.92	80.97	29.1
YOLOv3	DarkNet-53	416×416	82.85	88.48	85.66	69.7
FCOS	ResNet-50	640×640	77.73	89.44	83.58	52.3
Ours	ResNet-50	512×512	85.55	88.87	87.21	49.2

Figure 7 shows the comparison of the detection effect between this method and the comparison methods. It can be seen that although RefineDet, YOLOv3 and FCOS methods can detect the safety helmet objects in the near part of the image under the conventional image and the surveillance image, they all have different degrees of missing detection for the small-scale safety helmet object in the far part of the image. In contrast, the method in this paper can clearly locate and identify the small-scale safety helmet in the distant part of the image accurately

In this paper, we propose a safety helmet detection method based on the fusion of semantic guidance and feature selection to solve the problem of poor detection performance for small-scale safety helmet objects. The approach proposes multiscale nonlocal modules to generate global semantic contexts and guide the network in the decoding phase for image feature recovery. In addition, the proposed feature selection fusion structure can effectively select key information in deep and underlying features for fusion and provide refined semantic and spatial details for decoding the features. Experimental analysis shows that, compared with the baseline method CenterNet, this method can significantly improve the detection performance for safety helmet objects, especially for small-scale safety helmet objects are more sensitive.

Acknowledgements

This work was supported by a grant from the National Nature Science Foundation of China (No. 62161020).

Data availability: The publicly available helmet wearing dataset available in https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset.

Conflict of interest: The authors declare that they have no conflict of interest.

Zhang, W., Yang, C. F., Jiang, F., Gao, X. Z., & Zhang, X.: Safety Helmet Wearing Detection Based on Image Processing and Deep Learning. In 2020 International Conference on Communications, Information System and Computer Engineering (CISCE) ,pp. 343-347 (2020)
Shen, J., Xiong, X., Li, Y., He, W., Li, P., & Zheng, X..: Detecting safety helmet wearing on construction sites with bounding‐box regression and deep transfer learning. Computer‐Aided Civil and Infrastructure Engineering, 36(2), 180-196 (2021)
Chen, S., Tang, W., Ji, T., Zhu, H., Ouyang, Y., & Wang, W.: Detection of safety helmet wearing based on improved faster R-CNN. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1-7(2020)
Li, Y., Wei, H., Han, Z., Huang, J., & Wang, W.: Detecting safety helmet wearing on construction sites with bounding-box regression and deep transfer learning. Computer-Aided Civil and Infrastructure Engineering, 36(2): 180-196 (2021)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C.: SSD: Single shot multibox detector. In European conference on computer vision pp. 21-37(2016)
Han, G., Zhu, M., Zhao, X., & Gao, H.: Method based on the cross-layer attention mechanism and multiscale perception for safety helmet-wearing detection. Computers and Electrical Engineering, 95, 107458 (2021)
Zhou, Q., Qin, J., Xiang, X., Tan, Y., & Xiong, N. N.: Algorithm of helmet wearing detection based on AT-YOLO deep mode. CMC-COMPUTERS MATERIALS & CONTINUA, 69(1), 159-174(2021)
Cheng, R., He, X., Zheng, Z., & Wang, Z.: Multi-scale safety helmet detection based on SAS-YOLOv3-tiny. Applied Sciences, 11(8), 3652(2021)
Gu, Y., Wang, Y., Shi, L., Li, N., Zhuang, L., & Xu, S.: Automatic detection of safety helmet wearing based on head region location. IET Image Processing, 15(11), 2441-2453(2021)
Zhou, X., Wang, D., & Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850(2019)
He, K., Zhang, X., Ren, S., & Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9), 1904-1916(2015)
Wang, X., Girshick, R., Gupta, A., & He, K.: Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 7794-7803 (2018)
Zhu, F., Fang, C., & Ma, K. K. Pnen: Pyramid non-local enhanced networks. IEEE Transactions on Image Processing, 29, 8831-8841(2020)
Zhu, Z., Xu, M., Bai, S., Huang, T., & Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision .pp. 593-602(2019).
Hu, J., Shen, L., & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 7132-7141 (2018)
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S.: Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) ,pp. 3-19 (2018)
Misra, D., Nalamada, T., Arasanipalai, A. U., & Hou, Q.: Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision ,pp. 3139-3148 (2021)
Peng, D., Sun, Z., Chen, Z., Cai, Z., Xie, L., & Jin, L.: Detecting heads using feature refine net and cascaded multi-scale architecture. In 2018 24th International Conference on Pattern Recognition (ICPR) ,pp. 2528-2533(2018)
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z.: Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 4203-4212(2018)
Redmon, J., & Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767(2018)
Tian, Z., Shen, C., Chen, H., & He, T.: Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision ,pp. 9627-9636(2019)

No competing interests reported.

Download PDF

Journal Publication

published 20 May, 2023

Read the published version in Signal, Image and Video Processing →

Editorial decision: Major revision
07 Feb, 2023
Reviews received at journal
07 Feb, 2023
Reviewers agreed at journal
03 Feb, 2023
Reviewers invited by journal
16 Jan, 2023
Editor assigned by journal
16 Jan, 2023
Submission checks completed at journal
16 Jan, 2023
First submitted to journal
15 Jan, 2023

You are reading this latest preprint version

Safety helmet detection method based on semantic guidance and feature selection fusion

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. The Method

2.1 Network architecture

2.2 Multi scale non-local module

2.3 Feature selection fusion

2.3.1 Feature selection module

2.3.2 Feature fusion module

3. Experiments

3.1 Dataset

3.2 Implementation details

3.3 Evaluation indicators

3.4 Ablation experiments

3.5 Comparison experiments

4. Conclusions

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1