3.1 Dataset
The dataset was mainly derived from the publicly available safety helmet wearing dataset and expanded with 500 surveillance images. The expanded dataset contains 8081 images containing objects with labels of type hat and person, respectively, where the person data are derived from the SCUT-HEAD dataset[18]. Some images of the dataset are shown in Fig. 4. In this paper, it is divided into training and test sets according to 8:2, with 6464 images in the training set and 1617 images in the test set.
3.2 Implementation details
The Pytorch version used for the experiments is 1.10.0, the Python version is 3.6, and the GPU is Nvidia RTX 3090. The Adam optimizer is used, the initial learning rate is 5e-4, the batch size is 8, and the epoch is set to 200.
3.3 Evaluation indicators
This paper uses the AP(Average Precision), mAP(mean Average Precision) and FPS(Frames Per Second) to evaluate our method. The formula is expressed as:
$$\left\{ \begin{gathered} AP=\int_{0}^{1} {P\left( R \right)dR;} \hfill \\ mAP=\frac{1}{n}\sum\limits_{{i=1}}^{n} {AP;} \hfill \\ P=\frac{{TP}}{{TP+FP}}; \hfill \\ R=\frac{{TP}}{{TP+FN}}; \hfill \\ \end{gathered} \right.$$
7
where P denotes the ratio of the number of positive samples correctly predicted to the number of samples predicted to be positive. R is the ratio of the number of positive samples correctly predicted to the number of all positive samples. And TP denotes the number of samples predicted to be positive by the model and actually positive, FP denotes the number of samples classified as positive by the model but actually negative, and FN denotes the number of samples classified as negative by the model but actually positive. n indicates the number of types of object in the dataset.
3.4 Ablation experiments
To verify the effectiveness of the proposed method in this paper, the input image of 512×512 is set and the pre-trained ResNet-50 is used as the backbone for ablation experiments, and the results are shown in Table 1. where SE and FS denote the feature selection fusion structure constructed by the SE module and the feature selection module, respectively, while MSNM denotes the multiscale nonlocal module. mAP is calculated using an IoU threshold of 0.5.
Table 1
Comparison of test results of ablation experiments
Method | SE | FS | MSNM | hat | person | mAP(%) |
CenterNet | | | | 79.44 | 84.74 | 82.09 |
√ | | | 83.12 | 88.92 | 86.02 |
| √ | | 83.33 | 89.22 | 86.27 |
| | √ | 80.99 | 86.95 | 83.97 |
√ | | √ | 82.74 | 88.72 | 85.73 |
| √ | √ | 85.55 | 88.87 | 87.21 |
As seen in Table 1, the addition of the fusion structure driven by the SE module was able to increase the mAP by 3.93% compared to the baseline method CenterNet.And adding the feature selection structure proposed in this paper can improve the mAP by 4.18% from the baseline. And adding MSNM can improve the mAP of the method in this paper by 1.88%, which indicates that the global semantic context generated using MSNM can guide the recovery of the underlying image features. Adding MSNM to the feature selection fusion structure was able to improve the mAP by 5.12% and for the safety helmet object AP by 6.11%. This suggests that the use of feature selection fusion structures to provide refined semantic and spatially detailed features for decoding features, guided by the semantic context generated by multi scale non-local modules, can enhance the localization and recognition of small scale safety helmet objects. The darker red part in the heat map shown in Fig. 5 indicates the higher attention to the region, thus it can be seen that the method in this paper enhances the attention to the target region compared with the baseline method.
Figure 6 shows the detection effect of this paper and the baseline method under the surveillance image. It can be seen that the baseline method has a certain degree of missed detection and serious deviation of the prediction frame for the small-scale safety helmet object at the far end of the image, while this paper can better identify and locate the safety helmet object at the far end of the image accurately.
3.5 Comparison experiments
To further verify the detection performance of the method in this paper, the method in this paper was compared with RefineDet[19], YOLOv3[20] and the FCOS[21], method for experiments, and the experimental results are shown in Table 2.
As can be seen from Table 2, the detection accuracy of the method in this paper is the highest when the input image is 512×512, with an mAP of 87.21, which is 6.24% and 1.55% higher than the mAP values of RefineDet and YOLOv3 detection methods, respectively. As for the safety helmet object detection, the AP of the method in this paper can reach 85.55%, which is 4.52% and 2.7% higher compared with RefineDet and YOLOv3, respectively. For the FCOS method, the mAP value of this paper is improved by 3.63% compared with the FCOS method with 640×640 input image, and the AP for safety helmet object detection is improved by 7.82%, which further verifies the effectiveness of this paper's method. In addition, the FPS of the method in this paper is 49.2 when the input image is 512×512, which is slightly lower compared with YOLOv3 and FCOS methods, but the detection accuracy for the safety helmet object has been significantly improved.
Table 2
Comparison of test results of different methods
Method | Backbone | Input | hat | person | mAP(%) | FPS |
RefineDet | VGG-16 | 512×512 | 81.03 | 80.92 | 80.97 | 29.1 |
YOLOv3 | DarkNet-53 | 416×416 | 82.85 | 88.48 | 85.66 | 69.7 |
FCOS | ResNet-50 | 640×640 | 77.73 | 89.44 | 83.58 | 52.3 |
Ours | ResNet-50 | 512×512 | 85.55 | 88.87 | 87.21 | 49.2 |
Figure 7 shows the comparison of the detection effect between this method and the comparison methods. It can be seen that although RefineDet, YOLOv3 and FCOS methods can detect the safety helmet objects in the near part of the image under the conventional image and the surveillance image, they all have different degrees of missing detection for the small-scale safety helmet object in the far part of the image. In contrast, the method in this paper can clearly locate and identify the small-scale safety helmet in the distant part of the image accurately