An object detection method for the work of an unmanned sweeper in a noisy environment on an improved YOLO algorithm

Efficient and accurate object detection is crucial for the widespread use of low-cost unmanned sweepers. This paper focuses on the low-cost sweeper in practical working scenarios and proposes a traffic participant detection method based on an enhanced YOLO-v5 model. To train the model on noise knowledge, three types of noise are added to the data set in the offline phase, according to the vibration response of the mathematical model, and the impact of the low-cost camera. The loss function was optimized to balance detection accuracy and real-time performance while focusing on traffic participant detection using YOLO-v5. CTDS and BFSA modules were proposed based on the attention mechanism to enhance the YOLO-v5 model. Comparative experiments demonstrated the effectiveness of the proposed method, with the enhanced YOLO-v5 model achieving a 4.5% higher mean average precision than the traditional YOLO-v5 network. Moreover, the proposed method can process images at a frame per second of 89 while ensuring real-time performance, meeting the object detection requirements of actual sweeper.


Introduction
At present, unmanned sweepers have the advantages of high work efficiency, fast cleaning speed, all-weather work, and reduced labor costs. Therefore, unmanned sweepers have been applied to industrial parks. The market for unmanned sweepers is very broad, but the current unmanned sweepers have a high cost per unit, which increases the operating burden of industrial parks. As a result, some industrial parks cannot purchase unmanned sweepers and can only use higher prices. Hire sanitation workers to clean up. To further promote unmanned sweepers and popularize unmanned sweepers, it is necessary to compress and control the cost of unmanned sweepers. The most important thing for low-cost unmanned sweepers is to use low-cost suspension chassis devices without anti-shake functions camera. To ensure that B Baijun Shi bjshi@scut.edu.cn 1 School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou 510641, China 2 Intelligent Transportation Thrust, Systems Hub, The HongKong University of Science and Technology(Guangzhou), Guangzhou 511458, China low-cost unmanned sweepers can also complete work tasks, first of all, the traffic participant detection algorithm in the perception task needs to be optimized.
The motives of the enhanced YOLO-v5 model traffic participant detection method designed based on this paper are as follows: (1) Reduce costs. It is very necessary to use lowcost unmanned sweepers. Low-cost unmanned sweepers can utilize affordable suspension chassis devices and cameras lacking anti-shake functionality, but this may introduce various types of noise to the images captured by the cameras, thereby negatively impacting the performance of the object detection task. Using a higher-performance processor for additional filtering will also increase the cost of the sweeper in disguise, which runs counter to the manufacture of lowcost sweepers. (2) Reduce potential safety hazards. Too long a cleaning time will lead to various accidents.
This method performs data cleaning and noise processing on the BDD100K [1] image data set.
The structure of this paper is outlined as follows: Sect. 2 provides an overview of related research. Section 3 presents a YOLO-based method for detecting traffic participants, as well as a noise-adding processing technique for the data set. In Sect. 4, we present various experiments on noisy datasets, as well as ablation, comparison, and actual detection experiments. Finally, Sect. 5 concludes the paper.

Related work
Traditional object detection algorithms include Harr [2] combined with Adaptive Boosting (Adaboost), Histograms of Oriented Gradients (HOG) combined with Support Vector Machine (SVM), Deformable Part Model (DPM). Matsumoto [3] proposed to use Self-Quotient Epsilon-Filter (SQEF) and HOG Combine to extract features, it will lead to HOG extraction performance deteriorates. Moubtahij et al. [4] describe a detection strategy that relies on the Adaboost algorithm and a Polynomial Image Decomposition (PID) technique. However, the approach may be challenging to generalize to other types of scenes. Ali et al. [5] propose a rapid category-independent object detector that utilizes integrated modules to accelerate DPM steps. However, the real-time performance of this method is highly dependent on computing power the manual design of features is very difficult, and the designed features can generally only be based on prior knowledge, and it is not conducive to the current industrial development with an increasing degree of automation [6].
Ruxin et al. [11] propose a model. But it requires high computing memory, which further exacerbates the drawbacks of the two-stage algorithm. Wang, Zhou [12]. This approach effectively addresses the problem. Lai et al. [13] propose a module in this paper. This approach enables the model to detect objects even in low lighting conditions. If real-time performance is not considered, the average accuracy (MAP) of the two-step algorithm can theoretically reach very high.

Overall research approach
The Framework of object detection methods for unmanned sweeper in noisy environments is depicted in Fig. 1.
It mainly includes mathematical modeling of the sweeper and camera and solution of vibration response, quantitative expansion and modification of data set according to noise type, model training, and sweeper object detection based on enhanced YOLO-v5 model. First, use the Simulink simulator to mathematically model the sweeper and the camera with four degrees of freedom, and solve the noise corresponding to the vibration response. Then, according to the light and dark conditions of the actual working environment and common Gaussian noise, perform three noise-ordered additions to the data set. The YOLO-v5 model was improved using the module proposed in this paper, and the expanded data set was utilized to train the enhanced YOLO-v5 model.

Traffic participant detection based on YOLO
Currently, the YOLO-v5 algorithm is widely used due to its good performance, as well as its versatility for industrial applications. The algorithm supports various industrial interfaces and can be easily maintained and updated.

YOLO-v5
The YOLO-v5 model consists of three modules: Backbone, Neck, and Head. Compared with YOLO-v4, feature fusion only performs fusion from small-scale feature maps to largescale feature maps. The author also creatively introduced the PAN structure [24], it makes the fusion information richer and the feature fusion effect is better.
Backbone is composed of 5 ConvBlock_1 layers, 4 Con-vBlock_4 layers, and 1 SPPF layer. The YOLO-v5 in its original form is depicted in Fig. 2.

Improvements proposed for YOLO-v5
The improvements are as follows: (1) The presence of motion blur noise, Gaussian noise, and pepper noise resulting from the use of low-cost suspension systems and cameras can cause blurring of image edges and details, as well as blockage of certain target details, leading to an offset between the pixel gray value  Figure 3 is utilized to compare feature maps before and after improvement.
The image collected under three types of noise interference is shown in (a), while (b) displays the feature map of the enhanced YOLO-v5 after the first CBAM processing. The network at this stage focuses on small-scale feature texture information and has successfully extracted the outer contour of the target car in the image. In (c), the feature map of the enhanced YOLO-v5 after the second CBAM process demonstrates a higher level of abstraction than (b). (d) represents the final layer of feature extraction, which is the feature map after Transformer attention processing, with the highest degree of abstraction. On the other hand, (e) represents the feature map output by the unenhanced YOLO-v5 network corresponding to (b). The feature map output of the unenhanced YOLO-v5 network corresponding to (c) is shown in (f), where the network has not successfully extracted the outer contour. Finally, (g) displays the feature map of the unenhanced YOLO-v5 network corresponding to (d) output, which cannot accurately identify the target. Based on these results. Propose CTDS module and embed it in feature extraction backbone network, combining Convolutional Block Attention Module (CBAM) [25] and Transformer attention mechanism [26], The large parameter problem caused by the mechanism, replaces convolution with Depthwise Separable Convolution [27].
(2) Excessive noise interference can result in poor feature fusion effects in a network. We propose the adoption of a BFSA module, which can be integrated into the feature fusion network. In the BiFusion Neck [28], we embed Stand-Alone Self-Attention [29], Change the up-sampled method to a transposed convolution with learnable parameters [30], Increasing the number of learnable parameters in the feature layer can improve the generalization of the network. Figure 4 shows the comparison of feature maps before and after improvement. The image samples affected by three different types of noise are depicted in (a), while (b) shows the enhanced YOLO-v5 model after fusing the mediumscale features, which initially outlines the overall shape of the car distinctly. On the other hand, (c) displays the fusion results of medium-scale feature maps obtained by the original model, which fails to provide a clear description of the object. (3) Add a network decoupling module to embed it in the classification head to solve the coupling problem of classification and positioning problems [21].The structure of the enhanced YOLO-v5 model is depicted in Fig. 5.

Image noise addition
The captured images mainly have three types of noise: motion blur noise, pepper noise, and Gaussian noise. The computing power of unmanned sweeper is very limited. Therefore, the network must learn about noise, By increasing the training cost, the computing power burden of unmanned sweeper can be reduced.
Firstly, a four-degree-of-freedom mathematical model is established based on the sweeper and the camera. The mathematical modeling is illustrated in Fig. 6. The camera is installed on the vertical line of the center of gravity of the sweeper and is consolidated with the body. According to the analysis of vibration mechanics, the vibration response x 0 can be obtained from Eq. (3), Adding motion noise based on vibration response. where z is the partial impedance of each degree of freedom. The present paper discusses the unmanned sweeper's low-cost camera, which may generate additional random electrical signals during actual operation due to the low accuracy of its amplifier and analog-to-digital converter. This noise can be modeled as Gaussian noise, and the sensor's temperature can also impact the signal processing circuit and affect the pixel's grayscale value. The impact of temperature fluctuations can also be modeled as Gaussian noise [32].
The Gaussian noise conforms to the Gaussian distribution, and the formula for the Gaussian distribution is shown below Eq. (4).
x represents the gray value of the feature point. The use of low-cost cameras with poor internal sensors is susceptible to damage from vibrations, leading to dead or stuck pixels. This noise can be simulated through the use of pepper noise. This can result in individual pixels becoming blurred or blocked, leading to the presence of salt and pepper noise in the image [33]. Therefore, in the offline training phase, adding pepper noise to the model can effectively simulate the negative effects of component damage and the environment. Table 1 shows the experimental environment set up for ensuring the efficient training of the enhanced YOLO-v5 model. The relevant training parameters settings: the initial learning rate is 0.001, the batch size is 8, and the Epoch is 100.

Performance evaluation index
Precision, it is usually understood as the query accuracy, and can be calculated using Eq. (5).
Recall, it is usually understood as the query completion rate, and can be calculated using Eq. (6).
AP is a graph drawn using precision as the ordinate and recalls as the abscissa. MAP provides an overall evaluation of the model's detection performance. The AP value for a specific object category i is denoted as AP i , where i is the index of the category to be detected among k total categories. The formula to calculate AP i is shown below Eq. (7) Figure 7 illustrates the actual effect of processing images with noise, (a) represents the original image without any added noise, (b) represents the image after motion blur processing, (c) represents the image after pepper noise processing, (d) represents the image after Gaussian noise processing. By applying three types of noise to the original image, the resulting image can approximate a realistically collected image. Table 2 displays the partitioned data set.

Verify the necessity of adding noise
We randomly added the three types of noise discussed earlier to the noisy test set and evaluated the model's performance. As shown in Table 3, the results demonstrate a significant drop in the MAP and recall metrics, indicating that noise in the data set can lead to a decrease in the model's generalization ability. To assess the effectiveness of adding noise to the training set, we trained the unenhanced YOLO-v5s on the noisy test set after adding all three types of noise to the training set. The results show that the MAP increased by 1.4% and recall increased by 1.2% compared to the original unenhanced YOLO-v5s trained on the noise-free data set. In order to teach the model to learn from noisy data, we deliberately introduce noise information into the training set. Improving the generalization ability of the model is feasible. We utilize a noisy data set that includes three different types of noise, added in varying proportions, to enhance the model's robustness when confronted with these types of noise simultaneously.
A represents training an unenhanced YOLO-v5s on a training set without noise, and test it on a test set without noise, B represents training an unenhanced YOLO-v5s on a training set without noise, and test it on a test set with noise, C represents training an unenhanced YOLO-v5s on a training set with noise, and test it on a test set with noise.

Ablation experiment
In order to verify the effectiveness of the improvement module proposed in our paper for the YOLO-v5s network, we conducted ablation experiments on the noisy BDD100K data set to test the impact of individual improvement methods, as well as the fusion of various methods, on the performance of the model's main modification modules. Table 4 presents the comparison results obtained from the ablation experiments. The results indicate that the CTDS module, BFSA module, and decoupling head module proposed in our paper can effectively improve the performance of the model to varying degrees. The effectiveness of WIOU in improving the performance of object detection models has been demonstrated in the literature. To evaluate the impact of WIOU on the enhanced YOLO-v5 model, we conducted an ablation experiment, and the results are presented in Table 5. The utilization of WIOU not only enhances the performance metrics of the model but also improves its real-time detection capabilities.

Overall data set detection results
The comparison experiment is shown in Table 6. The enhanced YOLO-v5 model outperforms other detection algorithms in terms of MAP, achieving a MAP of 58.2% which is 4.5 percentage points higher than the unenhanced YOLO-v5s model. Additionally, the model's FPS rate is also the highest among the tested models, satisfying the real-time monitoring requirements. Although YOLO-P has the highest MAP, because its Precision is too low, it is easy to cause object error detection, resulting in serious errors in the subsequent decision-making control process, so it cannot be used, which ensured the experimental theoretical basis of the model in practical application.
The experimental results demonstrate that the enhanced YOLO-v5 model can efficiently perform object detection of traffic participants under real-world conditions, including noise interference caused by vibration and insufficient lighting. The test results are presented in Fig. 8.
In this comparison study, five different models are evaluated in terms of their target detection performance, as depicted in (a) through (e). Specifically, the YOLO-v3 model exhibits a significant drawback of low confidence in detecting targets, resulting in a high rate of missed detection. Furthermore, it fails to effectively detect targets in situations

Conclusion
The main contributions of our paper can be summarized as follows. Firstly, we utilize the enhanced YOLO-v5 model proposed in this paper to solve a practical engineering problem: detecting objects in a noisy environment with an unmanned sweeper. Secondly, we propose three types of noise processing methods for the BDD100K data set. These methods include adding motion blur noise based on a mathematical model, and configuring pepper and Gaussian noise based on field environment parameters. This approach introduces a significant amount of noise information into the data set, which facilitates subsequent model training. Thirdly, we propose two modules-the CTDS module and the BFSA module-that embed the traditional YOLO-v5s model and utilize WIOU instead of CIOU to realize the enhanced YOLO-v5 model. Subsequent research can focus on two points. Firstly, it is essential to augment the original data set by incorporating real-world noise and expanding its scope to match the intended application scenario. Secondly, the improved model presented in this paper may not be applicable to smaller-scale processors. Therefore, the next challenge is to further compress the model without affecting its performance.
Author contributions Material preparation, data collection, analysis and modification, experiment were performed by JH, the first draft of the manuscript was written by JH, All authors read and approved the final manuscript.

Funding
The authors did not receive support from any organization for the submitted work.
Data availability Due to the nature of this study and in order to protect the privacy of study participants, participants of this study did not agree for their data to be shared publicly, so supporting data are not available.