Enhanced safety implementation in 5S + 1 via object detection algorithms

Scholarly work points to 5S + 1, a simple yet powerful method of initiating quality in manufacturing, as one of the foundations of Lean manufacturing and the Toyota Production Systems. The 6th S, safety, is often used to prevent future occupational hazards, therefore, reducing the loss of time, money, and human resources. This paper aims to show how Industry 4.0 technologies such as computer-based vision and object detection algorithms can help implement the 6th S in 5S + 1 through monitoring and detecting workers who fail to adhere to standard safety practices such as wearing personal protective equipment (PPE). The paper evaluated and analyzed three different detection approaches and compared their performance metrics. In total, seven models were proposed to perform such a task. All the proposed models utilized You-Only-Look-Once (YOLO v7) architecture to verify workers’ PPE compliance. In approach I, three models were used to detect workers, safety helmets and safety vests. Then, a machine learning algorithm was used to verify if each detected worker is in PPE compliance. In approach II, the model simultaneously detects individual workers and verifies PPE compliance. In approach III, three different models were used to detect workers in the input feed. Then, a deep learning algorithm was used to verify the safety. All models were trained on Pictor-v3 dataset. It is found that the third approach, when utilizing VGG-16 algorithm, achieves the best performance, i.e., 80% F1 score, and can process 11.79 frames per second (FPS), making it suitable for real-time detection.


Introduction
The Lean manufacturing concept was introduced in Japan. The Toyota company first implemented the concept and was known as Toyota Production System. However, nowadays, the Lean concept is widely applied in industries. Lean manufacturing is defined as providing quality products while maintaining the low cost of manufacturing. The primary purpose of Lean manufacturing is to minimize enterprise waste, as waste is an extra resource burden that never adds value to the company. Lean manufacturing tools minimize operational costs by reducing waste, optimizing product quality, and increasing efficiency [1]. Furthermore, the Lean manufacturing concept is one of the most popular and widely used methods in the industries to achieve maximum productivity, high quality, and cost reduction in organizations. Results of a study concluded that the Lean concept boosts the productivity of organizations, reduces the cost of manufacturing, eliminates unnecessary downtime, better the utilization of resources, and maximizes profitability. Therefore, the Lean concept enhances the competitiveness of any organization in the market [2]. Therefore, enterprises strive to adopt the Lean manufacturing concept to become economically sound and practical. It was seen that there was a continuous improvement in organizations after the application of the Lean manufacturing concept [3]. The introduction of Lean manufacturing tools such as Total Productive Management (TPM), overall equipment effectiveness (OEE), and Jidoka directly impact the environment of the enterprise, social, and economic sustainability. Therefore, a safe working environment is the cause of improvement in employee commitment, morale, safety, and delivery time [4].
Back in 1970 5S method was first introduced by Takashi Osada [5] to sustain the implementation of the Lean concept. 5S, a Lean tool, includes elements such as Sort, Set, Shine, Standardize, and Sustain. The Safety was later added to make the 5S + 1. These tools eliminate unnecessary items that do not add value to the production by fixing an unhealthy, untidy work environment. In a study, ten different manufacturing units were used to find out the effect of the use of 5S, and it showed a positive result on the manufacturing units. Furthermore, after applying 5S, there was a continuous improvement in the workplace and clear evidence of improvement in employee human relations and motivation [6]. 5S system is the first step into Lean thinking to minimize waste and maximize productivity. Through it, they maintain discipline and order in workstations hence efficient and effective operational results [7]. Therefore, the 5S can be considered to be a cyclic method. As a result, there is a continuous improvement [8]. Even when applied in a limited resources environment, 5S has proven to bring positive changes in the work environment of health centers by reducing the number of unwanted items, improved directional indicators, and labeling the units of service. Thus, increasing the quality of services in a more efficient, safe, and patient-centered style also helped improve the behaviors of staff and patients towards the resources of the workplace [9]. Applying the 5S in a medical laboratory brought a higher order level, helped eliminate unnecessary objects, and increased productivity [10]. It was found that applying 5S in hospitals eliminated waste in motion, thus helping to reduce cycle time [11]. Consequently, a more Lean and more organized work environment was created while controlling work accidents and errors beforehand [12]. The application of 5S in the fast food industry contributed to optimizing the work process. It also contributed to a decrease in production time; and a reduction in energy consumption [13]. Implementing 5S has been shown to improve safety, minimize defect rates, increase equipment availability [14], and better cost reduction, which can result in higher agility and flexibility of the manufacturing enterprise and positively impact employee morale [15]. Applying the 5S methodology in packaging improved quality and food safety [16]. Furthermore, applying 5S in restaurant management caused a reduction in the number of steps that need to be taken by employees to fulfill a specific task. It also helped reduce the time spent searching for materials by 95%. Thus, helping reduce order time, serving more customers, and making more profits [17]. The use of 5S positively impacts performance by enhancing manufacturing production quality and effectiveness [18]. 5S + 1 has also been used to organize the workplace at a scientific instruments manufacturing company [19] and achieved results in process improvement, continuous improvement, and waste reduction [20].

Sort in 5S + 1
This is the first step when implementing the 5S + 1 strategy. It improves the quality of the work with better care in keeping things in order, thus improving the management of the workspace. Sort helps separate waste from the manufacturing process [21]. The first S, Sort, eliminates steps that do not add value [22]. In this process, a red tag is placed on the unnecessary items or the items that are not in the proper place or quantity. The red tag items are then moved or recycled, disposed or reassigned. Hence sorting helps to generate floor space and remove the items that are broken, scrap, or excess raw material [23].

Shine in 5S + 1
Once sorting is done, keeping order in the workspace is vital. Routine Lean up is essential for the production system's proper functioning. In 5S + 1, work is divided among employees in terms of the Leaning time and area of Leaning [24]. Also, faulty equipment due to excessive vibrations, leakages, misalignments, and other causes can be easily noticed in a Lean working environment. If these malfunctions are not fixed at the moment, it can lead to loss of production or equipment failure. Therefore, having a Lean and organized workplace can help prevent sources of potential failure or downtime from going unnoticed. Shine is taken from Seiso, which means sweep, scrub, or shine. The primary purpose of Seiso is to Lean up the workspace by removing dust, dirt, chips, and other contaminations. It emphasizes on that by guiding operators or workers on the shop floor to maintain the state of Lean of the machines and shop floor.

Set in 5S + 1
Set or Seiton is another element of the 5S + 1 method that means systemization, set in order, or organization. It helps in a facility layout when planning where to place different resources such as raw materials, machine tools, and semifinished products. This strategy ensures that it should not take more than 30 s to find a necessary object [21]. Once the sort has eliminated all the unnecessary items, set in order can be implemented. Set in order means to arrange items based on size and frequency of use. Furthermore, the items are labeled for ease of use and stored or placed in a location based on that. In set in order, arrangements are made in such an order that tools are located according to the frequency of their use. As a result, this procedure helps to minimize movements and search time. Utilizes tags that contain a sensor attached to an antenna that enables the transmission of data to the reader, transmitted data receiver, and computer database for storing-captured data [48][49][50] Requires high-resolution cameras, software embedded with image processing algorithms, computer power, and computer database for storing-captured data [51] High-maintenance cost, RFID tags are expensive, and they are likely to be damaged during loading and unloading [52]; batteries can run out in active RFID tags [53] Lower maintenance cost since it has a longer product life, but there might be a need to upgrade outdated systems every few generations [54] More labor intense Minimal human effort Implementation can be complex and time-consuming Easier to implement Electromagnetic data transmission [55] Visual data transmission Different tag types can be affected differently by environmental damage; overall, they can better handle exposure to sun and rain [56] Algorithm performance can be affected by an environmental factor such as rain It can read through objects [57] Must be in sight Reading range depends on the frequency; materials like metal and liquid can impact signal The reach depends on the specifications of the camera; light and motion can affect camera performance putting more pressure on the computer processing stage [58] Utilizes RFID tags or short-range transponders [59] Utilizes Machine Learning (ML) methods

Standardize in 5S + 1
Seiketsu is another element of 5S that means standardization, tidiness, and sanitization. It helps to carry out repeated activities without any hindrance and extra usage of time [21]. The primary objective of standardization is to make the first 3S; sort, set in order, and shine as a habit, so workers do not return to the old unorganized practices [25].

Sustain in 5S + 1
The Shitsuke element of the 5S means sustains. The main objective of Shitsuke is to keep or maintain all the new and effective processes as standards of the organization. Workplace inspections should be carried out as planned routine activities every specific period [21]. 5S + 1 helps eliminate obstacles that reduce the potential to reach an efficient production process [26]. Sustain is viewed as an essential step to maintain the first 4S when applied while allowing for better implementation of the sixth S in safety [27].

Safety in 5S + 1
Safety focuses on preventive measures to protect workers from hazardous conditions and provide them with a safe environment. Studies recall that safety plays a significant role in maintaining an environment that is free of stress, safe, and secure hence improving the work environment [28]. Operations such as welding, gas cutting, and casting process need extra precautionary measures to reduce the number of   [69] incidents that can happen [29]. The 5S + 1 method cannot be separated from industrial, manufacturing, or construction operations. Implementing 5S + 1 tools creates safer work conditions [30,31]. Deploying 5S + 1 in the healthcare industry reduced waste and mistakes and increased productivity [32]. Applying 5S + 1 ensured that operators used personal protective equipment (PPE) and gear to avoid accidents. [33]. The 5S + 1 method is a powerful engine that enhances the quality of the work environment by improving safety [34].

Background
With the purpose of showing the effectiveness of object detection algorithms in monitoring and detecting workers who fail to adhere to standard safety practices, a dataset was utilized that contained numerous instances of construction workers that could be classified as wearing a safety helmet, safety vest, both, or neither. Construction workers comprise 5% (more than 7 million employees) of the total workforce in the USA and almost 6.3% (more than $1.3 trillion) of its Gross Domestic Product (GDP) [35,36]. According to the Bureau of Labor Statistics (BLS), almost 19% of fatal occupational accidents are recorded in the construction industry, and about 9% of non-fatal occupational accidents [37]. Most of these accidents could have been prevented if workers adhered to appropriate safety measures such as wearing PPE such as a safety helmet, safety vest, gloves, safety goggles, and steel-toe shoes [38]. While governing laws and safety regulations holds employers responsible for enforcing, monitoring, and maintaining appropriate PPE on the job site [39], a recent study revealed that almost 40% of workers do not wear any PPE. Inadequate risk management measures, including failure to use or incorrect use of PPE, may significantly increase the risk of accidents [40]. Employers can be fined an amount of up to $13,260 for each employee who is out of compliance with PPE [41]. Applying 5S + 1 can help employers avoid workplace accidents and hefty non-compliance fines, reducing time and resources spent dealing with fines, lawsuits, and settlements as a result of non-compliance.
Integrating automation and big data in the Industry 4.0 [42] helped enhance the monitoring of PPE compliance. So far, there are two types of monitoring techniques, sensorbased which consist of utilizing Radio Frequency Identification (RFID) tags installed on each PPE component and monitoring the signals of the tags to verify if workers were adhering to PPE compliance [43][44][45]. The second type of monitoring is vision-based; in the past, this used to be done with the human eye of a foreman on the site; in these days, it utilizes camera systems to record images or videos of the job site, which are then analyzed to verify PPE compliance [46,47]. Table 1 shows a comparison between the vision-based model and sensor-based model. Table 2 shows a summary of surveyed literature where machine vision has been used to detect PPE and Fig

Dataset and methodology
The dataset Pictor-v3 [69] contains 774 crowdsourced and 698 web-mined images which contain 2,496 and a total of 2,230 instances of workers in these images, respectively. The crowdsourced images come already annotated via LabelMe [70], as seen in Fig. 2. Annotation is very time consuming and costly process; therefore, for the purpose of our paper, we choose the crowdsourced part of the dataset to conduct our analysis. Crowdsourced images were obtained from three different construction projects, while web-mined images were retrieved from publicly available images on the web. The dataset has four classes in total, workers (W) can either wear a safety helmet (H), safety vest (V), both (WVH), or none at all. A brief statistic of the dataset is shown in Table 3. Data augmentation was done through the YOLO v7 built-in feature. The data was trained for up to 50 epochs to help prevent overfitting [71,72]. The dataset had a random 64%/16%/20% split for training, validation, and testing.
In computer vision, the object detection problem consists of two stages, identifying an object (classification) in an image and precisely estimating its location (localization) within the image [42]. For example, a region-based detection algorithm such as R-CNN [73] first identifies Regions of Interest (ROI). It then uses a CNN to classify the identified ROI to detect objects in them [74]. Faster R-CNN is an improved version of R-CNN that performs classification and detection tasks faster than R-CNN [75]. Mask R-CNN [76] has also been proposed as a faster variant of R-CNN. However, these algorithms still needed to perform faster and with better performance. Therefore, algorithms of single-stage detectors were introduced; these algorithms include SSD [77], YOLO [78], R-FCN [79], and Mask R-FCN [80], among others that eliminated the need for designing a set of anchor boxes [81]; such as CenterNet [82], RetinaNet [83], CornerNet [84], and their different variants. While these fast single-stage detectors often significantly compromise accuracy for achieving real-time detection, to date, only YOLO is faster yet more accurate than other alternatives [85]. In our paper, three different detection approaches were proposed to perform compliance inspections by detecting workers wearing PPE (safety helmet and safety vest) and workers who do not.  [141] and photo of worker from the dataset [69] 3706 The International Journal of Advanced Manufacturing Technology (2023) 125:3701-3721

Approach I
The YOLO-v7 model individually detects the worker, helmet, and vest (three object classes). Next, an ML classifier is used to combine them WH (worker wearing only a helmet), WV (worker wearing only a vest), WHV (worker wearing both helmet and vest), or W (worker wearing nothing). In this approach the YOLOv7 detects three classes first (W, H, and V) then a classifier is used to sort them into four classes (W, WH, WV, and WHV). For example, if a worker is classified as wearing both (WHV), his practice would be labeled "SAFE" and recorded as in compliance. If not, the worker would be labeled "NOT SAFE" and recorded as out of compliance. The ML classifiers used in this approach were Decision Tree (DT), K-Nearest Neighbors (KNN), and Multilayer Perceptron (MLP). Figure 3 shows an illustration of this approach.
So what Approach I did is that it detects the worker, hat, and vest with YOLO v7. The YOLO v7 calculates the area of the hat and vest intersection with each worker's bounding box. The results are then passed into the DT, KNN, or MLP algorithms. These algorithms finalize the detection and classification tasks by knowing whether the hat and vest belonged to a specified worker. Here are a summary of the codes used for the DT, KNN, MLP.
Code for MLP was similar to this: So, we detected the hat, person, and vest through YOLO v7. The YOLOv7 gave us bounding boxes for each object in the image, along with their numerical values. These numerical values contain an x-coordinate, y-coordinate, height, and width. The ML classifiers (MLP, KNN, and DT) were trained on the training data in which we know which worker is wearing a vest or a helmet. Then, we predicted from the regression model if a specific bounding box of a helmet or a vest is attached to any bounding box of the worker. If the model returns TRUE, we say the worker is wearing a helmet.

Approach II
The YOLO-v7 model localizes workers in the input image and directly classifies each detected worker as W, WH, WV, or WHV. In this approach, the YOLOv7 detects all the classes from the first look. For example, if a worker is classified as wearing both (WHV), his practice would be labeled "SAFE" and recorded as in compliance. If not, the worker would be labeled "NOT SAFE" and recorded as out of compliance. Figure 4 shows an illustration of this approach.

Approach III
YOLO-v7 model first detects all workers in the input image, and then, a CNN-based classifier model is applied to the cropped worker images to classify the detected worker as W, WH, WV, or WHV. In this approach, the YOLOv7 detects the W class first then a classifier is used to sort them into four classes (W, WH, WV, and WHV). For example, if a worker is classified as wearing both (WHV), his practice would be labeled "SAFE" and recorded as in compliance. If not, the worker would be labeled "NOT SAFE" and recorded as out of compliance. These DL-based classifiers are VGG-16, ResNet-50, and Xception. Figure 5 shows an illustration of this approach.

You Only Look Once (YOLO)
Utilized in all the three proposed approaches, a YOLO v7 model takes 640 × 640 images as input. Therefore, all images were resized to a size of 640 × 640. YOLO has been used in extracting information from tables [86][87][88]; license plate recognition [89,90], automated invoice parsing [91], and for automated meter reading [92]. Instead of learning regions like in a Faster R-CNN, YOLO (currently in its seventh version) looks at the complete image, splits it into n × n grids, then uses a single CNN to predict the bounding boxes and the class probabilities for these boxes [93]. Finally, the bounding boxes are used to locate the class within the tested image [94]. Figure 6 shows an illustration of the algorithm utilized in approach II.

Decision Tree (DT)
DT is a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups concerning a particular output feature. DT is one of the most common Data Mining (DM) techniques widely used for classification and regression analysis. DT comes in many decision algorithms, some of which are binary trees that always produce two categories (binary-split) at any level of the tree-like Fig. 10 Illustration of the algorithm utilized in approach III with VGG-16. The YOLO v7 simple illustration [141] and photo of worker from the dataset [69] Fig. 11 Illustration of the algorithm utilized in approach III with Xception. The YOLO v7 simple illustration [141] and photo of worker from the dataset [69] CART and QUEST. Others like CHAID and C5.0 are nonbinary trees that often grow more than two categories at any level in the tree. Other minor differences exist between these four main DT algorithms, such as how to deal with missing values, variable selection, capacity to handle a vast number of classes in variables, and pruning methods [95,96]. Figure 7 shows an illustration of the algorithm utilized in approach I with DT.

K-Nearest Neighbors (KNN)
KNN is a supervised machine learning algorithm that can be used to solve both classification and regression problems. KNN assumes that similar data points exist nearby. In other words, similar data points are near to each other. KNN searches the entire data set for the k number of most neighbors and calculates distances for proximities before sorting the calculated distances in ascending order from smallest to largest and picking the first K with its feature that is associated with the smallest distance. KNN uses a large amount of training data, plotting data points in a high-dimensional space, where each axis in the space corresponds to an individual variable that characterizes that data point [97]. KNN has been used in intelligent mechanical systems to detect online fraud [98] and has been successfully implemented in a large number of business problems [99,100]. Figure 8 shows an illustration of the algorithm utilized in approach I with KNN.

Multilayer Perceptron (MLP)
MLP is a class of feedforward artificial neural network (ANN) that has been widely used in machine learning applications in all aspects of science [101]. The MLP gives an AI system the ability to do data-based problem solving by helping computers in programming themselves based on input data. MLP can be used in both supervised learning methods and unsupervised learning methods. It has an initial structure consisting of a network of nodes (neurons or perceptron) arranged in three layers: input, hidden, and output. The ways of its learning (inner workings) resemble how a newborn's brain is being developed without prior knowledge. The MLP model learns how to transform (in a linear or nonlinear way) input variables into output variables by creating layers upon layers of neurons of random weights [102]. Figure 9 shows an illustration of the algorithm utilized in approach I with MLP.

VGG-16
VGG-16 is a sixteen layers deep CNN algorithm that is used in many computer vision tasks. VGG-16 can classify images into 1000 object categories and has about 138 million parameters [103]. It has a unique, consistent architecture of 3 × 3 convolutional layers and 2 × 2 max-pooling layers [104,105]; in the end, it has three fully connected layers [106][107][108]. VGG-16 was used to classify and identify different varieties of peanuts and achieved an average accuracy of 96.7% [109]. In addition, it was utilized in the computer vision process in Unmanned Aerial Vehicles (UAV) [110] to detect flower heads with prominent stamens (tassel). Furthermore, VGG-16 was used in real-time detection in surveillance cam feed [111]. It has also been utilized in handgesture recognition tasks [112], detecting defects in a wafer structure [113], recognizing oil rigs in aerial images [114], and corn leaf disease diagnosis [115]. Figure 10 shows an Fig. 12 Illustration of the algorithm utilized in approach III with ResNet-50. The YOLO v7 simple illustration [141] and photo of worker from the dataset [69] Fig. 13 Prediction area in red vs. ground-truth area in blue. Photo of worker is from the dataset [69] illustration of the algorithm utilized in approach III with VGG-16.

Xception
Xception by Google is a CNN that consists of 71 deep convolutional layers [116]. This efficient deep architecture was achieved by maintaining fewer connections between the convolutional layers of the model, thus making it less dense. Xception has about 22.8 million parameters and one fully connected layer at the end. Xception has been used in a wide variety of applications, including face detection [117], medicinal leaf classification [118], automated semantic segmentation of tree branches [119], garbage image classification [120], detection of brain tumors in MR images [121], and urban scene analysis [122,123]. Figure 11 shows an illustration of the algorithm utilized in approach III with Xception.

ResNet-50
VGG-16 and Xception were limited in their number of deep layers due to the vanishing gradient problem with added depth. The vanishing gradient problem happens when the value of the product of the derivative decreases until, at some point, the partial derivative of the loss function approaches a value close to zero before the gradient propagates to the final depth. ResNet-50 is immune to this problem, which could allow us in some cases to get better performance. ResNet is a type of artificial neural network (ANN) that consists of residual neural networks (ResNet) used as a backbone for many computer vision tasks. ResNet-50 consists of 152 deep convolutional layers and has about 25.5 million parameters and one fully connected layer at the end [124]. ResNet-50 has been used in a wide variety of applications, including food recognition tasks [125], flower detection [126], breast cancer diagnosis in histopathological images [127], pneumonia prediction from medical images [128], malaria cell image classification [129], and urban planning [130]. Figure 12 shows an illustration of the algorithm utilized in approach III with ResNet-50.

Results and discussion
Object detectors such as YOLO v7 predict the location of objects of the given four classes in an image with a particular confidence score. The confidence score reflects how likely the predicted bounding box contains the targeted class and how confident the classifier is about it. The object position is defined by placing bounding boxes around the objects to locate them. Therefore, our detection models were represented by a set of attributes: object class with a corresponding bounding box, coordinates for each box, a certain height and width for the box, and a confidence score. For example, consider the object of interest (WHV) represented by a ground-truth bounding box (blue color) and the detected area represented by a predicted bounding box (reds color) in Fig. 13. A perfect match occurs when the area and location of the predicted and ground-truth boxes are the same [131,132]. The threshold value, or what is known as Intersection over Union (IoU), is used to evaluate these two bounding boxes. IoU is equal to the area of the overlap (intersection) between the predicted bounding box (red) and the ground-truth bounding box (blue) divided by the area of their union. Even a small IoU value still constitutes a valid prediction. However, an IoU close to one is considered more restrictive than an IoU close to zero [131,132]. In our work, we choose an IoU value of 0.5, which is neither loose nor restrictive. Many object detection performance measurements are decided based on the elements of its confusion matrix. These elements include true-positive (TP), false-positive (FP), false-negative (FN), and true-negative  Figure 14 shows an illustration of the elements (TP, FP, FN, and TN) and Table 4 summarizes the conditions at which each of these elements takes place, while Table 5 shows the values of TP, FP, FN, and TN of each class (W, WH, WV, WHV) and for each approach with its variations. Moreover, Fig. 15 illustrates the graphical representation of performance measurements for each approach.
Since the utilized part of the dataset does not contain workers with vests only, as we noted earlier (refer to Table 3), we expect to see zero TP, FP, and FN values for the WV class. However, the models were trained to detect workers with safety vests and helmets. So in some cases, as seen in Table 5, the models incorrectly detects a safety vest, as seen in the case with FP values for the WV class. Furthermore, to evaluate our detection approaches' performance in detecting the ground-truth bounding boxes for each class, we need to use performance metrics such as accuracy, precision, recall, and F1 score. In object detection, accuracy is not a reliable measurement due to the nature of class distribution, which is considerably non-uniform. The performance of Table 4 Conditions for TP, FP, TN, and FN in Fig. 7 TP Correctly classified to the class, IoU ≥ 0.5, meaning the object is there and the model correctly detects it, there is an overlap between the ground-truth box and the prediction box (A, B) FP Incorrectly classified the class, IoU < 0.5, meaning the object is there, but the predicted box has an IoU against the ground-truth box (C), or the object is not there but still the model detects one (D) FN Incorrectly classified to another class, IoU = 0, meaning the object is there and the model does not detect it or the ground-truth box has no prediction box against it (E, F) TN This is all the other unrelated classes or the background region correctly detected as a non-object. Thus, TN includes all possible negative classes that were not detected. In addition, in calculating object detection metrics, which we will address furthermore in details, FN are not essential and usually assigned a null value [133], (G, H)  our models is usually evaluated using precision, recall, and F1 score. Precision (= TP / (TP + FP)) is the ability of the model to detect relevant class. Precision scores range from 0 to 1; a high precision implies that most detected objects match ground truth objects. In comparison, recall (= TP / (TP + FN)) measures the probability of correctly detecting ground truth objects. Recall ranges from 0 to 1, where a high recall score means that most ground truth objects were detected.
[134]. Table 6 shows a summary of precision, recall, and F1 scores. High recall but low precision implies that all ground truth objects have been detected, but most detections are incorrect (many false positives). On the other hand, low recall but high precision implies that all predicted boxes are correct, but most ground truth objects have been missed (many false negatives). The ideal detector happens when both precision and recall values are high, meaning the model has most ground truth objects detected correctly. F1 score is the weighted average between precision and recall, F1 = (2 × Precision × Recall) / (Precision + Recall), high F1 score value indicate high model performance [135]. Therefore, the models with the highest F1 score values perform best. It is also found that models generally performed the best when detecting WH class. Finally, all models in approach III showed the most promising result in detecting the W class since the model YOLO v7 algorithm was dedicated to learning one object class. Table 7 shows the rank of such performance for each model, where a rank of 1 indicates the highest or best performance and a rank of 7 indicates the lowest or poor performance.
According to Table 7, approach II has the lowest performance among all the proposed approaches and their variations in detecting any class. This is because, in AII, the model tried to detect all four classes from the first look. A possible way to improve the performance of this approach is to train it with more images. Meanwhile, AIII with VGG-16 performed best when detecting workers with safety vests and helmets. This is because the VGG-16 has three fully connected convolutional layers at the end compared to one for ResNet-50 and Xception, respectively. Another reason is that VGG-16 has more than 138 million parameters compared to 25.5 and 22.2 for ResNet-50 and Xception, respectively. Following the performance of AIII with VGG-16 is AI with all its ML variations which were shown to perform very well compared to AIII with Resnet-50 or Xception. A possible reason for it is that DT and KNN are known for their excellent performance in sorting tasks and that their classes were trained for more epochs than all the other models. An epoch is the number of complete passes through the algorithm that each model was trained for to achieve the lowest loss value possible [136]. Table 8 shows the number of epochs each class was trained for in each model to achieve the lowest possible final loss value, while Table 9 shows the processing speed of each model in terms of frames per second (FPS).
Based on Table 9 the approach (AII) with the highest speed was the same approach with the lowest ranking performance (refer to Table 7), which could indicate that higher processing speeds might have an effect on the performance of the model. All models were ran on Google Colab Cloudbased GPU [137]. Figure 16 and Fig. 17 show a summary of the performance measurement curves that were discussed earlier in Sect. 4 of this paper.

Limitations of the study
One of the main limitations of vision-based detection methods is that they are susceptible to occlusion, poor illumination, and blurriness. [138]. Figure 17 illustrates a summary of the YOLO v7 architecture that was used in this paper. In addition, the lack of PPE datasets in general and datasets that relates to manufacturing, in particular, can affect the performance of any developed AI model since they rely solely on data for training, validating, and testing. Fewer data can lead to overfitting in the case of DL techniques, while more data can lead to longer processing times in ML techniques. Table 10 summarizes associated data-driven challenges [139] that might be faced when implementing our work at an enterprise level. The YOLO v7 has to detect workers in all three approaches for the proposed methodology to work. The only difference is that AII detects workers with PPE or without PPE as a whole, while AI and AIII detect workers first, then their PPE. If a pic contains only a PPE visual without the worker, it will be counted TN, as discussed earlier in Table 4 with an illustration in Fig. 14.
A single-factor ANOVA test was conducted on all models per class. In this test, the null hypothesis states that the mean of the elements of the confusion matrix are equal for all seven models, and the alternative hypothesis states that all the elements of the confusion matrix are not equal for all seven models. With a confidence level of 99.999% or a significance level α = 0.001, we have tested and applied the ANOVA analysis on all our seven models for all the classes. Table 11 shows a summary of the results.
Since the observed significance level p-value per class for all models is larger than the significance level α = 0.001. We can safely state that there were no significant differences in the elements of the confusion matrix per class for all models.

Conclusions
Real-time monitoring of proper PPE use is essential, and several ML and DL methods are explored to make this possible. In this study, PPE compliance detection techniques based on computer vision were proposed. Results from experiments comparing seven different algorithms revealed that the suggested YOLO v7 with VGG-16 algorithm performs very well in terms of F1 score and FPS performance measures. The models presented in this paper utilized Pictor-v3 dataset, where images were taken separately on different devices, locations, times, perspectives, PPE styles, and industry projects. The YOLO v7 with the VGG-16 model performed the best, which makes it the perfect candidate for real-time PPE detection. The proposed methods were only tested on safety vests and helmet classes; therefore, future work can focus on data with more classes, such as safety shoes, glass, and gloves, to draw more applications of the proposed models. Furthermore, future work can focus on combining Natural Language Processing (NLP) to generate safety reports that could be used in root cause analysis to prevent accidents from reoccurring in the future.
Author contribution Mohammad Shahin took care of conceptualization, methodology, data collection, investigation, original draft, review, and final revisions. Ali Hosseinzadeh took care of reviewing the investigation, and the proposed methodology. Hamed Bouzary took care of Example faced by management Data size and nature At an initial stage, an organization has to rely on whatever data it can source off the internet. Later, they have to incorporate their data as their system collects images of workers daily for processing. Then, as the AI gets more training on more and more images from the same environment, it becomes more and more accurate Uncertainty Uncertainty creeps in wherever there is a faulty device or improperly adjusted system. The proposed method relies on camera systems mainly. Staying up to date with the latest camera design upgrades can be costly but yield higher performance measurements. More work on that can be done to see how much performance measurements could be gained when the hardware is updated Models ML and DL models are constantly being updated. To that point, sometimes it could be more economically viable for the organization to update its software to take advantage of the latest features of these models. For example, new features can have better light and color filters, the ability to deal more precisely with low-resolution images, faster processing times, and less utilization of computer resources. Furthermore, DL optimizers [140] could be investigated in future work when deployed to constantly learn and update the system with the best hyperparameters, leading to higher performance based on experience