3.1 YOLOv5
YOLOv5 is the deep learning-based architecture that is used to conduct this experiment. YOLOv5 [18] is lightweight and fast and needs less computational power than the other current state-of-the-art architecture model while keeping the accuracy near the current state-of-the-art detection models. The first model to be published without a supporting paper was Glenn Jocher's YOLOv5, which was also marked as being under "ongoing development" on its repo. Glenn Jocher is a researcher at Ultralystics LLC. The Github repository for YOLOv5 is available : Releases · ultralytics/yolov5 (github.com). The publication date was June 2020. Python was used to construct YOLOv5 on IoT devices [26], which simplifies installation and integration, as opposed to C as in past versions. Also, the PyTorch community was larger than the Darknet community, giving it more possibility for growth and expansion in the future. [19]. It is much faster than the other YOLO models. YOLOv5 uses CSPNET [29] as the backbone to extract the feature map from the image. It also uses Path Aggregation Network (PANet) [31] to boost the information flow. Figure 3 shows the architecture of YOLOv5. We are using YOLOv5 for the following reasons:
1) The YOLOv5 has SOTA features such as an activation function, hyperparameter, data augmentation technique and an easy-to-use manual
2) The model has a simple architecture, which makes it computationally easy to train even with small resources.
3) The small and lightweight nature of YOLOv5 makes it useful for mobile devices and embedded applications.
3.2 YOLOv8
Also authored by Glenn Jocher and launched on January 23rd 2023, YOLOv8 is the latest in the family of algorithms and is still in development, along with adding many new features such as Anchor free detection and mosaic augmentation. A CLI that is included with YOLOv8 makes training a model easier to understand. Moreover, there is a Python package [27] that offers a smoother development experience than the previous model. The Github repository for the YOLOv8 is available : GitHub - ultralytics/ultralytics: NEW - YOLOv8 🚀 in PyTorch > ONNX > CoreML > TFLite. Predictions for both the bounding boxes as well as the classes are produced when the input image has been evaluated once. The algorithm is as since both the predictions – bounding box and classification, are performed simultaneously. The provided image is initially converted into a grid of equal lengths (S x S). Next, confidence scores are defined for each grid cell's "b" bounding boxes as shown in equation (i). [22]. Confidence is the probability that an object exists in every bounding box.
Confidence (C) = P(object) * \({IOU}_{pred}^{target}\) —(i)
Where, IOU = Intersection over union
IOU [23] stands for a fractional value in the range of 0 and 1. Union is the total area between the predicted and the target areas, whereas intersection is the overlap between the predicted bounding box and the target area. The ideal value is close to 1, which denotes that the estimated bounding box is near the target region. Along with this, every grid cell also predicts the Confidence conditional class probability as shown in equation (ii) and (iii).
C = P (Class i | object) * P(object) * \({I.O.U}_{pred}^{target}\) —(ii)
C = P(Class i ) * \({I.O.U}_{pred}^{target}\) —(iii)
Now coming to the loss function, it is calculated by summing all the bounding box parameter’s loss function result as shown in equation (iv),
$${\lambda }_{coord} {\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}\left[{\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}+{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}\right]$$
$$+ {\lambda }_{coord}{\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}\left[{\left(\sqrt{{w}_{i}}-\sqrt{{\widehat{w}}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\widehat{h}}_{i}}\right)}^{2}\right]$$
$$+ {\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}{\left({C}_{i}-{\widehat{C}}_{i}\right)}^{2} + {\lambda }_{noobj}{\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}{\left({C}_{i}-{\widehat{C}}_{i}\right)}^{2}$$
\(+ {\sum }_{i=0}^{{S}^{2}}{1}_{i}^{obj} {\sum }_{c ϵ classes}^{}{\left({p}_{i}\left(c\right)-{\widehat{p}}_{i}\left(c\right)\right)}^{2}\) —(iv)
The given equation carries five important terminologies defined as follows [23]:
-
(x i , yi) – coordinates of target center (bounding box)
-
( \(\widehat{x}\) i , \(\widehat{y}\)i) – coordinates of predicted center (bounding box)
-
(w i , hi) – dimensions of target
-
( \(\widehat{w}\) i , \(\widehat{h}\)i) – dimensions of predicted
Equation's first step computes the loss associated with the bounding box using coordinates (xi,yi). If an object is present inside the jth forecasted bounding box inside the ith cell, \({1}_{ij}^{obj}\) is defined as 1, and if it is not true then 0 as shown in equation (v). The prediction with the highest current IOU with the target region will be considered "responsible" for predicting an object by the predicted bounding box [19].
\({\lambda }_{coord} {\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}\left[{\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}+{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}\right]\) —(v)
The next portion is responsible for calculating the error in the prediction of the dimensions of the bounding box. The scale of the inaccuracy in the large boxes, however, is less impactful on the equation itself than it is in the small boxes. The square roots of width and height, which are both normalized in the range of 0 and 1, make the discrepancies between smaller values bigger than those between larger ones. As a result, rather than using the dimension values directly, the bounding box's square root is employed as shown in equation (vi).
\({\lambda }_{coord}{\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}\left[{\left(\sqrt{{w}_{i}}-\sqrt{{\widehat{w}}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\widehat{h}}_{i}}\right)}^{2}\right]\) —(vi)
The loss value of confidence is calculated in the next section for both circumstances, regardless of whether the object is present inside the bounding box or not. However, if that predictor is in charge of the target box, only then the loss function shall penalize the object confidence mistake as shown in equation (vii). If there is an object in the cell, \({1}_{ij}^{obj}\) equals 1, else results 0.
\({\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}{\left({C}_{i}-{\widehat{C}}_{i}\right)}^{2} + {\lambda }_{noobj}{\sum }_{i=0}^{{S}^{2}}{\sum }_{j=0}^{b}{1}_{ij}^{obj}{\left({C}_{i}-{\widehat{C}}_{i}\right)}^{2}\) —(vii)
With the exception of \({1}_{ij}^{obj}\), which is needed because the algorithm does not penalize classification errors if there are no objects present in the cell, the last portion computes the loss of class probability [19] as shown in equation (viii) .
\({\sum }_{i=0}^{{S}^{2}}{1}_{i}^{obj} {\sum }_{c ϵ classes}^{}{\left({p}_{i}\left(c\right)-{\widehat{p}}_{i}\left(c\right)\right)}^{2}\) —(viii)
The first step is to install and initialize both the algorithms – YOLOv5 and YOLOv8, both the algorithms will be running simultaneously on different devices with similar specifications all so it is easier to compare both results and also less time consuming. The data set that is chosen for this work is “American Sign Language letters” [23] from the publicly available datasets on Roboflow. Along with that, we will use PyTorch which is based on the well-known Torch library. Moreover, a Python-based library that is more frequently used for computer vision and natural language processing. When downloading a dataset from Roboflow, many methods are provided as to how that dataset should be implemented in the model. One of those methods is to apply the pytorch code that installs a roboflow package and also downloads the dataset directly into the directory of the program. Another major package that is downloaded is Ultralytics, this package provides all the versions of the YOLO algorithm hence making it optimal for this particular program.