ShuffleNetv2-YOLOv3: a real-time recognition method of static sign language based on a lightweight network

To better meet the communication needs of hearing impaired people and the public, it is of great significance to recognize sign language more quickly and accurately on embedded platforms and mobile terminals. YOLOv3, raised by Joseph Redmon and Ali Farhadi in 2018, achieved a great improvement in detection speed with considerable accuracy by optimizing Yolo. However, YOLOv3 is still too bloated to use on mobile terminals. A static sign language recognition method based on the ShuffleNetv2-YOLOv3 lightweight model was proposed. The ShuffleNetv2-YOLOv3 lightweight model makes the network lightweight by using ShuffleNetv2 as the backbone network of YOLOv3. The lightweight network improved the recognition speed steeply. Combing with the CIoU loss function, the ShuffleNetv2-YOLOv3 keeps the recognition accuracy while improving the recognition speed. Recognition effectiveness of the self-made sign language images and public database by the ShuffleNetv2-YOLOv3 lightweight model was evaluated by F1 score and mAP value. The performance of the ShuffleNetv2-YOLOv3 model was compared with that of the YOLOv3-tiny, SSD, Faster-RCNN, and YOLOv4-tiny model, respectively. The experimental results show that the proposed ShuffleNetv2-YOLOv3 model achieved a good balance between the accuracy and speed of the gesture detection under the premise of model lightweight. The F1 score and mAP value of the ShuffleNetv2-YOLOv3 model were 99.1% and 98.4%, respectively. The gesture detection speed on the GPU reaches 54 frames per second, which is better than other models. The mobile terminal application of the proposed lightweight model was also evaluated. The minimal inference speed of single frame images on the CPU and GPU is 0.14 and 0.025 s per image, respectively. It is only 1/6.5 and 1/8.5 of the running speed of the original YOLOv3 model. The ShuffleNetv2-YOLOv3 lightweight model is conducive to quick, real time, and similar static sign language gesture recognition, laying a good foundation for real-time gesture recognition in the embedded platforms and mobile terminals.


Introduction
by stacking multi-level ensembles together to extend the 72 features which are fed into the fully connected layer. This 73 method achieved good results on the publicly available Amer-74 ican Sign Language datasets. In addition, skeleton-based 75 gesture recognition, such as OpenPose [12] and MediaPipe 76 Hands [13], also achieved good detection results for gesture 77 recognition. However, the increase in the number of gestures 78 in the picture will directly lead to an increase in complexity 79 and computing effort. The amount of computing is positively 80 related to the number of gestures. More gestures lead to the 81 real-time detection of sign language more difficult. 82 With the development of target detection technology, 83 scholars converted the classification problem of target 84 recognition into target detection problems, RCNN (Region-85 Convolutional Neural network) [14], Fast-RCNN [15], 86 Faster-RCNN [16], and other algorithms were proposed to 87 obtain higher recognition accuracy. Jiang [17] proposes a 88 method for diver gesture recognition and segmentation. The 89 method first uses a progressive growth training method to 90 optimize the generative adversarial network and uses the 91 generative adversarial network model as a data enhancement 92 method using the mask R-CNN algorithm for gesture recog-93 nition and gesture segmentation. This method is able to detect 94 diver gestures well.

95
Despite the previously mentioned methods recognizing 96 the static gesture well, the complex networks, large models, 97 and slow operation limited it to realizing high-speed target 98 recognition on mobile terminals. Subsequently, to address 99 the shortcomings of the above target detection models, 100 some single-stage target detection algorithms were proposed. 101 These methods can predict the object class and location while 102 generating the candidate region, without splitting into two 103 stages to complete the detection task. The single-stage model 104 greatly reduces the model size, such as the iconic single-stage 105 detection algorithm YOLO [18]. And then, the YOLOv2 106 [19] and YOLOv3 [20] algorithms were proposed to further 107 improve the recognition accuracy and the recognition rate of 108 small targets. Fan et. al. [21] proposed a lightweight gesture 109 recognition algorithm based on the YOLOv4-tiny network 110 structure, aiming at the problem that the lightweight target 111 detection network has insufficient ability to extract static ges-112 ture features, a high false detection rate, and missed detection 113 rate. This can achieve accurate classification and real-time 114 detection and achieve better recognition effects for small-115 scale gestures. Liu proposed the SSD (Single Shot MultiBox 116 Detector) [22]. However, the detection speed of this network 117 is slow when applied to the mobile terminal, Because the 118 GPU computing speed of mobile terminals is much lower 119 than that of PC terminals. In order to meet the needs of 120 mobile devices, some lightweight CNN networks such as 121 MobileNet [23] and ShuffleNet [24] have been proposed, 122 which have a good balance between speed and accuracy. 123 The proposed ShufflenetV2 [25] uses channel shuffling to 124 improve the exchange of information flow between chan-125 nels, and further considers the actual speed of the hardware. 126 Above lightweight methods offer the potential for mobile 127 terminal application of gesture recognition.
In target detection, the IoU value of the bounding box (Distance-IoU) and CIoU (Complete-IoU) [28] were pro-166 posed taking into account the distance between the centroids 167 of the two frames.

168
Here, the CIoU bounding box loss is used as Eq. (3).
, is defined as the penalty term 171 of the prediction frame B and the target frame B gt , b and b gt 172 represents the center points of B and B gt , respectively, ρ rep-173 resents the Euclidean distance, and c represents the diagonal 174 distance of the smallest outer rectangle.

175
Where αv is the influence factor and α is the parameter 176 used to balance the ratio. The calculation formula is shown 177 as: where v is used as a parameter to measure the consistency of 180 the aspect ratio. The calculation formula is shown as:  The training set is composed of a random 80% of data 205 selected from the datasets by the python algorithm. The rest 206    226 In this experiment, mAP (mean average precision), F1 score, 227 and model size were used as criteria to evaluate the model:  The megabytes (MB) was used to measure the model size. Comparative experiment method was adopted to determine 252 the parameters.

Results and discussion
overlaps with the real frame and the center point also over-303 laps. It is more inclined to optimize in the direction of the 304 increasing overlapping regions. From the data in the table, 305 it can be verified that the mAP of the CIoU loss function of 306 ShuffleNetv2-YOLOv3 can reach 99.1%, which is the high-307 est among the four loss functions, and the F1 score can reach 308 98.5%. The two model sizes are the same. Therefore, the 309 version of CIoU loss function of ShuffleNetv2-YOLOv3 was 310 adopted as the final network structure.    Table 4 shows that the mAP and F1 scores of above three 327 models are the same on the Creative Senz3D dataset with 328 an image size of 1280 × 960, indicating that the three mod-329 els' gesture recognition accuracy is comparable; on the other 330 hand, among the three models, the ShuffleNetv2-YOLOv3 331 model has the smallest weight size of 8.9 MB, which is about 332 1/2.5 of the MobileNetV3-s-CenterNet model and 1/14 of the 333 YOLOv3 model. The FPS of ShuffleNetv2-YOLOv3-1.5X-334 CIOU model is 65 frames per second, which is comparable to 335 it of MobileNetV3-s-CenterNet model and about 1.38 times 336 faster than Darknet53-YOLOv3 model.   346 The suggested method is also compared with earlier tech-347 niques on the homemade datasets. The comparison results 348 are shown in Table 5 to verify the feasibility and supe-349 riority of the proposed method. The mAP of proposed 350 ShuffleNetv2-YOLOv3 model reaches 99.1%, the F1 score 351