4.1 Data set
The original dataset source was divided into two parts, one is the static image data captured by the camera, and the other is the image data obtained from the video screenshots captured by the camera when the shelter-transporting AGV was working. Images were named uniformly, and 1000 images were selected as the total data set of target detection training. The data set was divided into 800 training sets and 200 test sets.
Labelimg tool was used to annotate each image in the dataset, as shown in Fig. 3. The corresponding relationship between shelter, red cross and label in the data set is shown in Table 1. The anchor frame should completely cover the target in the annotation process, and the annotation object is the peripheral features of the shelter and the red cross in the middle. YOLO format was selected for annotation. Labelimg generate an outer frame in the form of boundary box in the image, and automatically generate a txt file with the same name as the annotation image after the manual annotation result was saved.
Table 1
Correspondence between label
Category | The shelter | The red cross |
Tag name | shelter | red cross |
Tag number | 0 | 1 |
4.2 Model selection
Considering comprehensive factors such as training efficiency and detection accuracy, a suitable detection model for shelter-transporting AGV detecting shelter in shelter transport with both better performance and faster speed should be obtained. Therefore, the ten pre-training models of yolov5 were chosen to be trained, and the index parameters of the ten pre-training models are shown in Table 2.
Table 2
Index parameters of ten different models
Model | Depth | Width | Layer | Parameter |
yolov5n | 0.33 | 0.25 | 270 | 1766623 |
yolov5n6 | 0.33 | 0.25 | 355 | 3096244 |
yolov5s | 0.33 | 0.50 | 270 | 7025023 |
yolov5s6 | 0.33 | 0.50 | 355 | 12326164 |
yolov5m | 0.67 | 0.75 | 391 | 21060447 |
yolov5m6 | 0.67 | 0.75 | 481 | 35281716 |
yolov5l | 1.00 | 1.00 | 499 | 46636735 |
yolov5l6 | 1.00 | 1.00 | 607 | 76170196 |
yolov5x | 1.33 | 1.25 | 567 | 86224543 |
yolov5x6 | 1.33 | 1.25 | 733 | 140045044 |
The models in Table 2 are listed in ascending order of complexity of network structure. The experimental environment is Ubuntu18.04 operating system, and based on Pytorch framework. CPU: Intel Core I9-10900K, GPU: NVIDIA RTX 3090, 24GB. Training parameter settings for ten pre-training models of yolov5 are shown in Table 3. |
Table 3
Experimental training parameters
Parameter | The numerical |
Epochs (Preset training times) | 500 |
Batch-size (Batch) | 16 |
Learning rate | 0.01 |
Momentum term | 0.937 |
Decay (Decay regular term) | 0.0005 |
The ten models in Table 2 were trained for a total of 16.059 hours according to the training parameters in Table 3, and the comparison of post-training model parameters are shown.
During the training process, various values change with the number of training steps increasing. The meanings of each value in Fig. 4 are as follows:
In Fig. 4(a), the precision is equal to the number of correct targets marked divided by the total number of targets marked. The closer to 1, and the higher the accuracy. And in Fig. 4(b), the recall rate is equal to the number of correct targets marked divided by the total number of targets that need to be marked. The closer to 1, and the higher the accuracy. In Fig. 4(c), the mAP_0.5 (mean Average Precision) represents when IOU is set to 0.5, the AP of all pictures of each category is calculated, and then all categories are averaged. In Fg.4(d), the mAP_0.5:0.95 represents the average mAP at different IOU thresholds (from 0.5 to 0.95 in steps of 0.05).
As can be seen from Fig. 4, when the number of training steps reached 200, each value tended to be stable. As the number of training steps reached 500, the curves all achieved a good fitting effect. By comparing the training results, it can be found that both the precision and the recall tend to reach 1 with the increase of training steps, indicating that all the 10 models achieved good training effects, and the mAP_0.5 were also stable around 1 as the number of training steps increased. The mAP_0.5:0.95 increased slowly in the first 100 training sessions, then tended to stabilize and approached 1 slowly, and it was obvious that there is a certain gap in the final stable value for different training models, but the overall results were all greater than 0.9 and the trend was stable.
In order to further analyze the training effect of the model, the 10 models after training were tested and compared. The detection results of the trained 10 models on the same test set are shown in Table 4.
Table 4
Comparison of detection results
Model | Layer | Parameter | Model size /M | Training time /h | Accuracy / % | Detection time /ms |
yolov5n | 213 | 1761871 | 3.8 | 0.719 | 94.24 | 5.5 |
yolov5n6 | 280 | 3089188 | 6.6 | 0.857 | 96.28 | 6.9 |
yolov5s | 213 | 7015519 | 14.4 | 0.813 | 95.04 | 5.8 |
yolov5s6 | 280 | 12312052 | 25.1 | 0.955 | 96.91 | 7.6 |
yolov5m | 308 | 21041679 | 42.6 | 1.260 | 95.22 | 8.3 |
yolov5m6 | 378 | 35254692 | 71.1 | 1.371 | 95.88 | 10.3 |
yolov5l | 392 | 46605951 | 93.8 | 1.847 | 96.46 | 10.6 |
yolov5l6 | 476 | 76126356 | 153.0 | 2.002 | 97.24 | 12.5 |
yolov5x | 444 | 86180143 | 173.1 | 2.929 | 96.96 | 14.7 |
yolov5x6 | 574 | 139980484 | 280.9 | 3.306 | 96.65 | 17.6 |
It can be seen from Table 4 that the more complex structure and more parameters has the longer corresponding training time and the larger weight. From the comparison of performance indexes and detection results of the different trained models, it can be seen that the detection accuracy of relatively complex pre-training models (i.e., yolov5m and yolov5m6), were not as good as that of relatively simple (i.e., yolov5s6 and yolov5n6), and the most complex yolov5x6 was not as good as the simplest yolov5s6. Therefore, it can be concluded that detection effect of the complex trained model may not be better in actual applications. Yolov5n has the minimum depth and width of pre-training model, and the minimum number of model layers, parameters and detection time obtained after training, as well as the minimum weight of model, which is very suitable to deployment on the shelter-transporting AGV. However, it has the lowest detection accuracy of 94.24% compared with other models. The yolov5n6 was slightly higher than yolov5n on model complexity, but it is nearly 2% higher than yolov5n on detection accuracy. Compared to other more complex including yolov5m, yolov5l and yolov5x, the accuracy of the yolov5n6 is not low, and detection time of the yolov5n6 is only 6.9ms, which is far less than the detection time of complex model (i.e., yolov5m, yolov5l and yolov5x) greater than 10ms.The training results of yolov5n6 are shown in Fig. 5.
According to the yolov5n6 training result, the model loss value decreased and tended to be stable with the increase of training steps. The curve fitting state was good, and precision, recall, mAP_0.5 and mAP_0.5:0.95 all tended to be stable at 1. Considering the detection accuracy, detection time and model weight, the yolov5n6 was selected as the best detection model and applied to shelter detection, which can well achieve the high detection accuracy and speed.
4.3 Discussion
To solve the problems of identifying shelters with low accuracy and slow speed existing in the process of shelter-transporting AGV shelter detection and further improve the detection performance, the detection model yolov5n6* was developed by introducing the CBAM into the model's main structure, changing the loss function from GIOU_Loss to CIOU_Loss, and selecting a more reasonable DIOU_nms. The box_loss and mAP_0.5:0.95 of the yolov5n6* and yolov5n6 models are shown in Fig. 6.
By comparing training results of the two models in Fig. 6, it can be seen that box_loss of the improved model yolov5n6* and yolov5n6 decreased with the increase of training steps and gradually tended to be stable. According to the comparison figure, the box_loss of the improved yolov5n6* is 1.2% lower than the yolov5n6, which meets the requirements of proposed strategy and proves that the improved strategy enables yolov5n6* to have higher positioning accuracy. As shown in Fig. 6(b), mAP_0.5:0.95 of yolov5n6* and yolov5n6 gradually approached 1 during training process, and tended to be stable after 400 training times. The training model reached the fitting. Compared with the yolov5n6 before the improvement, the mAP_0.5.95 of the yolov5n6* increased by 2%, indicating that the improved model yolov5n6* obtained good training results.
In order to further evaluate performance of the improved model yolov5n6* and the original model yolov5n6 both the yolov5n6* and the yolov5n6 were tested on the test set. The comparison of detection results of the two models shown in Table 5 and Fig. 7.
Table 5
Comparison of detection results
Model | Model size /M | Accuracy /% | Detection time /ms |
yolov5n6 | 6.6 | 96.28 | 6.9 |
yolov5n6* | 7.2 | 97.15 | 7.1 |
As can be seen from Table 5, compared with the yolo5n6, the detection accuracy of the yolov5n6* increased by 0.87%, however, the detection time of the yolov5n6* was only increased by 0.2ms. Therefore, the yolov5n6* was suitable for the application on a shelter-transporting AGV due to its small size. Figure 7 shows the original picture, and the detection effect pictures by the yolov5n6 and the yolov5n6*. It can be seen that the detection result will not filter the blocked object from the first row of pictures. However, it can be seen from the second row of pictures that the introduction of CBAM and the change of CIOU_Loss increases the confidence of the detection results. It was proved that the introduction of the attention mechanism and the improvement of the loss function were effective to improve the shelter detection ability and make the yolov5n6* model more robustness.