In this paper, DIOR dataset [39] is used to verify the detection effect of the improved YOLOv5s network; the precision, recall, [email protected] and [email protected]:.95 are used as performance indicator of the evaluate model. Finally, the improved YOLOv5s network is compared with the original YOLOv5s network, YOLO series network and EfficientDet model on the DIOR dataset.
4.1. Dataset introduction
DIOR dataset is an open large-scale dataset proposed by Northwestern Polytechnical University for object detection of optical remote sensing images. The image size of the dataset is 800*800, including 23463 images and 190288 object instances, involving 20 object classes, including airplane, airport, basketball court, baseball field, bridge, chimney, dam, expressway toll station, expressway service area, golf field, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, trains station, vehicle and windmill. The dataset example is shown in Figure 5.
4.2. Evaluation index
Precision represents the accuracy of the correct target in the prediction result, and the formula is shown in formula (3); Recall represents the proportion of correct target in all prediction results, and the formula is shown in formula (4). Both tend to rise in one value and fall in the other. The P-R curve can be obtained by taking Precision as the vertical axis and Recall as the horizontal axis. The area surrounding the curve with x axis and y axis is called the average precision (AP). The formula is shown in formula (5). The P-R curve of the improved YOLOv5s under DIOR data set is shown in Figure 6. mAP is the mean average precision of multi-objective, and the formula is shown in formula (6). [email protected] indicates the mAP of all categories when the IOU is 0.5. [email protected]:.95 refers to the average mAP of IOU between 0.5 and 0.95.
$$Precision=\frac{TP}{TP+FP}$$
3
$$Recall=\frac{TP}{TP+FN}$$
4
$$AP={\int }_{0}^{1}P\left(R\right)dr$$
5
$$mAP=\frac{1}{N}\sum _{i=1}^{N}{AP}_{i}$$
6
Where \(TP\) is the number of instances where the correct object is identified as correct; \(FP\) is the number of instances where the wrong target is identified as correct; \(FN\) is the number of instances where the correct target is identified as wrong; \(N\) is the total number of object classes, and in this paper, \(N\) is 20; \(i\) is the AP of the ith object.
4.3. Training and result analysis
4.3.1. Network training
The model training experimental environment of this paper: the graphics card is GeForce GTX 1080 Ti, GPU driver, CUDA version is 11.2, cuDNN version is 8.1, compilation language is Python3.8, batch size is 16 and epochs is 200.
Firstly, the overpass in RSOD dataset [40] and NWPU VHR-10 dataset [41] are combined into a new dataset by the transfer learning method, and this dataset is used to pre-train the improved YOLOv5s, and the generated pre-training weight is used as the training weight of the improved YOLOv5s. This method of migrating the data parameters of the pre-training model to the new model to help the new model training can improve the training speed of the improved YOLOv5s and accelerate the network convergence.
YOLOv5 network has the function of automatically calibrating the anchor box. Because an output feature layer is added to the network model in this paper, k-mean algorithm is used to recalculate the anchor box, and the original 9 anchor boxes are changed to 12 anchor boxes, which are (20,14), (19,38), (54,25), (38,76), (97,68), (84,189), (151,138), (208,230), (435,115), (207,491), (474,317), (476,590).
4.3.1. Result analysis
In this paper, YOLOv5s is compared with the improved yolov5s on DIOR dataset, and precision, recall, [email protected], [email protected]:.95 as performance indicators to evaluate the quality of the network as shown in Table 1. According to the table, the improved YOLOv5s is improved in precision, [email protected], [email protected]:.95 compared with original YOLOv5s, and the recall is not as effective as original YOLOv5s.
The test results are shown in Figure 7. The original YOLOv5s test images are on the left and the improved YOLOv5s test images are on the right. It can be seen from images in the first row that the ship objects are small, densely distributed and cover the larger object harbors. The modified YOLOv5s can identify the harbors more accurately than the original YOLOv5s; the second row show that under complex background, the accuracy of the improved YOLOv5s for the large object of golf field is nearly 50% higher than the original YOLOv5s; images in the third row show that overpass is particularly similar to the expressway. The original YOLOv5s classified the expressway as overpass, while the improved YOLOv5s identifies overpass more accurately.
In order to further verify the effectiveness of the improved YOLOv5s, in this paper, the improved YOLOv5s is compared with the YOLO series network and EfficientDet model on DIOR dataset, and AP of 20 types of ground objects and mAP are used as the indicators of the evaluation algorithm. The experimental results are shown in Table 2.
It can be seen from the comparison results in Table 2 that the mAP of the improved YOLOv5s has significantly improved compared with YOLOv3, YOLOv4, the original YOLOv5s and EfficientDet. Among them, the detection accuracy of large ground objects with complex environment has been greatly improved, such as airport, dam, golf field, harbor and trains station, and the detection accuracy of other ground objects has also been improved to varying degrees. It is concluded that the improved YOLOv5s network has a good detection effect on large ground objects in YOLO series network.