Improved YOLOv5 network method for remote sensing image-based ground objects recognition

High-resolution remote sensing images have the characteristics of complex background environment, clustering of objects, etc., the complex background makes the remote sensing image contain a large number of irrelevant ground objects with a high similarity or overlap, which makes the edge and texture of the objects not clear enough, and this leads to low recognition accuracy of ground objects such as airports, dams, and golf field, although the size of this object is large. Based on this problem, this paper proposes a remote sensing image object detection method based on the YOLOv5 network. By improving the backbone extraction network, the network structure can be deepened to get more information about large objects, and the detection effect can be improved by adding an attention mechanism and adding an output layer to enhance feature extraction and feature fusion. The pre-training weight is obtained by transfer learning and used as the training weight of the improved YOLOv5 to speed up the network convergence. The experiment is carried out on the DIOR dataset, the results show that the improved YOLOv5 network can significantly improve the accuracy of large object recognition compared with the YOLO series network and the EfficientDet model on DIOR dataset, and the mAP of the improved YOLOv5 network is 80.5%, which is 2% higher than the original YOLOv5 network.


Introduction
With the development of remote sensing technology, remote sensing has been widely used in modern industry and life (Wei and Liu 2021), and the object recognition of remote sensing images has been attached importance. At present, the object recognition of remote sensing images is widely used in many fields, such as military reconnaissance , environmental monitoring , urban construction (Guo et al. 2021), climatic data analysis (Zhou 2021), and building damage detection (Wu et al. 2021 ). The traditional object detection algorithms can be divided into three parts. Firstly, the candidate regions are obtained by image segmentation technology or sliding window method; then, the features from each region are extracted, and finally extracted features are put into the classifier for classification and recognition. Among them, the common feature extraction methods are scale-invariant feature transform (SIFT) (Lowe 2004), speeded up robust features (SURF) (Herbert et al. 2008), histogram of oriented gradient (HOG) (Dalal and Triggs 2005), etc., and the common classifiers include support vector machines (SVM) (Melgani and Bruzzone 2004), Haar (Viola and Jones 2004), AdaBoost (Soui et al. 2021), etc. Remote sensing images have the characteristics of dense ground objects and complex environments, so the traditional target detection algorithm requires a lot of calculation and it is inefficient. In recent years, with the continuous development of deep learning, object detection based on convolutional neural networks (CNNs) (LeCun et al. 2015) has been widely used in various fields and gradually replaces the traditional detection methods. There are two kinds of object detection algorithms based on deep learning: those based on candidate regions and those based on regression algorithms. The former, also known as the two-stage model, divide object detection into two phases: generating region proposals, classifying region proposals in classifier and correcting positions, such as R-CNN (Girshick et al. 2014), SPP-Net (He et al. 2014), Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2017), and Mast R-CNN (Kaiming et al. 2017). The latter directly regresses the predicted object to generate the bounding box, such as You Only Look Once (YOLO) (Redmon et al. 2016;Redmon and Farhadi 2017;Redmon 2018;Bochkovskiy et al. 2020;Ultralytics 2020, Liu et al. 2022, Single Shot MultiBox Detector (SSD) (Wei et al. 2016), and CenterNet (Zhou et al. 2019), EfficientDet (Tan et al. 2020).
According to the problems of remote sensing images, researchers have made unremitting efforts based on CNN. Chen et al. (2021) proposed Domain Adaptation Faster R-CNN algorithm to improve the robustness of the model and widen the scope of application and proved its effectiveness. Han et al. (2021) proposed a remote sensing image building detection algorithm combining Mask R-CNN with traditional object detection algorithm. This method improves the detection accuracy and reduces the calculation time. Li et al. (2021) proposed a lightweight keypoints-based oriented object detector for remote sensing images in view of the complex background. Xu et al. (2020) proposed a detection algorithm based on YOLOv3 for the detection of remote sensing targets at different scales so that the detection targets dominated by small targets can maintain the detection speed and improve the average accuracy. Zhou et al. (2021) proposed the multiscale detection network (MSDN) to solve the problem that aircraft size is small and proposed the deeper and wider module (DAWM) to resist the background noise. Finally, the DAWM is introduced into the MSDN and the novel network structure is named the multiscale refined detection network (MSRDN). Lu (2021) proposed a detection algorithm based on improved SSD to solve the problem of small and dense objects in remote sensing images, proposed a new loss function to speed up network convergence, and proposed the Laplace-NMS method, which had a good post-treatment effect on dense objects. Lots of research are focused on small objects detection in remote sensing images, but experiments show that the recognition effect of some large objects is not ideal in a complex environment. Therefore, the purpose of this paper is to propose an algorithm for large objects detection.
In the relatively complex environment, some large objects in remote sensing images have similar colors to the background and unclear textures, which makes it difficult to extract their features, resulting in low accuracy of target recognition, and this paper presents a method of remote sensing image ground object recognition based on YOLOv5 (Ultralytics 2020). This paper improves the YOLOv5s network structure by adding a set of CSP structure and attention mechanisms to the backbone extraction network and changing the output layer to four feature layers. Experiments show that the improved YOLOv5s network improves the accuracy of large ground objects detection compared with the original YOLOv5s.

YOLOv5 network
YOLOv5 network is the latest research achievement of YOLO series algorithms. Its network structure is similar to that of the YOLOv4 network; however, it is smaller than the YOLOv4 network, has faster convergence speed and running speed, and belongs to the lightweight algorithm. At the same time, it also improves accuracy. Therefore, this paper adopts the YOLOv5 algorithm for target detection of remote sensing images.
The YOLOv5 network structure consists of four parts: input, backbone, neck, and prediction. The network structure is shown in Fig. 1. There are four kinds of networks, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. They have a similar network structure with different depths and widths. The YOLOv5s network has the smallest structure and the shallowest depth, the fastest running speed, and the lowest accuracy. On this basis, the other three network structures become gradually deeper and wider, the accuracy becomes continuously improved, and on the contrary, the operation speed becomes slow.
The input uses mosaic data enhancement, adaptive anchor box operation, and picture scaling to process the input dataset; the backbone adopts focus structure and CSP structure. Focus improves the network speed and reduces floating-point operations (FLOPs) by slicing the input picture. The focus structure is shown in Fig. 1.1. YOLOv5 uses two CSP ) structures: CSP 1_ X and CSP 2_ X; CSP1_ X is used for down-sampling in backbone, and CSP 2_ X is used in neck. CSP can improve the learning ability of the network and ensure the accuracy of the network while reducing the operations. The structure diagrams of the two CSPs are shown in Fig. 1.1; neck adopts SPP-net and FPN ? PAN structure to enhance the feature fusion effect of the network; prediction adopts GIOU_ Loss (Rezatofighi et al. 2019), and GIOU not only pays attention to the overlapping area between the prediction box and the ground truth but also to the non-overlapping areas. GIOU solves the problems of IOU (Yu et al. 2016) while maintaining the advantages of IOU. The calculation Eqs. (1) and (2) are as follows: In addition to four different networks, the version of YOLOv5 is also constantly updated. This article uses v5.0 of YOLOv5s; compared with v4.0, v5.0 of YOLOv5 changed all activation functions in the network to SiLU() (Elfwing et al. 2017), deleted the conv in the CSP, and renamed it C3 as shown in Fig. 2. Therefore, the network structure of v5.0 is smaller and faster than that of v4.0.

Backbone improvement
Feature fusion of features with different scales can often get more useful object information. The low-level features have higher resolution, smaller receptive field, more texture information, and more noise, which is suitable for detecting small objects; the high-level feature has lower resolution and poor perception of object details, but the receptive field is larger, which is suitable for detecting large objects. In the DIOR dataset, the complex background environment leads to the unsatisfactory detection effect of some large ground objects. In this paper, a group of C3 structures is added to the backbone network of YOLOv5s, and the original three groups of C3 are changed to four groups of C3, so as to deepen the overall network structure, which can effectively improve the expression ability of the network and the learning ability of larger ground objects and then improve the detection accuracy of the model.

Attentional mechanism
Attention mechanism refers to human visual attention mechanism, it focuses on local information and suppresses redundant information, and in other words, the attention mechanism enables the network to find significant information among multitudinous information. In this paper, the attention mechanism (Jie et al. 2020) is added to the backbone of YOLOv5s to obtain the information of different characteristic channels and the importance of different channels and then suppress useless channels for the detection object. In this way, the network performance is enhanced by adding a small amount of computation. The improved backbone structure is shown in Fig. 3.

Neck improvement
Neck adopts the FPN (Lin et al. 2017) ? PAN structure. This structure adds a bottom-up feature pyramid network after FPN, which enhances the semantic expression and Fig. 1 The architecture of the YOLOv5 network Improved YOLOv5 network method for remote… 10881 location information on multiple scales. The neck of YOLOv5s integrates the C32_X structure to enhance the feature fusion effect of the network. Since a group of C3 structures is added in this paper, an output layer is added in the neck to further improve the feature extraction of the network. The improved FPN ? PAN structure is shown in Fig. 4.

Remote sensing image ground objects recognition method based on improved YOLOv5 network
Remote sensing image ground objects recognition method based on improved YOLOv5 network needs to recalculate the anchor box firstly and modify the data configuration file and network configuration file, and then, the pretraining weight is obtained by means of migration learning. Then, start training the network, read the data configuration file, analyze and load the network model, get the training weight and save it. Finally, the network detection was carried out, the trained network model is used to predict the training set and the image with prediction class and boundary box is generated and output. The process of remote sensing image ground objects recognition method based on improved YOLOv5 network is shown in Algorithm 1.

Experimental results and analysis
In this paper, the DIOR dataset ) is used to verify the detection effect of the improved YOLOv5s network; the precision, recall, mAP@0.5, and mAP@0.5:0.95 are used as a performance indicator of the evaluation model. Finally, the improved YOLOv5s network is compared with the original YOLOv5s network, YOLO series network, and EfficientDet model on the DIOR dataset.

DIOR dataset is an open large-scale dataset proposed by
Northwestern Polytechnical University for object detection of optical remote sensing images. The image size of the dataset is 800*800, including 23,463 images and 190,288 object instances, involving 20 object classes, including airplane, airport, basketball court, baseball field, bridge, chimney, dam, expressway toll station, expressway service area, golf field, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, trains station, vehicle, and windmill. The dataset example is shown in Fig. 5.

Evaluation index
Precision represents the proportion of the predicted correct objects that are actually the correct objects, Eq. (3); recall represents the proportion of correct objects in the prediction results to the number of actually correct objects, Eq. (4). Both tend to rise in one value and fall in the other. The P-R curve can be obtained by taking precision as the vertical axis and recall as the horizontal axis. The area surrounding the curve with the x-axis and y-axis is called the average precision (AP), Eq. (5). The P-R curve of the improved YOLOv5s under the DIOR dataset is shown in Fig. 6. mAP is the mean average precision of multi-objective, Eq. (6). mAP@0.5 indicates the mAP of all categories when the IOU is 0.5. mAP@0.5:0.95 refers to the average mAP of IOU between 0.5 and 0.95. Recall where TP is the number of instances where the correct object is identified as correct; FP is the number of instances where the wrong target is identified as correct; FN is the number of instances where the correct target is identified as wrong; N is the total number of object classes, and in this paper, N is 20; i is the AP of the ith object.

Network training
The model training experimental environment of this paper is as follows: The graphics card is GeForce GTX 1080 Ti, GPU driver, CUDA version is 11.2, cuDNN version is 8.1, compilation language is Python3.8, the batch size is 16, and epochs is 200. Firstly, transfer learning is adopted. The overpass in RSOD dataset (Long et al. 2017) and NWPU VHR-10

Result in analysis
In this paper, YOLOv5s are compared with the improved YOLOv5s on the DIOR dataset, and precision, recall, mAP@0.5, mAP@0.5:0.95 as performance indicators to evaluate the quality of the network as shown in Table 1. According to the table, the improved YOLOv5s are improved in precision, mAP@0.5, mAP@0.5:0.95 compared with original YOLOv5s, and the recall is not as effective as original YOLOv5s.
The test results are shown in Fig. 7. In each group of images, the upper row is the improved YOLOv5s test images, and the lower row is the original YOLOv5s test images. It can be seen from images in the group (a) of figures that the ship objects are small, densely distributed, and cover the larger object harbors, which interferes with the recognition of the harbors. The modified YOLOv5s can identify the harbors more accurately than the original YOLOv5s; in group (b) of figures, under the complex background, it is difficult to extract the features of the train station. The improved YOLOv5s can successfully identify the train station. The similarity between the two targets of bridge and overpass is high, and the original YOLOv5s recognition category is wrong, while the improved YOLOv5s can correctly identify the bridge with higher accuracy. The ground track field is inconsistent with other target sizes, and the improved YOLOv5s effectively improve the detection accuracy; in the group (c) of figures, the overpass, golf field, and airport are integrated with the background environment, and the object edge is not clear, which increases the difficulty of object recognition, and the accuracy of the improved YOLOv5s is significantly improved.
In order to further verify the effectiveness of the improved YOLOv5s, in this paper, the improved YOLOv5s are compared with the YOLO series network and  EfficientDet model on the DIOR dataset, and AP of 20 types of ground objects and mAP is used as the indicators of the evaluation algorithm. The experimental results are shown in Table 2. It can be seen from the comparison results in Table 2 that the mAP of the improved YOLOv5s has significantly improved compared with YOLOv3, YOLOv4, the original YOLOv5s, and EfficientDet. Among them, the detection accuracy of large ground objects with the complex environment has been greatly improved, such as airports, dam, golf fields, harbor, and trains station, and the detection accuracy of other ground objects has also been improved to varying degrees. It is concluded that the improved YOLOv5s network has a good detection effect on large ground objects in the YOLO series network.

Conclusion
Aiming at the problems existing in the recognition of large ground objects in remote sensing images, such as in complex background, the object edge is not clear and the object feature extraction is difficult, resulting in poor target detection effect, and this paper proposes a remote sensing image ground objects recognition method based on the YOLOv5 network. This paper uses the DIOR dataset; firstly, the anchor box is recalculated, then C3 and attention mechanism are added to the backbone to extract more features of large ground objects by deepening the network structure, and finally, output layer is added to the neck to enhance feature fusion, which improves the effect of the YOLOv5s network on ground object recognition. The experimental results show that compared with the original YOLOv5s network, YOLO series network, and Effi-cientDet model, the improved YOLOv5s network improves the recognition accuracy of large ground objects in a complex environment, and its mAP reaches 80.5% which is 2% higher than the original YOLOv5s network. In the next step, based on maintaining the existing advantages, we will further carry out research on improving the accuracy of small ground objects recognition.
Author contributions All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by JX. The first draft of the manuscript was written by JX, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Data availability The RSOD datasets generated during and/or analyzed during the current study are available in GitHub-RSIA-LIESMARS-WHU/RSOD-Dataset-an open dataset for object detection in remote sensing images. The DIOR datasets and NWPU VHR-10 dataset during and/or analyzed during the current study are not publicly available due to the link failure but are available from the corresponding author on reasonable request.

Declarations
Conflict of interest The authors have no relevant financial or nonfinancial interests to disclose.