Yolov5-Pest. The network structure of YOLOv5 algorithm is mainly divided into three modules: backbone, neck, and head. The backbone is used for feature extraction, the neck is used for feature fusion, and the head is used for target detection[11]. The backbone module uses Cross Stage Partial Network (CSPNet) and Spatial Pyramid Pool Fast (SPPF) to extract input image features and transmit them to the neck module. The neck module utilizes Path Aggregation Network (PANet) to generate feature pyramids, which bi-directional fuse low-level spatial features with high-level semantic features, enhancing the detection ability of objects of different scales. The prediction section of the head module has three different scale feature maps. Based on the features of different scales, the corresponding prediction frame is generated for the target image to determine the category, coordinates, and confidence of the detected object.
According to the width and depth of the network, YOLOv5 also includes four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Among them, the YOLOv5s model can ensure detection speed while ensuring detection accuracy, and the model volume is minimized. Therefore, this article chooses YOLOv5s as the basic framework.
This article proposes a small target detection algorithm for agricultural pests based on an improved YOLOv5 architecture. YOLOv5 has been improved mainly from the aspects of Backbone network and feature fusion. The purpose is to enhance spatial and semantic information, improve detection accuracy, while maintaining running speed.
Firstly, replace the C3 module of the YOLOv5 backbone layer and the PANet structure of the neck layer are replaced by the C3CBAM module and BiFPN structure, respectively. The C3CBAM module here is mainly used to extract pest image features and enhance the weight of pest target areas in the feature map from both channel and spatial aspects; The BiFPN structure adds a path from high resolution to low resolution, improving the efficiency of the feature fusion process. Also, a C3CA module has been added to the neck layer, where the C3CA module is mainly used to enhance the ability of image feature extraction and residual feature learning, further improving detection performance. The overall structure is shown in Fig. 1.
The Loss function of the improved model consists of Classes loss, Objectness loss and Location loss. During the training process, the full objective function can be described as follows:
$$\begin{array}{c}Loss={}_{1}{L}_{cls}+{}_{2}{L}_{obj}+{}_{3}{L}_{loc}\#\left(1\right)\end{array}$$
where \({L}_{cls}\)is Classes loss, \({L}_{obj}\) is Objectness loss, and \({L}_{loc}\)is Location loss. Both the classification function and the confidence function use the BCE loss. The difference between them is that the Classes loss only calculates the classes loss of positive samples, and the Objectness loss calculates the objectness loss of all samples. Here, objectness refers to the CIoU of the target boundary box predicted by the network and GT box. The binary Cross entropy function is defined as the formula:
$$\begin{array}{c}L=-y\text{lg}p-\left(1-y\right)\text{lg}\left(1-p\right) \#\left(2\right)\end{array}$$
When the input sample is positive, y is 1, and when the input sample is negative, y is 0. P is the probability that the model predicts that the input sample is a positive sample. The Location loss is the CIoU loss, and only the positive positioning loss is calculated. In this study, we replace the CIoU loss with DIoU, and the calculation formula is shown in the Eq. (3).
$$\begin{array}{c}{L}_{DIoU}=1-\left(IoU-\frac{{\rho }^{2}\left(b,{b}^{gt}\right)}{{c}^{2}}\right)\#\left(3\right)\end{array}$$
where IoU is the intersection ratio, \(b\) represents the center point of the predicted box and \({b}^{gt}\) represents the true center point of the box; \({\rho }^{2}(\bullet )\) is the Euclidean distance; \(c\) is the diagonal distance between the minimum closed area of the predicted box and the actual box.
C3CBAM. In object detection tasks, the importance of the features of the target object varies in each channel, and the importance of pixels at different positions in each channel also varies. Small targets such as pests or occluded targets occupy fewer pixels in the feature map, which can easily lead to the loss of their feature information in deep networks. Therefore, only by simultaneously considering these two different levels of importance can the model more accurately identify the target object. The attention mechanism in neural network models can focus on information of interest and ignore useless information, enhancing important features and suppressing general features. Among them, the CBAM[12] attention module combines spatial and channel attention modules, which can not only effectively increase the weight of occlusion or small targets in the entire feature map, but also easily embed them into any existing framework. It is a lightweight, simple and effective Convolutional neural network attention module. CBAM integrates attention weights from channel and spatial dimensions on the basis of the input feature map, and multiplies them with the input feature map to obtain a new feature map, which is conducive to extracting information from the feature map. The structure of the CBAM module is shown in Fig. 2.
The backbone feature extraction network of the original YOLOv5 adopts a C3 structure, and its structural diagram is shown in Fig. 3 (a). In order to enhance the ability of image feature extraction and residual feature learning, the CBAM attention module is used instead of the bottleneck module in the C3 module to improve the detection performance of small targets and occlusion. The overall structure of the C3CBAM module is shown in Fig. 3 (b), where the C3CBAM module includes multiple CBAM attention modules (the actual number N is determined by the product of n and depth_multiple parameters of the configuration file.yaml) and three standard convolution layers.
C3CA. The CA mechanism performs average pooling in both horizontal and vertical directions, and weights and fuses spatial information through spatial information encoding. The specific process of the coordinate attention block is shown in Fig. 4 (a). The detailed principle of coordinate attention mechanism is described in reference [13], which helps neural networks focus on effective coordinates and suppress invalid coordinates, thereby improving the efficiency of information flow in neural networks. The overall structure of the C3CA module is shown in Fig. 4 (b), which includes multiple CA attention modules (the actual number N is determined by the product of n and depth_multiple parameters of the configuration file.yaml) and three standard convolution layers.
BiFPN. Feature extraction networks are used to understand the context and meaning of images. In order to enhance the fusion of feature information at different scales, this study changed the PANet structure of the original YOLOv5 model to the BiFPN structure, as shown in Fig. 5. BiFPN has two core points [14]. Firstly, compared to the original FPN structure, it adds a path from high resolution to low resolution, improving the efficiency of the feature fusion process. Secondly, it removes nodes that only receive input from a single node, making BiFPN more efficient and lightweight than PANet. The Backbone network information of different scales is fused by up sampling and down sampling to unify the feature resolution scale. Adding horizontal connections between the original input and output nodes of the same feature reduces feature information loss due to excessive network layers. Therefore, as the model deepens, feature fusion can become more and more comprehensive.
Datasets. Wu[15] et al. proposed a new large-scale pest detection and identification benchmark dataset, IP102. This dataset contains over 75000 images from 102 categories, presenting a natural long tail distribution. It has 19000 images with borders for object detection. It has different types of pests, including eggs, larvae, pupae, and adults, as well as labels for categories such as rice leaf caterpillars, rice stem maggots, and cicadas. Its main characteristics are: 1). graded classification system; 2). natural long tail distribution; 3). unbalanced data distribution; 4). rich types of pests; 5). small inter class differences while large intra class differences.
In this study, we selected forty specific categories from the IP102 dataset: brown planthopper, Asian rice borer, corn borer, rice leaf roller, rice leafhopper, wheat thrips, and beet armyworm to form a new dataset. This dataset contains 7526 images, and we chose these categories because these pests mainly occur in rice, wheat, corn, and sugar beet, which can represent typical pests of common crops. Some images in the dataset are shown in Fig. 6.