YOLO-ERF: lightweight object detector for UAV aerial images

The application of object detection techniques in the field of unmanned aerial vehicles (UAVs) is an important research direction in computer vision. Because object detection in UAV aerial images needs to meet real-time requirements, a challenging problem in this technology is the trade-off between network parameters and detection accuracy. To solve this problem, this paper proposes a lightweight object detector family named YOLO-ERF. First, this paper proposes the effective receptive field (ERF) module, which can increase the convolutional kernel receptive field while preserving local details. The ERF module is then used to design a lightweight backbone to expand the network receptive field without the need for attaching additional context modules after the backbone to expand the receptive field. In addition, the proposed detectors use the ERF module to critically optimize the path aggregation network structure to improve accuracy with reduced network parameters. Finally, a lightweight detection head is proposed to improve small object recognition in complex backgrounds. With these optimizations, the YOLO-ERF models in this paper achieved a better trade-off between accuracy and parameters than other mainstream models, achieving strong results on the VisDrone and COCO datasets. YOLO-ERF-T reduced the number of network parameters by 40.3% when compared with YOLOv7-Tiny while increasing the average accuracy by 2.4% and 1.9%, respectively, in VisDrone and COCO datasets.


Introduction
With their powerful mobility, drones are widely used in the military, agriculture, environmental monitoring, and other fields.Military drones can be used in scenarios such as battlefield reconnaissance and border patrol.In agriculture, unmanned aerial vehicles (UAVs) can spray pesticides accurately, efficiently, and in an environmentally friendly manner.In environmental monitoring, they are used to observe air conditions, collect environmental data from surrounding areas, and monitor factory emissions in real-time.Computer vision techniques provide technical support for the development of UAVs.Object detection, which is a basic task in computer vision, is able to identify and locate all the objects in aerial UAV images.In recent years, deep learning based on convolutional neural networks has become the mainstream method for object detection.Two-stage methods, such as the R-CNN family [1][2][3], usually have high detection accuracy but slow inference speed.The one-stage algorithms such as the SSD series [4][5][6] and YOLO series [7][8][9][10][11][12][13][14][15] are fast but slightly less accurate in inference.Although these methods have achieved advanced performance on natural scene images (MS COCO [16]), when detecting objects in aerial images or videos from drones, satisfactory detection results cannot be achieved.
The following challenges exist for object detection in UAV aerial photography scenarios: (1) Unlike natural scene images (Fig. 1, left), the characteristics of UAVs that take images at low altitude results in images that contain many dense small objects (Fig. 1, right).Distant objects are very small, and dense scenes can produce occlusion, making object detection more difficult.(2) Limited by hardware resources, general object detection algorithms reduce the size of high-resolution aerial UAV images (e.g., to 320 × 320 or 640 × 640) before inputting them into the model, which can cause small objects to be missed during detection.(3) UAV aerial image object detection must operate in real-time, and hence the number of parameters of the network and the accuracy need to be balanced, even though there is an irreconcilable contradiction between the two.
In recent years, many researchers have worked on efficient detection algorithms such as the lightweight algorithms YOLOv5-N [11], PP-PicoDet [17], YOLOX-Nano [12], and YOLOv7-Tiny [13].Despite achieving advanced detection results on natural scene images COCO, these lightweight models struggle to meet the accuracy requirements when applied to UAV aerial images due to the challenges in aerial scenes previously mentioned.This paper presents YOLO-ERF, an unanchored detector for UAV aerial images that is both lightweight and highly accurate, which was inspired by these algorithms.In summary, the main contributions of this paper are as follows: • In this paper, we propose a new Effective Receptive Field (ERF) module that enhances the nonlinearity of the network by stacking multiple X blocks to preserve local details while increasing the receptive field of the network, allowing the model to exploit more contextual information.• In this paper, the backbone was redesigned using the ERF module.A new backbone network, ERFNet, is proposed that expands the receptive field of the network while reducing network parameters, providing more meaningful feature representations for subsequent tasks.This approach does not require the need to add additional contextual modules to extend the backbone receptive field.• With the help of the ERF module, the structure of PAN [18] has been improved and named ERF-PAN.It improves the feature extraction capability of the network, effectively utilizes multi-scale information, provides better object perception and aggregation capabilities for the network, and reduces parameters.Finally, a new lightweight detection head, ERF-Head, suitable for unmanned aerial vehicle aerial photography scenarios was designed, which made the model pay more attention to the regions of interest and provided more discriminative features for the network, effectively improving the detection performance of the model.• We conducted extensive experiments to demonstrate the effectiveness of the method proposed in this paper.YOLO-ERF-T obtained a 17.4% AP on the UAV dataset VisDrone [19], which is 2.4% higher than the AP of YOLOv7-Tiny.It obtained a 39.3% AP on the MS COCO validation set, which is 1.9% AP higher than that of YOLOv7-Tiny.

Receptive field
The receptive field is highly significant in convolutional neural networks.Usually, as the receptive field increases, the information available to the network also increases, thus improving the learning ability of the model.The process of forming multi-scale information through multiple up-and down-sampling of the network will lead to a partial loss of spatial information and the inability to reconstruct small object information, for which Yu et al. [20] proposed dilated convolution.Using dilated convolution to avoid downsampling, the receptive field of the network is effectively expanded without loss of resolution.Chen et al. [21] proposed the atrous spatial pyramid pooling module to solve the problem of multiple scales.The module acquires multiscale object information using multiple dilated convolutions with different dilation rates in parallel and finally fuses each branch to generate results.Wang et al. [22] proposed a hybrid dilated convolution solution for the "gridding issue," which occurs when some pixels are always unavailable due to improper selection of the dilation rate when using successive hole convolutions.The module is reused to alleviate the gridding issue by stacking several dilated convolutions with different dilation rates into one group, effectively exploiting the global information while expanding the network's receptive field.To trade-off the computational overhead and detection performance, Liu et al. [23] proposed the receptive field block.The module consists of two parts: a multi-branching layer and a dilated convolution layer.The multi-branching layer reuses features by using different convolution kernels, and the dilated convolution layer increases the receptive field.Liu et al. applied receptive field blocks to lightweight SSD networks to improve the detection accuracy of the model while achieving real-time performance.

General object detectors
Currently, object detection is widely used in natural scene images.Existing object detectors can be broadly classified as region-based detectors and region-free detectors.The most representative region-based detectors include R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN [24], Cascade R-CNN [25], and R-FCN [26].Representative region-free detectors include the YOLO series, the SSD series, and RetinaNet [27].YOLOv4, YOLOv5, YOLOX, YOLOv6, and YOLOv7 added the spatial pyramid pooling contextual module to the model to expand the receptive field of the network [28].
Although the detection accuracy is effectively improved, the parameters and computational complexity increase.Despite the advanced performance achieved by these models, most of them have a high number of parameters and require highperformance hardware resources.
To trade-off the parameters and accuracy of the network, many scholars are working on efficient detection algorithms.Wong et al. [29] proposed the highly compact network YOLO Nano, which is based on the YOLO model, using a human-machine collaborative design strategy.YOLO Nano is capable of real-time detection on edge devices and mobiles.Hu et al. [30] used depth-wise separable convolution and mobile inverted bottleneck convolution to improve YOLOv3-Tiny, significantly reducing parameters and computational complexity at the slight cost of a 0.7% reduction in detection performance.The YOLOX-Nano [12] based on an anchor-free model has only 0.91M parameters and achieved advanced detection performance on MS COCO using a SimOTA label assignment strategy and decoupled head.Han et al. [31] used 1 × 1 convolution to compress the number of channels and then used depth-wise separable convolution to obtain similar feature maps, thus reducing the parameters of the model and increasing the speed of the model.Cai et al. [32] proposed YOLObile through both compression and compilation perspectives.The blockpunched pruning technique is used to reduce the computational cost of the network while ensuring the accuracy of the model.A GPU-CPU collaborative scheme was also used to help optimize the running speed of the model on mobile devices.Yu et al. [17] achieved superior performance of PP-PicoDet on mobile devices by improving the ShuffleNetV2 network architecture and optimizing the CSP-PAN and SimOTA label assignment strategies.

UAV aerial image object detectors
In UAV aerial photography datasets, most objects are characterized by small scales, diversity in size, dense distributions, and other factors that make it challenging to detect these objects.Chen et al. [33] designed the hybrid detector RRNet with an adaptive resampling strategy for data enhancement and a regression module to generate accurate bounding boxes.The detector effectively solves the problem of small objects in dense scenes.Zhang et al. [34] performed channel pruning on the convolutional layer of YOLOv3, which significantly reduced the computation and number of parameters of the model while maintaining the detection performance, effectively improving the running speed.Zhang et al. [35] used a cascade architecture to refine the prediction frame, which addressed the dense object distribution and occlusion problems to some extent.Both models were trained and tested using data augmentation to improve the networks' ability to detect small objects, but without considering the number of network parameters and operational efficiency.Wang et al. [36] proposed a receptive field expansion module to increase the receptive field of the network while using a spatial refinement module to restore the spatial information of the object, and the combination of the two effectively solves the multiscale problem.GDF-Net [37] uses dilated convolution to refine the density features and provide a larger receptive field for the network, which improves the detection performance of the model and solves the problem of small objects in dense scenes.Jadhav et al. [38] handled multiscale objects in dense scenes by resizing the anchor and using the SE module [39] to enhance the sensitivity of the network to channel information.To handle the problem of long-tailed distributions in UAV aerial images, Yu et al. [40] proposed the DSHNet algorithm.Class-biased samplers are used to bias sample the objects of the head and tail categories, while separate classifiers are used to process the bias of tail and head, and finally, the losses of the head and tail categories are weighted.This effectively resolved the problem of imbalanced category distributions but required additional computational resources.To address the problem that a one-stage detector detects the object with missed detection, Tian et al. [41] proposed a dual neural network.The network performs secondary recognition of missed object areas and effectively enhanced the detection accuracy of small objects.Zhang et al. [42] proposed an adaptive dense pyramid network for the multiscale problem of UAV aerial images.Used the pyramid density module and the object detection module to align the features between density information and instance recognition, which improved the detection performance of the network, but took up more hardware resources.The acceleration performance of deep separable convolution did not achieve theoretical results on GPUs, for which Li et al. [43] proposed a fieldprogrammable gate array, thus achieving advanced results in UAV object detection.Firstly, the adjustable parallelism method in the computational unit is used to accelerate pointwise convolution, which effectively improved the utilization of computational resources and bandwidth.Later, the space-to-channel approach is used to improve the utilization and bandwidth of depthwise convolution.Finally, the preloading workflow of the system is optimized by reducing the waiting time between two images.

Proposed method
In this section, the design ideas of this paper are first introduced.The backbone, neck, and detection head were then redesigned, and these enhancement strategies were used to improve accuracy and reduce parameters.Finally, this section describes the anchor-free and label assignment strategies used to further improve performance.

ERF module
Residual connectivity structures [44,45] and dense connectivity structures [46,47] are widely used in convolutional neural networks.The residual connection structure alleviates the gradient disappearance problem brought about by the increased depth of the neural network.The dense connection structure enhances the feature transfer and reuses the features more effectively, which leads to good performance in object detection tasks.VoVNet [48] addresses the problem of high memory access costs caused by the dense connectivity of DenseNet.CSPVoVNet [49] enables different layer weights to learn more diverse features by analyzing gradient paths.Efficient layer aggregation networks [50] employ a layer aggregation architecture with effective gradient propagation paths by controlling the shortest and longest gradient paths of each layer, which effectively solves the problem of difficult convergence when scaling the depth model and enables the network to learn more features.With limited computational resources, expanding the receptive field of the convolution kernel can effectively improve the detection accuracy of the model.Dilated block (D block) [51] employs group convolution and dilation rate.Group convolution reduces the computational effort and number of trained parameters of the model, and the different dilation rates expand the convolution kernel receptive field without increasing computational effort.The D block applies different dilation rates to different groupings to extract multi-scale features.
Inspired by these methods, this paper proposes a new module called the ERF module for the model in this paper that uses residual connectivity and dense connectivity, which not only maintains the local details but also increases the receptive field of the convolution kernel.
Figure 2a illustrates the X block used in the method proposed in this paper.First, a 1 × 1 convolution is used to increase the nonlinear characteristics as well as crosschannel information interaction.When group convolution is then performed, the X block uses a dilation rate (d1) for half of the group and another dilation rate (d2) for the other half of the group.The number of parameters is saved while effectively expanding the receptive field of the convolution kernel.In addition, the SE module focuses on channel information, which can weigh the network channels well and only needs to add a small amount of computation to improve performance.Therefore, the X block uses the SE module to strengthen the features of the important channels and weaken the features of nonimportant channels.Finally, the use of 1 × 1 convolution adds nonlinear properties without loss of resolution.The group number of the D block, g, was set to 4 and d1 was set to 1 to retain local details and the reduction ratio of the SE was set to 4. The ERF module proposed in this paper is shown in Fig. 2. When the step size is 1 (Fig. 2b), the number of input channels is equally divided into two parts: the left branch is used as a residual connection to prevent gradient disappearance and reduce overfitting, and the right branch is used to expand the network receptive field by stacking n X blocks without losing local details.To balance the accuracy and parameters, three X blocks are stacked in the ERF module in YOLO-ERF-T with a small number of parameters, whereas four X blocks are used in YOLO-ERF-S and YOLO-ERF-L.The structure of Fig. 2c is used for downsampling, with the left branch learning features through the X block and the right branch learning features through the maxpooling operation, after which the features learned by the left and right branches are stitched together.Finally, the number of channels is recovered using a 1 × 1 convolution.

Backbone and neck
In the method proposed in this paper, we use the proposed ERF module to construct the backbone.The backbone is called ERFNet, and its structure is presented in Table 1.For YOLO-ERF-T, a 32-channel, 1-step focus module [11] is first used, which reduces the number of parameters without loss of information.Then, an ERF module with a step size of 2 is used to reduce the size of the feature map, as well as the amount of computation and parameters, while preserving important feature information and extracting more abstract features.Next, an ERF module with a step size of 1 is used to expand the receptive field, allowing the network to gain a broader perspective and capture more contextual information, thereby improving the network's feature extraction capability.Because of the four stacked ERF modules, the resolution of the input image is downsampled to 1/32 and the final number of channels is 512.Choosing the right dilation rate for the ERF module is not easy, and several dilation rates are evaluated in Sect.4.3.
The PAN [18] structure is widely used in the necks of YOLOv4, YOLOv5, YOLOX, and YOLOv7, and therefore the PAN structure is used in the proposed model.In this paper, the neck constructed by the ERF module is called ERF-PAN, and the multilayer feature maps are obtained by top-down and bottom-up feature fusion.The specific structure is shown in Fig. 3.
In this paper, we used a width factor to scale the number of channels in the backbone, neck, and detection head of the model.Thus, a series of detection networks with different parameters and different computational costs can be obtained.The basic number of channels for the backbone, neck, and detection head were set to [64, 128, 256, 512, 1024], [256, 512, 1024], and [256, 256, 256], respectively.Table 2 lists the value of width factor specified for different models.

ERF-Head
The detection head is used to handle the position regression and classification tasks for each object.For these tasks, many papers have proposed methods to improve the detection head [52][53][54].YOLOX adopted the approach of these papers and proposed the decoupled head structure.The background of dense small objects in aerial drone images is complicated and diverse.To improve the feature extraction capability of the detection head, the proposed X block is used in this paper to focus on the region of interest.The X block is a lightweight module that can be easily embedded into the detection head while increasing the receptive field of the convolution kernel and retaining local details, thus improving the feature representation of the detection head using this module.The structure of the detection head is shown in Fig. 4.

Anchor-free approach
Anchor-based object detection algorithms require manually designed hyperparameters such as the scale, aspect ratio, and intersection ratio threshold of the anchor based on the distribution in the training data.However, this approach does not generalize well to the UAV dataset.In contrast to the natural-scene dataset COCO, the UAV dataset VisDrone suffers from a high proportion of small objects and long-tail distribution, which can be visually illustrated in Fig. 5.For the above reasons, the anchor-free strategy is used in this paper.This approach reduces the number of predictions per grid from three to one, predicting four values for the two offsets x and y in the upper left corner of the grid, and the height h and width w of the prediction box.

Label assignment strategy
Label assignment has an important role in the field of object detection, where the loss weights of positive and negative samples are assigned to training samples for supervised learning.The assignment strategy used by RetinaNet, Faster-RCNN, and SSD is to use the max-IoU assigner, which discriminates positive and negative samples by calculating the IoU value between the ground truth (GT) and anchor.YOLOv5 uses the strategy of the width-to-height ratio between the GT and anchor to increase the number of positive samples.FCOS [55] takes the anchors that are in the central region of the ground truth as positive samples.The strategy used by TOOD [56], PP-YOLOE [57], and YOLOv6 is task alignment learning, which characterizes the classification task and localization task interaction while maintaining their features.To obtain globally optimal assignment results, YOLOX uses the SimOTA dynamic label assignment strategy and achieves advanced results.Therefore, the Fig. 3 YOLO-ERF structure.The backbone is ERFNet, which feeds the extracted C 3 , C 4 , C 5 feature maps to the neck.The neck is ERF-PAN, which fuses the three input feature maps and outputs three fea-ture maps P 3 , P 4 , P 5 .For the YOLO-ERF-T, the number of input channels is [128, 256, 512] and the number of output channels is [128,256,512] Table 2 Value of width factor for the YOLO-ERF series of networks

Model
Width factor YOLO-ERF-T 0.50 YOLO-ERF-S 1.00 YOLO-ERF-L 1.25 Fig. 4 YOLO-ERF-Head structure.The numbers of channels of P 3 , P 4 , P 5 are compressed using 1 × 1 convolution to keep the number of channels consistent.After that, two X-blocks are stacked twice to improve the feature extraction capability of the detection head.Finally, the prediction results are output using 1 × 1 convolution SimOTA dynamic label assignment strategy was chosen to optimize the training process in this paper.This section first described the ERF module, which increases the convolutional kernel receptive field while still preserving the local details.The optimization of the backbone, neck, and detection head using the ERF module was then described.These enhancement strategies are suitable for both the UAV aerial photography scenario and for reducing parameters and improving detection accuracy.Lastly, anchor-free and SimOTA dynamic label assignment strategies were introduced to further improve the detection performance.

Dataset
VisDrone2019 is an unmanned aerial photography dataset with image resolutions up to 2000× 1500.The training set contains 6,471 images with 343,205 labels and an average of 53 instance objects per image, which is a high object density, and most of the objects are very small (< 32 pixels), as shown in Fig. 5.The validation and test sets contain 548 and 1,610 images, respectively.The dataset includes 10 categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.All models in this paper were trained on the training set, validated on the validation set, and finally evaluated on the test set.
MS COCO2017 includes 80 categories and is the most widely used dataset for object detection assessment.
The training set has 118,287 images containing 860,001 instances, with an average of 7.3 instances of objects per image, of which 41.4% are small objects.In all experiments, the models were trained on the training set and evaluated on the 5,000 images of the validation set.

Evaluation indexes
The evaluation criteria of the MS COCO dataset were used.The main indicators were AP0.5 and AP0.5:0.95.AP0.5 (AP50) indicates the average accuracy of all object categories at an IOU threshold of 0.5.AP0.5:0.95 (AP) indicates the average accuracy of all object categories at IOU thresholds of 0.5 to 0.95 with a step size of 0.05.AP L , AP M , and AP S were calculated at IoU thresholds of 0.5 to 0.95 for three different scales of AP values for large, medium, and small objects.

Ablation study
In this paper, we used random gradient descent with a momentum value of 0.937 for training, and the weights decayed to 5e-4.A cosine annealing decay strategy was used to adjust the learning rate, with an initial learning rate of 0.01.On a single NVIDIA GTX 3090 GPU device, the batch size was the default value of 16. 150 epochs were used for training on the VisDrone dataset and 300 epochs were used for training on the MS COCO dataset.
In this paper, ablation experiments were performed on the proposed method.All experiments were performed on the VisDrone2019 dataset, and the experimental results were obtained on the test set.First, the backbone employed the ERFNet proposed in this paper, no neck was used, and the decoupled head was used for the detection head.The IoU loss function [58] was used to train the regression branch, and the binary cross-entropy loss function was used for both the object loss and confidence loss.The label assignment strategy used SimOTA and all activation functions used SiLU [59].The obtained AP test was 12.6%.Next, the ERF-PAN structure was added to the network, increasing the number of parameters by 1.7M, and the AP test was fur- ther increased to 15.5%.At this point, the receptive field of the network can perceive the detailed features of small and medium-sized objects well, but the receptive field of large objects is not enough, so the improvement of AP S and AP M is more than that of AP L .Finally, the decoupled head was replaced by the ERF-Head proposed in this paper.The parameters were reduced by 24.5% and the AP test was improved by 2.1%.Moreover, the detection accuracy of both small and large objects was improved, demonstrating the effectiveness of the ERF-Head.ERF-Head further expands the receptive field of the model, allowing it to perceive more large objects and fully utilize more contextual information.However, it may also capture some irrelevant noise and redundant information, resulting in less improvement in detection accuracy for small to medium objects compared to large objects.The results are shown in Table 3.
As shown in Table 4, the value of the dilation rate of the ERF module was investigated in this paper.In this study, we

Comparison with other detectors on VisDrone
To evaluate the performance of YOLO-ERF, a series of experiments were conducted in this study with other stateof-the-art one-stage detectors (e.g., YOLOv5 [11], YOLOX [12], and YOLOv7 [13]) on the VisDrone2019 dataset.The experimental results are listed in Table 5.Note that the results of these comparative methods are taken from the corresponding original papers.
As can be seen in Table 5, YOLO-ERF-T reduced the number of parameters by 58.8% and reduced the GFLOPs by 53.4%.Moreover, it had an AP that was 2.7% higher and an AP50 that was 5.3% higher than the results of YOLOX-S.It also reduced the number of parameters by 48.6% of parameters and reduced the GFLOPs by 24.2%, obtaining AP and AP50 results that were 1.9% and 4.2% higher than the results of YOLOv5-S.YOLO-ERF-L was also compared with the existing method YOLOX-X.The results show that it reduced the number of parameters by 49.3% and reduced the FLOPS by 32.6%, the AP was improved by 3.3%.
Because of the high resolution of the VisDrone dataset, some methods divide the images in the dataset into blocks when they are evaluated on the test set.Therefore, in this paper, we used the SAHI tool [60] to divide the images of the test set into blocks with edge lengths of 960 and overlap between blocks of 25%.The evaluation prediction results on the image blocks are shown in Table 5, revealing that there was a significant improvement in the detection accuracy.

Comparison with other detectors on MS COCO
To further evaluate the generality of YOLO-ERF, this study conducted experiments on the MS COCO2017 dataset, which has a larger number of experimental objects than the VisDrone2019 dataset.Because of the limitations of our hardware resources, only YOLO-ERF-T and YOLO-ERF-S were evaluated on the MS COCO dataset.In this study, we compared them with YOLOv5 [11], YOLOX [12], YOLOv3-Tiny [9], YOLOv4-Tiny [49], YOLOv6 [14], YOLOv7 [13], and YOLOv8 [15], and the experimental results are presented in Table 6.When its results were compared with the results of YOLOX-Nano, YOLOX-Tiny, YOLOv3-Tiny, and YOLOv4-Tiny(input size = 416), the YOLO-ERF-T in this paper significantly improved the AP results by 10.3%, 3.3%, 19.5%, and 14.4%, respectively.In addition, it significantly improved the AP results by 11.3%, 3.4%, 1.9%, and 2.0% when its results were compared with the results of YOLOv5-N, YOLOv6-N, YOLOv7-Tiny, and YOLOv8-N (input size = 640), respectively.

Conclusion
In this paper, we provided a series of lightweight object detectors called YOLO-ERF for UAV aerial images.First, we proposed the ERF module, which can increase the receptive field of the convolution kernel while retaining local details.A lightweight backbone was also designed using the ERF module, which effectively expanded the receptive field of the backbone and avoided adding additional contextual modules to expand the network's receptive field.In addition, the ERF module is used to improve the PAN structure, which improves the ability of the network to extract features while reducing the number of parameters.Finally, a new lightweight detection head was designed for small objects in complex contexts.We performed experiments on the VisDrone and MS COCO datasets to evaluate the effectiveness of YOLO-ERF.The experimental results showed that our method achieves a better trade-off between accuracy and parameters than the comparison methods.
Considering the high cost of memory access for networks with multi-branch structures, the excessive number of memory accesses may become a bottleneck in network performance.Therefore, in future work, we will design object detectors that are even more lightweight and can be deployed on edge devices and mobiles in real time real-time.

Fig. 2 a
Fig. 2 a X block and ERF module b when stride = 1 and c when stride = 2

Table 1
ERFNeta Channels are the number of output channels, and the number of input channels is inferred from the previous block b Repeat indicates the number of times each module is used

Table 3
Ablation experiments f YOLO-ERF-T

Table 5
Comparison of the accuracy of different object detectors on VisDrone2019Bold indicates the detection effect of the model in this paperThe models in the table were trained using the original images "*" represents the evaluation prediction on the divided image blocks