Receptive Field Enhancement and Attention Feature Fusion Network for Underwater Object Detection

doi:10.21203/rs.3.rs-3019832/v1

Download PDF

Research Article

Receptive Field Enhancement and Attention Feature Fusion Network for Underwater Object Detection

https://doi.org/10.21203/rs.3.rs-3019832/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 07 May, 2024

Read the published version in Journal of Electronic Imaging →

Version 1

posted

You are reading this latest preprint version

Underwater environments have characteristics such as unclear imaging and complex backgrounds, which lead to poor performance when applying mainstream object detection models directly. To improve the accuracy of underwater object detection, we propose a novel object detection model RF-YOLO, which uses Receptive Field Enhancement Module(RFAM)in the backbone network to finish receptive field enhancement and extract more effective features. We design Free-channel iterative Attention Feature Fusion༈FAFF༉ module to reconstruct the neck network and fuse different scales of feature layers to achieve cross-channel attention feature fusion. We use SIoU as the loss function of the model, which makes the model converge to the optimal direction of training through angle cost, distance cost, shape cost, and IoU cost. The network parameters increase after adding modules, and the model is not easy to converge to the optimal state, so we propose a new training method, which effectively mines the performance of the detection network. Experiments show that the proposed RF-YOLO achieves mAP of 87.56% and 86.39% on URPC2019 and URPC2020 respectively. Through comparative experiments and ablation experiments, it was verified that the proposed network model has higher detection accuracy in complex underwater environments.

underwater image

object detection

Receptive Field Enhancement

Attention Feature Fusion

With the booming development of computer vision technology, underwater object detection plays a huge role in fields such as fisheries, resource exploration, underwater archaeology, and exploring marine ecosystems[1, 2]. Traditional manual underwater operations not only require a lot of equipments and rich experience to ensure the successful completion of underwater operations[3], but also because of the uncertainty of the underwater environment, personnel who carry out underwater operations often need to take great risks. The development of underwater object detection provides a new path to solving underwater operation problems. In recent years, due to the booming development of deep learning technology, various object detection algorithms have emerged one after another. They perform excellently in land optical environments and achieve very good results on various land optical datasets. However, due to the complex and changeable underwater environment[4], images taken by the underwater vision system are affected by high noise, low visibility, edge blur, low contrast, and color bias[5, 6]. These effects often lead to image blur and texture distortion[7, 8], which have a great impact on the accuracy of object detection. Therefore, when we directly apply object detection algorithms to underwater environments, we often cannot achieve satisfactory results. The targets detected by underwater object detection are often small and occluded by each other[9–11], which also poses a great challenge to the accuracy of detection. In order to improve the accuracy of underwater object detection, we propose a novel object detection model RF-YOLO based on YOLOv7[12]. We employ RFAM[13] module in the backbone network to enhance the receptive field of the model and obtain more effective features. After adding modules, since the network parameters increase, the object detection model may not be trained sufficiently. We propose a new training paradigm that loads the weights obtained from the first training into the pre-trained weights of the second training and restarts training. This training paradigm can more effectively mine the performance of the detection network and improve detection accuracy. We design the Free-channel iterative Attention Feature Fusion(FAFF) module based on the inspiration of iAFF[14, 15] and use FAFF module to reconstruct its neck network[16] and better fuse information from different scales of feature layers[17]. Finally, we use SIoU as the loss function of our model, which is jointly constrained by angle cost, distance cost, shape cost and IoU cost[18, 19] to guide the model to train in the optimal direction. We reviews the research on underwater object detection, receptive field and loss function in section 2. Section 3 details our proposed RF-YOLO architecture and its components: RFAM module, FAFF module and SIoU loss function. We describe our experimental setup, parameters, dataset and metrics in section 4, and analyze the results through ablation and comparative experiments. Section 5 concludes the paper and discusses the advantages of our model and future work. The main contributions of this paper are as follows:

We propose a novel object detection network model RF-YOLO for underwater environments based on YOLOv7. We use RFAM module in its backbone part to enlarge its receptive field and extract more effective features. We propose FAFF module to reconstruct its neck network and fuse features from different scales of feature layers.

We use SIoU as RF-YOLO’s loss function to make prediction boxes closer to ground truth boxes and improve the prediction accuracy of underwater objects.

We propose a new training paradigm that can more effectively mine the performance of detection network and improve the accuracy of object detection when network parameters increase.

To verify the effectiveness of our RF-YOLO, we conducted comparative and ablation experiments on URPC2019 and URPC2020 datasets. The experimental results show that our proposed RF-YOLO outperforms the SOTA network models.

2.1 Underwater Object Detection

The current mainstream object detection network models are generally divided into single-stage algorithms and two-stage algorithms. The difference between them is whether there is a candidate box generation stage. Two-stage object detection algorithms such as Faster RCNN, Mask RCNN, etc., divide the object detection task into two stages. The first stage algorithms use a region proposal network to obtain the region proposal of interest, and the second stage maps the region proposal of interest to the feature map through pooling for classification and location regression. This two-stage algorithm has high detection accuracy, but poor real-time performance. Single-stage algorithms such as SSD and YOLO series do not generate candidate boxes, but directly output classification and localization results. Therefore, single-stage algorithms have fast detection speeds, but their detection accuracy is slightly inferior to two-stage algorithms. In underwater scenes, the quality of underwater images is greatly affected by lighting, and the images have low visibility, low contrast, and color distortion. Xu et al.[20] proposed that there is no strict positive correlation between underwater image enhancement and the improvement of underwater object detection accuracy, and enhanced underwater images might reduce the detection accuracy. Some scholars jointly train enhancement and detection to achieve higher detection accuracy. Yeh et al.[5] added a color conversion network before the object detection network, which converted the image from RGB color space to HSI color space for fine-tuning, and outputted grayscale images to the detection network for detection. Facing the problem of underwater image blur, Chen et al.[21] proposed a sample weighting network (SWIPNET) and a new training paradigm Curriculum Multi-Class Adaboost (CMA), which used sample reweighting algorithm to reduce the weight of lost targets, thereby reducing the interference of noise samples. Hu et al.[22] proposed an underwater object detection algorithm based on SSD and feature enhancement, which adopted the idea of feature cross-level fusion to improve the feature expression ability.

2.2 Receptive Field Enhancement

The research on receptive fields has a long history, and its main purpose is to improve the performance of object detection without increasing more computational cost. Inspired by the primate visual cortex neuroscience model, Szegedy et al.[23] proposed Inception, which improved the network model by approximating the expected best sparse structure with existing dense building blocks, enhancing the feature representation of the model. Subsequently, Szegedy et al. also proposed some research on Inception[24, 25], which used multiple branches with different kernel sizes to capture multi-scale information. But these kernels sample at the same center, so it is easy to lose some key feature details. Chen et al.[26] proposed ASPP, which used dilated convolution to change the distance between sampling centers. But ASPP sampled features with a uniform resolution as previous convolution layers with the same kernel size, so it was easy to cause confusion between objects and context information. Dai et al.[27] proposed Deformable CNN to learn unique resolutions for individual objects, but Deformable CNN also had the same problem as ASPP. Liu et al.[28] proposed RFB module, which consisted of multi-branch convolution layers with different kernels and trailing dilated pooling or convolution layers. The first part is similar to Inception, which is responsible for simulating kernels of various sizes. The second part reproduces the relationship between pRF size and eccentricity in human visual system. RFB module effectively improves the performance of single-stage object detection network. Fan et al.[13] proposed RFAM module and RFAM-PRO module, which reproduced RFB work, where RFAM-PRO further refined kernels to make it more conducive to small object detection.

2.3 Loss Function in Object Detection

Loss function is a class of functions that calculates the difference between predicted values and true values. In object detection, in order to improve detection accuracy, we need to make prediction boxes as close as possible to ground truth boxes. In this process, we need to introduce loss functions. Yu et al.[29] proposed IoU Loss that took the ratio of intersection and union between prediction box and ground truth box and then took negative logarithm. It solves two major problems of Smooth L1 series variables being independent and not having scale invariance. But IoU Loss can not optimize the situation where two boxes do not intersect nor can it reflect how two boxes intersect. Rezatofighi et al.[30] proposed GIoU Loss, which introduced the minimum bounding rectangle of prediction box and ground truth box based on IoU. But when prediction box and ground truth box are in a containment relationship or in a horizontal or vertical direction, GIoU loss degenerates into IoU Loss, i.e., when |C-A∪B|→0, it will cause the model to converge slowly. Zheng et al.[19] put forward DIoU, which modified the penalty term of introducing the minimum bounding box in GIoU to maximize the overlapping area to minimize the standardized distance between two BBox center points, thus accelerating the convergence process of loss. At the same time, they also proposed CIoU, which considered the aspect ratio of bounding boxes into loss function based on DIOU, further improving the regression accuracy. Gevorgyan et al.[18] presented that mismatch between ground truth box and prediction box would cause model to converge slower and less effectively. So they proposed a new loss function SIoU, which redefined penalty measure and considered angle between prediction box and ground truth box, effectively improving the accuracy of object detection.

The above methods can indeed alleviate the problem of poor detection performance caused by underwater image blurring to a certain extent, but due to the fact that the underwater objects to be detected are not only small, but also mostly overlapping, the accuracy of these methods cannot be well guaranteed. There is still a lot of room for improvement in the accuracy of underwater target detection. Based on the above problems, in order to effectively improve the performance of underwater target detection, this paper proposes a new target detection model RF-YOLO.

3.1 Network Structure

The RF-YOLO model proposed in this paper is divided into three parts: backbone, which extracts features from the input image, neck network, which enhances features, and detection head, which performs decoding process. Figure 1 shows the network structure of RF-YOLO.

The RFAM module enhances multi-scale features by using multiple convolution kernels with different dilation rates, and then fuses these branches. Then it uses a 1x1 convolution to adjust the output channel number. Finally, it uses a skip connection to weight the input result and the result after branch fusion, making the whole module have a residual structure. The RFAM module effectively enlarges the receptive field of the network, making the backbone more thorough in feature extraction and improving detection accuracy. We used four FAFF modules to reconstruct neck network, and they fused two feature layers of different scale sizes extracted from the backbone network, fuses features with inconsistent semantics and scales. It solves the problem of linear addition and cascading operations being irrelevant to context information, realizes cross-channel information interaction, and effectively improves the accuracy of underwater object detection.

3.2 Receptive Field Enhancement Module (RFAM)

The structure diagram of RFAM module used in backbone part of this paper is shown in Fig. 2. Firstly, the input data is adjusted by three 1×1 convolution branches, and then the three branches are concatenated by convolution with different kernel sizes. Secondly, the data of the three branches is merged together and then adjusted by another 1×1 convolution for channel adjustment. Finally, it used a skip connection to cascade the data from the input with the data from the three branches, and output the data after weighting. The calculation formula of RFAM module can be expressed by Eq. (1):

$${X}_{out}=\tau \times （{X}_{in}\otimes ϵ（{X}_{1}\oplus {X}_{2}\oplus {X}_{3}）\times scale）$$

where ${X}_{out}$represents the output result, ${X}_{in}$represents the input data,${X}_{1}$,${X}_{2}$ and ${X}_{3}$ respectively represent the three branches after different dilation rate convolutions. $\oplus$ is the concatenation operation, and$ϵ$ represents the operation of adjusting the channel by 1×1 convolution after concatenating the three branches. The scale is the weight in the skip connection process, and in this, paper scale is set to 0.1, $\tau$ is the relu activation function, which activates the output data after the skip connection operation is completed. The RFAM module can effectively enhance the receptive field because of three convolutions with different dilation rates to effectively enhance the receptive field[28]. In our network, RFAM module in backbone can not only obtain multi-scale context information, but also make backbone extract more useful features in feature extraction process, improving the accuracy of underwater object detection.

3.3 Multi-scale Feature Fusion

3.3.1 Multi-scale Channel Attention (MS-CAM)

The key idea of MS-CAM is to achieve channel attention at multiple scales by changing the size of spatial pooling. In MS-CAM, there are two branches: global feature channel attention and local feature channel attention. In order to keep the whole module lightweight, both branches use PWconv (pointwise convolution) to reduce the number of parameters. The local feature attention branch can be represented by Eq. (2):

$${X}_{out}={X}_{in} \otimes （\phi \times （（{X}_{out1}\oplus {X}_{out2}））$$

where$\phi$represents the sigmoid activation function, $\oplus$ is the cascading operation, and $\otimes$ indicate element-wise multiplication.

3.3.2 Free-channel iterative Attention Feature Fusion (FAFF) Module

In order to solve the problem of linear cascading operations being irrelevant to contextual information and better fuse features from different scale feature layers, this paper proposes the FAFF module to reconstruct the neck network of RF-YOLO. Figure 3 shows the structure diagram of the FAFF module.

Given two features X and Y with different scales, the FAFF module performs the operations as shown in Eq. (4) and Eq. (5):

$$X\cup Y=M（X + Y） \otimes X + （1 - M（X + Y））\otimes Y$$

$$Z =\in （ M（X\cup Y） \otimes X +（1 - M（X\cup Y）） \otimes Y$$

where cascading operation is $\oplus$, $\otimes$represents element-wise multiplication, and $\in$ indicates the operation of adjusting the channels by 1x1 convolution. M denotes MS-CAM operation, and Z is the final output result. By using the FAFF module to reconstruct the neck network, the features from different scale feature layers are effectively fused.

3.4 SIoU

Previous loss functions mainly aggregated indicators such as distance, overlap area and aspect ratio between the predicted box and the ground truth box, and they did not consider the angle difference between the ground truth box and the predicted box. SIoU redefined the penalty amount, considering the angle between the center point of the predicted box and the center point of the ground truth box, effectively improving the detection accuracy. SIoU consists of four cost functions: angle cost, distance cost, shape cost, and IoU cost. The angle cost can be defined by Fig. 5 and Eqs. (5)-(6) :

$$\wedge =1-2\times {sin}^{2}（arcsin（x）-\frac{\pi }{4}）$$

$$x=\frac{{C}_{ℎ}}{D}=sin\alpha$$

where D is the distance between the center point of the predicted box and the center point of the ground truth box, x is defined by Eq. (6), and ${\text{C}}_{\text{h}}$ is the height of the rectangle with D as the diagonal. When calculating the loss, if α is less than $\frac{\pi }{4}$, SIoU will first try to make α as small as possible, otherwise it will minimize β. Since the change of angle cost will affects the distance cost, the distance cost is defined as follows:

$$\nabla ={\sum }_{t=x，y}（1-{e}^{-\gamma {\rho }_{t}}）$$

$${\rho }_{x} ={（\frac{（{b}_{{c}_{x}}^{gt}-{b}_{{c}_{x}}）}{{c}_{w}}）}^{2}$$

$${\rho }_{y}= {（\frac{（{b}_{{c}_{y}}^{gt}-{b}_{{c}_{y}}）}{{c}_{ℎ}}）}^{2}$$

where${\gamma }=2-\wedge$.

The shape cost is defined by Eqs. (10)-(12):

$$\varOmega ={\sum }_{t=w，ℎ}{（1-{e}^{-{\omega }_{t}}）}^{\theta }$$

${\omega }_{w}$ = $\frac{|w-{w}^{gt}|}{max（w，{w}^{gt}）}$ (11)

${\omega }_{ℎ}$ = $\frac{|ℎ-{ℎ}^{gt}|}{max（ℎ，{ℎ}^{gt}）}$ (12)

where (w, h), and(${w}^{gt}$, ${ℎ}^{gt}$) are the width and height of the Predicted box and ground truth box, respectively.

IoU cost is shown in as follows:

$$IoU=\frac{|B\cap {B}^{gt}|}{|B\cup {B}^{gt}|}$$

The formula of the whole SIoU loss is shown in Eq. (14):

$${L}_{box}=1-IoU+\frac{\nabla +\varOmega }{2}$$

4.1 Dataset Setting and Evaluation Metrics

The performance of the proposed RF-YOLO is verified on the open-source datasets URPC2019 and URPC2020. The URPC2019 dataset contains 4757 images, and the URPC2020 dataset contains 5543 images. Both datasets include four categories: echinus, starfish, holothurian, and scallop. Since the test set of the URPC series dataset is no longer available. Therefore, this paper sets the ratio of training set and validation set to test set as 9:1, and the ratio of training set to validation set as 9:1. The URPC series dataset not only has the characteristic of low resolution, but also has a very unbalanced category. Among the four categories, the number of echinus is much more than those of the other three categories. These two characteristics of the URPC series dataset pose a great challenge to the accuracy of object detection. Some images of training set in URPC are shown in Fig. 6, and the category information of the datasets URPC2019 and URPC2020 is shown in Table 1 and Table 2.

Table 1

URPC2019 dataset and classes
class	Number
echinus	18490
holothurian	5199
starfish	5794
scallop	6617

Table 2

URPC2020 dataset and classes
class	Number
echinus	22343
holothurian	5537
starfish	6841
scallop	6720

The evaluation metrics used in this paper are AP and ${mAP}_{50}$. AP reflects the average recognition accuracy of each category, and the calculation of AP requires the introduction of precision and recall, where recall is the horizontal axis and precision is the vertical axis. By integrating the P-R (precision-recall) curve, the value of AP can be obtained. mAP reflects the average recognition accuracy of all categories, and ${mAP}_{50}$ is one of the most important indicators in object detection. And it is shown as follows (the IOU threshold of model detection accuracy is 0.5):

$$mAP=\frac{1}{n}\times {\sum }_{i=1}^{n}{AP}_{i}$$

where n is the number of categories, and ${\text{A}\text{P}}_{\text{i}}$ is the average precision of the corresponding class.

4.2 Experimental environment and parameter setting

4.2.1 Experimental environment setting

The experimental environment of this paper is shown in Table 3. The pytorch1.11 + cu11.7 framework is used, the processor is Intel® Core(TM) i9-10900 CPU @ 3.70 GHz, the RAM is 32.0GB, the graphics card is NVIDA GeForce RTX 3060, and the software programming environment is Python3.7.

Table 3

Experimental environment
environment	parameter
CPU	NVIDTA GeForce RTX 3060
GPU	Intel(R) Core (TM) i9-10900 CPU @ 3.70 GHz
Pytorch	1.11
CUDA	11.7
Python	3.7

4.2.2 Experimental Parameter Setting

In the experiments, in order to ensure the fairness of the experiments, the hyperparameters are set consistently, and the image input resolution is adjusted to 640×640 before the network starts training. The IOU threshold is set to 0.5, and all networks use the weights trained on the large-scale dataset Imagenet as the initial pre-trained weights. The epoch is set to 350, of which 50 are frozen training and 300 are unfrozen training. During the period of frozen training, the backbone is frozen to prevent the initial backbone weights from being destroyed in the early stage of training, and its gradient is not updated. The batch size of frozen training is set to 8, and the batch size of unfrozen training is set to 4. Num-works is set to 4, allowing the network to read data in multiple threads.

4.2.3 Double train

The learning rate of YOLOv7 uses cosine annealing algorithm to decay. The original YOLOv7 model reached the optimal state after more than 100 times of training, and the learning rate at this time did not decay to a very small value, so it would not fall into a local optimum due to a small learning rate. Due to the increase of network parameters after adding RFAM module to the backbone in this paper, the number of training times required to achieve full convergence increases, and the learning rate at this time has decayed to a very small value, so it is easy to fall into a local optimum and cannot reach a global optimum. Based on the above situation, this paper proposes a new training method for detection models with increased network parameters after adding modules. After the first training is completed, we take the weight with the highest mAP and load it into the pre-trained model of the second training and start training. In the second training, the initial learning rate is relatively high, so it can skip the local optimum point and find the global optimum, making the network fully trained. The effectiveness of this training method is verified by the ablation experiment in Section 4.4.

4.3 Comparison with Other Object Detection Algorithms

In this section, we compare our proposed method with other networks on URPC2019 and URPC2020 datasets. In order to better verify the effectiveness of our proposed method, we not only compare with YOLOv7 as a baseline, but also with YOLOX[31] and YOLOv5 which are also YOLO series. The evaluation metric is ${mAP}_{50}$. The results of comparative experiments are shown in Table 4 and Table 5.

Table 4

Comparison results on URPC2019 dataset (bold indicates the highest result)
Model	echinus	holothurian	scallop	starfish	mAP
YOLOv5	89.50%	79.10%	81.85%	86.88%	84.33%
YOLOX[31]	90.77%	83.25%	85.74%	87.71%	86.87%
YOLOv7[12]	90.26%	82.48%	86.42%	86.95%	86.53%
RF-YOLO(ours)	91.90%	83.54%	86.54%	88.28%	87.56%

Table 4

Comparison results on URPC2020 dataset (bold indicates the highest result)
Model	echinus	holothurian	scallop	starfish	mAP
YOLOv5	92.15%	76.26%	81.27%	85.91%	83.90%
YOLOX[31]	92.23%	79.55%	83.48%	87.01%	85.56%
YOLOv7[12]	91.93%	80.82%	83.92%	86.88%	85.88%
RF-YOLO(ours)	91.20%	82.89%	84.74%	86.74%	86.39%

Table 3 and Table 4 show the comparison of our proposed method with other methods on the two datasets. It can be seen from the tables that on both datasets, our proposed RF-YOLO achieves the highest mAP value. Among them, on the URPC2019 dataset, our proposed RF-YOLO has a 1.03% higher mAP than its baseline YOLOv7, and RF-YOLO has the highest AP value for all categories. On the URPC2019 dataset, our proposed RF-YOLO also has the highest mAP value, which is 0.51% higher than its baseline mAP, and it also achieves the best mAP value among all comparison methods. The above comparative experiments verify the effectiveness of our proposed RF-YOLO. Figure 7 shows the qualitative comparison of our method on the URPC2019 dataset. The first column in Fig. 7 showed our method detected one more scallop than other comparison methods. In the second column, our method detected one more scallop than YOLOv5 and YOLOv7. In the third column, our method detected one more echinus than YOLOv7. In the final column, our method is consistent with the ground truth, while other methods have False Positive or False Negative.From the results of the comparison images, we can see that our proposed RF-YOLO can indeed detect more and more complete objects.

4.4 Ablation Experiment

In order to verify the effectiveness of the proposed RF-YOLO in underwater object detection, we use the method of controlling variables to quantitatively evaluate the experimental results. On the URPC2019, the rationality of the modules is verified by ablation experiments. Table 5 shows the experimental results for RFAM, RFAM + SIOU and RFAM + FAFF + SIoU.

Table 5

Ablation experiments on URPC2019 (bold indicates the highest result)
Baseline	Double train	RFAM	SIoU	iAFF	echinus	holothurian	scallop	starfish	mAP
√					90.26%	82.48%	86.42%	86.95%	86.53%
√	√	√			91.19%	83.47%	85.41%	88.18%	87.06%
√			√		90.62%	83.04%	85.72%	87.31%	86.67%
√	√			√	91.82%	82.48%	85.92%	87.09%	86.82%
√	√	√	√		91.75%	82.80%	86.03%	88.22%	87.20%
√		√	√	√	91.17%	82.69%	84.61%	88.14%	86.65%
√	√	√	√	√	91.90%	83.54%	86.54%	88.28%	87.56%

It can be seen from Table 5 that RFAM, FAFF and SIoU simultaneously are included in our RF-YOLO, the network has the highest mAP value and the best detection accuracy. The three modules in our network are indispensable, and the combination of the three modules has the best effect on underwater object detection.

4.5 Limitations

In the experiment process, we also found some limitations and disadvantages of RF-YOLO. In some images, the algorithm proposed in this paper still has false positives or false negatives. Figure 8 shows the limitations of our method. In the left column, all methods have false positives or false negatives for echinus. In the right column, all methods miss holothurian.

This paper proposed a novel network model RF-YOLO for underwater object detection, which incorporated RFAM module in the backbone part, enlarged the receptive field of the model, and extracted more useful features. In the neck part, the model used FAFF module, fused the feature layers of different scales, and realized cross-channel information interaction, effectively improving the accuracy of object detection. In terms of loss function, SIoU was used in our model, and the model was better constrained to converge in the right direction. To solve the problem that the model is not easy to converge to the optimal state after the network parameters increase, this paper also proposed a new training paradigm, which achieved global optimum by two training sessions and improves the accuracy of object detection. Experimental results showed that RF-YOLO in this paper achieved mAP of 86.39% and 87.56% on URPC2020 and URPC2019 respectively, superior to the baseline model. In future work, we will reduce the parameter size of the model while improving the accuracy of object detection, making it suitable for mobile devices.

M. H. Zhang, S. B. Xu, W. Song, Q. He, and Q. M. Wei, "Lightweight Underwater Object Detection Based on YOLO v4 and Multi-Scale Attentional Feature Fusion," REMOTE SENSING, vol. 13, no. 22, NOV 2021, Art no. 4706.
J. Shen, T. Fan, M. Tang, Q. Zhang, Z. Sun, and F. Huang, "A Biological Hierarchical Model Based Underwater Moving Object Detection," Computational and Mathematical Methods in Medicine, vol. 2014, 2014.
A. L. Li, L. Yu, and S. W. Tian, "Underwater Biological Detection Based on YOLOv4 Combined with Channel Attention," JOURNAL OF MARINE SCIENCE AND ENGINEERING, vol. 10, no. 4, APR 2022, Art no. 469.
J. K. Wang et al., "A Novel Attention-Based Lightweight Network for Multiscale Object Detection in Underwater Images," JOURNAL OF SENSORS, vol. 2022, SEP 7 2022, Art no. 2582687.
F. Lei, F. Tang, and S. Li, "Underwater Target Detection Algorithm Based on Improved YOLOv5," Journal of Marine Science and Engineering, 2022.
C.-H. Yeh et al., "Lightweight Deep Neural Network for Joint Learning of Underwater Object Detection and Color Conversion," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, pp. 6129–6143, 2021.
X. Chen, Y. Lu, Z. Wu, J. Yu, and L. Wen, "Reveal of Domain Effect: How Visual Restoration Contributes to Object Detection in Aquatic Scenes," ArXiv, vol. abs/2003.01913, 2020.
J. Wang, S. M. Qi, C. Wang, J. Luo, X. Wen, and R. Cao, "B-YOLOX-S: A Lightweight Method for Underwater Object Detection Based on Data Augmentation and Multiscale Feature Fusion," JOURNAL OF MARINE SCIENCE AND ENGINEERING, vol. 10, no. 11, NOV 2022, Art no. 1764.
K. Liu, L. Peng, and S. R. Tang, "Underwater Object Detection Using TC-YOLO with Attention Mechanisms," SENSORS, vol. 23, no. 5, MAR 2023, Art no. 2567.
Z. Jiang and R.-S. Wang, "Underwater Object Detection Based on Improved Single Shot MultiBox Detector," Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, 2020.
A. Mathias, S. Dhanalakshmi, and R. Kumar, "Occlusion aware underwater object tracking using hybrid adaptive deep SORT -YOLOv3 approach," Multimedia Tools and Applications, vol. 81, pp. 44109–44121, 2022.
C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," ArXiv, vol. abs/2207.02696, 2022.
B. Fan, W. Chen, Y. Cong, and J. Tian, "Dual Refinement Underwater Object Detection Network," in Computer Vision – ECCV 2020 (Lecture Notes in Computer Science, 2020, pp. 275–291.
Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, "Attentional Feature Fusion," 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568, 2020.
X. Wei, L. Yu, S. Tian, P. Feng, and X. Ning, "Underwater target detection with an attention mechanism and improved scale," Multimedia Tools and Applications, vol. 80, pp. 33747–33761, 2021.
T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, "Feature Pyramid Networks for Object Detection," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944, 2016.
R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2013.
Z. Gevorgyan, "SIoU Loss: More Powerful Learning for Bounding Box Regression," ArXiv, vol. abs/2205.12740, 2022.
Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression," in AAAI Conference on Artificial Intelligence, 2019.
S. B. Xu, M. H. Zhang, W. Song, H. B. Mei, Q. He, and A. Liotta, "A systematic review and analysis of deep learning-based underwater object detection," NEUROCOMPUTING, vol. 527, pp. 204–232, MAR 28 2023.
L. Chen et al., "SWIPENET: Object detection in noisy underwater scenes," PATTERN RECOGNITION, vol. 132, DEC 2022, Art no. 108926.
K. Hu, F. Y. Lu, M. X. Lu, Z. L. Deng, and Y. P. Liu, "A Marine Object Detection Algorithm Based on SSD and Feature Enhancement," COMPLEXITY, vol. 2020, SEP 30 2020, Art no. 5476142.
C. Szegedy et al., "Going Deeper with Convolutions," presented at the 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, and Ieee, "Rethinking the Inception Architecture for Computer Vision," presented at the 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016.
C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, and Aaai, "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning," presented at the THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017.
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking Atrous Convolution for Semantic Image Segmentation," ArXiv, vol. abs/1706.05587, 2017.
J. Dai et al., "Deformable Convolutional Networks," 2017 IEEE International Conference on Computer Vision (ICCV), pp. 764–773, 2017.
S. Liu, D. Huang, and Y. Wang, "Receptive Field Block Net for Accurate and Fast Object Detection," ArXiv, vol. abs/1711.07767, 2017.
J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. S. Huang, "UnitBox: An Advanced Object Detection Network," Proceedings of the 24th ACM international conference on Multimedia, 2016.
S. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. D. Reid, and S. Savarese, "Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 658–666, 2019.
Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, "YOLOX: Exceeding YOLO Series in 2021," ArXiv, vol. abs/2107.08430, 2021.

No competing interests reported.

Download PDF

Journal Publication

published 07 May, 2024

Read the published version in Journal of Electronic Imaging →

Version 1

posted

You are reading this latest preprint version

Receptive Field Enhancement and Attention Feature Fusion Network for Underwater Object Detection

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Related Work

2.1 Underwater Object Detection

2.2 Receptive Field Enhancement

2.3 Loss Function in Object Detection

3. Method

3.1 Network Structure

3.2 Receptive Field Enhancement Module (RFAM)

3.3 Multi-scale Feature Fusion

3.3.1 Multi-scale Channel Attention (MS-CAM)

3.3.2 Free-channel iterative Attention Feature Fusion (FAFF) Module

3.4 SIoU

4. Experiments and Results

4.1 Dataset Setting and Evaluation Metrics

4.2 Experimental environment and parameter setting

4.2.1 Experimental environment setting

4.2.2 Experimental Parameter Setting

4.2.3 Double train

4.3 Comparison with Other Object Detection Algorithms

4.4 Ablation Experiment

4.5 Limitations

5. Summary

References

Additional Declarations

Status:

Journal Publication

Version 1