2.1 Underwater Object Detection
The current mainstream object detection network models are generally divided into single-stage algorithms and two-stage algorithms. The difference between them is whether there is a candidate box generation stage. Two-stage object detection algorithms such as Faster RCNN, Mask RCNN, etc., divide the object detection task into two stages. The first stage algorithms use a region proposal network to obtain the region proposal of interest, and the second stage maps the region proposal of interest to the feature map through pooling for classification and location regression. This two-stage algorithm has high detection accuracy, but poor real-time performance. Single-stage algorithms such as SSD and YOLO series do not generate candidate boxes, but directly output classification and localization results. Therefore, single-stage algorithms have fast detection speeds, but their detection accuracy is slightly inferior to two-stage algorithms. In underwater scenes, the quality of underwater images is greatly affected by lighting, and the images have low visibility, low contrast, and color distortion. Xu et al.[20] proposed that there is no strict positive correlation between underwater image enhancement and the improvement of underwater object detection accuracy, and enhanced underwater images might reduce the detection accuracy. Some scholars jointly train enhancement and detection to achieve higher detection accuracy. Yeh et al.[5] added a color conversion network before the object detection network, which converted the image from RGB color space to HSI color space for fine-tuning, and outputted grayscale images to the detection network for detection. Facing the problem of underwater image blur, Chen et al.[21] proposed a sample weighting network (SWIPNET) and a new training paradigm Curriculum Multi-Class Adaboost (CMA), which used sample reweighting algorithm to reduce the weight of lost targets, thereby reducing the interference of noise samples. Hu et al.[22] proposed an underwater object detection algorithm based on SSD and feature enhancement, which adopted the idea of feature cross-level fusion to improve the feature expression ability.
2.2 Receptive Field Enhancement
The research on receptive fields has a long history, and its main purpose is to improve the performance of object detection without increasing more computational cost. Inspired by the primate visual cortex neuroscience model, Szegedy et al.[23] proposed Inception, which improved the network model by approximating the expected best sparse structure with existing dense building blocks, enhancing the feature representation of the model. Subsequently, Szegedy et al. also proposed some research on Inception[24, 25], which used multiple branches with different kernel sizes to capture multi-scale information. But these kernels sample at the same center, so it is easy to lose some key feature details. Chen et al.[26] proposed ASPP, which used dilated convolution to change the distance between sampling centers. But ASPP sampled features with a uniform resolution as previous convolution layers with the same kernel size, so it was easy to cause confusion between objects and context information. Dai et al.[27] proposed Deformable CNN to learn unique resolutions for individual objects, but Deformable CNN also had the same problem as ASPP. Liu et al.[28] proposed RFB module, which consisted of multi-branch convolution layers with different kernels and trailing dilated pooling or convolution layers. The first part is similar to Inception, which is responsible for simulating kernels of various sizes. The second part reproduces the relationship between pRF size and eccentricity in human visual system. RFB module effectively improves the performance of single-stage object detection network. Fan et al.[13] proposed RFAM module and RFAM-PRO module, which reproduced RFB work, where RFAM-PRO further refined kernels to make it more conducive to small object detection.
2.3 Loss Function in Object Detection
Loss function is a class of functions that calculates the difference between predicted values and true values. In object detection, in order to improve detection accuracy, we need to make prediction boxes as close as possible to ground truth boxes. In this process, we need to introduce loss functions. Yu et al.[29] proposed IoU Loss that took the ratio of intersection and union between prediction box and ground truth box and then took negative logarithm. It solves two major problems of Smooth L1 series variables being independent and not having scale invariance. But IoU Loss can not optimize the situation where two boxes do not intersect nor can it reflect how two boxes intersect. Rezatofighi et al.[30] proposed GIoU Loss, which introduced the minimum bounding rectangle of prediction box and ground truth box based on IoU. But when prediction box and ground truth box are in a containment relationship or in a horizontal or vertical direction, GIoU loss degenerates into IoU Loss, i.e., when |C-A∪B|→0, it will cause the model to converge slowly. Zheng et al.[19] put forward DIoU, which modified the penalty term of introducing the minimum bounding box in GIoU to maximize the overlapping area to minimize the standardized distance between two BBox center points, thus accelerating the convergence process of loss. At the same time, they also proposed CIoU, which considered the aspect ratio of bounding boxes into loss function based on DIOU, further improving the regression accuracy. Gevorgyan et al.[18] presented that mismatch between ground truth box and prediction box would cause model to converge slower and less effectively. So they proposed a new loss function SIoU, which redefined penalty measure and considered angle between prediction box and ground truth box, effectively improving the accuracy of object detection.