A Deep Learning-Based Embedding Framework for Object Detection and Recognition in Underwater Marine Organisms

The detection of marine organisms is an important part of the intelligent strategy in marine ranch, which requires an underwater robot to detect the marine organism quickly and accurately in the complex ocean environment. Based on the latest deep learning arithmetic, this paper put forward to find the marine organism in a picture or video to construct a real-time objective invention system for marine organisms. The neural network arithmetic ： YOLOv4 was employed to extract the deep features of marine organisms, implementing the accurate detection and size detection of different fish can use arithmetic for evaluation in fisheries. Furthermore, improving the architecture of the backbone and the neck connection is called YOLOv4-embedding. As a result, compared with other object detection arithmetic, YOLOv4-embedding object detection arithmetic was better at detection accuracy -- higher detection confidence and higher detection ratio than other one-stage object detection arithmetic, EfficientDet-D3 example. The consequence demonstrates that the suggested instrument could implement the rapid invention of different varieties in marine organisms. Compared to the YOLOv4, the mAP 75 of the YOLOv4-embedding achieves an improvement of 2.92% for the marine organism dataset at a rapid rate of ~51 FPS on RTX 3090, 60.8% AP 50 for the MS COCO dataset.

2017, Lin et al. [ 8] propos ed RetinaNet to i mbalance positive and negative samples in the dataset. In 2018, YOLOv3 with higher APS(sm all object average precision) performance on the s mall size objects were releas ed, which has greatly im proved the problems of the previous two versions, and it has a higher detection speed and accuracy.
Recently, Bochkovski y et al. [ 9] introduced the lates t versi on , YOLOv4, which is powerful than previ ous versi ons in AP(Average Precision) and FPS(Fra mes Per Second), a backbone net work, neck net work, and activation function improves with optimization s. In this paper, YOLOv4 wi ll be emp loyed.
This paper aims to d evelop a recognition s yst em for Auton omous Underwat er Vehicles to make it possible to recognize the Su rrounding environm ent and marine organisms. Speci fically, using a normal RGB camera to obtain marine organisms images in the aquaculture ponds and object detection are carri ed out under different conditions of i llumination and occlusion. The main contribution of this work includes : (1) Based on the improvem ent of the newest det ection algorithm YOLOv4, rapid and precise det ection for marine organisms can be reali zed under vari ous environmental conditions; (2) Different neural net work algorithms are compared and discussed the results of fish det ec tion to verify the applicability, rec omm ended a method in marine orga nisms detection.

Related work on fish and marine organisms detection
In recent years, deep learning-based CV techniques and object detection algorithms have been widely exploited in aquaculture, such as fish size measuring, body analysis, quality calculation, illness diagnosis, etc. As a kind of contactless method, high-precision CV techniques can monitor the size, the fabric, and the physical condition of the farmed organisms and became an important monitoring way in aquaculture. As mentioned before, CNN has been widely used in CV and particularly made a breakthrough in abstract cognitive problems. For example, based on the Fish4 Knowledge dataset, Rathi designed a 3-level CNN to classify 21 types of tropic fish.
By combining the feature selection framework and image segmentation, Marini implemented classification for fish species and assessed the abundance of fish on the collected data. Mandal combined Faster R-CNN with three classification networks (ZFNeT, CNN-M, and VGG16) to realized regional prediction for fish and crustaceans collected from Queensland beaches. Konovalov designed an Xception CNN-based fish detector for a fish group and realized underwater fish detection in multiple water areas. The recent study that is based on YOLOv3 show that it can improve the detection performance for marine organisms by using the color and texture features of target objects.
However, the detection of marine organisms used CPU to train images in early work, the training speed was slower. Moreover, if the scattered background noise is not eliminated in the image, the detection ratio may also be reduced in marine organisms. Meanwhile, GPU processing to apply in images dataset makes training more effective and faster.
In this work, the algorithm YOLOv4 -embedding is present to realize fast detection for marine organisms.
Compared with other arithmetics, YOLOv4 -embedding is fast and better accurate. Thus, YOLOv4 -embedding can balance high accuracy and fast speed for marine organisms recognition tasks, especially for underwater robots to perform fishing tasks. Another advantage of our work is that us ing a normal RGB camera instead of perplexed sensors greatly decreases the cost of collecting marine organisms' images under shallow sea conditions.
Rest of this research is organized and listed belo . In Section 3, the data and methods are define . There is also briefly review the properties of the YOLOv4 -embedding arithmetic. Different object detection arithmetic techniques are then compared in Section 4. Finally, experimental analysis is consulted and established in Section 5 and Section 6.

Data and relevant methods
To validate the proposed method, us ing the data provided by The Institute of Oceanology, Chinese Academy of Sciences (IOCAS), which is collected in the real aquaculture environment. Specifically, 1557 valid marine organism images cover 4 types of marine organisms, i.e., Abalone, Echinoidea, Ho lothuroidea, and Thamnaconus modestus. A digital color camera (GoPro CHDHX) with a resolution of 1280 x720 pixels is used to retrieve data, and the shooting angle was set to the front and side of the aquaculture ponds. Furthermore, some photos with an elev ation angle of 45 were adopted for comparison. FIG document contains the labels and the coordinates of boxes that bound the target marine organisms in the image.

Detection procedure
In this subsection, the inside architecture of YOLOv4-embedding is presented in detail. YOLOv4 mainly includes 4 basic components, i.e., the input layer, backbone network (BackBone), neck network (Neck), and output layer. The input layer accepts fixed-size images extracted through the backbone network and sent to the neck network for feature fusion. The output layer outputs 3 different scale prediction anchor boxes called YOLO Head. Fig. 1 is the architecture of marine organisms detection based on the YOLOv4-embedding algorithm. The detection process is summarized as follows: Step 1: Fed Marine organisms image into the network.
Step 2: The backbone is a CSPDarknet53 model, which keeps the Darknet53 framework, but uses the CSP mechanism. The Mish activation function and Leaky activation function are employed that abstracts the info from the image.
Step PAN uses feature pyramid and path aggregation technology to propagate low -level information to the top -level easier. With these components, the multi -scale prediction for three types of targets, i.e., large ones, medium ones, and small ones, can be performed.
Step 4: Embedding the convolution layer and linear activation function at the end of the YOLOv4 Neck. CBLR is employed in the network. The architecture is shown in Fig. 2. Concat task is the addition of tensors and dimensionality, which add the features of the two CBLR.
Step 5: The YOLOv4 head performs predicting, which outputs the final detection results. Now, clarifing the concrete building blocks. The backbone is CSPDarknet53 architecture that comprises 5 CSP (Cross Stage Partial connections) modules, 11 CBM (Convolution al+ Batch normalization +Mish) modules and 10 CBL(Convolutional+ Batch normalization +Leaky). Particularly, the CBM module fulfills the convolution task by using Batch Normalization and Mish activation functions. In contrast, the CBL module fulfills the co nvolution task by using Batch Normalization and Leaky Relu activation functions.
Leaky relu function is The diagrams of the Mish function and Leaky Relu function are shown in Fig.3. Then, the bounding box whose confidence is lower than the threshold will be deleted, and the best of all candidate boxes would be chosen under the DIOU_nms arithmetic. Finally, CIoU loss is brought forward by forcing the conformity of the aspect ratio. The loss function can be defined as: (3) Where α represents the weight function, and v represents the similarity of the aspect ratio. After filtering by CIoU, the detection consequence is output, and the detection task is completed. Since CIOU takes the distance, overlap rate, scale, and penalty factors into account, the prediction frame can quickly approach the aspect ratio of the real frame during training, avoiding divergence in the training process of the YOLOv3 version.

Experimental Results
This paragraph illustrates the consequence of marine organisms detection in the training period and detection period. In addition, the evaluation target, training parameters, and detection effects on different occasions are depicted.

Experimental setup
In marine organisms graphic classification tests, the tolerate hyper-parameters are listed: the training steps are 30938, the batch measurement and the mini-batch measurement are 8 and 1 separately; the polynomial decay learning ratio timetable outline is adopted with a first learning ratio of 0.0013. The iteration steps are 1000; the momentum and weight parameter are separately set as 0.949 and 0.0005.

Assess training models
In the training section, which assesses the recapitulation capability and progressively optimizes the model, precision, recall, and the mAP score were employed as an assessment target.
Where precision represents positive predictive value, TP represents real positive, FP represents fake positive, and FN represents fake negative.
In training, the batch size was subscribed to 8.      The weight conforming to the maximum mAP in each training was selected, the precision and recall were produced when the threshold was 0.5 and 0.75, compare the capabilit y of the three epochs of training, the performance of the YOLOv4 as shown in Table 1 and Table 3, and the performance of the YOLOv4 -embedding in Table 2 and Table 4.      Fig. 6(a) is 60%, The object detection confidence in F ig. 6(b) is 98%, The object detection confidence in Fig. 6(c) is 100%. If the light source is insufficient, the detection confidence of marine organisms is lower. Otherwise, the light source is close to the detection object, the detection confidence of mar ine organisms is higher. That is why examining the trained detection pattern in separately occlusion extent one by one, as shown in Fig. 7(a) is large zone occlusion(orange box), that did affect the detect result, because the information of the marine organism was too little due to occlusion; Fig. 7(b) shows the accurate detection when the occlusion zone decreased, but the pattern still detected the marine organisms' confidence is 65%; The information of the marine organism in Fig. 7(c) was almost completely, the marine organism's confidence is 91%. Occlusion in Fig. 7(a) and Fig. 7(b) often happens in successive detection. Accurate detection of all marine organisms in succeeding frames has great meaning for solving duplicated detection, while deep learning arithmetic has stronger environmental conditions.

Discussion
Other deep learning arithmetic is compared in this division to examine the capability of the marine organism detection pattern in the aquaculture ponds. At the same time, marine organism detection consequences given deep learning arithmetic are compared and analyzed.

Comparison of YOLOv3 and YOLOv4-based in marine organism detection
The marine organism dataset in the YOLOv4-embedding neural network are trained and detected. The epoch was subscribed as 300, and the optimal training model was selected for validation, with mAP50 of 97.19%. Fig. 8 shows the P-R curve of the three YOLOv4-based neural networks on the validation set. As referred above, YOLOv4 uses FPN+PAN architecture to collect the features of the trunk layer and detection layer again and again via multi-scales, which is of great meaning for improving small objects network detection, and YOLOv4-embedding changes the architecture of CBM to DEPCU (deep convolution), that consists of 3 building blocks(CBL+CBM+CBL).  Finally, the marine organism image at a different angle are caught . Though most marine organisms can be caught horizontally, and some marine organisms may not swim horizontally. Executing experiments on the marine organism images with an angle of 45 to see whether the marine organism can be detected. Fig.10 shows the detection consequences of YOLOv3 and YOLOv4-embedding in the 45 angles, it can be seen YOLOv4-embedding exactly detected the marine organisms, but YOLOv3 detected fewer numbers of marine organisms. This is because YOLOv4embedding has more generalization capability in different angles of marine organisms detection.  FIG 11(a). YOLOv4 detectors on the MS COCO dataset

Conclusion
Accurate marine organisms detection is a major meaning to the clever administration of the aquaculture ponds. This paper proposed a detection modus given the latest YOLOv4 neural network for marine organism detection in natural conditions. Besides, the performance of the EfficientDet-D3 arithmetic and YOLOv4-based arithmetic in marine organism detection are analyzed. According to the experimental results, the following conclusions can be summarized: (1) Deep learning arithmetic for marine organism detection in the aquaculture ponds is suitable. The structural features of the YOLOv4-embedding neural network and the crux issues of aquaculture ponds detection were analyzed. In the network, CSPDarknet53 deepens the network that could extract more deep marine organism features and reduce the interference of background; The SPP architecture increases the acceptance range of network characteristics with less computational expense.
In addition, FPN+PAN architecture merges multi-scale features to extract more profound marine organisms' semantic information and positioning information, which detect marine organisms more accurately. However, accurate detection can still be achieved when the marine organism sizes in the same image are highly different; The CIOU arithmetic improves the confidence of marine organism detection results, and YOLOv4-embedding arithmetic achieves 60.8% AP50 for the MS COCO dataset in Fig. 11.
(2) The marine organism detection arithmetic in the aquaculture ponds, based on the YOLOv4-embedding neural network, can detect precisely when under different occlusion and illumination for different breeds and matureness, information for the marine organism intelligent management and underwater machine. As shown from the effect of experiments, different marine organisms can be recognized and classified effectively by using improved CSPNet [10] architecture into the neck architecture of YOLOv4. Furthermore, optimizing the gradient backpropagation path improves the network's learning ability and greatly reduces the amount of calculation under the premise of ensuring accuracy.
The detection performance of YOLOv4-based arithmetic is better than EfficientDet-D3 arithmetic for marine organism detection in the aquaculture ponds. Compared with YOLOv4, YOLOv4-embedding, and YOLOv4-DEPCU, the average detection time of the three arithmetic was 19.31ms ，19.46ms and 20.20ms. The mAP75 of marine organisms was 82.68%, 85.6%, and 84.75%, respectively. In the training section, YOLOv4-embedding obtained the optimal weight model with 300 epochs iterations. Moreover, higher FPS is still the characteristic of YOLOv4embedding.
In the detection section, YOLOv4-embedding was superior to other arithmetic with its high confidence and high mAP. In conclusion, the proposed architecture is suitable for marine organism detection in aquaculture ponds. Future work will mainly obtain the coordinate value of marine organisms in the real world, achieve the localization of marine organisms, calculate the picking dot's location, deploy a marine organism detection model in tiny terminals, develop a moving mechanically in the research. Increase the identification ability of marine biological species characteristics in the aquaculture pond and reduce background noise interference. In addition, this algorithm can cope with the problem of reduced marine biological recognition ability due to different depths and different light intensities. After increasing training data and training times, it can effectively improve the accuracy of recognition. Thus, this research provides a new solution to the problem of species identification in aquaculture ponds and will more accurately grasp the status of individual fish bodies and species in the aquaculture ponds to quickly discover and solve the problems in the aquaculture ponds and make specific contributions to reducing the risks of the aquaculture ponds.