Aquaculture has experienced an average annual growth rate of 5.3% since the turn of the century, making it one of the food industries with the fastest growth worldwide for nearly five decades (Olsen, R.L., & Hasan, M.R., 2012). With the scale expansion of aquaculture, the automatic counting of farmed fish became a critical task for efficient production management of aquaculture. By accurately counting the number of fish in the culture environment, producers might better monitor stocking density, help improve fish survival rate and feed conversion rate, thus boosting the economic efficiency (Li, D., 2020). However, manual and statistical approaches were currently used in aquaculture to count the fish (Li, D., Hao, Y., & Duan, Y., 2019). The manual counting was time-consuming and labor-intensive, and the invasive counting method had an effect on the welfare and health of the fish. Therefore, a non-invasive and accurate counting of farmed fish is needed to scientifically direct production.
Computer-vision-based automatic fish counting can minimize harm to the aquaculture environment and thus can maintain the fish behavior as usual. The aquaculture field is paying increasing attention to this research methodology. Morais et al. (2005) obtained an overall counting accuracy of 81% by using log-likelihood ratios to segment the front and back scenes and Bayesian filtering techniques to track changes in the number of fish fries. The method can work reliably in the face of extreme environmental changes and deal with problems such as occlusion or great inter-frame movements. Toh et al. (2009) labeled the spot position of the fish, filtered the noise and background objects through image processing technology and calculated the spot area to obtain the number of fish, which was effective for counting a small fish population whereas the errors accumulated with the increase of fish number due to the sensitivity of the area threshold. Labuguen et al. (2012) used image processing to detect the pixel area that each fish contour occupied and calculated the total area within the shape after the fish fries’ photos taken in a specially designed container were binarized and edge-detected. This experiment achieved high counting accuracy by limiting the experimental conditions to reduce the overlap and occlusion of the fish fries in the water. Hernández-Ontiveros et al. (2018) made use of a low-cost and high-performance embedded system and realized the counting task of two kinds of marine ornamental fish through digital image processing. As early as 1995, the backpropagation algorithm was used to count fish in images (Newbury et al.,1995). For example, Fan et al. (2013) used the Otsu algorithm for threshold segmentation of the fish connectivity domain and built a counting model using the backpropagation networks and least squares support vector institutions, obtaining an average counting accuracy of 98.73%. All the above methods relied on expert experience for foreground segmentation, which overlooked the segmentation accuracy influenced by artifacts from water fluctuations and overlap and shape changes from continuous motion. The counting accuracy was thus affected. In addition, the mentioned fish images above were collected in an ideal environment, which were quite different from those collected in a cultured environment.
The fish counting method based on deep learning algorithms can realize the automatic extraction of features by utilizing the great learning ability of convolutional neural networks while no human experience is needed to segment the foreground. Previous studies on fish counting by deep learning method can be divided into two categories: fish counting based on target detection and fish counting based on density map regression. The counting method based on target detection firstly detected fish targets and then counted them statistically according to the detection results. Lainez et al. (2019) examined the effectiveness of CNNs for the fish counting by means of detecting and then counting the fish in the pictures. Tseng et al. (2020) used the mask R-CNN to identify and count the fish objects in an electronic monitoring system, obtaining the counting accuracy rate of 77.31%. Connolly et al. (2021) used a faster R-CNN as the network framework to detect the fish data collected by a bait-type remote underwater video system (BRUVS). With the advancement of the single-stage target detection algorithm, the YOLOv3 network and the mask R-CNN network were used by Kandimalla et al. (2021) to count fish targets in the high-resolution visual acoustic dataset DIDSON, and the YOLOv3 network was able to reach a detection speed of 24 fps. Allken et al. (2021) implemented the Keras framework to realize the recognition and counting of the sparse targets in trawl nets. The above studies show that the counting approach based on target detection significantly affects fish counting in non-production environments such as in surveillance systems.
The counting method based on the density map regression algorithm is more suitable for dense fish targets in a factory farming environment. Zhang et al. (2020) used a multi-column convolutional neural network MCNN as the front-end network and an extended convolutional neural network DCNN as the back-end network to establish a hybrid neural network model for counting the fish fries, obtaining the counting accuracy rate of 95.06%. Yu et al. (2021) improved the MCNN based on the attention mechanism to generate the density regression diagram of the fish swarm image and obtained the accurate counting number after integration, with the counting accuracy of 94.69%. Yu et al. (2022) proposed a multiscale dense residual connectivity network based on MCNN which generated density maps that accurately reflected the distribution of fish populations, with an average absolute error reduction of 21.8% and a mean square error reduction of 22.8% under high-density conditions. The density map regression algorithm is well suited for counting fish in situations of industrial farming because it can estimate the number of cultured fish in a dense state more precisely. However, due to the structural characteristics of convolutional neural networks, the density maps generated from the original images after multi-layer feature extraction and downsampling are usually of shallow resolution or poor quality. The low-quality density map is likely to be one of the factors limiting the accuracy.
In order to enhance the quality of the density map and produce precise counting results, this research proposes a farmed fish counting model (MAT) based on multi-column dilated convolution, attention mechanism, and Swin Transformer. The MAT model consists of a feature extraction module and an attention module. The feature extraction module is a pyramid-like network structure based on the Swin Transformer block and it can integrate the multi-layer information and improve the resolution of the output feature map. The attention module is a back-end network made up of a multi-column dilated convolution and a residual attention mechanism. The multi-scale information is further processed by this module and the fish aggregation region is treated as a critical region to generate the final density map. The ultimate goal of this study is to offer an effective and reliable counting scheme for cultured fish, which can provide theoretical support for subsequent tasks such as cultured fish biomass estimation.
The following arrangements of this paper are as follows: Section 2 presents the data acquisition and labeling, and the MAT network design; Section 3 gives the experimental results; Section 4 shows ablation experiments and discussion on the model; Section 5 is the conclusion and prospect.