Convolutional Neural Network
With the application of Convolutional neural network in the field of image segmentation, the shortcomings of the previous image segmentation methods (such as threshold[44], edge detection[45] and region segmentation[46]methods) are increasingly obvious. The threshold-based image segmentation method sets different parameters for the grayscale threshold, classifies the grayscale histogram, and then divides it into different ranges. Pixels within the same range are considered to belong to the same type with similarity[47]. This method is simple to calculate, has high computational efficiency, and is fast, but it is not ideal for processing images that contain excessive information. Edge based image segmentation method is a classic segmentation technique, whose basic principle is to analyze the brightness values between pixels in the image to inspect possible boundaries[49]. If the difference in brightness value between pixel points and adjacent edge pixels points is significant, it is assumed that the corresponding pixel points are located at a boundary point. [50]. By detecting and connecting pixel points on these boundaries, an edge range is formed, and the image is divided into different regions[51]. This method can effectively distinguish different regions, but it should be noted that image noise may interfere with boundary detection, so appropriate preprocessing and filtering are required[52]. Region based image segmentation is a technique that connects pixels with similar features and segments different regions[53]. Compared to other segmentation methods, this method can effectively reduce the impact of insufficient segmentation space and spatial continuity on segmentation results[54]. However, this method is prone to improper segmentation, where pixels that should not have been divided into different regions are divided into different regions, thereby affecting the segmentation effect[55]. Therefore, when using this method, it is important to carefully select and adjust the parameters. With the development and innovation of convolutional neural networks, they have been able to effectively utilize image information to solve various segmentation problems of images, thereby achieving accurate image segmentation[56]. So far, the field of image segmentation has achieved great development. The current research on image segmentation methods based on convolutional neural networks mainly includes full convolutional neural networks(FCN), deep laboratories(DeepLab), masked region based convolutional neural networks(Mask Region-based), and Pyramid Scene Parsing Network(PSPNet).
2.1 Image segmentation model based on FCN
FCN first applied convolutional neural networks to image segmentation, laying a certain foundation for the basic network model framework of image segmentation[57]. The previous Convolutional neural network are usually stacked by multiple convolutional layers, and feature mapping is performed in the full connection layer of the last layer. In contrast, FCN adopts a full convolutional approach, which performs convolution calculations on the output image again. In the deconvolution calculation, an upsampling operation is used, and the output result is achieved after passing through the final layer.
FCN utilizes convolutional neural networks for extracting image features, and uses reverse convolution to resample the image, and then revert its resolution. So as to get more accurate segmentation results, FCN utilizes the skip connection mechanism to fuse feature maps of different resolutions [58]. As shown in Figure 1, In this process, the high-resolution feature maps and the low-resolution feature maps are fused through addition. Then, a 1x1 convolutional layer is used as the classification layer to reduce the dimensionality of the feature map and output the classification results for each pixel. The output of the classification layer is a probability graph that represents the probability of each pixel belongs to different categories. In image segmentation tasks, the category with the highest probability is used as the segmentation result for that pixel. According to the different pooling layers used during the pooling process, the segmentation results are divided into FCN-32s, FCN-16s, and FCN-8s. From the segmentation results shown in Figure 2, it can be seen that due to the fusion of multi-level data features, the segmentation effect of the FCN-8s model is significantly better than that of FCN-32s and FCN-16s.
2.2 Image segmentation model based on DeepLab
The DeepLab series of network models is a type of model used for semantic segmentation. In order to obtain context information, it uses atrous convolution to expand the receiving field, which solves the limitation that traditional convolution operations can only obtain information in the local receptive field[62]. Atrous convolution enhances the ability of feature extraction by adding voids within the convolutional kernel, capable of covering a large range of image regions without adding parameters and computational complexity. This method can efficaciously deal with large receptive field and improve model performance[63].
In the early DeepLab models[64], as shown in Figure 4 (a), the input image was first feature extracted through a convolutional feature extraction module. Then the feature map is processed through the atrous convolution module, Among them, the receptive field is expanded by using atrous convolution kernels with different expansion rates, and then achieve more context information. Secondly, in the global pooling module, the feature map is compressed into a vector to extract the contextual information of the image. Finally, the output vector from global pooling is expanded to the same size as the original input image using bilinear interpolation, yielding the final segmentation result[65].
In the Deep Lab V2 model, it uses Atrous convolution more flexibly, explaining the Atrous Spatial Pyramid Pooling (ASPP), Utilizing the advantages of atrous convolution, extract features from different scales and fuse them, and finally retain the fully connected module for processing[66]. On this basis, a coding decoder structure based on atrous convolution is proposed. The encoder stage adopts a series of convolutional operations to extract the features of the input image while maintaining the integrity of the image information[67]. The decoder upsamples the encoded output image and utilizes skip connection technology to fuse shallow and deep features in the image, thereby obtaining richer image details[68]. In addition, through atrous convolution, the receptive field of the target is enlarged, and the edge and detail information of the target are increased, thus realizing the effective extraction of the target.
In the Deep Lab V3 model, the image is input into the feature extraction module for feature extraction. The feature map is processed by the multi-scale atrous convolution module, and different scales of hole convolution cores are used to obtain broader context information and richer semantic information[69]. The ASPP module further expands the perceptual field by using different sizes of pore convolutions to perform convolution operations on feature images, thereby obtaining multiple contextual information with a wide range of scales. The feature fusion module further improves the accuracy of image segmentation by fusing the features extracted from different modules in multiple modules. Finally, using Bilinear interpola image can be output [66].
In the Deep Lab-V3+ model, as shown in Figure 4 (b), the encoding and decoding structure is used, where the encoder part refers to the DeepLab-V3+model[70]. After the image is input into the backbone network, it will obtain two feature layers. The high-level feature layer will enter the ASPP module in the encoder, and the low-level feature layer will directly enter the decoder for 1×1 convolution and channel Is compressed which can effectively reduce the proportion of lower levels. In Figure 5, the segmentation effect of the DeepLab-V3+model on different objects (dogs, people) is highlighted. The first column is the input image, and the second column is the segmentation effect. From the figure, it can be seen that the segmented image can clearly distinguish the foreground and background, and can also display the boundary information of the segmented image. Experimental results have shown that this algorithm can effectively segment fine-grained images.
2.3 Image segmentation model based on Mask R-CNN
Mask R-CNN is an instance segmentation model based on Faster R-CNN, which further provides pixel level segmentation information for each object on the basis of object detection[71]. The framework of Mask R-CNN is shown in Figure 6. Firstly, the input image is subjected to convolutional neural networks to extract feature maps, and then candidate target regions are generated through Region Proposal Networks (RPN). Then, The Region of Interest (ROI) pooling layer can convert the effective feature maps in the corresponding candidate regions into feature vectors of a certain size.
Next, the extracted feature vectors then pass through two fully connected layers to predict the category and boundary coordinates of candidate objects. In addition, a branch called Mask head has been added to predict the category of each pixel in each candidate box, thereby obtaining pixel level segmentation information. Finally, the final target box is selected through Non-Maximum Suppression (NMS) and the corresponding mask is output as the final result of the model.
Unlike the semantic segmentation of models such as FCN and Deep Lab, Mask R-CNN can complete instance segmentation based on semantic segmentation. Compared to existing instance segmentation models, such as FCIS[72] and MNC[73], Mask R-CNN can make the model more flexible and improve partitioning accuracy, which can be used to achieve more image processing tasks, including instance segmentation[74] and object detection[75].
Figure 7 illustrates the segmentation effect of the Mask R-CNN model on objects in different information environments. The model not only enables accurate detection and localization of objects in the image but also performs precise segmentation, thereby allowing differentiation of individual instances within the same class of objects.
2.4 Image segmentation model based on PSPNet
PSPNet is a model used for semantic segmentation of scene objects, which can fully utilize contextual information to analyze complex environments. This model is also the first to use convolutional neural networks to compute the output of the convolutional feature map at the end of the image, and to use an internal pyramid pooling module to capture and upsample features from different subregions[76]. This model can make full use of the features of each subordinate region to form a feature representation that contains both local and global information. Finally, use the SoftMax layer to classify and merge these features, and then perform convolution operations, obtaining the final predicted result and classification for each pixel. The model contains many advantages such as multi scale analysis, feature sharing, gradual adjustment and robustness, which can well improve the accuracy and stability of image recognition, making PSPNet a commonly used deep learning network structure in computer vision tasks, and has achieved remarkable results in semantic segmentation of scene objects[77].
PSPNet is an advanced semantic segmentation model that can adapt to variety of complex scenarios and tasks. Through the pyramid pooling module, it combines contextual information from different regions and improves the expression ability of global features[78]. In addition, the model also adopts a deeply supervised optimization strategy to train the deep network and achieves excellent segmentation results on multiple datasets, surpassing models such as FCN and DeepLab-V2[79]. However, this model also has some drawbacks, such as insufficient precision in processing occlusion and inaccurate boundary segmentation for different targets (humans, airplanes, cows), as shown in Figure 8.
2.5 Comparison and analysis of experimental results
Qualitative analysis was conducted on the above methods, and three commonly used datasets in the field of image segmentation, PASACL VOC[81] , Microsoft COCO[82]and Cityscapes[83], were used to compare the performance of convolutional neural network-based image segmentation methods, objective and fair test results were obtained, as shown in Figure 9. By conducting comparative experiments on image segmentation methods using the Microsoft COCO dataset, it can be seen from the analysis of experimental results that compared with the true values of semantic segmentation (Figure 9 (b)) and the FCN-8s method (Figure 9 (d)), the algorithm can effectively distinguish the types of most objects; PSPNet (Figure 9 (e)) can classify most targets, especially in traffic scenes with complex image information, and can achieve good results. DeepLab-V3+(Figure 9 (f)) can effectively segment most objects and handle boundary details well, resulting in a very obvious overall segmentation effect. Mask R-CNN (Figure 9 (g)) belongs to instance segmentation. Compared with another instance segmentation (Figure 9 (c)), this method can also achieve higher classification accuracy by segmenting different individuals in the same class based on semantic segmentation of the segmented objects. In summary, FCN, PSPNet, and DeepLab-V3+can effectively perform semantic segmentation on images, with significant output effects. Mask R-CNN is suitable for instance segmentation and can classify objects very accurately.
The performance of the image segmentation method introduced in this paper is compared with other convolutional neural network-based image segmentation methods in quantitative analysis under the existing experimental conditions. The results are shown in Tables 1-3, The average intersection ratio is used as the accuracy measure in Tables 1 and 2, and the pixel accuracy is used as the accuracy measure in Table 3. From Tables 1 and 2, it can be seen that compared with other segmentation methods, DeepLab-V3+can obtain higher accuracy on PASCAL VOC, Cityscapes and other test datasets, with 89.0% and 82.1% respectively. However, from Table 3, it can be seen that compared to the other two methods, Mask R-CNN segmentation accuracy is very high on the Microsoft COCO dataset, with pixel accuracy reaching 37.00%.
Table. 1 Performance of Various Segmentation Methods on PASCAL VOC Datasets
Sort
|
Segmentation method
|
MIoU/%
|
1
|
DeepLab-v3+
|
89.1
|
2
|
DeepLab-v3
|
85.6
|
3
|
DeepLab-v2
|
79.8
|
4
|
PSPNet
|
85.5
|
5
|
FCN-8s
|
67.1
|
6
|
CRF-RNN
|
74.8
|
7
|
DPN
|
77.4
|
Table. 2 Performance of various segmentation methods on the Cityscapes dataset
Sort
|
Segmentation method
|
MIoU/%
|
1
|
DeepLab-v3+
|
82.0
|
2
|
DeepLab-v3
|
81.2
|
3
|
DeepLab-v2
|
70.3
|
4
|
PSPNet
|
81.1
|
5
|
FCN-8S
|
65.2
|
6
|
CRF-RNN
|
62.4
|
7
|
DPN
|
66.7
|
Table. 3 Performance of Various Segmentation Methods on Microsoft COCO Datasets
Sort
|
Segmentation method
|
PA/%
|
1
|
Mask R-CNN
|
37.00
|
2
|
FCIS
|
33.50
|
3
|
MNC
|
24.50
|