Image segmentation with boundary-to-pixel direction and magnitude based on watershed and attention mechanism

An improved image segmentation algorithm with boundary-to-pixel direction and magnitude (IS-BPDM) is proposed to deal with small regions segmentation while keeping the accuracy of edge segmentation. First, we develop a BPDM network embedded with watershed and attention module and use an adaptive loss function to learn each pixel’s robust and accurate BPDM which is defined as a two-dimensional vector, including direction and magnitude, and pointing from its nearest boundary pixel to itself. Then, we use the learned BPDMs to obtain the refined initial segmented regions by considering the pixels near boundary have shorter magnitude and near root pixels have longer magnitude, meanwhile adjacent pixels in different regions or nearby pixels on both sides of root pixel in same region have opposite directions and nearby pixels in same region have similar directions. Last, we utilize a fast grouping method according to direction similarity to combine these initial segmented regions into final segmentation. The experimental results show that compared with the state-of-the-art segmentation methods, the IS-BPDM approach proposed in this paper achieves better segmentation accuracy and high computational efficiency and outperforms in small regions segmentation on public datasets.


Introduction
Image segmentation is the basis of target detection and image classification and has become a key step in artificial intelligence applications. It aims to divide an image into nonoverlapping regions, and pixels in each region have their own unique perceptual appearance, e.g., color, texture, intensity. Typically, image segmentation technology can be divided into traditional segmentation methods and later emerging deep learning methods [1]. Many traditional image segmentation tasks are unsupervised learning by using region [2], threshold [3][4][5], boundary [6], graph theory [7], energy functional [8] and so on. Although these methods have been widely used in simple structure images segmentation, insufficient priori properties knowledge easily lead to a dissatisfied performance in dealing with the weak boundaries on natural images [9][10][11]. In addition, thanks to the large cost in converting the contour into segmentation, the difficulty of implementing these methods is analogous to building a single-span bridge across a wide river.
With the development of neural networks, some stateof-the-art image segmentation techniques [12][13][14][15][16][17] based on deep learning are mainly end-to-end approaches and have a witnessed significant progress in both accuracy and computational efficiency. The milestone approach is the fully connected convolutional neural network (FCN) [16] which adopts contemporary classification networks and uses skip connection architecture, followed by an up sampled deconvolution network to accomplish semantic segmentation. With further development, novel segmentation approaches, such as DeepLab [17] based on atrous convolutions , were proposed to handle the problem of segmenting objects at multiple scales. In order to make adequate use of the semantic context information of image scene, [18,19] proposed a conditional generation adversarial network (cGAN) to solve the general pixel-to-pixel mapping problem and automatically learned how to segment the image accurately. [20] trained the semantic segmentation network end-to-end and pixel-to-pixel to reduce parameter redundancy and time cost. However, most of them failed to identify weak boundary and some small objects, and then some works considered integrating traditional methods into deep network to solve the limitations.
One excellent way of addressing the limitations of traditional methods and aforementioned deep learning approaches is to use boundary-to-pixel direction (BPD) [21][22][23] of each pixel to improve the segmentation performance. BPD is learned and is used to represent the relative position between each pixel and its nearest boundary pixel. The better performing method [21] combined the pixels with similar BPDs to form super-BPDs according to similar directional characteristics, so as to ensure that all pixels in the same super-BPD have robust similar directional characteristics.
Although the super-BPD [21] usually achieves a pleasant trade-off between the accuracy and efficiency on image segmentation, there are still some challenges. On the one hand, the learned BPDs are not accurate enough for weak edges and small regions because of dramatic changes of direction. On the other hand, the BPD only considers the direction of the pixel and ignores the important magnitude which represents the distance from the pixel to boundary. These two drawbacks lead to poor segmentation performance on small regions segmentation and overlapping targets. In fact, watershed has a good effect on processing overlap regions and weak boundaries, and the attention module can make more attention to small and weak edge characteristics. Therefore, we propose a BPDM network embedded with watershed and attention module and use an adaptive loss function to learn each pixel's robust and accurate BPDM. On this basis, we use the direction similarity and magnitude of the learned BPDMs to achieve the final segmentation.
To conclude, our contributions are in these aspects: • Proposing a novel BPDM network and loss function to obtain robust and accurate BPDMs, which can effectively improve the accuracy of BPDMs on small regions and weak edges. • Improving the segmentation algorithm by using the priori properties of BPDMs to refine boundary pixels and root pixels, which can acquire a pleasant image segmentation result. • The experimental results evaluating on three datasets demonstrate that the proposed IS-BPDM achieves com-petitive performances against some state-of-the-art methods.
The rest of the paper is organized as follows: The related techniques are briefly described in Sect. 2. Details of our BPDM learning approach are proposed in Sect. 3. Image segmentation with BPDMs is introduced in Sect. 4. Datasets, implementation details and experimental results are displayed in Sect. 5. Then, the conclusion is indicated in Sect. 6.

Related work
We briefly review some image segmentation tasks leveraging watershed algorithm, attention module and direction information.

Watershed algorithm
Watershed algorithms [24][25][26] are the boundary segmentation methods by using the extracted object contour features. In many segmentation scenes, the overlap between multiple objects in an image leads to the wrong merging of smaller objects and larger regions, which is a challenge for image segmentation task. Watershed algorithm has satisfactory performance for weak edges, overlap and small regions segmentation. But only using the watershed often leads to over-segmentation, and cannot merge the result pieces into one component and produce correct semantic segmentation when segment overlap objects. [24] uses two-phase super-pixel segmentation method based on the watershed transformation with global and local boundary marching and produces superior accuracy and efficiency. Mutex watershed algorithm [25] learns local attractive and repulsive edges, followed by an improved maximum spanning tree to achieve good image segmentation. The marker watershed algorithm [26] combines watershed and end-to-end CNNs to solve the problem of complex processing steps in most examples, and improves the segmentation performance. Recently, [27] utilizes the marker watershed to separate adherent cells to avoid over-segmentation. [28] proposes a watershed seed point marking method and forms an adaptive watershed segmentation algorithm. They can have a good segmentation for the adhesion and edges. In this paper, BPDMs are used to represent the boundary-to-pixel direction and magnitude, and their learning depends on the accurate location information of boundary pixels. Therefore, the accuracy of boundary contour extraction determines the accuracy of BPDMs learning. We use the watershed module to preprocess the original image, so that the weak edge contour of the objects can be emphasized to obtain the accurate BPDMs.

Attention module
The attention mechanism originates from imitating human visual perception and plays a vital role in the sensory system [29]. The sensory system can use the focusing function to focus on some local scenes, transfer limited visual attention to the local areas of interest, and selectively capture more important visual structure information [29]. With the rise of CNNs, most works have proved that adding attention mechanism to the CNNs structure can improve the feature expression ability of the network [30,31]. Such as the SE module proposed in [30] introduces the attention mechanism only on the channel. CBAM proposed in [31] considers the attention mechanism from the two dimensions of channel and space. CBAM and SE modules can be embedded in mainstream network, which can not only improve the feature extraction ability of model, but also control the amount of computation. Based on these, this paper adds the CBAM to the BPDM network to learn robust BPDMs.

BPD Learning
Inspired by the algorithms of computing component trees [22,23], BPD [21] provides direction information for each pixel and effective informative for super-pixels [5]. It is convenient for the subsequent grouping and merging of pixels according to direction similarity that nearby pixels from different regions have opposite directions and adjacent pixels in the same region have similar directions. In [21], BPDs are learned based on FCN structure which adds ASPP layer [17] to enlarge the receptive field in downsampling process and then are partitioned into super-BPDs by using the robust direction similarity. Although super-BPD can separate nearby regions with weak boundaries, the segmentation on small regions is not very well due to the learned BPDs around small regions are not very accurate. For solving this issue, a novel BPDM network is proposed, in which watershed module and CBAM module are added into FCN to learn robust and accurate BPDM of each pixel. Then, besides the direction similarity, magnitude of the learned BPDMs is also used to effectively produce root pixels and initial segmentations, followed by using region adjacency graphs (RAG) partition algorithm to accomplish the final image segmentation. The proposed IS-BPDM can effectively improve the segmentation accuracy of edge and small regions.

BPDM definition
For each pixel p in the image, we search its nearest boundary pixel B p , and BPDM of p is given by where DM p is a two-dimensional direction vector pointing from B p to p.
The DM p provides cues about direction and magnitude. The direction is used to calculate the similarity between p and other pixels, and the magnitude is used to determine whether p is a boundary pixel or root pixel.

Architecture of BPDM network
The quality of BPDMs directly affects the performance of subsequent image segmentation. As shown in Fig. 1, the proposed BPDM network includes WA feature extraction module and multiscale feature fusion module and learns accurate BPDM of each pixel.

WA feature extraction
Considering that many image segmentation tasks do not have enough accuracy for small regions segmentation, watershed module and attention mechanism module are embedded into WA feature extraction module to remedy this issue. In the watershed module, mathematical morphological transformation is used to mark the foreground and background of the image to obtain the marked input image as shown in Fig. 2a. Then, watershed algorithm is used to extract contour features of objects to realize rough image segmentation as shown in Fig.2b. Thanks to the complexity of image information, the attention mechanism is also added in the down-sampling network to assign different weight to pixel features, so that the weak edges and boundaries of small regions can be focused on.
In forward propagation, BPDM network uses five group convolution layers and four maximum pooling layers and embeds attention mechanism module behind the fourth pooling layer to extract the attention feature map which is used as the input of the fifth convolution layer. The feature map output by the last convolution layer is performed 2, 4, 8 and 16 times dilation, respectively, in ASPP layer and concatenated as an output of WA feature extraction module.   Figure 3 illustrates the details of the CBMA module which extracts features from both channel and space dimensions. This module can be integrated into any CNNs architecture seamlessly with negligible overheads and trained end-to-end along with base CNNs. The intermediate feature map F pool4 output by the fourth pooling layer is used as the input of the channel attention module, and the feature map F 1 output by channel attention module is used as the input of the space attention module, and the final feature map is F 2 . The whole process of CBAM is as follows: where ⊗ denotes element-wise multiplication. A c (•) and A s (•) are operators of channel attention and space attention, respectively. In short, the channel attention and the spatial attention are computed, respectively, as where M L P denotes multilayer perceptron with one hidden layer, AP is the average pooling layer and M P is the maximum pooling layer. σ denotes the sigmoid function. f 7×7 is a convolution operator using a 7 × 7 convolution kernel for feature fusion.

Multiscale feature fusion
As shown in Fig. 1, in multiscale feature fusion module, 1 × 1 convolution and deconvolution are applied into conv3, conv4, conv5 and the output of WA feature extraction module, followed by a skip connection of these output features. Finally, three consecutive 1 × 1 deconvolutions are used on the fuse feature maps to achieve the BPDMs prediction. The whole process of multiscale feature fusion is as follows: where M b denotes a series of the four feature maps, in which the maps are resized to the size of conv3 and are bilinear up sampled.
[•] b is a skip connection operator. ReLU is a activation function. S conv3 d represents the deconvolution operation after resizing the feature maps to the size of the third convolution layer. f 1×1 is a 1 × 1 convolution operator. F conv3 , F conv4 , F conv5 and F W A represent the output feature maps of conv3, conv4, conv5 and WA feature extraction module, respectively.
Then, perform the following operation to obtain the BPDMs of all pixels in the image: where f 3×1×1 denotes three 1 × 1 deconvolution operations.

Adaptive loss function
The magnitude loss and the direction loss are considered for the BPDMs learning. The loss function L for BPDMs learning is defined as following: where w( p) = 1/|GT p | n ,n > 0 is the adaptive weight of pixel p. |GT p | is the size of ground truth segment containing p. The larger of n, the more importance of the small regions. L m and L d are the loss of magnitude and direction, respectively. DM p andDM p represent the ground truth BPDM and learned BPDM of p, respectively. • 2 is L 2 norm. β( p) = 1/|D p | is used to normalize the magnitude loss. α is the hyper-parameter to trade off between the direction loss and magnitude loss and generally is set to 1. Figure 4 demonstrates the heatmap visualization of the learned L 2 magnitude and direction of each pixel. Super-BPD [21] mainly focuses on the boundary-to-pixel direction of each pixel, and the learned magnitude of each pixel is around 1. For super-BPD, small objects with coordinate near 100 on the x-axis cannot be recognized. After adding watershed module and attention mechanism module into super-BPD network, respectively, the results on small segmentation regions have been improved. Based on this work, the proposed BPDM network is constructed by combining watershed and attention mechanism module, and the adaptive loss function is used to train the BPDM network to learn accurate BPDMs. The learned BPDM retains the smoothness of pixel's direction and magnitude and improves the prediction effect on fine objects to a certain extent. It can be observed that the prediction results on L 2 magnitude and direction of IS-BPDM are more refine on small regions.

Initial segmentation
Inspired by the algorithm [21], the parent image P and root pixels R are optimized according to the directions and magnitudes of the learned BPDMs, as depicted in Algorithm 1. Initially, the parent of each pixel p is set to itself and the root pixel set R is empty. Then, we calculate the included angle cos −1 D M p ,DM n p of the directions of p and the neighbor n p and compare their included angle with the threshold θ α . If the included angle is larger than θ α , it means that the BPDMs of the two pixels are dissimilar. In addition, if p is a root pixel, then its magnitude should be in the interval (d e 1 , d e 2 ), and then insert the root pixel p into the set R. Otherwise, the parent of p is updated to n p . Because the root pixels in the same region are close to each other near the region's symmetry axis, parent P(r ) should be updated to the last root pixel within the bottom half of 3×3 window N b 3 centered at r . The final parent image P which represents initial segmentation is obtained via above operation.
The pixels in the image can be combined into a tree forest in which the trees represent disjoint regions. As shown in Fig.  5b, c, the tree forest composed of these trees is the initial parent image, each tree has its corresponding root pixels.

Final segmentation
Similar to [21], for each pixel r ∈ R, A r represents the area of the initial segment. Given the threshold α s and α t , the initial segments are divided into large, small and tiny regions and a region adjacency graph G(R, E) is constructed based on the initial segmentation.
The direction similarity S(e) on each edge e = (r 1 , r 2 ) ∈ E , linking two regions R and R , is computed. S(e) is defined as following: where B(e) = {( p i , q i )}, p i ∈ R 1 , q i ∈ R 2 is defined as the pairs of boundary points between regions R , R , and |B(e)| is the numbers of the pairs. P s ( p) denotes the sth step starting from pixel p. If S(e) is larger than the given threshold h θ which value is assigned according to the areas of adjacent regions, the two regions will be merged together. Through the above merge operation, small crumb regions in the initial segmentation can be cleaned up to obtain the final segment result, as shown in Fig. 5d.

Datasets
The performance of the proposed IS-BPDM is evaluated on three datasets of PASCAL Context [32], BSDS500 [33] and Cityscapes [34]. Pascal Context is a pixel level semantic annotation of the whole image and we segment some obvious objects as the novel class which are labeled as background in the dataset. A total of 7072 images are used for training and 3031 images are used for testing.
BSDS500 includes 200 training images, 100 verification images and 200 test images. Each image has about 5-10 ground truth segmentations, and we select the finest ground truth segmentation to train and test and expand the training images by rotating and flipping.
Cityscapes is a dataset of high-resolution urban image scenes, which includes 2975 training images and 500 test images, in which every image has coarse label and fine label. In our experiment, fine label is used for supervised learning.

Training and hyper-parameters
In BPDM network, FCN adopts pretrained VGG16 on Ima-geNet to extract basic feature maps. During training our model, the learning rate of the network is the same as [21], and optimizer uses ADAM [35]. The model is trained for 10000 epochs on each dataset, respectively.
During initial segmentation, the hyper-parameters d e 1 , d e 2 are set to 2 and 23, respectively, and other hyper-parameters are same as [21].
All algorithms are trained and tested on 2xIntel Xeon Gold 6226R 16-core CPU (2.9GHz), 256GB RAM, and 4x NVIDIA Tesla V100S-PCIE-32GB GPU. The training of BPDM network is realized with Pytorch environment, and the final merging and segmentation is realized by using CUDA and C ++.

Qualitative and quantitative evaluations
To evaluate the performances of our method, mean Intersection over Union (m I oU ) [16], F-measure for boundaries (F b ) [33] and computing expense are considered. m I oU is used to assess the correlation between ground truth and prediction, the higher the value, the better the performance of segmentation. Similarly, the higher F b means the edge segmentation effect is better. For computational consumption, the values of time consuming are provided, and the unit is second.
The proposed IS-BPMD is tested and compared with some state-of-the-art segmentation methods such as CascadePSP [14], MagNet [15] and super-BPD [21], and colored the segmentation results of IS-BPDM and super-BPD by referring to the ground truth. Some qualitative comparison results on  Fig. 6. On Pascal Context and BSDS500 datasets, Our IS-BPDM can achieve better segmentation than MagNet, CascadePSP and super-BPD. It can segment overlapping objects (wall and painting) and small objects (books and bed) on PASCAL Context and can clearly segment the dog and big stone on BSDS500. On Cityscapes dataset, although IS-BPDM does not obtain ideal segmentation result on both sides of the road than MagNet, it can segment overlapping people and cars and also can segment small objects more finely than three other methods. Meanwhile, our IS-BPDM has smoother edges and higher segmentation accuracy, which can be seen from the enlarged boxes (black and red) in Fig. 6. Moreover, Table 1 illustrates the comparisons of our IS-BPDM algorithm and some widely used image segmentation methods on three datasets. On PASCAL Context and BSDS500 datasets, our IS-BPDM can learn the accurate BPDMs and make full use of the priori properties, so it can achieve the highest m I oU and F b and has a good trade-off between accuracy and efficiency. On Cityscapes dataset in which images are high-resolution and their contents are complex, our IS-BPDM achieves the top two segmentation accuracy in m I oU and F b , and it can effectively segment the house, people, car and so on and also achieves a high efficiency.
It can be concluded that traditional segmentation methods rely on human intervention, and neural network approaches outperform traditional methods due to a large number training samples and strong fitting ability of the network. Our IS-BPDM approach uses neural network to learn accurate and robust BPDMs and adopts traditional segmentation method with BPDMs to finish the finial segmentation according to the priori properties knowledge of BPDMs. The combinational model has a reasonable segmentation and high efficiency.

Module
We study the effects of adding watershed and attention mechanism modules to the network on Pascal Context. As stated in Table 2, when only watershed or attention mechanism is added, the segmentation performance is improved. Better result can be obtained by adding both watershed and attention mechanism modules.

Setting of the n value of the adaptive weight
We study the effect of n value of the adaptive weight w( p) in Eq.(8) during model training. As shown in Fig. 7, with the increase in n value, it is more sensitive to small regions. However, when n is greater than 1, it pays too much attention to small regions and results in over-segmentation. When n is set to 1, we can achieve better segmentation performance.

Direction characters
The effects of direction and magnitude characters on the initial segmentation based on BPDMs are depicted in Table 3. It can be seen that considering direction and magnitude can achieve better segmentation results in comparison.

Conclusion and future work
For image segmentation, higher accuracy of edge and small regions and less time consuming are required. The proposed IS-BPDM considers that the pixels nearby the boundary pixels in different regions should have opposite directions and shorter magnitudes, while the nearby pixels on both sides of root pixel in the same region have opposite directions and longer magnitudes. So constructing a BPDM network Bold values indicate the quantitative comparison of advanced methods on the three datasets embedded with watershed and attention mechanism module and using an adaptive loss function to train the network can effectively improve the accuracy and robustness of BPDMs on small areas and weak edges. Then, the initial segmented regions are accomplished according to the direction similarity and magnitude of these BPDMs and finally merged them into the final segmentation based on RAG. Experiments show that the proposed IS-BPDM achieves a reasonable and accurate performance for small object segmentation. Though the proposed IS-BPDM is validated and outperforms a pleasant segmentation accuracy and efficiency, it still does not realize semantic segmentation. In the future work, we would like to consider the end-to-end semantic segmentation guided by the BPDMs.