MSAA-Net: a multi-scale attention-aware U-Net is used to segment the liver

Automatic segmentation of the liver from CT images is a very challenging task because the shape of the liver in the abdominal cavity varies from person to person and it also often fits closely with other organs. In recent years, with the continuous development of deep learning and the proposal of CNN, the neural network-based segmentation models have shown good performance in the field of image segmentation. Among the many network models, U-Net stands out in the task of medical image segmentation. In this paper, we propose a segmentation network MSAA-Net combining multi-scale features and an improved attention-aware U-Net. We extracted features of different scales on a single feature layer and performed attention perception in the channel dimension. We demonstrate that this architecture improves the performance of U-Net, while significantly reducing computational costs. To address the problem that U-Net’s skip connection is difficult to optimize for merging objects of different sizes, we designed a multi-scale attention gate structure (MAG), which allows the model to automatically learn to focus on targets of different sizes. In addition, MAG can be extended to all structures which contain skip connections, such as U-Net and FCN variants. Our structure was extensively evaluated on the 3Dircadb dataset, and the DICE similarity coefficient of the method for the liver segmentation task was 94.42%, with a much smaller number of model parameters than other attentional models. The experimental results show that MSAA-Net achieves very competitive performance in liver segmentation.

ing worldwide [1]. Computed tomography (CT) is one of the main tools currently used to detect liver cancer. Segmenting areas of the liver from high-resolution CT images of the abdomen is the first step in treating liver cancer, but this task is too dependent on the experience of the physician and is very energy consuming. With the continuous development of computer technology, the automatic detection and segmentation of liver has become an important research direction in the field of medical image segmentation. Different structures in CT images have the characteristics of uneven grayscale and similar grayscale, which makes the task of accurately segmenting the liver area from the image of multiple organs combined together more challenging.
In recent years, with the development and application of deep learning, more and more researchers have started to focus on fully automatic segmentation methods based on deep learning. In 2015, Ronneberger et al. [3] proposed the U-net, one of the most classical innovations in medical image segmentation, where they proposed a U-shaped network and a unique connection structure based on the FCN that allows training on fewer datasets. Over time, U-net architectures have been popularized and improved, and different U-net for different medical tasks have been proposed, but these architectures deal with a single 2D medical image picture and cannot utilize segmented 3D spatial information of the parts. In order to segment 3D liver images, Cicek et al. [4] proposed a 3D U-Net network, and they extended the U-Net architecture by replacing all 2D operations with corresponding 3D operations. Also, they labeled only some of the slices in the volume to be segmented to train the model and used it for the segmentation task of the whole volume. In some segmentation tasks, researchers want network models to learn to focus on the parts of interest and ignore irrelevant regions. In 2018, Oktay et al. [7] added attention mechanism to U-Net, and their method successfully improves the accuracy of the pancreas segmentation task. In the same year, Hu et al. [8] proposed SE-Net, which is another attention mechanism. SE-Net gives different weights to each channel by "Squeeze-and-Excitation," and this channel attention mechanism further improves the accuracy of the segmentation task. It is worth mentioning that Vaswani et al. [9] proposed a network based entirely on attentional mechanisms in 2017 and successfully applied it to the field of NLP. After this, dosovitskiy et al. [10] successfully applied Transformer to the CV domain by designing Vision Transformer (VIT) and applying it to image classification tasks. VIT discards the convolution operation to input the image chunks directly to the encoder and finally achieves an accuracy similar to that of CNN. In 2021, Chen et al. [11] proposed TransUnet, a powerful encoder that combines Transformer and U-Net for segmentation of medical images. They added Transformer to the last layer of the U-Net encoder, which improved the encoding capability of the U-Net encoder, and eventually TransUnet multi-organ segmentation and heart segmentation tasks achieved great success. In 2021, Liu et al. [12] proposed Swin-Transformer, a backbone architecture based entirely on Transformer, which successfully extended VIT to tasks such as semantic segmentation and target detection and achieved good results. Recently, Swin-Unet [13] was designed, which is the first pure Transformer's U-shaped coding and decoding structure, and it has been shown that U-Net detached from CNN can still achieve good segmentation performance. Nevertheless, the self-attention mechanism still has a lot of room for development, and the traditional U-Net optimization scheme should not be neglected.
Skip connections allow the combination of low-level semantic information from the encoding path and high-level semantics from the decoding path, and many models achieve multi-scale feature fusion in this way. Researchers often use multi-level CNNs to cascade the extracted features when the task region appears to be widely different in shape and size. Although these variants including the encoder-decoder structure have the most advanced performance, this approach will lead to an increase in the amount of calculation and redundancy of model parameters. For example, most of the variants with U-Net based on a single feature layer will repeatedly extract semantic information at similar scales. As the network structure deepens, the training difficulty of the model increases. To address these problems, we introduce the bottleneck Res2Net [14] to enhance the multiscale capability of U-Net and reduce the number of parameters of the model. The SE [8] module is used to enhance the network learning capability. In addition, we also design a multi-scale attentional gate structure (MAG) to increase the weight of target regions while suppressing background regions.
The contributions of this study are as follows: • We propose a MSAA-Net based on U-Net, which combines the attention mechanism and a more granular level multi-scale feature fusion. • We design an attention gate (MAG) combining multiscale feature fusion, spatial attention mechanism, and channel attention mechanism to optimize skip connection. MAG can increase the weight of target regions while suppressing irrelevant background regions. MAG can theoretically be used in other structures with skip connections. • We apply the SE-block to all modules, which improves the learning capability of the network. • We conducted extensive comparison experiments on the 3Dircadb dataset, and the result shows that MSAA-Net achieves a very competitive performance on the segmentation task of liver parenchyma with a much reduced number of parameters compared to similar attention networks.

Related work
With the continuous updating and development of hardware technology, deep learning algorithms have flooded into the medical imaging field, and many excellent algorithms based on deep Many excellent algorithms based on deep learning have been proposed and continuously improved. In the past few years, deep learning algorithms based on convolutional neural networks (CNNs) have been widely used in image processing due to their powerful nonlinear feature extraction and data processing capabilities. Karen et al. [16] explored the relationship between depth and performance of convolutional neural networks and successfully constructed a 16to 19-layer convolutional network (VGGNet) by stacking small convolutional kernels and maximum pooling operations several times. VGGNet is a landmark innovation in image processing and is still used by many advanced models to extract image features until now. Long et al. [18] defined a skip structured network (FCN), which combines deep and shallow semantic information to produce accurate and detailed segmentation. This structure implements end-to-end segmentation, where an input image of a certain size will result in a segmentation map of the same size. Ronneberger et al. [3] proposed an FCN-based U-Net architecture for encoder-to-decoder mapping via skip connections between feature maps of the same size. This network structure is the most classical in the field of biomedical images, and many excellent models use it as a backbone network. Milletari et al. [19] proposed V-Net by extending the U-Net structure to 3D volumes, and they used random nonlinear variations and histogram matching to extend the data. The experiments demonstrate the good performance of V-Net in medical image processing and greatly reduce the computing time. Zhou et al. [21] proposed a model UNet++ based on the structure of U-Net integrated according to different depths of U-Net. UNet++ can learn the best depth for the current task through U-Net networks with different depths. They improved skip connectivity by interconnecting multiple feature layers at the same scale, which coincides with the dense connectivity structure. Kushnure et al. [23] proposed a multiscale approach to improve the perceptual field of convolutional neural networks, and they combined channel attention and Res2 block to design MS-Unet, which achieved the best accuracy of liver segmentation in 3Dircadb dataset at that time. Schlemper et al. [24] proposed the R50 Att-Unet that automatically learns to focus on different shapes and sizes, and they replaced the traditional encoder of the Att-Unet with the encoder of ResNet50. This approach greatly deepens the network to extract more semantic information, while the unique residual structure does not increase the network parameters. Compared with spatial attention and channel attention mechanisms, selfattention possesses powerful encoding capabilities. Chen et al. [11] proposed transUnet by applying Transformer to the encoding stage of U-Net, and the powerful encoder enables TransUnet to achieve good performance in multi-organ segmentation tasks. The same U-shaped structure Swin-Unet [13] discards the CNN using the pure Transformer-based structure. Swin-Unet achieves better performance in multiorgan segmentation tasks than the full-convolution or the combination of transformer and convolution methods. However, the full-transformer-based model discards the ability to capture local features and also requires an arithmeticintensive pre-training process. Although attentional mechanisms have been widely used in medical image processing, few models combine different kinds of attention mechanisms. U-Net uses skip connections to fuse multi-scale information, but this structure has the disadvantage of generating semantic conflicts. In addition, researchers have designed models with progressively more complex structures and larger numbers of parameters in order to pursue segmentation accuracy, while work on optimizing and streamlining network models is less common.

Methodology
This chapter describes in detail the architecture of MSAA-Net.

MSAA-Net
The U-shaped architecture has been widely used in medical image processing tasks, and its skip-connected structure, which incorporates shallow information and high-level features, has good stability. The structure of MSAA-Net is shown in Fig. 1. MSAA-Net can be divided into encoding stage and decoding stage. In the encoder, MSAA-Net uses five feature layers to extract image feature information, but unlike U-Net where two 3×3 convolutions are repeatedly applied to each layer, MSAA-Net uses the bottleneck structure of Res2Net and the SE module to extract the semantic information of each layer as shown in Fig. 2. When one layer of feature extraction is completed, MSAA-Net will use max-pooling with a step size of 2 to compress the feature map and increase the number of channels with a 1 × 1 convolution, and the next layer will continue to extract semantic information in the same way. Similar to the encoder, the decoder uses the same Res2Net+SE feature extraction approach, but the MSAA-Net is optimized using multiscale attention gates in the skip connections. MAG can effectively mitigate semantic conflicts in skip connections, allowing the network to focus on regions of interest and suppress irrelevant background regions. MSAA-Net acquires a 512×512 three-channel image as input and outputs a segmented image of the same size. With the SE module and MAG, MSAA-Net has a more accurate segmentation effect.

Res2Net
The bottleneck structure is the underlying structure of many advanced network models [25]. Res2Net improves on the bottleneck module, replacing a set of 3×3 filters with multiple sets of smaller 3×3 filters and a connection similar to the residual learning framework is made. This allows Res2Net to retain the similar functionality of the bottleneck module while gaining enhanced multi-scale feature fusion capabilities.
After the 1×1 convolution operation, the Res2Net module divides equally the obtained feature map by number of channels, and each partitioned subset is denoted by {1, 2, ..., s}, where s denotes the number of blocks. Each x i has the same size and the same number of channels. All x i except x 1 need to go through different 3×3 convolution Fig. 1 The proposed MSAA-Net for fine segmentation of the liver. The input is a 512 × 512 three-channel image, and the number of channels is converted to 64 by 3 × 3 convolution, which is output to the Res2Net+SE module, and the final segmentation result map is output by the encoder and decoder. It is important to note that all feature extraction is done by the Res2Net+SE module and the skip connections are optimized using MAG operations. K i denotes the i-th set of convolution operations, and y i denotes the output of the corresponding x i . y i (i > 2) is calculated from k i (x i + y (i−1) ). y i , and the output out can be expressed as: Here represents the cumulative stitching operation of all chunks for the channel dimension.
Note that the feature maps processed by K (i+1) () are obtained by stitching K i (x i ) and x (i+1) , so they have a larger sense of receptive field, and the output result out contains scale information of different sizes. This splitting and then multi-scale fusion process is beneficial to extract global and local information.

Squeeze-and-excitation blocks
Squeeze-and-excitation (SE) blocks were proposed by Hu et al [8] and are referred to as channel attention mechanisms in some papers [26][27][28]. The SE module recalibrates the channel response by displaying the interdependencies between the modeled channels. In brief, the SE module can update the importance of each channel by learning, strengthening the weights of channels that are of interest to the task and suppressing channels that are less relevant to the task. We introduce the SE Blocks into all modules, which will cost a small amount of computation to improve the segmentation accuracy. In addition, we try to integrate it with the spa- Fig. 2 Details of the MSAA-Net encoder layer 3 are shown. After two feature extractions, the size of the feature map is 128×128 and the number of channels is 256. The Res2Net module obtains multi-scale information and then recalibrates the channels through the SE module to output feature maps with the same specifications tial attention mechanism to produce a more efficient gating device for optimizing skip connections. Figure 2 shows the structure of the Res2Net and SE Block, which is inspired by Ms-UNet [23].

MAG
We designed the MAG by combining the channel attention mechanism and the spatial attention mechanism. The proposed structure is shown in Fig. 3. MAG processes the feature map x 1 used for the skip connections to obtain the optimized Fig. 3 Schematic diagram of the proposed MAG. the MAG receives input from the encoder part and the decoder part, respectively. x 1 is the low-level feature map used to complete the connection, and x 2 has higher-level semantic information. x 1 generates attention coefficients α 1 from the contextual information of x 2 , x 2 is the signal collected from the larger sensory field, and α 1 performs the first scaling of the spatial dimension of x 1 to obtain x out put1 . x 1 performs a squeeze and excitation operation on its own channel to obtain the attention factor α 2 , and α 2 performs a second deflation of the channel dimension on x 1 to obtain x out put1 . x out put1 is the output of MAG Fig. 4 The mathematical process of MAG, which omits the feature map size and B N , etc. W i denotes the convolution operation and W f ci denotes the fully connected operation. F pl , F up , and F sum represent pooling, upsampling, and pixel summation operations, respectively feature map x out put2 ,x out put2 with the decoder part of the feature map for channel splicing and then downscaled into Res2Net+SE module by 1×1 convolution to complete multiscale information fusion [29][30][31]. Since only the features are scaled, MAG spends a small amount of computation to significantly optimize the skip connections.
A feature map x 1 ∈ R H ×W ×C with an intermediate layer and a feature map containing highe-level semantic information are given as inputs. High-level features x 2 resolution is smaller and needs to be up-sampled before subsequent operations can be performed. x 2 is up-sampled to get x 2 ∈ R H ×W ×2C , x 2 and x 1 have the same size, and the number of channels of x 1 and x 2 are unified by 1 × 1 convolution and then summed to g 1 ∈ R H ×W ×Fint after batch normalization (B N ). x 1 obtains g 1 after collecting semantic information on a coarser scale. After the activation of g 1 by the nonlinear activation function (ReLU ), and then the 1×1 convolutional compression channel, we get σ 1 ∈ R H ×W ×1 . σ 1 is activated by Sigmoid to find the 2D spatial attention coefficient α 1 . The essence of α 1 is a weight matrix that it has the same size as x 1 . Therefore, x 1 α 1 achieves spatial attention discovery and outputs x out put1 .
Unlike the structure of Su et al. [28], the MAG channel attention was found to be generated by the extrusion and activation of x 1 , which is a more concise structure. Since x out put1 is optimized for x 1 spatial weights, optimizing channels on the basis of x out put1 only requires squeezing and activation of x 1 . As shown in Fig. 3, the average global pooling is used for x 1 to obtain x 1 ∈ R 1×1×C , x 1 by two fully connected layers to achieve auto-calibration to obtain σ 2 ∈ R 1×1×C .σ 2 . σ 2 is activated by Sigmoid to obtain the 1D channel attention factor α 2 . The essence of α 2 is a weight vector, which is equal to the channel of x 1 . x out put1 α 2 implements the channel attention discovery and outputs x out put2 . Figure 4 shows the mathematical process of MAG, W is the convolution kernel weight

Dataset and data preprocessing
We use the publicly available 3D Image Reconstruction for Comparison of Algorithm Database 3Dircadb [33] for training and testing of the model. The 3Dircadb database consists of CT scans of ten female and ten male patients with liver tumors. The number of CT scan slices for each patient ranged from 74 to 260, respectively, and were stored in DICOM format. The database is divided into 20 folders, each containing labels for locations of interest that have been manually We preprocessed all images of the entire dataset. First, the DICOM formatted image is were converted to 512×512 png images to be used as input for the network. Secondly, we windowed all the images, using (400,50) HU value window to highlight the task area to make the segmented area clean. Finally, we histogram equalized all 2803 original and labeled images each, which solves the problem of overall darkness in the dataset.

Loss function
We use a loss function consisting of a weighted cross-entropy loss and a dice loss [19,34] to calculate the gradient. The cross-entropy loss function has been widely used in classification tasks and has its unique advantages. The segmentation task of liver substance is a high-precision binary classification problem; therefore, the cross-entropy loss function should be used as the preferred loss function, which is defined as follows: where y denotes the ground truth value of the image pixel and p denotes the value of the predicted label.
There are many cases in the liver parenchyma segmentation task where the background is much larger than the segmented region, so we add dice loss and add weights to the cross-entropy loss. The dice loss function is defined as follows: where p and g denote the predicted binary partition volume and the predicted binary volume of the label, respectively, and ϕ takes the value 1e − 5. The final loss function is defined as follows: where w 1 , w 2 (0 < w i ≤ 1) are the weight coefficients that are used to balance the values of the cross-entropy loss function and the dice loss function. In our segmentation task, the best performance is obtained for w 1 = 0.5, w 2 = 0.5.

Evaluation measures
Mean Pixel Accuracy (M P A), Mean Intersection over Union (M I oU) and dice coefficient (DC) are used as evaluation metrics. These metrics are calculated as follows: I oU = T P T P + F P + F N (8) where T P (true positive) indicates the number of foreground pixels correctly classified as foreground (liver), T N (true negative) indicates the number of background pixels correctly classified as background (non-liver region), F P (false positive) indicates the number of background pixels incorrectly identified as foreground, and F N (false negative) indicates the number of foreground pixels incorrectly classified as background. N is the total number of categories, in this article N = 2. Sum means the summation operation of the corresponding index scores for all categories. M P A is the ratio of the average number of correctly classified pixels per class to the total number of pixels and is used to indicate the segmentation accuracy of the task. M I oU reflects the average proportion of common elements at matching positions to the segmentation result, and all inaccurate segmentations reduce the M I oU score. Dice calculates the similarity between the predicted and true results and is used to evaluate the model performance.

Experimental results and analysis
In this section,we calculated the average values of Dice,IoU and PA, respectively. These metrics are commonly used for medical image segmentation. In addition, we also record the number of parameters of the model and the convergence speed of the network, which can help us to evaluate the model more comprehensively. The backbone architecture of our model is based on the U-Net, which is therefore used as the benchmark for evaluation. We gradually add different modules to U-Net and optimize them and select the most efficient structure for comparative analysis, and the segmentation results are shown in Table 1. The training environment of all models is the same. As can be seen from Table 1, the Res2 block greatly reduces the use of model parameters. After adding Attention Gate and SEblock, Res2-block and U-Net obtain significant performance improvement. Combined with the experimental results of Attention U-net, we attribute this performance improvement to SE-block. Finally, we compare the segmentation results of MSAA-Net with other networks, and MSAA-Net achieves very good segmentation accuracy. Obviously, these performance improvements are due to the improved multiscale attention gate structure (MAG). The Res2-block structure in MSAA-Net helps us to greatly reduce the number of parameters, which is only 15.46% of that of TransUnet and 51.09% of that of R50 Att-Unet.
To evaluate the training speed of MSAA-Net, we record the loss value every 10 epochs for the different models mentioned above. The loss values are shown in Fig. 5. Compared with other models, the loss value of MSAA-Net still decreases after 60 epochs. This proves that Res2-block has the effect of improving gradient disappearance, so we can iterate more times to train MSAA-Net.

Comparison with other methods
In this section, we compare MSAA-Net with other classical semantic segmentation architectures, including U-Net [3], UNet++ [21], ATT-UNet [7], Ms-UNet [23], TransUnet [11] and R50 Att-Unet [24]. The source code of Ms-Net is not Fig. 6 Comparison of segmentation results of MSAA-Net with other networks. We combine the segmentation results of MSAA-Net with the original image, so that the segmentation effect is more intuitive available, but the structure is similar to that of MSAA-Net and can be used as a reference. Since our input image is a 512 × 512 size CT image from 3Dircadb, the number of parameters of the model will be larger and the segmentation will be better, which does not affect the comparison of the model. As shown in Table 2, MSAA-Net obtains the closest segmentation accuracy to TransUnet, and its number of parameters is much smaller than other models. Figure 6 shows the segmentation results of the liver of MSAA-Net compared with other models. The segmentation results are compared with other models, and it can be seen visually that the segmentation results of MSAA-Net are better than those of the previous models.

Discussion
Liver image segmentation is a typical medical image segmentation task, and this work can be used as a preprocessing for treating liver diseases. Adjacent organs, blood vessels, and tumors can affect the segmentation accuracy of the liver. In this paper, we introduce some advanced structures into the U-Net [3] network to improve the segmentation effect and reduce the model parameters. In addition, we also design a multiscale attention gate structure to optimize skip connections and alleviate the semantic gaps between encoder and decoder. With a new attentional gate (MAG) structure and some optimization strategies, we improve the Dice scores of liver image segmentation. MAG combines the spatial attention mechanism and the channel attention mechanism, and weights the feature maps of the encoder separately in different dimensions, which makes the task-relevant regions and relevant channels have increased weights and are more easily learned and updated. To validate the results of our work, we performed ablation experiments on the 3Dircadb dataset, continuously adding and modifying the structure of the U-Net and analyzing the role of different structures for the task. For clearer evaluation results, we used 1/3 of the total data set as the test set, which would result in low scores for the evaluation metrics, but will not affect the results of the comparison experiments. Ultimately, we conclude that: Res2-block can replace the traditional convolutional block greatly reducing the model parameters, SE-block can be optimized for all encoder and decoder feature maps, and MAG can be optimized for skip connections (which may be more effective in segmentation tasks with small target regions). The evaluation results are shown in Tables 1 and 2, where the MSAA-Net received a dice score close to that of TransUnet, 94.4%, which is only half of the U-Net parameter.

Conclusion
In this paper, we propose an improved U-Net model MSAA-Net for automatic segmentation of the liver. We replace the original feature extraction module with a combination of Res2-block and SE-block and improve the skip connection with the proposed MAG. MSAA-Net obtains a very competitive performance on the 3Dircadb dataset. We also propose the scalability of MAG. The combination of spatial attention and channel attention is also applicable to other models.