3D U-Net With Attention and Focal Loss for Coronary Tree Segmentation

. The semantic segmentation of coronary artery is important in clinical diagnosis and treatment of coronary artery disease (CAD). The problems of intra-class inconsistency and inter-class indistinction in coronary artery and vein are very prominent. In this paper, we propose a 3D U-Net model improvement plan using attention mechanism and focal loss. U-Net is combined with channel attention and spatial attention to distinguish confusing categories and targets with similar appearance features. We use focal loss to optimize the loss function to solve the problem of category imbalance between coronary artery and background category.

connection operation transfers features extracted by the down-sampling layer directly to the up-sampling layer, which makes pixel positioning more accurate and segmentation accuracy higher.
In the task of segmentation of the coronary artery tree, we focus on the characteristics of the coronary tree's fine lumen and rich topology. However, in the actual coronary tree segmentation work, 3D U-Net [3] model is difficult to distinguish between confusing categories and targets with similar appearance features. For example, coronary arteries and veins have very similar visual features in topology, shape and brightness. In response to this problem in coronary tree segmentation, the dual attention model [4] captures rich context dependencies based on a self-restraint mechanism, which can effectively solve the problems of intra-class inconsistency and inter-class indistinction in segmentation tasks.
In this paper, we propose an improvement plan for the 3D U-Net model. First, we incorporated channel attention, spatial attention, and a combination of the two to make use of the relationship between coronary tree targets. Secondly, we optimized the loss calculation during training.
According to the main remarks highlighted above, we propose an improved 3D U-Net model to optimize the segmentation of the coronary artery tree. In particular, the main contributions of our work are: • The improved attention models enrich the details of coronary tree segmentation.
• The focal loss based on the 3D U-Net model pays more attention to the foreground category and participates in the gradient calculation.
• Proper isotropic spacing parameters improve the results of coronary artery tree segmentation.
• Better segmentation performance has been achieved by the proposed model.
The rest of this paper is organized as followings. The second part briefly introduces the related work for image segmentation. The third part details the architecture of our proposed model. The fourth part shows the experiments setting and results. Finally, the conclusion and future work are drawn in section five.
Besides, shape priors can also be incorporated into the convolutional neural network [9][10][11]. For instance, Lee et al. [9] explicitly enforced a roughly tubular shape prior for the coronary segments by introducing a template transformer network, through which a shape template can be deformed via network-based registration to produce an accurate segmentation of the input image, as well as to guarantee topological constraints. Lee et al. [9] showed that such method significantly outperformed a baseline network that used only fully-connected layers on healthy subjects (mean Dice score: 0.75 vs. 0.67).

U-Net with attention
Oktay et al. [12] proposed an attention gating (AG) model for medical imaging, which can Xu et al. [13] proposed the first visual attention model in image captioning. Usually, they used "hard" pooling to select the most likely attentive area, or use "soft" pooling to average spatial features and attentive weights. As for VQA, Zhu et al. [14] use "soft" attention to merge image area features. To further improve spatial attention, Yang et al. [15] applied a stacked spatial attention model, where the second attention is based on the attentive feature map modulated by the first one. Different from theirs, Chen et al. [16] multi-layer attention is applied on the multiple layers of a CNN. A common defect of the above spatial models is that they generally resort to weighted pooling on the attentive feature map.
Thus, spatial information will be lost inevitably [16]. More seriously, their attention is only applied in the last conv-layer, where the size of receptive field will be quite large and the differences between each receptive field region are quite limited, resulting in insignificant spatial attentions [16].
Tolooshams et al. [17] proposed an end-to-end neural architecture for multichannel speech enhancement, call Channel-Attention Dense U-Net. The distinguishing feature of the proposed framework is a channel attention (CA) mechanism inspired by beamforming. CA is motivated by the self-attention mechanism, which captures global dependencies within the data. This paper incorporates CA into a U-Net to guide the network to decide, at every layer, which feature maps to pay the most attention to.

Vessel segmentation based on U-Net
Chen et al. [18] proposed to incorporate the vesselness map into the input of the 3D U-Net, which serves as the reinforced information to highlight the tubular structure of coronary arteries. In the Livne et al. [19] proposed the half U-Net, where the number of channels in each layer was consistently reduced to half. The half U-Net was fed with cerebrovascular 2D image patches and returned the 2D segmentation probability map for each given patch.

CNN optimization with loss function
Lin et al. [21] proposed the focal loss idea for the problem of sample imbalance, mainly to solve the problem of serious imbalance of positive and negative sample ratio in one-stage target detection.
This loss function reduces the weight of a large number of simple negative samples in training, and can also be understood as a kind of difficult sample mining. State-of-the-art accuracy and running time are achieved on the challenging COCO data set.

Method
The 3D U-Net [3] model is one of the most widely used approaches in medical image segmentation. It has a typical encoder-decoder structure, as shown in Figure 1.
Compared with ASPP [22], PSPNet [23], LargeKernel [24] and other models that integrate multi-scale contextual information, the encoder-decoder structure of 3D U-Net better integrates the semantic information of the low, middle and high levels. Our 3D U-Net focuses on the coronary artery tree segmentation task to capture characteristics of the coronary tree lumen and rich topology. However, in the actual coronary artery tree segmentation work, it is challenging for the 3D U-Net model to distinguish confusing categories and targets with similar appearance features. For example, coronary arteries and veins are similar in topology, shape and brightness. It is necessary to strengthen the feature expression of intra-class inconsistency and inter-class indistinction for 3D U-Net models.
Therefore, we propose a U-Net combined with attention mechanism to use the relationship between coronary tree targets.

U-Net with channel attention
Channel attention module aims to capture the interdependence between channels, and adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels.

Fig.2. Channel attention module architecture
The structure of channel attention module is illustrated in Figure 2. To make the description simpler, we use 2D CNN to illustrate the process, and 3D cases can be extended accordingly. The input A is a collection of 2D feature maps with height H, width W and channel number C. Each 2D feature map of A is flattened to a vector of length N=H×W, thus a matrix of size C×N is formed.
Multiplying with its transpose and normalizing the result with softmax, one can obtain the map X (C×C), which represents the pair-wise interdependency for all C channels.
After multiplying with X, the output is further scaled by a factor β, and then added to the original feature map A to obtain the final output E. Multiplying this attention with implies selectively integrating highly dependent channels, and hence improving semantic feature expression. Thus, semantic dependence between channels is modeled and feature maps are re-calibrated with this dependence.
An improved U-Net model (U-Net with CAM) combining U-Net with the channel attention module is proposed. We added the channel attention module before the last convolutional layer of the decoder to help U-Net with CAM pay more attention to contributions between channels. The specific network structure of U-Net with CAM is shown in Figure 3. Context is important for medical image segmentation and it aims to capture global dependencies regardless of spatial position. In order to model richer local context dependence, our method employs a SAM structure (as shown in Figure 4), encoding a wider context dependence to local features. Each element of the S matrix is: Here are some remarks for spatial attention map S. The element Sji represents the influence of pixel i on pixel j. Larger values indicate stronger relative dependence. Note, when calculating S, value at a pixel is the inner product on the channel dimension, not a single channel.
Each element of the final output E matrix is: Where α is the scale factor, Di is the element of D, and Aj is the element of A. The attention is multiplied by the original map, i.e., the feature map is updated using the weighted sum of all positions, and the features is selectively strengthened according to similarities between pixels. It is equivalent to using the learned long-distance dependency.
Based on the above analysis of the spatial attention module, an improved U-Net model (U-Net with SAM) is proposed. We added the spatial attention module before the last convolutional layer of the decoder to help U-Net with SAM pay better attention to contributions between pixels. The specific network structure of U-Net with SAM is shown in Figure 5.
However, the map of spatial attention described above is very large in 3D, causing memory overflow. So, the original design of the spatial attention mechanism is not feasible for 3D, especially when image size is large. Therefore, we simplify it as shown in Figure 5. This spatial attention is obtained by 1×1 convolution to collapse the C=32 channels to one map of size D×H×W, and each channel of the feature maps shares the same spatial attention. This way, we can effectively reduce the use of memory space. In this setting, the spatial attention map does not represent the pixel-wise dependency, but it still guides the attention of the network to where it needs.

U-Net with dual attention
In response to the aforementioned CAM and SAM, we also proposed a segmentation architecture that combines U-Net with CAM and SAM, namely U-Net with DAM, to enhance the discriminative ability of the feature representation of coronary tree segmentation. As shown in Figure 6, the feature map output by the penultimate layer of convolution passes through CAM and SAM, and then the features of CAM and the features of SAM are subjected to the sum operation, that is, "Sum fusion" in Figure 6, and the segmentation result of the final coronary artery tree is obtained.

U-Net with loss optimization
In the coronary artery tree segmentation task, there is an extreme imbalance between the foreground (coronary) and background (non-coronary) categories. If we use the native cross entropy loss, the background dominates the loss, and it leads to low sensitivity. Therefore, based on previous work of [21], we use the focal loss to weigh the balance of foreground and background classes in semantic segmentation. The definition of Focal loss is: According to conclusion of [21], γ = 2 gives the best performance. Therefore, in the coronary artery tree segmentation task, we mainly optimize for α with fixed γ = 2.

Experiments
In this part, we first introduce training and testing setup in detail in Section 4.1. In Section 4.2, we evaluate the segmentation performance of the coronary artery tree based on the U-Net with improved attention model. Then, we evaluate the impact of loss optimization on coronary segmentation performance in section 4.3. Finally, we test the effect of isotropic spacing on the performance of coronary artery tree segmentation in Section 4.3.

Data and Setting
In the experiments, we collected 300 cases of coronary artery CTA, all from the retrospective data of Liaoning Provincial People's Hospital. The 300 cases include patients with suspected coronary heart disease (coronary artery stenosis>50%) and patients without significant stenosis (coronary artery stenosis≤50% During training, all hyper-parameters follow those in 3D U-Net [3]. Specially, the initial learning rate is 0.001. The weight decay is 0.0001 and momentum is 0.98. In our experiments, when training epoch is set to be 80, and a total of 40k iterations, all the deep learning models are already converged.
We adopt standard data augmentation methods and train the networks using SGD with a mini-batch size of 2 for each GPU. It takes 36 hours to train a model on NVIDIA GTX 1080Ti.
In the performance evaluation, we use the mean of class-wise intersection over union (Mean IoU), false positive, sensitivity and specificity to quantify the overall statistical performance of the coronary artery tree segmentation. Among them, mean IOU is measured by the voxel overlap, and it may cause bias towards thicker vessels because they contain more voxels. To balance this effect, sensitivity and specificity are measured by the length of coronary vessels centerlines, which are extracted by the skeleton operation.

Attention module for coronary segmentation
In this section, we try to conduct an in-depth discussion on the performance of 3D U-Net with CAM, SAM, and DAM models on coronary tree segmentation. For the coronary tree segmentation task, the segmentation performance of the 3D U-Net model is used as a benchmark. As seen from Table 1, for the mean IoU, the segmentation performance difference between 3D U-Net with CAM and U-Net with SAM is very small, and only U-Net with DAM has dropped by nearly 2%. For sensitivity, 3D U-Net has similar performance as U-Net with SAM and U-Net with DAM models, while the 3D U-Net with CAM model has a significant increase of more than 2%. Compared with manual labeled ground truth, the improvement of sensitivity is reflected by more branches or longer length of the coronary tree. Similarly, for specificity, 3D U-Net produce results close to U-Net with SAM and U-Net with DAM models, while the 3D U-Net with CAM model drops significantly by more than 4%. The main reason is that the 3D U-Net with CAM model captures more and longer branches. Comparing with ground truth, the additional mismatched branch sections fall in two categories: (1) correct arteries but not segmented by hand in ground truth (2) wrong vessels.
Both cases cause the specificity to decrease due to the significant increase in the denominator during calculation. However, we found that more cases belong to the first category than the second.
We One can see that U-Net with CAM has the best coverage of branches among the four methods, and also yields clean segmentation at the distal part. These observations are consistent with the statistical criterions summarized in the Table 1.
In summary, Mean IoU, False positive, sensitivity and specificity reflect the segmentation performance of the methods, and are consistent with the advantages and disadvantages of the segmentation visual effect for most examples.

Loss for coronary segmentation
The results with focal loss and cross entropy loss (baseline) are shown in Table 2. Compared with baseline of U-Net model, one can find that the segmentation performance with focal loss has a significant decrease in the mean IoU, which is mainly due to the significant increase in false positives.
Although these false positives contain wrong segmentations, more of them are correct coronaries but not labeled by hand. Regarding the focal loss itself, we can find that when α=2, gives better results than α=1.

Isotropic spacing
We also investigate the performance of four methods with different isotropic spacing. Specifically, the input to the networks are image patches of size 128 x 128 x 128, but we can choose different values of spacing for the isotropic resampling. The results are shown in Table 3. Resampling spacing during training can change the field-of-view (FOV) of the model. Given the same input patch size, larger resampling spacing indicates larger FOV. When the FOV is large, the model has better modeling of the higher level information, like the overall shape and anatomy of the coronary tree, but may lose some of the details. On the contrary, when the FOV is small, the details of the object can be learned well. However, it confuses true coronary arteries and other types of vessels.
This tendency is shown in the table. The models with smaller spacing has higher sensitivity (indicating more vessels are captured), but the false positives are also high (indicating more wrong vessels are captured).
From Table 3, we observe that the overall performance is best when the spacing is close to the original spacing (0.6mm), and spacing 0.3 mm gives better results than 0.9mm.

Conclusion
In this paper, we combined attention mechanism to improve the 3D U-Net model. Also, the impact of different loss and isotropic spacing on the performance of coronary artery tree segmentation was investigated. It can be seen that among the improved models, 3D U-Net with CAM has the best segmentation performance, especially it yields highest sensitivity but still keeps segmentation clean.
From the experiments of focal loss, we should pay more attention to the problem of unbalanced segmentation during training, specifically, in the coronary artery tree segmentation, loss of the foreground class to participate in the gradient calculation. Regarding resampling spacing, isotropic spacing similar to the original spacing for training gives the best results.