Dual Attention Multiscale Network for Vessel Segmentation in Fundus Photography

Background: Automatic vessel structure segmentation is an essential step towards an automatic disease diagnosis system. The task is challenging due to the variance shapes and sizes of vessels across populations. Methods: A multiscale network with dual attention is proposed to segment vessels in diﬀerent sizes. The network injects spatial attention module and channel attention module on feature map which size is 18 of the input size. The network also uses multiscale input to receive multi-level information, and the network uses the multiscale output to gain more supervision. Results: The proposed method is tested on two publicly available datasets: DRIVE and CHASEDB1. The accuracy, AUC, sensitivity, speciﬁcity on DRIVE dataset is 0.9615, 0.9866, 0.7693, and 0.9851, respectively. On the CHASEDB1 dataset, the metrics are 0.9797, 0.9895, 0.8432, and 0.9863 respectively. The ablative study further shows eﬀectiveness for each part of the network. Conclusions: Multiscale and dual attention mechanism both improves the performance. The proposed architecture is simple and eﬀective. The inference time is 12ms on a GPU and has potential for real-world applications. The code will be made publicly available.


Background
The segmentation of vasculature in retinal images is important in aiding the management of many diseases, such as diabetes and hypertension. Diabetic Retinopathy (DR) is caused by high blood sugar levels and result in the swelling of the retinal vessels [1]. Hypertensive Retinopathy (HR) is caused by high blood pressure and result in the narrowing of vessels or increased vascular tortuosity [2]. The early diagnosis of pathological diseases often helps patients to receive timely treatment. However, manually labeling vessel structures is time-consuming, tedious, and subject to human error. Automated segmentation of retinal vessels is in high demand and can release the intense burden of skilled staff.
The retinal blood vessels structure is extremely complicated together with high tortuosity and various shapes, such as angles, branching patterns, length, and width [3]. The high anatomical variability and varying vessel scales across populations make the task challenging. Furthermore, the noise and poor contrast company by the low resolutions further increase the difficulty. Traditional vessel segmentation methods often cannot robustly segment all vessels of interest.
Deep learning methods show impressive performance on image segmentation. The most widely used architecture is U-Net [4]. The coarse-to-fine feature representation learned by U-Net is suitable to gain satisfactory performance on a small dataset. The attention U-Net can further improve the performance [5]. The attention module automatically learns to focus on interest vessels with varying shapes while preserving computational efficiency. DA-Net [6] proposes spatial attention module and channel attention module for natural scene parsing. The spatial attention module and the channel attention module utilize a self-attention mechanism to capture features dependencies in the spatial and channel dimensions respectively. The spatial attention module aggregates feature at all positions with weighted summation and the channel attention module captures the channel dependencies between any two channel maps. Motivated by the DA-Net, this paper design a dual attention multiscale network for vessel segmentation.

Related Works
In this section, we explain some most commonly used attention mechanisms and multiscale networks for vessel segmentation.

Attention Network
Attention U-Net [5] captures a sufficiently large receptive field to collect semantic contextual information and integrate attention gates to reduce false-positive predictions for small objects that show large shape variability. Ni et.al [7] proposes a global channel attention module for vessel segmentation that emphasizes the inheritance relationship of the entire network feature. CS-Net [8] integrates channel attention and spatial attention into U-Net for 2D and 3D vessel segmentation. Hao et.al [9] exploits contextual frames of sequential images in a sliding window centered at the current frame and equipped with a channel attention mechanism in the decoder stage. Li et.al [10] proposes an attention gate to highlight salient features that are passed through the skip connections. HANet [11] automatically focuses the network's attention on regions which are "hard" to segment. The vessel regions which are "hard" or "easy" are based on the coarse segmentation probabilistic map.

Multiscale Network
Yue et.al [12] utilizes different scale image patches as inputs to learn richer multiscale information. Roberto et.al [13] proposes a multiple-scale Hessian approach to enhance the vessels followed by thresholding. Wu et.al [14] generate multiscale feature maps by max-pooling layers and up-sampling layers. The first multiscale network converts an image patch into a probabilistic retinal vessel map, and the following multiscale network further refines the map. Yin et.al [15] proposed to utilize multiscale input to fuse multi-level information.
Different from all the above methods, the proposed method takes advantage of both multiscale and attention mechanisms.

Data Preperation
We conduct experiments on two datasets: DRIVE and CHASEDB1.
DRIVE: The Digital Retinal Images for Vessel Extraction (DRIVE) is a dataset for retinal vessel segmentation, which consists of 40 color fundus images of size 768 × 584 pixels, including 7 abnormal pathology cases. It was equally divided into 20 images for training and 20 images for testing along with two manual segmentations of the vessels. The first segmentation is accepted as the ground-truth for performance evaluation while the second segmentation is accepted as a human observer reference for performance comparison. The images are captured in digital form from a Canon CR5 non-mydriatic 3CCD camera at 45 • field of view (FOV).
CHASEDB1: The CHASEDB1 dataset [16] for retinal vessel segmentation which consists of 28 color retina images of size 960 × 999 pixels are collected from both left and right eyes of 14 school children. These images are captured by a handheld Nidek NM-200-D fundus camera at 30 • field of view and each image is annotated by two independent human experts. We select the first 20 images for training and the rest 8 images for testing [17].

Evaluation Critia
The vessel segmentation process is a pixel-based classification, with each pixel being classified as a vessel or surrounding tissue. We employ 4 indicators (Specificity (Spe), Sensitivity (Sen), Accuracy (Acc) and Area Under ROC (AUC)) to measure model performance. Specificity (Spe) is the ability to detect non-vessel pixels. Accuracy (Acc) is measured by the ratio of the total number of correctly classified pixels (the sum of true positives and true negatives) to the number of pixels in the image field of view (FOV): (1) Here, TP(true positive) is where a pixel is identified as the vessel in both the segmented image and ground truth, TN(true negative) is where a non-vessel pixel of the ground truth is correctly classified in the segmented image.

Implementation Details
We set the learning rate at 0.001 decayed by a factor of 10 every 50 epochs. The network was trained for 300 epochs from scratch on an NVIDIA GeForce RTX 3090 Ti GPU. The input images of the neural network are resized to 512 × 512. In order to improve the generalization ability of the network, we also use several data enhancement techniques, including random horizontal flip with a probability of 0.5, random rotation in [−20 • , 20 • ], and gamma contrast enhancement in [0. 5,2].

Performance Evaluation
In this section, we compared our method with other state-of-the-art methods on DRIVE and CHASEDB1 datasets. The methods include U-Net [4], Zhang et al. [18], Liskowski et al. [19], DRIU [20], Yan et al. [21], CE-Net [22], LadderNet [23], DU-Net [24], Bo Liu et al. [25], VesselNet [26], DA-Net [6], Yin et al. [15] and CS-Net [8].  Table 1 shows the performance on the DRIVE dataset. Figure 1 shows the prediction of the proposed method on DRIVE dataset. The proposed method achieves the highest AUC compared to other methods. CS-Net inserts the attention module into the branch which the size of the feature map equal to the 1 16 of the original size. Our proposed attention module gathers the information from the feature map with 1 8 of the original image size. The input image size of CS-Net is 384 × 384 while the input image size of the proposed network is 512 × 512, our proposed method still shows higher AUC compared to CS-Net. Our method achieves higher performance compared to DA-Net, since DA-Net directly upsamples the attention map as the output. The attention module used in our method is also different from CS-Net. We concatenate the spatial attention map, the channel attention map and the sum of these two maps together to form a more discriminate feature representation. Table 2 shows the performance evaluation on CHASEDB1 dataset. Figure 2 shows the correspondense prediction. Our method surpasses all the other methods. The experimental results verify the effectiveness of the proposed method.

Ablative Studies
This section evaluates the performance of each part of the network. Table 3 and Table 4 shows the performance of the proposed network using different modules on DRIVE dataset and CHASEDB1 dataset respectively. The evaluation metric includes mean IOU commonly used in the semantic segmentaion. The multiscale architecture significantly improves the performance compared to U-Net and the attention mechanism further improves the performance.

Discussion
The dual attention network combines spatial attention, channel attention with a multiscale network. In this section, We analyze the time efficiency and the parameters of the proposed model. Time Efficiency The inference time for AM-Net is 12ms on a 3090Ti Nvidia GPU for one image with a size of 512 × 512. The simple and effective architecture can be easily applied to smart AI applications.
Model Parameters The proposed AM-Net has 9.95M parameters and 75.438G flops. The multiscale backbone needs 9.415M parameter and 73.163G flops. The dual attention module only needs 0.54M parameters.
The proposed network can perform fast inference and does not have many parameters. The simple and effective multiscale dual attention network has the potential to deploy to real-world applications.

Conclusion
In the paper, we propose a dual attention multiscale network. The network contains multiscale input, multiscale output and dual attention module. The dual attention module contains spatial attention and channel attention. The spatial attention gathers information from different positions and the channel attention module models the relationship between different channels. The experiments verify the effectiveness of the proposed method. The proposed framework enables fast inference and can be deployed to real-world applications.

Methods
The network structure consists of multiscale input, dual attention module and multiscale output.

U-shape Architecture
The architecture of the proposed method is shown in Figure. 3. Our network is constructed based on U-Net, and the input of the encoder path is an image pyramid. Two 3 × 3 convolution layers are applied to the encoder path, then is a 2 × 2 max-pooling operation with the element-wise rectified linear unit (ReLU) activation function to generate encoder feature maps. The feature map of the down-sample is connected to the feature map of the down-scaled input image. The number of feature maps is also doubled after down-sampling, enabling the architecture to learn complex structures efficiently. Like the encoder path, the decoder path produces a decoder feature map by using two 3 × 3 convolution layers and a 2 × 2 up-sampling layer, with the number of feature maps halved to preserve symmetry. The feature map of the encoder path is appended to the input of the corresponding decoder path by using skip connections. Finally, the high-dimensional feature representation of the output of the last decoder layer is fed to the dual-attention module to learn the relationship between position and channel, so that the feature representation with more discernible power can be sent to the multiscale output layer for final prediction.

Dual Attention Mechanism
The dual attention mechanism contains a spatial attention module and a channel attention module as shown in Figure 4.
Spatial Attention Module: The spatial attention module models rich contextual dependencies over feature maps by learning a spatial attention matrix, which represents the spatial relationships between the features of any two pixels. Different from DA-Net [6], we put the attention module in the branch with the feature map size equal to the 1 8 of the original image rather than directly resampling the attention map as the output. The design retains more detailed information without adding many parameters. Furthermore, the vessel segmentation requires skipconnection operations to fuse low-level information and recover the spatial information loss caused by down-sampling operations. The input feature representation S ∈ R C×H×W is feed into three convolution layers to generate three feature maps A, B and C, where A, B, C ∈ R C×H×W . The three feature maps are reshaped into C × N , where N = H × W is the total pixel number. After that, the transpose of A and B is multiplied and followed by a softmax layer to form the spatial attention SA ∈ R N ×N : The spatial attention module SA represents the impact of position i on position j. A similar feature representation will have a greater correlation. The transposition of SA is multiplied by C to form a feature representation and reshaped to R C×H×W . The result is multiplied by a scale parameter α and followed by an element-wise summation with input feature F to generate the final spatial attention feature map SAO: Here, α is a learnable parameter and it is initialized to 0. The spatial attention module calculates the weighted sum across all positions of the feature map. The relationship between vessel pixels at different locations will be fully learned. The similar vessel pixels promote each other and the spatial attention module improves the semantic consistency.
Channel Attention Module: The relationship between different feature map channels of high-level features can be learned by the channel attention module. The long-range contextual information in the channel dimension helps improve the vessel segmentation performance since different vessel responses are associated with each other. The original feature representation S ∈ R C×H×W is reshaped to A ′ ∈ R C×N , where N = H ×W is the total pixel number. A ′ and the transpose of A ′ is multiplied and followed by a softmax layer to form the channel-wise attention map: .
The channel attention map CA calculates the impact of channel i on channel j. The transpose of CA and the input feature map S is multiplied and reshape the result to R C×H×W . The reshaped result is scaled by parameter β and element-wise sum with S to form the final channel attention CAO ∈ R C×H×W : Different from other attention modules, we concatenate the spatial attention map, the channel attention map, the summation of the two maps together to form a more capable feature representation.

Multiscale Output
Multiscale outputs provide more supervision in network training. There are M side-output layers in the network, and each side-output layer can be considered as a classifier to generate a matching local output map for the earlier layers. Here are the loss functions of all the side-output layers: L cross−entropy (y, y ′ ), L cross−entropy is the cross entropy loss for each side-output layer: y i is the predicted probability value for class i and y ′ i is the true probability for that class. We compute 4 side-output maps and an average layer to combine them all while the final optimization function is the sum of these 5 side-output losses. The side-output layer alleviates the gradient vanishing problem by back-propagating the side-output loss to the early layer in the decoder path, which is helpful for the training of the early layer. We use multiscale fusion because it has been proven to achieve high performance. The side-output layer also adds supervises information for each scale so that it could output a better results. The final layer which is considered as classifier treats the vessel segmentation as a pixel-wise classification to produce the probability map of each pixel.