MF2ResU-Net: A Multi-Feature Fusion Deep Learning Architecture for Retinal Blood Vessel Segmentation

: Segmentation of blood vessels becomes an essential step in computer aided diagnosis system for the diseases in several departments of ophthalmology, neurosurgery, oncology, cardiology, and laryngology. Aiming at the problem of insufficient segmentation of small blood vessels by existing methods, a novel method based on multi-module fusion residual neural network model (MF 2 ResU-Net) was proposed. In the proposed networks, to obtain refined features of vessels, three cascade connected U-Net networks were employed as main networks. To deal with the problem of over-fitting, residual paths were used in main networks. In the blocks of U-Net in MF 2 ResU-Net, in order to remove the semantic difference in low-level and high-level, shortcut 18 connections were used to combine the encoder layers and decoder layers in the blocks. Furthermore, atrous spatial pyramid pooling was embedded between the encoder and decoder to achieve multi-scale feature of blood vessels. During the training of the networks, to deal with the imbalance between background and foreground, a novel joint loss function was proposed based on 22 the dice and cost- sensitive, which could greatly reduce the which are 0.8013 and 0.8102, 0.9842 and 0.9809, 0.9700 and 0.9776, and 0.9797 and 0.9837 27 respectively for DRIVE and CHASE DB1. The results of experiments demonstrated the effectiveness 28 and robustness of the model in the segmentation of complex curvature and small blood vessels. 29

retinal blood vessels based on feature classification. The blood vessels were extracted from the color fundus image by 48 applying the preprocessing methods and segmentation techniques using matched filter and modified local entropy 49 thresholding operation. To reduce the ophthalmologists' time for examining the retinal images, Fan Guo et al. [7] 50 proposed a supervised method for segmenting blood vessels in retinal images based on the ELM classifier. For these machine learning methods, the features used for classification could have a big impact on the result of the prediction.
background, we proposed a novel loss function-based dice loss function and cross function, which could reduce effects 83 of unbalance of samples during training.

84
The contributions of our work can be elaborated as follows: 85 1. To refine representation features of retinal small vessels, a novel network, multi-module fusion residual neural 86 network model, MF 2 ResU-Net, was proposed, by which features of blurry small vessels can be detect. 87 2. A novel loss function which was based on the dice loss function and cross entropy, and added a cost-sensitive 88 matrix was introduced to achieve more balanced segmentation between the vessel and non-vessel pixels.

89
The rest of this paper was organized as follows: Section 2 presented the improved method and fusion; Section 3 90 introduced experiment data sets and analyzes the experimental results; Section 4 summarized the paper and draws our  Since of the difficulty on feature detection on small blurry vessels, it was hard to obtain considerable 94 segmentation results by conventional methods. In this paper, we proposed multi-module fusion residual U-Net model, 95 named MF 2 ResU-Net, to refine the small vessels and to obtain the segmentation of retinal vessels.

97
In MF 2 ResU-Net, we used three cascade-connected U-Net as the backbone network of the module. U-Net was a 98 classic encoder-decoder structure net, A distinctive contribution of the U-Net architecture was the introduction of 99 shortcut connections between the corresponding layers before the max-pooling and after the deconvolution operations.

100
As shown in Fig. 1(a). As the features coming from the encoder were computed in the earlier layers of the network.

101
Opposite, the decoder features of going through convolution, down-sampling and up-sampling were supposed to be of 102 much higher level, because they were computed at the very deep layers of the network. Thus, there were semantic In order to remedy semantic differences between the same layers of encoder and decoder, Szegedy et al. [20] used the Inception network, the simplest way to augment U-Net with a multi-resolution analysis capability was to incorporate 3 × 3, and 7 × 7 convolution operations in parallel to the 5 × 5 convolution operation, as shown in Fig.   109 1(b). Therefore, replacing the convolutional layers with Inception like blocks should facilitate the U-Net architecture to 110 reconcile the features learnt from the image at different scales. Another possible option was to use stride convolutions, 111 but in our experiments, although performance had improved, the introduction of additional convolutional layers in 112 parallel extravagantly increases the memory requirement. We factorized the bigger, more demanding 5 × 5 and 7 × 7 113 convolutional layers, using a sequence of smaller and lightweight 3 × 3 convolutional blocks, as shown in Fig. 1(c).

114
our fusing residual path was a cascade-connected blocks structure, and each block consists convolutional layers with 115 3×3 filters and 1×1 filter. This modification greatly reduced the memory requirement, we gradually increased the 116 filters in those, to prevent the memory requirement of the earlier layers from exceedingly propagating to the deeper 117 part of the network. We also add a residual connection because of their efficacy in biomedical image segmentation 118 (Drozdzal et al. [21]) as well as to add 1 × 1 convolutional layer, which may have allowed us to capture some 119 additional retina spatial information. We named this structure as 'Res-path'. And according to the difference number of 120 different layers, set different length Res-path. In order to refine feature maps of retinal vessels, we used fusing U-Net, named ResU-Net, as the blocks of our model as shown in Fig. 2. To avoid losing feature in detection, we used a light U-Net structure with two convolutional layers and two max-pooling layers as down-sampling in encoder and two convolutional layers and two deconvolution

133
Net. According to tubular characteristics of various size in vessels, 4 atrous convolutions with the size of 3×3 134 convolution were used for multi-scale feature extraction. ASPP was inspired by the spatial pyramid pooling method of 135 DeepLab v2 [22], but due to using too large dilation rate, network extract invalid features from blood vessels, too large 136 dilation rate was not suitable for data sets of retinas, so the dilation convolution with a void rate of 24 in ASPP was 137 deleted. The dilation rates in our model were defined as {2,4,8,16}. In order to accelerate computing, 1×1 convolution 138 was employed after each atrous convolution. In the final feature map, the feature image was up-sampled using the 139 bilinear interpolation method. The resolution of each feature map after atrous convolution was expanded by bilinear 140 interpolation, which made the size of each layer feature map consistent. Finally, the target feature map was formed by four feature maps through the addition of pixels and corresponding. The entire parameters in ASPP as shown in Table  Conv

147
The aim of this study was to build deep learning models to segment retinal vessels in fundus images, Fig. 4

167
From statistics in a typical retinal blood vessel in data set, we found that the ratio of the pixels of foreground to where N was the number of patch pixels, represented the foreground predicted probability of the input pixel k and 175 k y was the true label of the pixel k, which was either 1 (foreground) or 0 (background) in this task. For the imbalance 176 problem in our task, we improved the binary cross entropy loss function. As presented in Form.4, we defined a novel was an indicator for wrong prediction to , .  is the penalty parameter for 180 predicting the blood vessel, which was a positive real number. In this loss function. In the loss function of CE L , we 181 could set a large value of  to enlarge the loss of the wrong prediction in foreground. Since of the ratio of foreground 182 to background,  was set to 12 in this paper. 183 To deal with imbalance problem and obtain a perfect criterion on intersection of union, the dice loss function [23] 184 was proposed in segmentation task, which could be presented in Form. (5). where k x represented the fundus blood vessel region segmented by the algorithm, and k y denoted the fundus blood 187 vessel region manually segmented by the expert. | k x ∩ k y | represented the same area of the retinal blood vessel region 188 k x segmented by the proposed method and expert. To remedy the numerical problems, we improved the dice loss by 1 22 11 12 To deal with the vanishing gradient problem of dice loss and combine the advantage of two loss functions, a 192 synthetic loss function for the training of MF 2 ResU-Net model was proposed in this paper, where α was a parameter which controls the contribution of the CE L and dice L loss functions.

218
Net model, the technique that enhances image contrast was used to make the retinal blood vessel features more 219 obvious. Since that gray scale image showed better contrast than the RGB images [25]. We used gray scale image as 220 the input of models. To strength the contrast ratio between vessels and background in retinal images, we used three 221 strategies for image preprocessing, which were normalization, contrast limited adaptive histogram equalization [26] 222 (CLAHE) and gamma correction. Fig. 6 shows preprocessed images of one typical retinal image using the three 223 strategies. The preprocessed retinal image had a high contrast between the blood vessel outline and the background and 224 reduced noise.

225
Data augmentation was widely applied in convolutional neural networks because of its high efficiency and 226 operability. Considering that the DRIVE and CHASE DB1 were small datasets, model will be prone to overfitting and 227 has a poor classification performance. Therefore, it was necessary to augment the dataset for achieving the better 228 results. Four image processing steps were used for augmenting dataset and they are rotating, mirroring, shifting and 229 cropping. To reduce overfitting problem, our models were trained on small patches which were randomly extracted 230 from the images. In order to reduce the calculation complexity and ensure the surrounding local features, small size 231 blocks of 48 × 48 randomly were extracted from the preprocess images were used to train our model. In this paper, corresponding labels for experiments were presented in Fig 7,    In order to quantitatively evaluate segmentation of the proposed algorithm, four evaluation indicators were used in 248 this paper, which were: accuracy (ACC), sensitivity (Sen), specificity (Spe) and F1-Score [28]. In this model, positive 249 refered to blood vessels and negative refered to background. The ACC, Sen, Spe and F1-Score are defined as follows:

261
The AUC value represents the area under ROC curve. The ROC curve was an important method for measuring the 262 comprehensive performance of image semantic segmentation. Its value ranged from 0 to 1. Condition AUC = 1, a 263 perfect classifier; 0.5 < AUC < 1, better than random classifiers; 0 < AUC < 0.5, worse than the random classifier.  be segmented many image blocks with size of 48x48, as shown in Fig. 8(b)   with the analysis of the experimental results of Block4, as shown in the discounted indicators in Fig. 8, Table 3 and  We compared our proposed model with two state-of-the-art networks. One was the customized implementation of 293 U-Net, which we had introduced above; The other was the DeepLab v2. DeepLab v2 was an advanced segmentation 294 network, which combines deep convolutional nets, atrous convolution, and fully connected CRFs. In MF 2 ResU-Net, 295 we combined atrous convolution and U-Net. In order to highlight the excellence of our works. We compared the three 296 models, DeepLab v2, U-Net and MF 2 ResU-Net based on the DRIVE and CHASE DB1datasets.We evaluated the 297 model using the test data. Sen, Spe, ACC, F1 and AUC were compared and shown in Tables 4 and 5

300
What was more, we evaluated the models using ROC curves, which was shown in Fig. 9. The closer the ROC curve to 301 the top-left border was in the ROC coordinates, the more accurate a model. These results shown that the curves of 302 MF 2 ResU-Net were the most top-left one among the three models while the U-Net curve was the lowest one of the

318
From the whole images in Fig. 10, the results of MF 2 ResU-Net were closer to the ground-truth than other 319 methods, which meaned that our model was superior to the comparing methods. From the enlarged version of segment 320 blocks, MF 2 ResU-Net could detect small vessels which were disturbed by complex background, but the small vessels 321 in the results of comparing methods were blurred or disappeared. MF 2 ResU-Net also had a more obvious expression of 322 vascular features, and could segment the subtle blood vessels that were not obvious. Fig. 11 shows the segmentation 323 details of the two datasets. Fig. 11

338
We also compared our method with several published state-of-the-art approaches which were proposed recently.

339
From

360
In this paper, we presented a novel residual neural network based on U-Net for retinal vessel segmentation. The

361
experimental results proved that the method succeeded both absolutely and in comparison, with nine other state-of-the-362 art similar methods using two well-known publicly available datasets. The proposed method encompasses many 363 elements that wholly contribute to its success. To refine segmentation feature of retinal vessel, we used residual path to 364 connect encoder and decoder of U-Net, and ASPP was used between encoder and decoder to obtain global feature.