As early as the 1860s doctors have started looking for possible solutions to taking pictures of the eye. By the 1880s, a partial solution was developed. Doctors can now take the picture of the eye by placing a camera on the head of the patient but still need to wait for about 3 minutes to get the film developed. This was a great development, although the procedure looks simple and basic it was the first time the picture of the eye could be captured by anyone. In 1926, the first fundus camera was invented. This camera could only take photography of a portion of the eye. In 1997 (70 years later), the retinal fundus camera that could take pictures of 130 degrees was developed. By the 21st century, a non-invasive camera that could capture 200 degrees view was developed [14]. In terms of computers, the first algorithm that was executed on a machine was created by Ada Lovelace in 1843. Since then several algorithms have been created to perform specific tasks. In 1966, computers were attached to objects to see if they could identify the object (this was the beginning of computer vision). By the next decade, mathematical analysis and quantitative applications were introduced to computer vision. Scale-space representation [15], contour models [16], and Markov random fields [17] became part of computer vision algorithm.
Marvin Minisky was the first to attempt to mimic the human brain. His research opened the way into the abilities of computers to process information for decision-making. In 1959, Russell Kirsch invented a digital image scanner that transforms images into good numbers. In 1963, Lawrence Robert processed the 3D information about a solid object from 2D photographs. In 1980 Kunishiko Fukushima built the precursor of modern CNN. In 1999, David Lowes, describe a visual recognition system that uses local invariant features. In 2001, the first face detection framework with a real-time procedure was introduced. The breakthrough moment in computer vision came up in 2012 when AlexNet won the ImageNet completion. Since then, several researchers have used the CNN method to segment and classify medical images (especially retinal fundus images).
2. 1 Review of methods
Several existing CNN methods to segment and classify retinal fundus images have been proposed by researchers. These methods use combinations of network architectures resulting in sophisticated AI platforms. To give the readers a grasp of the latest trends, we have categorized these CNN methods into three: (1) CNN methods to classify and segment optic disc and cup (2) CNN methods to classify and segment arteries and veins (3) CNN methods to classify and segment retinal blood vessels. These three categories have a significant number of CNN methods that have shown good results. A block diagram depicting the different CNN approaches is shown in Fig. 4.
2.2 CNN used for Segmentation and Classification
2.2.1 CNN used for Optic Disk and Optic Cup
Automatic segmentation of the optic disk and optic cup can help remove problems encountered or envisaged in the manual procedure. However, this segmentation technique is faced with challenges such as 1) unclear boundaries 2)big variability, 3)Interference from other components in the image, and 4) mixed pathologies. To solve these problems, researchers have proposed different CNN methods. Reference [18] is an encoder-decoder network that involves two components. The first component is the feature detection (FDS) while the second is the cross-correction sub-network (CCS). The FDS preserves features by stacking two convolutional layers (3x3), a batch normalization (BN), and a rectified linear unit (ReLU). The CCS uses the subnetwork to reduce multiple pooling operations. The second layer of the network is the decoding block, this block improves the contrast and combines multiple encoding features. Overall, reference [18] used the FDS and CCS as the encoding layer and uses a decoding layer to upsample the segmentation. Fu et al [21] used a combination of UNet and probability bubbles for segmentation of optic disc (OD). The images were preprocessed with an iterative robust homomorphic surface filtering (IRHSF) method [22]. Then the UNet detects the OD and the blood vessel. The position constraint model was introduced to avoid the bright lesion which could act as a distraction. The hough transform segments the images and creates a technique to avoid DR lesion distraction. Finally, probability bubbles were modeled and used to fuse segmented components. Reference [23] segment OD with RGV generated CNN images. First, the data was prepared and augmented. Then RGV images were converted to RGB images. A two-stage technique was used to alleviate the class imbalance constraint problem. The candidate location was determined with the guided search procedure. Finally, a weighted neighborhood voting was conducted and the OD produced the localized portable position. The CNN architecture consists of a convolutional layer, max-pooling layer, fully connected layer, and a ReLU. The softmax was used for logistic regression which produced the output of the image.
Yuna et al. [25], proposed a multiscale CNN method for OD and optic cup (OC) segmentation. First, the images are cropped to get the region of interest (ROI). Then, the contrast limited adaptive histogram equalization (CLAHE) (see [26] for the usage of CLAHE for medical images) was applied to the image for enhancement. The ROI image was again transformed into polar coordinates [27], then the CLAHE and polar coordinate images were concatenated. The concatenated image was then passed to the W-Net architecture. The W-Net is an end-to-end CNN architecture that consists of feature extractor and context extractor modules. The W-Net uses these modules to create a segmentation map for the output procedure (see Fig 5 for better understanding).
Reference [28] proposed the coarse-to-fine deep learning architecture. This method consists of the input, the segmentation architecture, and output. The vessel is extracted from the original image, then a vessel density map is generated to highlight the location of the OD. Next, the UNet segments the retinal image and the vessel density map. The outcome produces two different segmentation. An overlapping strategy with the disc patch is used to execute false segmentation. Finally, the UNet is used to segment candidate regions for the final OD segmentation. Reference [29] segment and classify three components (vasculature, OD, and Fovea) with the simple CNN method. The color image was normalized from RGB to LUV color space. Thereafter the LUV color space was converted back to RGB. The normalized image was fed into the CNN which has 6 layers (2 convolution, 2 max pooling, and 2 fully connected layers). At the end of the CNN procedure, four outputs were produced.
Reference [30] proposed the deep learning enhanced CNN. The first process was to preprocess the image with the Gaussian filter and image normalization. The filter removes noise while the normalization balances the color boost region and the illumination of the image. Next, the color-texture morphological approach (combination of erosion and dilation) was used to capture the global distribution features. Then, the edge histogram texture descriptor (CLAHE and Sobel edge detector[32]) was applied to analyze and detect structures. The Watershed algorithm [31] was applied to extract contour shape and localize the disc part on the OD. A cropping method was used to Curson the image to the required size. Finally, the image was fed to the proposed deep learning CNN algorithm. This algorithm is an end-to-end decoder-encoder method with 39 layers (19 convolutions, 4 max-pooling, 4 upsampling, 4 dropout, and 11 merge layers). In both testing and training procedures, the complete RGB image was used for segmentation. Reference [33] proposed a semantic method for the segmentation and classification of OD. The input image was augmented based on the label training data. The augmentation was artificially generated based on the image. The augmentation was used for training the network and the dataset was divided into 70% training and 30% testing. Rotation (horizontal and vertical flip) of retinal fundus images was effective and enables effective segmentation results. The network uses the encoder and decoder architecture. The encoder block downsamples to keep the required classes in the image and to remove unwanted pixels. It consists of 18 layers with 13 convolutions and 5 pooling layers. The decoder block upsamples the image with operations to generate the same size. This block has 20 layers consisting of 14 convolutions, 5 pooling, and 1 softmax layer. Finally, the pixels are classified and marked according to the classes and network. The pixel layer gives probabilities of each class with the combination of the loss function. The output of the procedure is a segmentation mask (see Fig. 6 for more details).
Reference [115] developed a coarse-to-fine segmentation process that uses UNet to obtain a rough segmentation boundary and cropping to secure the boundary area from a contour-centered image. Secondly, SU-Net (a fully convolutional network) was combined with the Viterbi algorithm to segment boundaries. Then, a data augmentation method was introduced to avoid overfitting. The network uses the UNet and the centered contour coordinates for segmentation. The SU-Net consists of three blocks: the Encoder, decoder, and sequence decoder. The first and the second blocks contain the traditional encoder and decoder (segmentation mask) of UNet with skip connection, convolution, ReLU, upsampling, and downsampling. Sequence encoder layers consist of two parts: gateway module and cascaded gate units. The gateway module gets its inputs from each layer of the decoder, these inputs are processed and concatenated before sending to the gates unit. The gateway module consists of 3 upsampling and 3 convolutions + sigmoid layers, while the gate unit consists of ReLU and softmax layer. To model, the interaction of prediction and spatial constraints the Viterbi algorithm [117, 118] was used to decode the output of the sequence decoder. Reference [119] proposed the attention-based fully connected CNN (AFCNN). The images were resized to 512 x 512 pixels, then morphological operations (opening, closing, and erosion) were performed on the resized image A cropping procedure was then placed on the image with a bounding box of the mask. The AFCNN has 19 layers consisting of 3 attention blocks, 12 convolution layers, 2 dropout layers, and 1 softmax layer. Reference [120] and [121] both used the CNN for segmentation of OD in RF images. In [120] image patches were first extracted then the global contrast normalization was used to brighten the image. The zero-phase component analysis was then applied to the image before it was augmented and fed to the CNN. The CNN is a simple traditional method with convolution and pooling layers arranged sequentially. Images segmented with the CNN were again segmented with the fuzzy c means [122] for object detection procedure. Meanwhile, in [121] a 9 layer CNN was used to segment the image. This network consists of 3 convolutional layers, 4 ReLU, 2 max-pooling, 1 fully connected layer, and 1 softmax. The Comparison of different methods in this section is available in Tables 1 and 2.
Table 1: CNN used for Optic Disk and Optic Cup
Author
|
Ref
|
Year
|
CNN Name
|
Inspiration for research
|
procedure
|
Additional comments
|
Wan et al
|
[18]
|
2021
|
Asymmetric deep learning network
|
UNet [20]
M-Net [19]
|
Segmentation
|
Uses the FDS and CCS for encoding while the decoder upsamples
|
Fu et al.
|
[21]
|
2021
|
Fusing UNet with probability bubbles
|
UNet
|
Segmentation
|
Preprocessing and a combination of UNet and probability hough bubbles
|
Meng et al
|
[23]
|
2018
|
RGV generated CNN model
|
LeNet-5 [24]
|
Segmentation
|
Simple CNN model that works with RGV images
|
Yuan et al
|
[25]
|
2021
|
Multi-scale W-Net
|
M-NET and UNet
|
Segmentation
|
Pyramid W- shaped backbone network for OD and OC
|
Wang et al
|
[28]
|
2019
|
Coarse-to-fine deep learning
|
UNet
|
Segmentation
|
Vessel extraction and vessel density map with the UNet for OD segmentation
|
Tan et al
|
[29]
|
2017
|
Single convolutional neural network
|
Multiple segmentation
|
Segmentation
|
Single CNN to segment vasculature, OD, and Fovea
|
Veena et al.
|
[30]
|
2021
|
Deep learning enhanced CNN
|
Encoder-decoder CNN
|
Segmentation
|
Combined preprocessing, enhancement, and deep CNN model for OD segmentation
|
Imtiaz et al.
|
[33]
|
2021
|
Label based encoder and decoder semantic segmentation
|
Encoder-decoder CNN
|
Segmentation
|
Augmentation based label semantic segmentation for OD
|
Xie et al.
|
[115]
|
2020
|
SU-Net and Viterbi algorithm
|
UNet, dilated CNN [116], Viterbi algorithm [117, 118]
|
Segmentation
|
UNet, S-UNet, and Viterbi algorithm connected to segment vessels in RF images.
|
Sadhukhan et al.
|
[119]
|
2020
|
AFCNN
|
FCNN
|
Segmentation
|
Attention mechanism combined with FCNN
|
Priyanka et al.
|
[120]
|
2017
|
Patches CNN
|
CNN, Fuzzy C Means
|
Segmentation
|
Combined the CNN with the Fuzzy C means for segmentation
|
Raja et al.
|
[121]
|
2020
|
Traditional CNN
|
CNN
|
Segmentation
|
Used the tradition CNN for segmentaation
|
Table 2: Pros and cons of CNN used for Optic Disk and Optic Cup
No of the Databases used
|
Ref
|
Advantages
|
Disadvantages
|
Accuracies (Accuracy)
|
Accuracies (AUC)
|
3
|
[18]
|
Can train a model directly using the end-to-end procedure on source and target.
Easily converts data from one form to the other.
|
May become too lossy.
If not properly configured, can produce improper decoding output.
Easy to miss important features from the encoder
|
0.937
|
-
|
4
|
[21]
|
Can easily detect noise because of the excellent pixel capture procedure.
Images with less symmetric components are captured faster.
|
Easy to produce misleading output
Symmetric components produce terrible results
|
0.99
|
0.99
|
4
|
[23]
|
Easy to use with less dependent on computational space
Easy to detect and extract important feature
|
The process is too cumbersome.
May produce poor accuracies for certain images because the architecture is not deep
|
0.98
|
-
|
3
|
[25]
|
Gives a more flexible and detailed images representation.
May produce very good accuracies since it uses a very deep network
|
Require large computational space.
Difficult to implement since it requires some little human interaction.
|
0.95
|
0.99
|
6
|
[28]
|
Allow usage of global location and context at the same time.
Does not need multiple runs to get acceptable segmentation
|
Accuracies may be poor because the network is not deep and features are not robust
|
0.93
|
0.97
|
1
|
[29]
|
Segmentation and classification may be effective since the image was already reduced between 0 and 1.
Noise is reduced effectively.
|
Produce unstable procedures during training.
Highly dependent on data for effectiveness.
|
0.96
|
0.95
|
1
|
[30]
|
Can execute and correlate features that enable faster training and learning
|
Uses several deep networks that require large computation
Produces several weight parameters that make training slow
|
0.98
|
|
2
|
[33]
|
Prevents data scarcity by adding more training data to the model.
Resolves class imbalance and increase generalization ability
|
Provides data bias that leads to suboptimal performance.
Accuracies may reduce in the very noisy image due to lack of preprocessing method.
|
0.86
|
0.99
|
3
|
[115]
|
Avoid overfitting with the data augmentation procedure.
Highly satisfactory bit error rate performance with the high-speed operation and ease of implementation.
|
Accuracies may reduce in the very noisy image due to lack of preprocessing method.
|
-
|
0.97
|
6
|
[119]
|
Resized images help to decrease computational time.
Effective identification of information in input to accomplish a task.
|
Difficult to parallelize the system and could be time-consuming.
|
-
|
-
|
1
|
[120]
|
Allows gradual membership and can cluster at points measured at pixel degrees
May give good accuracy for overlapping datasets.
|
Involve so many iterations to give good accuracy. This could involve a lot of time.
|
0.95
|
-
|
1
|
[121]
|
In the recognition, framework accuracy may be very high due to the added ReLU layers.
|
Do not encode the orientation object which can cause a vanishing gradient.
Does not require spatial input data, and requires so much training data for a good result.
|
0.90
|
-
|
2.2.2 CNN used for Arteries and Vein
Reference [34] proposed the encoder-decoder CNN model for the segmentation of arteries and veins. The median filter (with kernel size equal to one-tenth) was applied to correct illumination from the retinal fundus image. Then the image was passed to the CNN model. The encoding layer encodes inputs into smaller vectors while the decoding layer upsamples. Each encoder layer has 3 stacked convolutions followed by a max-pooling layer. The decoding layer has upsampling layer followed by a convolutional layer. In the encoding block, there are 32 feature maps for the first, 64 for the second, 128 for the third, 256 for the fourth, and 512 for the fifth. In the decoder, the number of features map is reduced downward. A final convolution layer reduces the map from 16 to 3 classes (background, arteries, and vein). Like previous researches, Morano et al. [35] proposed a simultaneous segmentation module inspired by the UNet method. This research preprocessed the fundus images with a local intensity normalization and channel-wise global contrast enhancement [35]. Thereafter, the image was passed to the CNN network. This Network uses UNet to predict the mask and was channeled to three parts (arteries, veins, and vessels). The binary cross-entropy and manually annotated segmentation mask of the structure were combined to produce the final mask. The Comparison of different methods in this section is available in Tables 3 and 4.
Table 3: CNN used for Arteries and Vein
Author
|
Ref
|
Year
|
CNN Name
|
Inspiration for research
|
procedure
|
Additional comments
|
Girard et al
|
[34]
|
2019
|
Joint segmentation model
|
UNET
|
Segmentation
|
Uses the median filter and encoder and decoder semantic segmentation.
|
Morano et al.
|
[35]
|
2021
|
Simultaneous segmenttion
|
UNET
|
Segmentation
|
Preprocessed and UNET multichannel for segmentation
|
Table 4: Pros and cons of CNN used for Arteries and Vein
No of the Databases used
|
Ref
|
Advantages
|
Disadvantages
|
Accuracies (Accuracy)
|
Accuracies (AUC)
|
2
|
[34]
|
Effective in edge preservation.
Require a fewer number of images to produce accurate results.
|
The small noise ratio can break up image edge and produce false noise on the edge which could affect the accuracy
|
0.96
|
0.98
|
2
|
[35]
|
Do not need a lot of data to perform optimally
|
May produce errors from the normalization process.
Accuracy may not be excellent since the network is not deep.
|
0.96
|
0.97
|
2.2.3 CNN used for Retinal Vessel
Retinal vessel segmentation is a problem that has been long-standing in medical image analysis [148, 149]. Several challenges characterize the vessel segmentation, they include:
- Presence of several abnormalities of varying sizes and shapes: Several abnormalities surround the vessel, these abnormalities can affect the effective segmentation of the vessels in RF images.
- The fewer annotated data: The limited numbers of elucidated data can result in overfitting, hence this is a major challenge when segmenting vessels in FR images.
- Vessel structural differences: retinal vessels are characterized by thick and thin structures, hence it is difficult to specify a particular model or network that is suitable for all kinds of vessels when segmenting.
- Unstructured prediction: Pixel classification is different from vessel segmentation, therefore it has become difficult to predict the structure.
In the light of the above challenges, several authors have used CNN methods to segment vessels in RF images. Budak et al [36] proposed the densely connected and concatenated multi-encoder-decoder CNN (DCCMED-CNN). The DCCMED uses the patch-based learning network and consists of a training and testing phase. For the training, the inputs are color patches extracted from raw retina images without preprocessing. The DCCMED was utilized as the network for the model and segmented binary masks were produced as the output of the training phase. The training phase also has weights, this weight were trained by stochastic gradient descent methods [37]. The DCCMED consists of three encoder-decoder blocks. The first block has 2 max-pooling layers, 2 max unpooling layers, 8 concatenated convolution, Batch normalization, and ReLU layers. The second and third blocks have the same configurations as the first. Finally, a softmax layer was used for prediction.
Tang et al. [38] proposed the multi-proportion channel ensemble model (MPC-EM) for the segmentation of retina vessels. The MPC-EM consists of 5 submodel networks. The Green and Red channels were divided into 5, with the green channel taking the highest chuck (0.69G + 0.4R, 0.7G + 0.3R, and so on.). These channels are preprocessed and passed to the sub-models. Each submodel has a Net-like structure of encoder-center-decoder. The encoder (consisting of convolutional, max-pooling, and ReLU layers) converts the image into the feature vector representation. The decoder (consisting of ConvTranspose, convolutional, and ReLU layers) converts the feature vectors into a probabilistic map. A center architecture was used as a transitional region for the shape of feature vector adjustment. Each submodel jointly used shallow localization to classify pixels into images. To optimize the subnetworks the triple convolutional residual block was used to enhance, ease and avoid vanishing gradient [39].
Reference [40] proposed the RCNN-based junction proposal network. This network takes an input 128x128 image patches and outputs the bounding boxes of potential junction location. The network consists of four parts: 1) backbone for feature extraction 2) region proposal network 3) head module for bounding-box regression, and 4) classification for mask generation. For the backbone feature extraction, the ResNet [50] was used and a pyramid structure [43] was adopted to consider multiple scales. The region proposal network takes image patches as input and outputs rectangular junction proposal location with the fully convolutional network (kernel size 3 and stride 1). The resulting features are a 1x1 convolutional layer producing the region regression and region classification as output. The junction classification network uses the pixel-wise difference of segmentation task loss and classification task loss. The proposed network is a combined multi-task CNN with 27 layers. The output of the ResNet50 is processed for classification, while the output of the convolutional layer is used for segmentation. Overall, the model has 15 convolutional, 4 duplicated feature maps, and 2 fully connected layers.
Reference [44] used the shallow UNet to segment retinal fundus images. This method has six stages: 1) Phase image registration, 2)Vessel probability map generator, 3) Postprocessing, 4) Second segmentation, 5) Region of interest selection and 6) Vessel diameter measurement. The image registration stage aligns the fundus image spatial domain to the moving spatial image domain [45]. The vessel probability stage uses the encoder-decoder CNN (shallow UNet model) for extracting vessels from fundus images. The Shallow UNet model consists of 11 convolutional layers, 5 dropout layers, 2 max-pooling layers, 2 upsampling layers, and 1 softmax layer. In the postprocessing stage outputs of the shallow UNet were thresholded, binarized, and cropped to a region of interest determined by an optic disc size. Then a manual delineation of veins and arteries (marked in red or blue) are processed to the shallow UNet again for segmentation. This time, the segmentation output is the optic disc. The shallow UNet was assigned for segmenting both the vessel and the optic disc (at different stages). Finally, the retinal diameters were measured using a similar method proposed by [46] (central venular equivalent (CRVE)).
Reference [47] proposed the multipath cascaded UNet (MCU-Net) for segmentation and classification. The MCU-Net takes as input three data (raw FFA, small-scale FFA, and large-scale FFA) and fuses vessel features from these inputs to generate a vascular probability map as output. The MCU-Net consists of an attention gate [48] and a residual recurrent unit [49]. The processed inputs are cascaded with the UNet architecture to produce the final output. The MCU-Net has two blocks 1) the refinement block, and 2) the FFA image fusion block. The FFA image fusion block consists of three fusion strategies (early fusion strategy, late fusion strategy, and intermediate fusion strategy). These fusion strategies fused the inputs for further processing. The refinement blocks accept the output of the fusion process to produce the final mask (for more on this method see fig 7).
Reference [50] proposed the Nested U-shaped attention network (NUA-Net) for the segmentation and classification of retinal images. The images were first transferred from RGB color space to LAB space, then CLAHE was applied to the images. Thereafter, the images were transferred back to RGB color space. Similar to the experiment conducted in [51], the green channel images were used as the network inputs. The NUA-Net extracts patches as inputs and predicts pixel-wise soft segmentation. This network consists of an encoding stage and 4 decoding stages. The resolution of feature maps was halved once the scale increases. A 3 x 3 convolution layer was used to extract shallow features, then downsampling of the feature vector was performed. Each encoding stage has a 2x2 max-pooling followed by convolution with batch normalization, ReLU, and dropout. A simple bottom-up approach is used to derive features at a larger scale. The multiscale upsampling attention (MSUA) model was developed to harness mutual relationships among blocks. A joint loss was adopted to put supervision on each decoding stage.
Guo et al. [52] proposed the multiscale deeply supervised network with short connection (BTS-DSN). This network uses short connections to transfer semantic information between side-output layers. Two approaches were considered: the bottom-top short connections and the top-bottom short connections. The RF images were divided into 4 combined pairs of convolution+ReLU+max-pooling. The short connection approach then carries the signal to the upsampling layer which is also divided into 4 pairs. The process of upsampling further moves the connection to the sigmoid to translate the image into a mask. The four masked images are then fused to form a single image mask. The key element of this network is the top-bottom, and bottom-top short connection approaches. A switch of connectivity within layers gives the BTS-DSN a flexible procedure. Reference [53] proposed the multiple deep convolutional neural network (MDCNN) for a formulated classification and segmentation task. The MDCNN was constructed by cascading multiple networks with the same structure. The training procedure was performed by using the incremental learning strategy which improves the network performance. Training with the same procedure was introduced to overcome the poor performance of the previous CNN. The final output was determined using a majority voting method on the MDCNNs results.
Noh et al. [35] proposed the scale-space approximation for multi-scale representation in CNN (SSANet). A 1-dimensional signal was used as an example for analyzing convolution and upsampling in the frequency domain. The SSANet consists of 3 blocks, (feature generation, feature aggregation, and inference) with 33 layers. The feature generation block has 21 layers consisting of 1 convolution layer, 3 upsampling layers, and 17 ResBlock layers. The feature generation block gets the input and extracts features from the inputs. At every upsampling interval, there is a connection to the feature aggregation stage. The feature aggregation block is the intermediary stage that connects outputs of the generation stage to the inference stage. The aggregation stage perform two key procedure: it moves input before each upsampling in the generation block. Secondly, it accepts inputs from the final block of the generation using 9 layers (5 Convolutional layers, and 4 upsampling layers) for this procedure. The inference block collects inputs from the aggregation and transforms these inputs to mask. The aggregation concatenates upsampling procedures and sends them to the inference. The inference has 3 layers (2 convolutions, and 1 sigmoid layer).
Reference [57] combined the size-invariant feature maps [58] with the dense connectivity [59] (SID2Net) for the segmentation and classification of RF. The size-invariant feature map reduces the loss of small blood vessels, while the dense connectivity reduces the computational cost. The SID2Net is a bottleneck that has the green channel as the inputs. Two bottleneck modules and three dense blocks were used to extract features, these features are finally merged by two convolutional layers and a sigmoid layer for prediction. The network is divided into two bottlenecks (bottlenecks 1 and 2) with each bottleneck having 36 and 48 output feature maps respectively. The network has 3 dense connectivity blocks. To generate probability maps, the output feature maps of the third dense connectivity are integrated. An ablation experiment was carried out dividing the network into the dense network (DNet), and DNet with size-invariant feature maps (SIDNet). Reference [60] used the multi-instance heating regression to predict RF image segmentation. This method predicts binary maps with the pixels corresponding to the location and labeling of the positive class of the ground truth. The RF images are passed to the UNET framework which extracts features creating multi-instance heatmaps and the local maximal. Finally, the results were interpolated back into the original RF images. The UNet architecture used in this research has 19 layers consisting of 1 input, 9 convolution+ReLU, 4 Transpose convolution, and 1 convolution (output Network).
Reference [61] used the vessel-specific skip chain convolutional network (VSSC Net) for blood vessel segmentation. The VSSC Net involves two stages: preprocessing and segmentation. The preprocessing stage converts RF images to grayscale then the adaptive fractional difference approach [62] is applied to the grayscale image to form the first plane of interest. The CLAHE is applied to the grayscale image to form the second plane of interest. The CLAHE image is applied to the Gaussian filter to form the third plane of interest. The intensity of the images is reduced by a factor of 2, before concatenating them to give the final preprocessed image. The segmentation stage (VSSC Net) is an end-to-end framework that takes input images of arbitrary size producing a probability map. VSSC Net has two components: base network architecture and novel architecture. The base network consists of different convolutional layers splitted into 4 pairs. The VGG-16 [63] was used as the base network. The proposed novel network has two blocks placed on top of the base network. The first block VE_1 consists of 4 vessel-specific convolutions (VSC) and 4 skip chain convolution (SC) layers. The second block VE_2 consists of 3 VSC layers and 3 SC layers. A skip connection is applied to connect the VE_1 to the VE_2.
The attention-based before-activation residual UNet (BSEResU-Net) proposed by [64] was inspired by the modified UNet architecture. BSEResU-Net exploits the attention mechanism and the dropblock regularization method to reduce overfitting. The images were preprocessed by transforming RGB images to grayscale, then the grayscale images are normalized. The CLAHE algorithm was applied to the grayscale image, then a gamma adjustment was applied to the CLAHE image. The preprocessed image was fed into the network. The BSEResU-Net consists of two parts: BSE residual block and the ResU-Net. The BSE residual block consists of a residual layer, pooling layer, ReLU, sigmoid, and 2 convolutional layers. The ResU-Net has 33 convolutional layers with 16 residual operations, 2 transpose convolutional layers, 2 downsampling layers, and 1 output map. Reference [65] proposed the multipath scale network (MPS-Net) for retinal vessel segmentation. The MPS-Net is an end-to-end network that uses one high-resolution RF input and produces a probability map with two low resolutions as output. The image was first converted to grayscale, then passed on to the MPS-Net. This network has 16 layers and three branches. The first branch has 8 H layers, while the second and third have 4 M and 3 L respectively. The network has 13 multi-path scale modules, 3 convolution +ReLU, 3 Normalization +ReLU, and 1 cropping layer. The multi-path scale module has 3 regional paths concatenated together and arranged horizontally to produce the output. The range entropy [67] definition was introduced to describe vessel information of the feature maps.
Reference [68] proposed the multipath CNN for RF segmentation. This network converts the original image to low-frequency and high-frequency images with the low-pass Gaussian filter and the high-pass Gaussian filter. The low-frequency image is sent to the CNN for segmentation. The CNN consists of a convolution downsampling and convolution upsampling. This CNN has 32 convolutional layers with four blocks of 64, 128, 64, and 32. The downsampling part performs max-pooling while the upsampling employs bilinear interpolation. Meanwhile, the high-frequency image is sent to another CNN with the encoding and decoding regions. This CNN consists of max-pooling + Convolution, Max-pooling +convolution+upsampling, and upsampling + Convolution. Finally, the output of the first and second CNN are concatenated (fusion) to produce the final segmentation. To find out if there is a difference between the preprocessed image and the images without preprocessing, reference [69] used the sin-Net for segmentation of vessels in RF images. The authors segment the images with and without the preprocessing stage. To preprocess, the CLAHE and the multi-scale top hat transform (MTHT) [70] are used to enhance image contrast. The Sin-Net architecture consists of 17 layers comprising of 11 convolution operations, 2 upsampling layers, 2 down-sampling layers, 1 output and input layer each. The upsampling and down-sampling layers are sandwiched in between the convolution operations. Results indicate that the preprocessed images when fed to the network performed slightly better than the images without preprocessing.
The usage of reinforcement learning in RF images is gaining prominence. Reference [71] used CNN with reinforcement learning to segment vessels in RF images. The images are divided into smaller patches and sent to the CNN for training. The CNN has five components: convolution, pooling, dropout, fully connected, and loss function. The dropout increases the generalization ability of the network while the fully connected layer acted as the classifier that connects the CNN to the reinforcement method. The reinforcement sample learning component reinforces the samples with poor performance in training. Overall, the network has 2 convolution layers, 2 max-pooling layers, and 1 layer each for dropout, fully connected and loss function layers. Deep CNN [73] has received tremendous recognition in medical image processing. As an example, wu et al. [72] proposed the network followed network (NFN+). The NFN+ was preprocessed with the CLAHE algorithm and patched. The enhanced patched image was fed into the network for training. The NFN+ consists of four modules: 1) encoder and decoder of the front network, 2) encoder and decoder of the followed network, 3) front group of intra-network skip connection 4) Second group of inter-network skip connection. The intra-network skip connection connects the first and second modules, while the inter-network skip connection bridges the second and third modules. Information gathered at each skip module is incorporated sequentially into the next module. Overall, the NFN+ has two connections (front and followed network) with 10 combined parts of convolution, batch normalization, and dropout. At every interval, the network is concatenated.
Fully convolutional networks (FCN) have gained relevance in tasks related to nonmedical imaging. However, such tides are changing, reference [74] used the FCN for segmenting retinal vessels in RF images. They used the method adopted by [75] to pad the region of interest to avoid excessive contrast enhancement at the border of the image. Gaussian filter, gamma correction, and CLAHE were applied to the image. The preprocessed image was passed to the UNet architecture for segmentation. Finally, the output from the network was subtracted from the original image. Reference [76] proposed RV-Net for vessel segmentation. This method preprocesses the RF images by replacing the black area with an average color (see [78]), then the image is converted to LAB. The CLAHE algorithm is applied to the image and the channels are merged and converted back to RGB. The preprocessed image is augmented by performing image transformation, cropping, and patch extraction [79]. The images are fed into the network for segmentation by the RV-Net. This network (RV-Net) is a U-form network that consists of upsampling and downsampling frameworks. The downsampling has 6 blocks consisting of convolution +ReLU, LCM, and max-pooling. Each layer of the downsampling has a max-pooling at the end. Meanwhile, the upsampling layer has 6 blocks consisting of upsampling, convolution+softmax, and LCM. Reference [80] proposed the hybrid CNN and ensemble random forest (RFs) [81] method. The CNN was used for segmentation while the RF was the trainable traditional method used for classification. The CNN has 5 layers consisting of the convolution, subsampling, and fully connected layer. The subsampling layer is the local averaging method that reduces the spatial resolutions of feature maps.
Since the inception of CNN, several versions and modifications have been proposed in the literature. The research by Hu et al. [82] proposed the multiscale CNN with cross-entropy loss function. The original RF image was augmented, then fed into the network. The multiscale CNN has 4 stages, the first and second stages consist of 4 convolutions and one max-pooling layer each. Two convolution layers are concatenated and the max-pooling transfers to the next stage. The 3rd and 4th stages consist of 6 convolutional layers with a single max-pooling layer. At the starting point of the network, there are 20 convolutions, and 3 max-pooling layers. At last, each map in every stage is upsampled to the original size to either connect to the corresponding side-output or fuse to the feature map. The improved entropy loss function considers the sample balance and inclined the leading process to segment vascular parts (see Fig. 8).
Reference [84] proposed the symmetric equilibrium generative adversarial network (SEGAN) for vessel segmentation. The SEGAN is an end-to-end synthetic neural network that utilizes the adversarial principle. Three principles are used in this research: SEGAN, multiscale feature refine block (MSFRB), and attention mechanism (AM) [86]. The MSFRB is used to extract the shallow-layer features which are high in resolution but have low semantics. The AM on the other hand is used to allocate weights to channel in the MSFRB. Both MSFRB and AM are part of the SEGAN framework. The SEGAN is a combined U-sharped network that has the G and D procedure. The D distinguishes details while the G fakes the details and enhances recognition. The network consists of 20 layers of two end-to-end networks. At the G network, the MSFRB and AM are utilized, concatenated, and passed on to the next layer. Overall, there are 13 traditional layers, 5 MSFRB, and 5 Am layers. A downsampling, upsampling, and skip layer were used in the network. An ablation experiment using the UNet was adopted in this research.
Multitask segmentation is becoming popular in deep learning architecture. A multitask segmentation creates a procedure to segment images in the task of different positions. The research by [87] proposed a hybrid multitask deep learning for segmenting vessels. The original image was annotated before being fed into the deep learning algorithm. The network has two modules: 1) multitask segmentation, 2) fusion network module. For both modules, the improved UNet framework was adopted for segmentation and fusion. The network is an encoder-decoder segmentation consisting of 20 layers. The network consists of 11 convolution +Batch normalization +ReLU, 4 max-pooling, 4 upsampling, 1 sigmoid, input and output layer. The output of the segmentation is passed to the fusion layer for final output.
The deformable U-Net(DUNet) proposed by Jin et al. [88] is a U-shaped architecture with an encoder and decoder framework. Some of the convolutional layers in the traditional UNet are replaced with deformable convolutional blocks. The DUNet integrates the low-level features with high-level features and receptive fields are trained. The design is constructed with 4 convolution layers, 4 batch normalizations, and 4 ReLU layers. In addition, the model consists of 4 convoffset, a global average pooling, 1 dense layer, and 1 softmax. Three deformable convolutional blocks are mapped in the middle of the network. These deformable convolutional blocks have 12 layers with a final output generated by softmax. Reference [90] proposed the strided FCNN for the segmentation of vessels in RF images. The images were preprocessed with the morphological tactics and the principal component analysis (PCA) [91]. The morphological tactics are used to remove the uneven illumination issue achieving the uniform contrast. The PCA transforms the image to grayscale. The network has 5 fully consecutive convolutional blocks with sizes ranging from 16,32,64,128, and 256 features. Apart from the two layers, all the other blocks have three convolutional layers (in the encoder). The encoder has 18 layers with 13 convolution + leaky ReLU [93] and 5 strided convolutional layers. Meanwhile, the encoder has 20 layers with 10 convolution+LReLU, 5 upsampling +ReLU, 4 concatenation + Convolution +LReLU, and 1 Convolution +Sigmoid layer. There is no ablation experiment in this research. Reference [94] proposed the end-to-end improved CNN for vessel segmentation. This network used the multi encoder-decoder principle and a new progressive reduction model that was integrated into the network. The network has 4 interconnected components ( multi-encoder and parallel components, RGB-encoder and green channel encoder, decoder component, progressive reduction components). The RGB consists of six levels ( convolution, spatial dropout, batch normalization, and max-pooling layers), while the decoder consists of five levels (deconvolution, concatenation, and batch normalization). The last module consists of convolution, batch normalization, and concatenation layer. Data augmentation was performed to generate more data for the network.
Reference [95] proposed the contextual information enhanced UNet (CIEU-Net) with dilated convolutional module for vessel segmentation. The cascaded dilated module and the pyramid module are integrated to form the segmentation network. The proposed network is a UNet with a modification of cascaded residual dilated module and pyramid module. There are 13 blocks with 47 layers used as the baseline, 5 Residual blocks, 2 convolutional layers, and 3 dilated convolutions. Reference [96] proposed the scale and context-sensitive network for the segmentation of vessels. The model consists of three modules: scale-aware feature aggregation (SFA), adaptive feature fusion (AFF), and multi-level semantic supervision (MSS). The SFA adjusts the receptive field dynamically to extract features. The AFF guides the fusion between features efficiently. While the MSS is used to learn distinctive semantic representative. The SFA consists of multiscale features extraction (MFE) and dynamic features selections (DFE). The SFA has 6 convolutions, 2 convolutions + ReLU, and softmax. The AFF module used the squeeze and extraction operation to model correction among features channels. Finally, the MSS fuses the channel masked to produce the final prediction. The SFA, AFF, and MSS are arranged on the SCS-Net. Overall, the SCS-Net consists of 17 layers for the segmentation of vessels. An ablation experiment was carried out on this research and augmented before training. Reference [97] proposed the enhanced encoder atrous UNet (EEA-UNet) for retinal vessel segmentation. The images were preprocessed with the CLAHE and resized to 512 x 512. Post-processing was done by morphological operations to remove isolated false positives. The EEA-UNet is an asymmetric contraction and expansion path that replaces all the convolutions as the atrous convolution to increase the receptive field. The contracting part has 5 blocks containing 2 atrous convolutions, batch normalization, pooling, and ReLU layer. The atrous convolution reduces the image size without losing the significant features in it [98]. Overall, the EEA-UNet consists of dilated convolution + batch normalization+ ReLU, max-pooling, Depth concatenation, and transpose convolution.
Reference [99] proposed a U-shaped deep learning fusion network (DF-Net) for vessel image segmentation. The method involves 4 stages: multiscale fusion, U-shaped network, feature fusion, and classifier fusion. The original image was multiscaled by the image pyramid [100] and constructed on a multiscale input integrated into the encoder path for information fusion. The U-shaped network collects the pyramid images and processes them for transmission to the next block. The network is an encoder-decoder network that consists of 2 convolution layers, max-pooling with ReLU. The down-sampled feature map is concatenated and doubled to give proper learning. Similarly, the decoder has 2 convolutions, followed by up-sampling with the number of features map halved. The vessel fusion module is attached to the decoder and enhanced with the corresponding output features. This network is combined with the Frangi filter and a deep neural network is trained. Finally, the vessel fusion module integrates masked images to produce the final segmentation. Data augmentation was conducted in this research.
Recently, multi-scale or multitask methods have been used in CNN. The research by Tang et al. [101] adopts this multiscale method. The authors proposed multi-scale channel importance sorting (MSCS) for vessel segmentation. First, the CLAHE algorithm was used to enhance the image before it was fed to the network. The MSCS is an encoder-decoder that consists of 3 encoder and 2 decoder blocks. Each encoder block consists of multi-scale, channel importance, and a convolution layer. At the end of each encoder block, a max-pooling operation is conducted. The multi-scale optimize local topology, while the channel compresses, and regularizes the network to prevent overfitting. Between the encoder and decoder, the spatial attention mechanism was used instead of the traditional skip connection to readjust the output and characterize the encoder generating the attention coefficients. The research by [102] proposed the cascaded attention residual network (AReN-UNet), that integrates the attention and residual modules. The encoder and decoder of the networks are connected to produce the final output. The upper block consists of 16 channels and uses lesser computational memory when compared to the lower part of the network. The downsampling layer is used in the encoder while the up-sampling is used in the decoder. The aggregated residual module consists of concatenated max and average pooling and a shared MLP [105] with the sigmoid layer. Meanwhile, the spatial attention block concatenates the max and average pooling with a convolution layer and sigmoid. A skip connection is used to pair the encoder and the decoder. Reference [103] proposed the multiscale dense network (MD-Net) that makes good use of the multi-scale features and the encoder features. This network is preprocessed with the CLAHE algorithm. The data augmentation and patch segmentation were applied to the CLAHE image to avoid overfitting. A residual atrous spatial pyramid pooling (Res-ASPP) was blended into the error framework and the dense multi-level fusion merges the features in the encoder and decoder. A squeeze and extraction (SE) block is applied to the concatenated layer for effective feature channels. The Res-ASPP has 12 layers all of which are convolution layers with varying dimensions and sizes. The multi-level fusion mechanism and SE block perform the fusion procedure in the network. Overall, the MD-Net has 3 Res-ASPP layers sandwiched in the encoder framework while there are 3 SE blocks in the decoder framework. The skip connections connect the encoder to the decoder. Reference [106] used the combination of edge detection and neural network to segment vessels in RF images. The image was preprocessed by an iterative algorithm that removes strong contrast between the fundus of the retina and the outer region (see [107]). Then the median and Roberts filters were used to remove noise. The method used feature vectors with eight characteristic pixels. The feature vectors include 1) image gradient obtained with edge detection (Prewitt, Sobel, Canny and Gaussian [108, and 109]), 2) the Laplacian of Gaussian filter, 3) morphological transformation (erosion, dilation, and top hat filtering [110]). The cascaded feed-forward network was used for segmentation. The network has 1 input and output each, and 4 hidden layers. The hidden layer has different neurons with a hyperbolic tangent sigmoid as a transfer function.
Reference [111] proposed the simplified UNet for the segmentation of RF images. A combination of the residual block and batch normalization in the upsampling and downsampling layers produces the required segmentation results. Different patches are extracted from the original images as inputs and trained with a novel loss function to generate the possibility of each input pixel. Then the probability map is binarized with the thresholding algorithm to generate the vessel segmentation. The simplified UNet has 10 blocks consisting of 1 CONV_ReLU1 layer, 1 convolution layer, 3 Block2 layers, 2 Block I1, and 3 Block I2 layers. Skip connections are used to link Block I1 and Block I2 together. The Block2 layer consists of transpose convolution, concatenation layer, and 2 CONV + ReLU layers. Meanwhile, Block I1 and I2 consists of 1 CONV +ReLU, and 4 Batch Normalization + ReLU+ Convolution. Reference [113] combined the attention-based neural network with transfer learning. This research uses the optimized learning method to classify and grade RF images. The attention mechanism adds attention over pixels near the vessel. The Gaussian filter was used to normalize color balance and illumination, then the data was augmented. Finally, the attention network learns and a traditional CNN performs feature extraction. The fully connected layer and the softmax were used for classification. The network comprises of pertaining inception V3, a combination of batch normalization and dropout, attention layer (has 4 convolutions), and the classification layer (convolution, pooling, 2 fully connected layers, and softmax). The softmax graded the health risk with 0 as bad and 2 as good. Reference [114] used morphological process, thresholding, edge detection, and adaptive histogram for segmentation, while CNN was used for classification. After preprocessing and segmentation, the image was fed into the trained CNN for classification (either normal or diseased). The trained CNN has individual neurons tiled to the system such that they respond to overlapping regions. The network has 3 convolutional layers, and 1 layer each for flattened, fully connected, and softmax. The images are classified into 4 namely: Normal, mild, moderate, and severe. Reference [123] used the CNN-RRN to segment retinal images. The image was first preprocessed with the median filter for denoising and smoothing. Then, the image was resized with dual-tree complex wavelet transform and then decomposed into sub-bands, this sub-band was further fed as input into the network. The classification is done using renewal networks (CNN with recurrent neural network concept [124]). The RNN captures information from sequence and time-series data. The concept of CNN was also incorporated by adding recurrent connections to each convolutional network layer. The Comparison of different methods in this section is available in Tables 5 and 6.
Table 5: CNN used for Retinal vessel
Author
|
Ref
|
Year
|
CNN Name
|
Inspiration for research
|
procedure
|
Additional comments
|
Budak et al
|
[36]
|
2020
|
Densely connected/concatenated multi encoder-decoder CNN
|
Feedforward CNN
|
Segmentation
|
Three encoder-decoder blocks with a final softmax layer.
|
Tang et al.
|
[38]
|
2019
|
Multi-propotion channel ensemble model
|
Ensemble model
|
Segmentation
|
Sub-channel, submodel CNN segmentation
|
Zhao et al
|
[40]
|
2020
|
RCNN-based junction refinement network
|
Masked RCNN-model [41]
|
Segmentation
|
Multi-task combined RCNN method for segmentation and classification
|
Yuan et al
|
[44]
|
2021
|
Shallow U-Net
|
UNet
|
Segmentation
|
Image registration combined with a shallow UNET for segmentation
|
Sun et al
|
[47]
|
2021
|
Multi-path cascaded UNet
|
UNet
|
Segmentation,
|
Refinement and FFA image fusion block for segmentation of arteries
|
Zhao et al
|
[50]
|
2021
|
Nested U-shaped attention network
|
UNet
|
Segmentation
|
Nested U-shaped network to segment and classify RF images.
|
|
|
|
|
|
|
|
Guo et al.
|
[52]
|
2019
|
Bottom-top and top-bottom short connection deep supervised network
|
Deep supervised network
|
Segmentation
|
The deeply supervised network embeds the upsampling and max-pooling with the weight fusion.
|
Guo et al.
|
[53]
|
2018
|
Multiple deep CNN
|
Deep CNN [54]
|
Segmentation
|
Multiple DCNNs cascaded for segmentation.
|
Noh et al.
|
[55]
|
2019
|
Scale-space approximated CNN
|
DRIU [56]
|
Segmentation
|
Generation, aggregation, and inference blocks for segmentation.
|
Zhuo et al.
|
[57]
|
2020
|
Size-invariant and dense connectivity network
|
DenseNet Network [59]
|
Segmentation
|
Synchronization of dense connectivity and size-invariant feature maps.
|
Hervella et al.
|
[60]
|
2020
|
Multi-instance heat map regression
|
DNN
|
Segmentation
|
Combination of UNET and instance heat map for detection.
|
P. M Samuel & T. Veeramalai
|
[61]
|
2021
|
Vessel Specific Skip chain CNN
|
Fully convolutional networks
|
Segmentation
|
Preprocessing and VSSC Net segmentation architecture
|
D. Li & S. Rahardja
|
[64]
|
2021
|
Attention-based before-activation residual U-Net
|
Modified UNet
|
Segmentation
|
Preprocessing BSE residual layer and ResU-Net residual layer.
|
Lin et al.
|
[65]
|
2021
|
Multi-path scale network
|
HR-Net[66]
|
Segmentation
|
Multi-path scale module combined with several other modules.
|
Tian et al.
|
[68]
|
2020
|
Multi-path CNN
|
UNet
|
Segmentation
|
Two CNN frameworks for low and high-frequency images.
|
I. Atli & O. S. Gedik
|
[69]
|
2021
|
Sine-Net CNN
|
Fully CNN
|
Segmentation
|
Uses 17 layers for segmentation of RF images
|
Guo et al.
|
[71]
|
2018
|
CNN with reinforcement sample learning.
|
Reinforcement learning
|
Segmentation
|
Uses 17 layers for segmentation of RF images
|
Wu et al.
|
[72]
|
2020
|
A network followed network.
|
Deep CNN [73]
|
Segmentation
|
Front and followed network sandwiched with components.
|
Hemelings et al.
|
[74]
|
2019
|
Fully convolutional network.
|
UNet
|
Segmentation
|
Preprocessing with UNET framework for segmentation.
|
Boudegga et al.
|
[76]
|
2021
|
RV-Net
|
UNet, AlexNet[77], VGG
|
Segmentation
|
Preprocessing and RV-Net for vessel segmentation
|
Wang et al.
|
[80]
|
2015
|
Features and ensemble learning
|
CNN, and RFs
|
Segmentation and classification
|
CNN and RFs for segmentation and classification of vessels.
|
Hu et al.
|
[82]
|
2018
|
Multiscale CNN
|
Richer onvolutional features [83]
|
Segmentation
|
Multiscale CNN with an improved cross-entropy loss function
|
Zhou et al.
|
[84]
|
2021
|
Equilibrium GAN
|
UNet, GAN [85]
|
Segmentation
|
Uses the MSFRB and AM for segmentation of vessel
|
Yang et al.
|
[87]
|
2021
|
Improved UNet
|
UNet
|
Segmentation
|
Multitask and fusion block for vessel segmentation
|
Jin et al.
|
[88]
|
2019
|
DUNet
|
UNet, Deformable convNet [89]
|
Segmentation
|
Replacement of Convolution block in the UNETnetwork with the convoffset block for vessel segmentation
|
Soomro et al.
|
[90]
|
2019
|
Strided FCNN
|
SegNet [92]
|
Segmentation
|
Morphological tactics and PCA with the Encoder-decoder method for vessel segmentation.
|
Chala et al.
|
[94]
|
2021
|
Improved deep CNN
|
DCNN
|
Segmentation
|
Morphological tactics and PCA with the Encoder-decoder method for vessel segmentation.
|
Sun et al.
|
[95]
|
2021
|
CIEU-Net
|
UNet
|
Segmentation
|
UNET with integrated residual dilated module and the pyramid module
|
Wu et al.
|
[96]
|
2021
|
SCS-Net
|
UNet
|
Segmentation
|
Combination of SFA, AFF, and MSS for vessel segmentation
|
Sathananthavathi & Indumathi
|
[97]
|
2021
|
EEA UNet
|
UNet
|
Segmentation
|
Modified UNET architecture with atrous convolution for segmentation of vessels.
|
Yin et al.
|
[99]
|
2021
|
DF-Net
|
UNet
|
Segmentation
|
Pyramid U-shaped fusion network for segmentation of vessel.
|
Tang et al.
|
[101]
|
2020
|
MSCS
|
UNet
|
Segmentation
|
Multi-scale, channel importance, and spatial attention for segmentation of vessels.
|
Rahman et al.
|
[102]
|
2021
|
Cascaded AReN-UNet
|
UNet
|
Segmentation
|
Concatenated attention and residual modules for vessel segmentation.
|
Shi et al.
|
[103]
|
2021
|
MD-Net
|
SegNet, PSPNet [104], UNet
|
Segmentation
|
Res-ASPP and SE blocks sandwiched to segment RF vessels.
|
Tchinda et al.
|
[106]
|
2021
|
Classical edge detection and neural network
|
Artificial neural network
|
Segmentation
|
Feature vectors combined with a cascaded feed-forward network.
|
Gegundez-Arias et al.
|
[111]
|
2021
|
Simplified UNet
|
UNet
|
Segmentation
|
Combination of residual and batch normalization for vessel segmentation.
|
Maji & Sekh
|
[113]
|
2020
|
Tradition method with CNN
|
CNN
|
Classification
|
Combine the attention-based network with the transfer learning method for vessel segmentation.
|
Sangeethaa & Maheswari
|
[114]
|
2018
|
Trained CNN
|
CNN
|
Segmentation and Classification
|
Combined thresholding, morphological operation, edge detection, and adaptive histogram for segmentation then CNN for classification.
|
Muthusamy & Tholkapiyan
|
[123]
|
2019
|
CNN-RNN
|
CNN
|
Segmentation and Classification
|
Feature extraction and CNN-RNN classification.
|
|
|
|
|
|
|
|
Table 6: Pros and cons of CNN used for Retinal vessel
No of the Databases used
|
Ref
|
Advantages
|
Disadvantages
|
Accuracies Accuracy
|
Accuracies (AUC)
|
2
|
[36]
|
Data handling is performed excellently, this may improve accuracies.
The network is very deep, hence providing the ability to handle complex representation.
|
Produce loos of information with more parameters, hence the network consumes time.
Can not perform the translational invariant procedure.
|
0.97
|
0.98
|
4
|
[38]
|
Effectively build with residual blocks to avoid vanishing gradient.
Create lower gradients, bias, and understand the data effectively
|
Easy to produce misleading output
Computationally expensive
|
0.98
|
-
|
2
|
[40]
|
Fast training time when compared to most deep networks.
Produce good and improved mean average precision which may result in good accuracy.
|
When testing the process becomes slow.
Training is expensive for space and time.
|
0.78
|
0.70
|
1
|
[44]
|
Provides better representation of vessel than some other CNN.
Global and local contexts are allowed at the same time, this may help increase accuracy.
|
Learning becomes slower at the middle layer which results in slow process time.
|
-
|
-
|
3
|
[47]
|
Uses the hyperdense connection to avoid gradient descent vanishing problem
Uses the attention module to select important features to enhance segmentation accuracy.
|
More weights are added which can reduce training time.
May become cumbersome and confusing if not properly managed.
|
0.98
|
0.98
|
5
|
[50]
|
Provide selected focus on the segment of sequence which may improve accuracy.
Since feature dimensions are changed deeply, they can give insight into features.
|
Provides more regularization to the network and interference with the parameter.
|
0.96
|
0.97
|
3
|
[52]
|
Deep supervision in the BTS is used to avoid the gradient vanishing problem.
Learns easily without human supervision.
|
Requires large data to perform optimally.
Very expensive to train due to the complex data model.
|
0.82
|
0.97
|
2
|
[53]
|
Irrespective of data, the network will still produce results.
A deeper network may produce better accuracy (The deeper the better)
|
May produce overfitting which reduces accuracy.
Consumes huge processing time.
|
0.97
|
0.97
|
4
|
[55]
|
Produce an approximation of the sequence that would have been obtained, hence creating a faster training procedure
|
Cumbersome and consumes huge processing time.
|
0.99
|
0.99
|
2
|
[57]
|
Reduces the loss of small vessels using the size-invariant feature maps.
The dense connectivity reduces the computational cost.
|
The large gaps in the loss and accuracies may reduce accuracies.
|
0.96
|
0.98
|
2
|
[60]
|
Uses heuristic strategy to increase information, hence improving the feedback to learn a task.
Transform the hard binary labels into soft labels which give the likelihood of locations targeted by pixels.
|
If not well planned, the middle of the training procedure could be problematic.
|
0.74
|
-
|
6
|
[61]
|
Prevents over-amplification of noise and gives good contrast enhancement.
|
Cumbersome, and involve many feature extraction procedures that good produce error.
|
0.97
|
0.99
|
3
|
[64]
|
Reduce overfitting by using the attention mechanism and regularization method.
|
Large data is needed for optimal performance.
The framework may become slow in the middle of the training which may affect the overall training.
|
0.95
|
0.98
|
4
|
[65]
|
Use the multi-path scale module to avoid the problem of determination of components to perform a certain task, hence producing better accuracies.
Easy to start a new task.
|
Training and testing time is very slow
|
0.98
|
0.98
|
2
|
[68]
|
With the usage of two CNN, it is easy to produce good accuracies.
Fast training and quick implementation time
|
May suffer from gradient vanishing problem.
|
0.96
|
0.96
|
3
|
[69]
|
Avoiding dense layers may make the network faster.
Can be deployed on a system without larger memory
|
Maybe prone to overfitting because of the complexity and model parameter
|
0.97
|
0.98
|
2
|
[71]
|
Errors that occur during the training process are corrected, hence accuracies may be improved.
May be used to solve other complex problems.
|
If not properly constructed may lead to state overload that can diminish results.
May not be suitable for simple problems.
|
0.92
|
0.96
|
3
|
[72]
|
Uses augmentation and patch techniques to create more sample data, hence increasing the training data.
Reduces noise in the local areas.
|
Can not perform optimally without lots of data.
Cumbersome and may be difficult to understand.
|
0.96
|
0.98
|
2
|
[74]
|
Provide good image quality that may improve accuracy
|
The network may become slow at the middle layer, this will affect overall training.
|
0.94
|
-
|
2
|
[76]
|
Produce a fast timing for training and testing
|
Prone to vanishing gradient that affects accuracy.
|
0.98
|
-
|
2
|
[80]
|
Simple and can work efficiently on a large dataset.
Robust and can perform effectively even in the presence of noise
|
When compared to a more sophisticated network, it may perform poorly.
|
0.98
|
0.97
|
2
|
[82]
|
The deeper the better, this network may produce accurate segmentation.
Because of the augmentation method, that becomes sufficient, hence producing better training.
|
The network is too cumbersome and computationally expensive.
|
0.96
|
0.97
|
4
|
[84]
|
Does not require labeled data, hence training becomes faster.
May not require preprocessing because the data are shaper.
Data interpretation becomes easy, hence accuracy is improved.
|
The training process is hard and can become complex if not properly arranged.
|
0.96
|
0.98
|
3
|
[87]
|
Does not need multiple runs of training to produce good accuracies.
Can learn and perform with a few images.
|
Overall slow training in the middle of the system.
|
0.96
|
-
|
6
|
[88]
|
Perform well in a large and unstructured dataset.
Can be computationally efficient.
|
Very slow for both testing and training.
Require the use of a large dataset to function optimally.
|
0.96
|
0.98
|
4
|
[90]
|
Improves boundary delineation which in turn can produce good accuracies.
Reduce the number of parameters fed into the network
|
Requires large computational cost and large data.
|
0.98
|
0.98
|
2
|
[94]
|
Produce very good accuracies and can reduce vanishing gradient problem
|
Cumbersome and heavily dependent on large datasets.
Prone to noise, since there was no preprocessing in the network
|
0.98
|
-
|
3
|
[95]
|
Can easily localize with other networks and still produce a good result.
Training patches are larger than the training images.
|
Very slow in the middle of the network, this can affect the overall network.
|
0.97
|
0.98
|
5
|
[96]
|
Capable of dealing with large-scale variation and extracting representative features.
Extract varying sizes and shapes of the vessel.
|
Very slow in the middle of the network, this can affect the overall network.
|
0.97
|
0.98
|
4
|
[97]
|
Is not prone to the choice of network depth but modifies the UNET architecture for optimal results.
|
Very slow in the middle of the network, this can affect the overall network.
|
0.95
|
-
|
3
|
[99]
|
Easy to train and gives good inference time for images.
Can detect vessels of different sizes.
|
Heavily dependent on the very large quality of data
|
0.98
|
0.98
|
2
|
[101]
|
The channel and multi-scale are used to prevent the network from overfitting.
The spatial attention mechanism characterizes the encoder to generate coefficients and improve accuracies.
|
The network can not retrain a longer sequence of data.
More weights are added to the data which can increase the training time.
|
0.96
|
0.98
|
2
|
[102]
|
Concatenated networks help to improve the model spatial and structural representation and generalization.
|
This network is computationally expensive and depends on large data.
|
0.97
|
0.98
|
3
|
[103]
|
The data augmentation and patch segmentation were used to avoid overfitting
|
computationally expensive and depends on large data.
|
0.97
|
-
|
3
|
[106]
|
The method is robust to noise and can perform optimally despite errors.
It uses fast evaluation to learn and produce accuracies
|
Does not have specific rules for the structure which may make rule assignment difficult and time-consuming.
|
0.95
|
0.96
|
3
|
[111]
|
The network reduces the covariant shift and the dependence of gradient on the scale of the parameters.
The model is regularized and reduces the need for dropout and other regularization techniques.
|
Computationally expensive and require excessive care for segmentation tasks.
Regular changing of the construct can result in a poor generalized scale and shift of input data.
|
0.97
|
0.99
|
2
|
[113]
|
Has a good computational efficiency reduction mechanism.
May produce faster training time due to convolution replacement.
|
The system may be hard to parallelize due to the usage of the attention mechanism.
|
-
|
-
|
3
|
[114]
|
Automatically detect important features without human supervision.
|
Requires large dataset for effective result.
|
0.96
|
-
|
3
|
[123]
|
The model can collect records so that each pattern can assume dependent status.
Extend the pixel neighbor with CNN
|
Prone to exploding and vanishing gradient problem.
Training is very difficult.
|
-
|
-
|