Colorectal cancer (CRC) is a leading cause of cancer mortality globally [1]. Most colorectal cancers evolve from adenomatous polyps, making early detection and removal of polyps critical for CRC prevention and treatment [2]. Colonoscopy is the gold standard for detecting and removing polyps before they develop into CRC [3]. However, accurately identifying and segmenting polyps during colonoscopy is a complex task due to the diversity of polyps in terms of shape, size, and texture. This can lead to missed or misdiagnosed polyps, which can seriously harm patient health.
Machine learning (ML) algorithms, particularly convolutional neural networks (CNNs), have shown promising results in medical image segmentation and have been applied to polyp detection and segmentation [4, 5]. While deep learning (DL) algorithms can achieve high precision, they typically require large amounts of labeled data [6, 38, 39], which can be costly and time-consuming to obtain [40].
In an effort to improve the accuracy and efficiency of polyp segmentation, researchers have developed various deep learning (DL) architectures that employ different techniques to address this complex task. Examples of DL architectures used for polyp segmentation include U-Net [7], FCN [8], and their variants, such as U-Net++ [9], ResUNet++ [10], and H-DenseUNet [11]. While these methods can achieve precise segmentation results, their performance may be less robust when faced with a wide range of polyp characteristics.
In this study, we present a novel supervised convolutional neural network architecture for image segmentation that uses the encoder-decoder structure of the U-Net [7] architecture with some significant differences. The key feature of our architecture is the combination between our custom-designed convolutional block and the residual downsampling. The convolutional block enables our model to accurately locate and predict the borders of polyps with a small margin of error. By incorporating residual downsampling, the model can utilize initial image information at each resolution level in the encoder segment, further improving its performance. Also, we have used DeepLabV3 atrous convolutions [12] for capturing spacial information and the residual block of ResUNet++ [10] for enhanced feature extraction.
The main contributions of this paper are:
-
Our custom-built convolutional block, DUCK (Deep Understanding Convolutional Kernel), allows more in-depth feature selection, enabling the model to locate the polyp target accurately and correctly predict its borders.
-
Our method uses residual downsampling, which allows it to use the initial image information at each resolution level in the encoder segment. This way, the network always has the original field of view alongside the processed input image.
-
Our model does not use external modules and was only trained on the target dataset (no pre-training of any kind)
-
Our method accurately identifies polyps regardless of number, shape, size, and texture.
-
Extensive experiments prove that our method achieves good performance and leads existing methods on several benchmark datasets.
Related Work
Convolutional Neural Networks
Automatic polyp segmentation is crucial in clinical practice to reduce cancer mortality rates. Medical image segmentation tasks usually employ convolutional neural networks, and several widely utilized architectures have been applied to this problem.
One such architecture is U-Net [7], an encoder-decoder model developed initially for biomedical image segmentation. U-Net exhibits the advantage of being relatively simple and efficient while still achieving good performance on various medical image segmentation tasks. However, it may struggle with more complex or varied input images, and alternative methods may be more suitable in these cases.
PraNet [32] is a CNN architecture specifically designed for automatic polyp segmentation in colonoscopy images. It employs a parallel partial decoder to extract high-level features from the images and generate a global map as initial guidance for the following processing steps. Furthermore, it utilizes a reverse attention module to mine boundary cues, which helps to establish the relationship between different regions of the images and their boundaries. PraNet also incorporates a recurrent cooperation mechanism to correct misaligned predictions and improve segmentation accuracy. The results of the evaluations indicate that PraNet significantly improves the segmentation accuracy and has an advantage in terms of real-time processing efficiency, reaching a speed of about 50 frames per second.
DeepLabV3+ [13] is an extension of the DeepLabV3 [12] architecture for semantic image segmentation. It employs atrous convolutions, which allow for a dilated field of view and the extraction of features at multiple scales to improve the capture of long-range contextual dependencies. This approach enables the more accurate segmentation of objects with complex shapes or large-scale variations but also requires more computation and may be slower to train and infer.
HRNetV2 [14, 15] is a CNN architecture for human pose estimation that utilizes a fully connected style-like architecture to share multi-scale information between layers at different resolutions. This architecture can improve performance on small or blurry objects but may be more prone to overfitting and require more data to achieve good performance.
Other CNNs designed explicitly for automatic polyp segmentation include ResUNet [16], which incorporates residual blocks to enhance location information for polyps, and HarDNet-DFUS [17], which combines a custom-built encoder block called HarDBlock with the decoder of Lawin Transformer to improve accuracy and inference speed. ResUNet can leverage the powerful expressive capacity of residual blocks but may require more data and computation to achieve good performance. HarDNet-DFUS is designed with real-time prediction in mind but may sacrifice some accuracy in favor of faster inference.
ColonFormer [18] utilizes attention mechanisms in the encoder and includes a refinement module with attention on the x and y axis at different resolutions to achieve a more refined output while maintaining a decoder similar to the classical U-Net. Attention mechanisms can be effective for handling large or complex input images but may require more computation and be more challenging to optimize than other methods.
MSRF-Net [21] is a CNN architecture specifically designed for medical image segmentation. It utilizes a unique Dual-Scale Dense Fusion (DSDF) block to exchange multi-scale features with varying receptive fields, allowing the preservation of resolution and improved information flow. The MSRF sub-network then employs a series of these DSDF blocks to perform multi-scale fusion, enabling the propagation of high-level and low-level features for accurate segmentation. However, one limitation of this method is that it may not perform well on low-contrast images.
Transformers
While the previously mentioned methods have achieved good results for automatic polyp segmentation, other approaches that utilize transformers in the encoder perform particularly well on this task. These models typically use a pre-trained vision transformer as an encoder trained on a large dataset, such as ImageNet [22], to extract relevant features from the input image. These features are then fed to the decoder, which processes multi-scale features and combines them into a single, final output. Examples of such approaches include FCN-Transformer [19] and SSFormer-L [20], which have achieved state-of-the-art (SOTA) performance on the Kvasir Segmentation Dataset at the time of their release.
The use of Transformers has gained traction in the field of Computer Vision (CV) in the past years, as they have been widely used in the field of Natural Language Processing (NLP) and have shown spectacular results in retaining the global context of the subject at hand. Vision-Transformers (ViT) [34], like their NLP counterparts, make use of a mechanism called Attention [33], which aggregates global context to extract relevant information from large image patches.
While ViTs [34] seem to perform well in the CV field, traditional CNN methods, like the EfficientNetV2 [35], outperformed them in popular image classifications datasets, such as ImageNet [22] or CIFAR-10 [36], proving that more efficient CNN methods can still be developed.
As such, our proposed method explores the benefits of traditional CNNs over ViT-based architectures in biomedical image segmentation and how they can still yield substantial improvements in the accuracy metrics.
Overall, this field is an active area of research, with various approaches being proposed and evaluated. Thus, further research is needed to determine the models' optimal design and training strategies. It is essential to carefully consider the trade-offs between accuracy, computational efficiency, and other performance metrics when selecting a method for a specific application.