Colorectal cancer (CRC) has been a challenging clinical problem in health and it is the third leading cause of cancer-related deaths in humans [1]. Notably, in 2012, CRC accounted for more than 1.36 million cancer cases, and was the third most common cancer worldwide, accounting for 10% of the total number of men and 9.2% of women. When CRC is diagnosed and detected at an early stage, patient survival rates can reach nearly 90%, while advanced disease drops to less than 7% [2]. Adenomatous polyps are usually the most common tumor in clinical screening for cancer. Early detection and removal of these precancerous lesions have been shown to prevent many cancers and reduce mortality rate effectively [3]. It is widely believed that more than 95% of colorectal cancers originate from adenomatous polyps. Polyps detected at an early stage are usually defined as benign lesions with dysplastic epithelium and they have the potential for malignant transformation [4]. In conventional polyp detection methods, clinicians require a high degree of hand-eye coordination. However, nearly 25% of polyps are still missed during video examinations. Therefore, it is vital to use computer-aided diagnostic techniques to help clinicians detect polyps that are easily missed by the human eye [5].
Existing technical defects: Traditional image segmentation methods include threshold segmentation methods, edge-based segmentation methods, region-based segmentation methods, etc. Most of these methods are based on manual feature extraction of images, such as color and texture information. However, the artificially designed features are often shallow, causing significant limitations and little room for performance improvement in the traditional methods [6]. Compared with traditional methods, semantic segmentation based on a framework of deep learning has significant advantages in terms of accuracy and efficiency. Although better networks have been designed for semantic segmentation, the results are still not applicable to all types of images, the diversity of images makes the amount of training data to be prepared very large, and the interference between the various categories is carried out. These issues reduce the accuracy of pixel prediction. At the same time, as the layers of the neural network deepen, the edge information of the image will be lost seriously. These factors seriously affect the effect of image segmentation [7].
The traditional image segmentation methods for colon polyps are based on the characteristics of human vision. The features are more sensitive and extracted with discriminative power. The features extracted by traditional image processing methods have specific physical meanings in each dimension, including color, texture information, location, etc[8]. In recent years, existing deep learning techniques have performed better in polyp detection than traditional methods, extracting polyp regions more comprehensively [12]. Deep neural networks learn features of high-dimensional images based on broader and deeper network hierarchies. In the public polyp segmentation dataset, there is still a lot of room for improvement in accuracy.
This paper proposes a Pyramid Attention Transformer (PAFormer) network based on the fusion of resnet50 and the Transformer encoder. The Transformer model based on the attention mechanism has achieved good results in the fields of machine translation and natural language processing (NLP) [9]. Currently, there are many works in the computer version (CV) field using Transformer for end-to-end target detection and image classification [10, 11]. We chose Transformer's encoder module with resnet50 as the backbone. The segmentation accuracy is improved on the common dataset by extracting and fusing semantic information from different layers.
The contributions of this paper are as follows.
1. In this paper, the PAFormer module was proposed, which was integrated with the encoding and decoding modules in segmentation networks and formally extracted high-level semantic features of images through global aggregation in an incremental pyramidal structure.
2. The SEAP (SEattention and Atrous Spatial Pyramid Pooling) module was proposed to reduce the spatial location information lost due to deep convolution by local aggregation and focus more on edge detection.
3. Through extensive ablation experiments, this paper demonstrates that the proposed network (PAFormer) achieves better results on all existing public datasets and is significantly competitive with existing methods.