An Automated Framework for Detection, Localization, and Classication of Colonic Polyp using Deep Learning

Colorectal cancer (CRC) in its advanced stage is one of the leading causes of death worldwide. However, early detection of polyps which are the precursor to such cancer can lead to better prognosis and clinical management. This report proposes an automated diagnostic technique to detect, localize, and classify polyps in colonoscopy video frames. Manual detection and localization of polyps on hugely acquired colonic frames have many limitations. Our deep learning-based framework proposes an attention-based YOLOv4 detector for polyp detection and localization. Finally, leveraging a fusion of deep and handcrafted features of the polyps, the detected polyps are classiﬁed as benign or malignant. The individual and the cross-database performances on two databases suggest the robustness of our method in polyp localization. The comparison of our approach based on signiﬁcant clinical parameters with current state-of-the-art methods conﬁrms that our method can be used for automated polyp localization in both real-time and ofﬂine colonoscopic video frames. Our method can give an average precision of 0.8971 and 0.9171 and an average IoU of 0.8325 and 0.8179 for the Kvsir-SEG and SUN databases, respectively. Similarly, our proposed classiﬁcation framework on the detected polyps yields a classiﬁcation accuracy of 96.66% on a public dataset.


Introduction
Colorectal cancer (CRC) is one of the major health crises across the globe. The high mortality of CRC contributes significantly to the total deaths, and it is considered to be the third most frequently occurring cancer 1 . Such cancer in its initial state is called polyp and is generally benign. Polyps are abnormal tissues and are usually found in the mucosa of the colon 2 . Colonoscopy is one of the medical procedures adopted in detecting such polyps. Early detection of such polyps is crucial. It helps in a better prognosis and can increase the possibility of survival. During the entire colonoscopy, a considerable number of images of colon regions are captured. Nowadays, wireless capsule endoscopy (WCE) is used, which captures thousands of images of the entire gastrointestinal (GI) tract 3 . The doctors inspect each captured frame for detecting the presence of an anomaly. However, reviewing each frame manually for polyp detection from a hugely acquired colonic frame is very difficult and inefficient. The features of polyps are so indistinctive that sometimes it is challenging to distinguish them from the normal colon tissues. Also, the maximum polyp detection rate, which can be achieved through colonoscopy, is less than 50 % as it is highly operator dependent 4 . Therefore, it is essential to reduce the miss detection rate of polyps.
The application of new technologies in health care applications is on a constant rise. With the advent of new modalities, efforts have been made to enhance the efficiency of colonoscopy. Recently, optical endoscopic modalities using narrow-band imaging (NBI) have been developed for improved colorectal lesions detection 5,6 . NBI imaging enhances the vascular pattern of the lesion, thereby increases their discriminating ability. Blue light imaging, and i-Scan endoscopy are also used for better polyp detection 7,8 . Another endoscopy modality that acquires high definition (HD) images is linked color imaging (LCI). This imaging technique uses colour as an important cue for lesions. Generally, the color of a malignant (Adenomatous) polyp looks reddish, and the colour of a non-adenomatous polyp is whitish 9 . LCI enhances the red and white color and makes the red area look more reddish, and the white area looks more whitish during colonoscopy. Thus, LCI not only helps in lesion detection but also helps in their classification. Other techniques adopted to improve lesion detection include better bowel preparation, use of the broad field of view camera, flattening of colonic folds, etc. 10 . However, the diagnosis using these techniques for better polyp detection during colonoscopy needs a highly experienced and trained expert in this domain 11 .
Another problem that arises during lesion detection in colonoscopy frames is the high variability in the polyp characteristics. Typically, small or serrated polyps, diminutive and isochromatic flat polyps, are missed during manual inspection. Device and patient-specific colonoscopic frames will have different image characteristics 12 . Therefore, the generalization of a particular methodology in colonoscopy image analysis cannot be made. Therefore, all the above-discussed challenges must be considered while proposing an automated polyp detection system. An automatic diagnostic assistant system (DAS) is proposed for polyp detection and localization in this report. Subsequently, the detected polys are classified into benign (hyperplastic) or malignant (Adenomatous) by our proposed classification system. Our method does not need any expert during colonoscopy. Our proposed deep learning-based method can do polyp detection and localization on off-line and real-time colonoscopy frames. First, we'll go through our suggested technique for polyp detection and localization.
Both handcrafted feature learning and deep learning-based methods have been proposed over the years for polyp detection in the literature. Handcrafted-based techniques use different cues from the polyps' image, viz. color, texture, shape, surface properties, etc. On the contrary, deep learning-based methods use the hidden features of the image. Most of the works using handcrafted based feature learning methods are based on supervised learning [13][14][15][16] . However, these methods provide inferior performances as features learned during the training of a supervised model may not be sufficient to generalize the test datasets. The huge variation of image features among the acquired data from different endoscopic modalities gives unsatisfactory performances even in the same modality. Sasmal et al., 17 proposed an unsupervised polyp detection method using saliency map and particle filtering. Recently, deep learning-based automated polyp detection system have been proposed for real-time polyp detection [18][19][20][21] . Convolutional neural networks (CNNs) based methods have been deployed in medical imaging for various tasks [22][23][24][25][26] . Shin et al. 27 , proposed a transfer learning approach for polyp detection in colonoscopy. They used Inception ResNet and proposed a region-CNN for the task. Shin et al., 28 proposed a conditional generative adversarial network to generate synthetic colonoscopy images for improved detection performance. Lee et al., 21 employed YOLO-v2 29 for real-time polyp detection and localization. Yamada et al., 30 deployed Faster RCNN and VGG-16 to detect and localize lesions in endoscopic video frames. They achieved a real-time detection performance with minimum polyp miss rate in colonoscopy video frames.
Polyp detection systems based on deep learning have improved overall performance in video frames of colonoscopy 19 . One major issue with the deep learning-based techniques is that we cannot establish the generalization ability of these models though they perform better than the handcrafted-based methods. In medical procedures, especially in polyp detection and localization systems, the following features are desired: 1) consistency in performance, i.e., the DAS must reliably produce the performance independent of imaging modalities and patients. 2) minimum polyp miss rate, i.e., a high detection rate, and 3) real-time application, which could help in immediate attention to the patient.
Considering all these requirements for devising an automated polyp detection system, we propose an attention based YOLOv4 detector for these tasks. The main contribution of our proposed method can be summarized as follows. This work presents an attention mechanism in the YOLOv4 framework for improved polyp detection. Our approach proposes to use spatial and channel attention modules in the backbone of the YOLOv4 framework. The attention mechanism gives importance to the region of interest (ROI), i.e., the polyp regions in a colonoscopy frame. A comparison of performance based on important matrices with state-of-the-art methods is presented in this article. The performances evaluated on two databases validates the robustness and generalization capability of our approach.

Dataset
We used two databases viz. 1) Kvsir-SEG 31 and 2) SUN Colonoscopy Video Database 32 for detection and localization tasks. The Kvsir-SEG database is a freely available open-access database, whereas the SUN database can be used after registration and agreement from the source. SUN (Showa University and Nagoya University) Colonoscopy Video Database is the colonoscopy-video database designed to evaluate an automated colorectal-polyp detection system. It comprises 49,136 polyp frames taken from 100 different polyps using a high-definition endoscope (CF-HQ290ZI and CF-H290ECI; Olympus, Tokyo, Japan). Similarly, the Kvsir-SEG dataset contains 1000 image frames acquired using ScopeGuide, Olympus Europe, endoscope. Some of the samples from both the data sets are shown in Fig. 1. The details of the datasets are given in Table 1.   better diagnosis and early treatment. Also, the decision support system must automatically detect any abnormality in the frames. Therefore, the first stage of our current work focuses on handling real-time data for automatic detection and localization of polyps in the colonoscopy frames. For this, a deep learning-based attention YOLOv4 model is proposed in this work. The architecture of the proposed model is shown in Fig. 3. Furthermore, employing our suggested classification network, the localised polyps are classified as benign or malignant. The classification approach is explained in further detail later in this article.

Attention YOLO
YOLO is a single-step object detection model and is considered superior to other deep learning models owing to its optimal accuracy and detection speed 29 . Further, YOLOv2 33 and YOLOv3 34 were proposed which show improved detection performances. In YOLOv3, a CNN Darknet53 is employed as a backbone of the architecture, efficiently extracting features from the input image. Later, YOLOv4 was proposed by Bochkovskiy et al. 35 to enhance the detection performance and speed. It integrates all the efficient approaches which are employed in different domains. Though it performs well on various datasets, its applicability and generalizability to medical imaging cannot be guaranteed. The medical images, especially endoscopic video frames, are generally of low quality, and they may have high noise, specularity, blur, etc. Also, a lack of annotated data may lead to overfitting the YOLOv4 model and make it less efficient in polyp detection and localization. Therefore, some changes corresponding to the polyp characteristics of endoscopic video frames are made in the existing model for better performances. Occlusion, clutter, poor image quality, noise, etc., degrade detection performances. Generally, endoscopy videos suffer from such limitations. Also, the bounding box (BBoxes) used to localize the target objects may fit the arbitrary contour of objects. Therefore, various methods are generally adopted to highlight the real target object neglecting the background. The attention mechanism is among the solutions to these problems by enabling the network to focus more on the target object. Attention mechanisms are coupled in deep detection models to learn key features of the object. It mimics the property of the human visual system. Recently, attention mechanism has shown promising performances in various computer vision applications [36][37][38][39] . Therefore, the attention module is embedded into the backbone of CSP Darknet to focus more on the ROI of feature maps. This module would enable extraction of the polyp regions' important features, ignoring the non-polyp regions of colonoscopy frames. Our method proposes two attention modules, namely, the channel attention module and spatial attention module, and are incorporated in the backbone of YOLOv4. YOLOv4 extracts feature maps to three different branches to obtain three feature grid maps with various scales for detecting objects of different sizes. The three YOLO heads are then trying to localize the 4/15 objects with the BBoxes. Our proposed attention modules are integrated on the feature maps before the three YOLO heads can detect and localize polyps.

Channel attention block
The channel attention block is proposed to integrate the interaction among the inter-channel feature maps. It is employed to enhance the vital information of a feature map of an object. Let the input feature map be represented as M ∈ R H×W ×C , where H, W , and C represent the height, width, and depth of a feature map, respectively. As shown in Fig. 4, a global average pooling operation is employed across all the depth maps to extract contextual information, embedded in the channel descriptor given as I c ∈ R 1×1×C , and the c-th element of I c is given by: . Again, to further explore the inter-channel nonlinear relationship among the channel maps, we employ a 2-layers CNN followed by a sigmoid activation function. In order to reduce some parameters overhead, W 1 is used as the dimensionality reduction layer with a reduction of factor 16 38 . Similarly, W 2 is used to increase the dimensionality again. This process is given as: where, W 1 ∈ C 16 ×C and W 2 ∈ C × C 16 .
Finally, an element-wise summation operation is adopted between the input feature map and the generated channel attention map through residual connection to mitigate the incurred information loss. The final feature map is given as: I ′ c × M + M. The channel attention module is illustrated in Fig. 4.

Spatial attention
The spatial attention mechanism focuses on the local regions of a feature map. Thus, this module is employed to preserve the local polyp ROI information in the feature maps. Fig. 5 depicts a spatial attention module. As shown in Fig. 5, a 7 × 7 convolutional layer is introduced to aggregate the interspatial interaction of maps to produce a one-dimensional spatial descriptor. Let the input feature map be represented as: M ∈ R H×W ×C . Then, the generated feature descriptor I s is represented as: where, I s ∈ R H×W ×1 and conv 7×7 (·) denotes a 7 × 7 convolutional layer. The sigmoid function then activates this feature map to highlight the important regions. Subsequently, it is multiplied and summed up with the input feature maps to produce the final feature map, which is given as:

Classification of Detected Polyps
Following the identification of polyps, endoscopists split off the polyp areas and vividly access them for cancer diagnosis. They do this by analysing several polyp features such as shape, colour, texture, and surface patterns etc. Due to the large medical images acquired during colonoscopy and the similarity in pathological manifestations across ailments, physical inspection and labelling of polyps is tedious and inefficient. The polyp characteristics may not always be visible to the human eye, and diagnostic information may be ignored, making decision-making extremely challenging. An automated polyp classifier for two-class polyp classification, i.e., adenoma (malignant) and hyperplastic (benign), is provided in this report to solve the aforementioned problems.
Hand-crafted feature learning approaches were used in the early research on automatic polyp categorization from colonoscopy frames [40][41][42][43][44] . The inconsistency of these approaches' performance in terms of repeatability is a drawback. Furthermore, the generalizability and robustness of these techniques cannot be guaranteed because pathological situations vary greatly even within the same modality's dataset. Also, a huge domain knowledge is required to characterize the discriminating features of the polyps.
Deep learning-based techniques are better at handling such variances and give a high degree of generalisation. As a result, there has been an increase in interest in using such models in medical image and video processing especially in polyp classification [45][46][47][48][49] . However, one of the primary drawbacks of these methods is that they require a large quantity of labelled data during training in order to get relatively good classification performance. Large-scale polyp databases, on the other hand, are harder to achieve by. The wide range of imaging methods and processes, as well as privacy concerns and a lack of medical integration, may provide a number of obstacles in obtaining high-quality, large-scale polyp images. In light of these issues, we suggested a classification technique that does not necessitate the use of large amounts of labelled data. We'll illustrate how the non-linearity of a small, imbalanced dataset may be correctly described by the features learned via our proposed network. As a result, in this work, we present a unique polyp categorization technique to solve some of these problems. For classification of the segmented polys, we propose using the Triplet Network architecture and its related triplet loss to learn 6/15 non-linear representations between polyps. We show that the learned features may be used as a highly discriminative basis for machine learning models. We compare our findings to those of prior research and show that the features acquired by a Triple Network can characterize the non-linearity of a small dataset, making them acceptable for use in a linear classifier. In addition, integrating deep and handcrafted features improves polyp classification efficiency. For deep features, we employed a triplet network based on siamese architecture and the handcrafted features were extracted using pyramid histogram of oriented gradient (PHOG). As discussed earlier, texture and shape information of polyps play a vital role while dysplasia grading by the endoscopists. In our proposed framework, the triplet network helps to learn distributed embedding by the notion of similarity and dissimilarity whereas the PHOG extracts the shape and texture information of the polyps 44 . The suggested classification approach is shown schematically in Fig. 7.

PHOG
A polyp's geometry, texture, and colour provide enough information on its nature. The proposed approach uses a pyramid histogram of oriented gradient (PHOG) characteristics to define the geometry or morphology of a polyp. At each pyramid resolution level, the HOG vector is calculated (L). Finally, the PHOG descriptor is extracted by concatenating all of the HOG vectors. The PHOG descriptor's dimensionality for the full image is provided as: K * ∑ L l=0 4 l . In this work, K and L values are taken as 8 and 4, respectively. The details of this feature extraction technique can be found in our previous paper 44 . The feature extraction technique using the proposed PHOG is shown in Fig. 6.

Triplet Network
The Siamese network 50 inspired Triplet Network design consists of three identical sub-networks with common parameters. Each sub-network is taught to recognize embedded characteristics in three different samples, the anchor, positive, and negative samples, respectively. A triplet is made up of an anchor, a positive, and a negative sample. The L2 distance between the anchor and the positive sample, as well as the anchor and the negative sample, are the network outputs. The cost function is computed using the triplet loss in Eq. 4, where f a i represents the anchor embedding, f p i represents the positive embedding, and f n i represents the negative embedding.
The value of α was taken 0.5, and the dimensionality of the embedding was set as 256.

Training
The anchor and positive samples were labeled as benign polyps, while the negative was labeled as malignant. Three Triplet networks were trained using identical hyperparameters, with Adam as the preferred optimizer and learning rate 0.0001 as the hyperparameters. Each network was started using ImageNet weights and trained from the ground up. For each of the triplet's images, our Siamese Network will generate embeddings. We achieved this by connecting a few Dense layers to a ResNet50 model that has been pre-trained on ImageNet. All of the model's layers' weights will be frozen until the layer conv5_block1_out. The last layers were fine-tuned during training.

Evaluation Metrices
In this work, some of the extensively recommended standard metrics are used to evaluate detection and localization performances. 18,51 .
• IoU(A, B)= A∩B A∪B , measures the overlap between two bounding boxes A and B as the ratio between the overlapped area.
• AP: Average precision was computed as an average APs for IoU from 0.25 to 0.75 with a step-size of 0.05.
Similarly, standard performance indicators like as accuracy, sensitivity, specificity, precision, recall, and F-score are employed for classification.

Experimental Setup and Configuration
Two databases are used in our experiment for this study. The details of the databases are given in Table 1 Table 2 shows the detection and localization performances by different state-of-the-art methods on the Kvsir-SEG dataset. It can be observed that our method achieves an average precision (AP) of 0.8971, which is the best among all. The APs achieved at multiple IoU threshold i.e AP 25 , AP 50 , and AP 75 are 0.9485, 0.9279, and 0.7849, respectively. The IoU measures the precision at which the bounding box localizes the target object. From our results, it is clearly observed that our method is better in localizing polyp ROIs compared to the state-of-the-art methods. Also, our method can detect polyps at a real-time speed of 50 FPS. Therefore, our method can be employed for accurate polyp detection and localization in real-time colonoscopy video frames.   8 shows qualitative results of some samples from the Kvsir-SEG dataset for the polyp detection and localization task. Results from the recent state-of-the-art method, YOLOv4, and our proposed method are shown in Fig. 8 . From the figures, it can be observed that both YOLOv4 and YOLOv4+Attention can detect and localize polyps with high confidence. Some of the bounding boxes are annotated with the yellow arrows to show that our proposed method is better in localizing the polyps. In YOLOv4, most polyps are localized with wider bounding boxes than the proposed YOLOv4+Attention. This is also validated with the quantitative results, where the average IoU for YOLOv4+Attention is better than YOLOv4, as shown in Table 2. Samples with the blue bounding boxes are ground truths and are available with the data set.  The performances on the SUN database with YOLOv4 and the proposed method are also shown in Table 3. The qualitative localization performances on some of the samples of the SUN database are shown in Fig. 9. It is observed that similar performances are also achieved on this data set. YOLOv4 model did not detect the second image of the second-row polyp, but our proposed model could detect and localize it. Further to validate the robustness of our model, we also cross-validated the performances. We evaluated the performance using the test data from the Kvasir-SEG dataset while the model was trained with the SUN database. The performance of the cross dataset is shown in Table 4.

Classification Results
The proposed method is validated on the publicly available a labeled polyp dataset for colorectal polyps classification 40 . The dataset is available at url: http://www.depeca.uah.es/colonoscopy_dataset/. It contains video sequences using narrow-band imaging (NBI) and White light (WL) imaging. The dataset contains video sequences for 21, 15, and 40 hyperplastic (benign), serrated, and adenoma (malignant) polyps. Fig. 10 shows some of the samples from both the classes of the dataset. The video sequences are converted to frames, and from each frames, the polyps are segmented out using the proposed YOLOv4 attention network. Subsequently, these polyps are fed to the triplet network for classification. In this work, only NBI image frames from hyperplastic and adenoma classes are considered. Three-fold cross-validation was employed as a validation method for our approach. Extracted features are analyzed using linear SVMs to classify polyps between benign and malignant. A classification accuracy of 90.16% is achieved. The embedded features of the image samples of the database are analyzed using t-SNE and are shown in Fig. 11. Further, the PHOG features are also fused with the embedded features extracted from the triplet network to enhance the classification performances. The fusion of these features increases the dimensionality and non-linearity in the feature space. Therefore, an RBF kernel SVM was used for classification of the fused features. It was also varified from the experiments that the RBF SVM performs better as compared to other classifiers. Table 5 shows the classification accuracies of some of the handcrafted-based methods on the same dataset. Similarly, Table 6 shows the classification accuracies on the same dataset using the transfer learning approaches. Finally, the results are compared with the state-of-the-art methods, and it is clearly seen that our method gives better performances in a limited data environment. The results are shown in Table 7.    Table 7. Comparison with the existing works.

Discussion
This paper presents a framework for analysis of colonic polyps using colonoscopy video frames. A deep attention based YOLOv4 network is proposed to detect and localise polyps in the first step of the study. The performance of the suggested algorithm outperforms state-of-the-art approaches by a significant margin. The generalizability and robustness of our method are also demonstrated by the consistency of results across datasets and between datasets. Following that, the localised polyps are classified, which is crucial for better prognosis. We propose a triplet network based on siamese architecture, followed by SVM, to achieve this. Additionally, local polyp features are extracted and fused with deep features, resulting in improved classification results. The effectiveness of our strategy in a limited data environment is demonstrated by its classification performance on a relative small dataset. We hope to improve polyp detection and localization in future by training the network with features that best characterize the polyp clinical manifestations. Further, grading of dysplasia in polyps could also allow practitioners better comprehend pathological situations.

Data availability
The Kvsir-SEG and the SUN databases are publicly available. The dataset used for the classification is also an open access dataset.