Information Fusion for Colorectal Polyps Medical Image Segmentation

Training a deep neural network often requires a large amount of annotated data, which is scarce in the medical image analysis domain. In this work, we present a simple yet effective technique for enhancing medical image segmentation neural network through information fusion. The proposed approach utilizes information from different spatial scales and combines them in a learnable way. Experimental results on two benchmark datasets demonstrate that the proposed fusion module improves the segmentation performance of state-of-the-art neural networks.


Background
Semantic segmentation is a common computer vision task in which each pixel in an image is assigned to a pre-defined category [1]. It has been widely used in applications such as landscape mapping, autonomous driving, and medical image analysis [2,3]. In the last decade, training a deep convolutional neural network (CNN) to segment the image has replaced the handcrafted features to become a common practice due to its superior performance. It is well understood that deep neural networks are notoriously data hungry insofar as effectively training the model is concerned. As op-posed to the application of semantic segmentation in the industry, deploying a segmentation system for medical image analysis has been found challenging. Lack of annotated data has always been a major obstacle in the medical image analysis community [4]. This is attribute to the manner on how medical images are labeled. Unlike natural images that can be easily obtained, medical images are hard to acquire due to the data privacy concern. To make matters worse the annotator is required to have domain expertise for labeling medical images. Colorectal cancer is the third most commonly diagnosed cancer and the fourth cause of cancer death around the world [5]. As a precursor to the colorectal cancer, polyps are abnormal tissue growths on the inner lining of the colon or rectum. In clinical practice, colonoscopy is the preferred technique for detection and assessment of polyps. Therefore, developing a method to accurately segment polyps in colonoscopy images is of great interest as it helps the doctor to identify polyps at an early stage. However, segmenting polyps is a challenging task as the same type of polyps appears in a high variety of color, size and texture. Despite the advancement in colonoscopy, the boundary between polyps and surrounding mucous membrane is usually not very sharp as shown in the Figure 1. To address these challenges, in this paper, we propose a novel information fusion module to improve the performance of existing segmentation networks. The key idea is to combine features from different layers in a data adaptive way. We evaluate the proposed method on two publicly available datasets, Kvasir-SEG [6] and CVC-612 [7], for polyp segmentation in colonoscopy images. The results demonstrate that the proposed fusion module improves on multiple network architectures for medical image segmentation. The remainder of this paper is organized as follows. We introduce the related work in Section 2 and describe the proposed approach in Section 3. Section 4 describes the two datasets along with the segmentation results. Concluding remarks are provided in Section 5.

Related work
To address the label scarcity problem, plenty of machine learning techniques have proposed in the last few years. Existing techniques can be roughly grouped into two broad categories: efficient training and data feature enhancement. Active learning seeks to train the network with as little labeled data as possible by picking the most informative samples to increase network training efficiency [8]. By pre-training the network on a large-scale dataset like ImageNet [9], where labels are relatively easy to collect, transfer learning allows a model to be fine-tuned with a considerably less quantity of data. Semi-supervised learning differs from supervised learning in that it extracts informative characteristics from unlabeled data [10]. Data feature enhancement, on the other hand, focuses on creating more discriminative features and exploiting the relationship between data samples. Synthetic data generated with generative adversarial network (GAN) [11] has proven to be very effective for improving the model performance [12,13]. Statistical shape and appearance models have also been used to generate training samples with higher data variance that helps improve the robust-ness of the model [14]. The hierarchical features in CNN provide a natural way to address the object scale variance issue in medical image segmentation [15]. U-Net [16] and its variants [17][18][19][20] capture the multi-scale information by connecting features from shallow and deep layers. Attention-based methods [21] guide the training to focus on important image regions such as object edges such that the segmentation accuracy can be improved.

Methods
In this section, we introduce the proposed information fusion module. We begin by describing the base network, fully convolutional network (FCN) (1), and the feature refinement unit [22]. We then explain how the multi-scale features are combined through the information fusion module. Figure 2 de-picts our network architecture. Fully convolutional network (FCN) has been extensively used for semantic segmentation since it was introduced in 2015 [1]. To simplify the flowchart, we omit the convolutional layers and the output (softmax) layer in FCN in Figure 2. The convolutional features after four pooling layers, which are commonly known as FCN-32s, FCN-16s, FCN-8s, and FCN-4s, are sent through the information fusion module before making the prediction. The information fusion module not only improves the discrimination ability of features, but also aggregates features at different level in a learnable fashion. In particular, the convolution features at each stage are first connected to a feature refinement unit (FRU). The detailed architecture of a FRU is shown in Figure 2. Inspired by the RefineNet [22], we add two additional convolutional layers and associated ReLU activation layers [23] in combine with a residual connection. This refinement process has shown to be effective in dealing with fine-scale features [22]. When used in colonoscopy image segmentation, the feature refinement is helpful for segmenting polyps that are relatively small as compared to the image size. It is well known that early layers in a CNN are responsible for extracting shallow, fine, and appearance information from the data, while layers in later stages tend to produce features at a coarse scale and are related to the semantic information [24,25]. The success of U-Net [16] has shown that combining these two types of features is critical for producing accurate segmentation map for medical images. Instead of directly concatenating all features, the outputs of FRU are added together and followed by a 1 × 1 convolution. This essentially learns a weight for features from each layer. Compared to a feature concatenation or a feature summation, weights for each component can be learned   through the error back propagation. Therefore, features that contain different information can be combined in a datadriven way. It is worth noting that we choose FCN as the base network for illustration purpose. The proposed information fusion module is flexible and generic. We can attach different number of FRUs based on the feature groups. For example, when the proposed information fusion module is used with U-Net, we can attach one FRU after each skip connection, which results in a 4-stage fusion. The information fusion module can also be used with other network architectures as long as the convolutional features can be extracted from different depths. The standard pixel-wise cross entropy loss is used to train the network.

Experiments
In this section, we evaluate the proposed method by applying them to two publicly available polyp segmentation datasets. We compare the performance of multiple networks with and without adding the information fusion module.

Datasets and Baselines
Kvasir-SEG (6) is a recently released dataset for polyp segmentation. It consists of 1000 images and their corresponding ground truth. These images were manually labeled by a medical doctor and reviewed by an experienced gastroenterologist from Vestre Viken Health Trust in Norway. CVC-612 (7) is a benchmark dataset that has been used in many polyps segmentation studies. It contains 612 images that were extracted from 31 colonoscopy videos. For both datasets, each pixel in the image is labeled as either polyp or non-polyp. To evaluate the efficacy of the proposed method, we compare the performance before and after adding the information fusion module with multiple networks, including FCN [1], U-Net [16], U-Net++ [17], ResUNet++ [19], PraNet [26]. Since the PraNet has a similar multi-level feature aggregation design, we compare it with the proposed method by replacing the parallel reverse attention module in PraNet with the proposed information fusion module.

Results
For both datasets, we follow the common experimental setup used in [19,26] by randomly split the data into three sets, training (80%), validation (10%), and testing (10%). The performance of network is measured with standard semantic segmentation metrics, Dice and mean Intersection over Union (mIoU). All networks are implemented using PyTroch framework [27]. The experiments are conducted on an AWS EC2 p3.2xlarge instance. For baseline networks, i.e., without information fusion module, we use numbers reported by the original authors. To have a fair comparison, we use the same hyperparameters as the authors specified in the original papers when training the network with information fusion module. For networks that the training details are not provided, we train the network 100 epochs with a batch size 16. The initial learning rate is set to 0.01. For optimization, we use the SGD optimizer with cosine decay for learning rate schedule [28]. We also use the standard data augmentation, such as random flip, crop, rotation, and color jittering, etc., to improve the model performance. Table 1 shows the segmentation results of all networks. As can be seen that the information fusion module (indicated by (IF) in the table) consistently improved the model performance on both datasets. When applied to a basic network like FCN [1], the information fusion module improved Dice from 0.794 to 0.822 by 3.5%, and mIoU from 0.763 to 0.792 by 3.8% on the Kvasir-SEG dataset. A similar trend can be observed on the CVC-612 dataset as well as for U-Net. With advanced variants of U-Net, the additional connections in U-Net++ [17] and ResUNet++ [19] are expected to enhance the feature combination from different layers. The model with in-formation fusion module still achieved higher Dice and mIoU. We attribute this to the weighted summation with learned weights compared to simple feature concatenation and feature summation. The improvement on the recently published PraNet [26] is relatively small as opposed to other networks. This is because the PraNet also adopts multi-scale  feature fusion using an attention module, which shares the same motivation as the proposed information fusion module.
It is worth mentioning that the performance achieved by the PraNet is the state-of-the-art on both datasets. Considering that the PraNet results are close to the empirical upper bound of these two datasets, the improvement achieved by the information fusion module is considerably significant. Fig. 3 shows the qualitative results for models with and without the in-formation fusion module on CVC-612 and Kvasir-SEG datasets. We observe similar trends from all the compared networks and choose FCN and PraNet as examples. As can be seen that networks with the proposed information fusion module produced segmentation masks with sharper edges compared to the ones produced by networks without the proposed module. The ba-sic network (FCN and PraNet) tended to make mistake when the boundary between polyps and surrounding mucous membrane is blurry. With the information fusion module, the networks (FCN (IF) and PraNet (IF)) were able to separate these polyps from background. We attribute this improvement to the ability to combine fine-scale features with coarse-scale features.

Conclusion
In this paper, we have proposed a novel approach to improve semantic segmentation networks for medical image segmentation. The proposed information fusion can be easily incorporated as a module for all common segmentation networks. The feature refinement and weighted feature summation that provided by the information fusion module have been demonstrated to be effective in improving the segmentation performance on real polyp segmentation datasets when compared with other basic networks.