LeukoSegmenter: A Double Encoder-decoder Based Network for Leukocyte Segmentation From Blood Smear Images

Segmentation of blood cells is a prerequisite step in automated morphological analysis of blood smear images, cell count determination, and diagnosis of various diseases such as leukemia. It is extremely challenging due to the di ﬀ erent sizes, shapes, morphological characteristics, and overlapping of blood cells. Due to its complicated nature, it is generally performed as a sequence of steps. However, sequential segmentation results in restricted accuracy due to cascading of errors that creep during each stage. On the contrary, pixel-wise segmentation of blood cells is a single-step task and gives promising results. In this paper, we propose LeukoSegmenter, a double encoder-decoder for precise pixel-wise segmentation of leukocytes from blood smear images. It uses pre-trained ResNet18 based encoders and U-Net-based decoders. Feature maps obtained from the ﬁrst network are utilized as attention maps. These are used as input in conjunction with the original 3-channel image to obtain the ﬁnal mask from the second network. This mechanism allows the latter encoder-decoder pair to focus explicitly on leukocytes and ignore other blood cells and debris, thus improving the segmentation accuracy. Experiments on ALL-IDB1 dataset show that the proposed LeukoSegmenter achieves an intersection-over-union score of 94.6827% and a Dice score of 97.1987% which is superior to that of state-of-the-art methods.


Introduction
Leukemia, a type of blood cancer, is a deadly disease that causes thousands of fatalities every year globally. More than sixty thousand cases were registered in United States (US) alone in the year 2019 [1]. In the same year, Global Cancer Observatory positioned it among the top 10 deadly cancers in India [2]. In general, human peripheral blood consists of three types of cells namely red blood cells (RBCs), white blood cells (WBCs), and platelets (Fig. 1). Out of these, 2S a b r i n a D h a l l a e t a l .

Fig. 1 Types of blood cells
WBCs are more prone to become cancerous. In leukemia, blood-forming tissues start producing a large number of abnormal WBCs, slowly killing the normal cells in the bloodstream. Depending on the type of WBCs affected, leukemia is classified into two types as lymphoblastic (affects lymphocytes) and myeloid (affects monocytes and granulocytes). These can be further categorized on the basis of the rate of progress of leukemia as acute (i.e., progression rate is fast) and chronic (i.e., progression rate is slow).
Technological advancements during the past five decades have introduced various semi-automated methods in the field of hematology such as flow cytometry, blood cell count using a hemocytometer and blood smear image processing. Flow cytometry and hemocytometers are used for differential cell count and cannot be used for detailed morphological analysis of the blood cells. On the contrary, the image processing-based systems use pattern recognition algorithms to analyze the geometry, size, color, and texture of blood cells, much like a human expert. The early image processing-based systems capture various hematological features and apply different rules to quantify cancerous characteristics. These rule-based systems are ad hoc, fragile and do not fit into one generic model. Furthermore, various slide preparation issues like variation in optical density, overlapping cells, disrupted cells, stain debris, stain variations and image acquisition issues like lighting, scale, noise, compression cause significant variance in their results.
Advancements in machine learning, specifically deep learning, have led to the development of methods that have given benchmark performances in various computer vision tasks [3][4][5][6]. Though machine learning algorithms are based on an explicit selection of feature sets, deep learning algorithms automate this process [7]. Automated feature extraction allows an extremely large number of features to be extracted, beyond human comprehension. In a deep neural   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 network (DNN), input passes through a sequence of stacked layers. Each layer utilizes the low-level features selected from the preceding layer(s) and passes high-level features to the succeeding layer(s). This process continues till the final layer of the network is reached which gives the comprehensive representation of the data. The depth of a network or the number of stacked layers in a network significantly impacts the number of parameters and the performance of the network. Fine-tuning a large number of trainable parameters through training requires enormous computational power, generally met using graphics processing units (GPUs). To reduce DNN's computational requirement, a suggested way is to preprocess the input image, extract region-of-interest (RoI) and then pass it to the network. For leukemia, WBCs from blood smear images are segmented, stacked and then passed to a DNN-based computer-aided diagnosis (CADx) system as a block for analysis. This would help the network to focus on smaller ROIs, and perform faster and accurate analysis.
Biomedical image segmentation using deep learning is generally performed using convolutional neural networks (CNNs), the networks best at processing spatial data. During the recent years, CNNs such as fully convolutional network (FCN) [5], and encoder-decoder networks such as SegNet [8], UNet [9] and others [10][11][12] have been used to perform biomedical segmentation with good results. The encoder-decoder networks generally use models such as AlexNet [3], VGG16 [4] and ResNet [13], pre-trained on natural images dataset such as ImageNet [14], as encoder. The encoder is then followed by a decoder which projects the low-level features extracted by the encoder to a high-resolution pixel space such that each pixel is classified into respective classes. In this paper, we propose LeukoSegmenter, a double encoder-decoder network to segment leukocytes from blood smear images. The contributions of this paper are as follows: 1. The proposed model cascades encoder-decoder twice with the concatenation of feature maps in between them. The concatenated input allows the latter encoder-decoder to explicitly focus on leukocytes while ignoring other blood cells and cell debris, thereby giving better-segmented results. 2. The model has a symmetrical structure and is based on ResNet18 and UNet architectures. It leverages the inherent advantages of ResNet18 architecture for feature extraction and UNet architecture for hierarchical upscaling to recreate the segmented image. The usage of pretrained networks also resulted in faster convergence of the proposed model. 3. We present a comprehensive empirical study including an ablation study validating the need of cascading two encoder-decoder networks. The qualitative and quantitative results from the ablation study indicate that the proposed dual ResNet18-UNet encoder-decoder network gives better segmentation results in terms of IoU and Dice score on ALL-IDB1 dataset as compared to single ResNet18-UNet network, single ResNet34-UNet++ network [15], and double ResNet34-UNet++ network. 4. The proposed LeukoSegmenter gives consistent and reproducible results when evaluated on the standard dataset using the benchmark metrics.
4S a b r i n a D h a l l a e t a l . The rest of the paper is organized as follows. Section 2 presents the work related to leukocyte segmentation from blood smear images. The architecture of LeukoSegmenter, dataset details, and the evaluation metrics used to evaluate the performance of the proposed model are presented in Section 3. Section 4 presents an ablation study, the results of the proposed model, and their comparison with that of state-of-the-art methods. Finally, conclusions are drawn in Section 5.

Related Work
The existing methods for leukocyte segmentation from a blood smear image can be broadly classified as traditional methods and deep learning-based methods.

Traditional methods
The traditional methods for leukocyte segmentation are further categorized as: 1. Rule-based methods: These methods use heuristic rules formulated on the basis of prior knowledge of cells such as their shape, texture and size to perform segmentation. The methods have the advantage of being simple and the rules can be applied in a different order, giving notable freedom. Rawat et al. [16]  The method needs to be used in conjunction or be followed by other algorithms such as active contours and edge detection. The deformable model-based methods have limitations that their convergence is expensive and is highly dependent upon the initialization of the curves.

Deep learning-based methods
Deep learning-based methods have outperformed state-of-the-art methods in several computer vision tasks including leukocyte segmentation. Three main DNN-architectures that have been used for segmentation of leukocytes from blood smear images are: 1. SegNet: Introduced by Badrinarayanan et al. [8], in the year 2017, Seg-Net architecture consists of an encoder network and a corresponding decoder network, followed by a pixel-wise classification layer. Encoder network, which is similar to VGG16 architecture, is used for feature extraction whereas the decoder network upsamples the resultant feature map using max-pool indices and the convolution operation. Trans et al. [31] used Seg-Net to segment WBCs from 42 images of ALL-IDB database. Using this architecture, an overall IoU score of 89.96% is obtained. 2. Fully-convolutional networks (FCNs): These networks are modified CNNs [5], in which the last fully connected layers are removed and the output map is passed to the decoder network. Skip connections are embedded from the corresponding encoders to the decoders. In contrast to SegNet, FCNs use deconvolution operation to upsample the feature maps. Depending on the number of times the pooling operation is performed, they are further categorized as FCN8, FCN16 and FCN32. Shahzad et al. [32] used a variant of FCN where VGG16 was used as a feature extractor model to segment RBCs, WBCs and platelets from ALL-IDB blood slide images. Mean ac-1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 curacy of 91.96% is achieved for all the three cellular components of blood, which leaves scope for improvement in the future. 3. U-Net: This network is the most popular form of encoder-decoder networks that is used for semantic segmentation of medical images [9,33]. Its network architecture is similar to SegNet and FCNs where the encoder downsamples the input image and the decoder performs up-sampling, thus taking shape of the letter 'U'. The major difference between FCNs and UNet is that the former uses an addition operation between downsampled and upsampled feature maps on the same level whereas the latter uses a concatenation operation for the same. Thus, UNet prevents loss of any information and is specifically designed for medical images. Lu et al. [15] used a combination of UNet++, a variant of UNet with multi-skip connections, and ResNet50 to segment WBCs from the background using four types of datasets. The mean IoU value obtained for all the datasets is above 90%. However, the major drawback of the model is that it has not been tested on images having multiple WBCs. 4. DeepLab: This model has been designed in the year 2017 for semantic segmentation and consists of encoder-decoder phases [34]. It aims to overcome the major drawback of all the architectures defined above which is the incapability to handle varied sized inputs. This network can be trained on various sized images using Spatial Pyramid Pooling (SPP). To reduce the computational complexities and cost involved in this model, dilated convolutions are used. These types of convolutions introduce space in between kernel values so that field of view is increased and the number of parameters is not humongous. Roy et al. in 2021 trained blood cell images on DeepLabv3+ model (a variant of DeepLab which uses depthwise separable convolutions in both encoder and decoder phase) which used ResNet-50 to downsample the images [35]. The average IoU on three datasets obtained was 92.1%. Although the results are satisfactory it needs to be tested with images containing multiple overlapped cells too. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65

Methodology
During the end-to-end training process, the encoder tries to learn patterns in an image with the help of many convolutional layers, denoted by C m .T h e s e layers capture local information in an image and pass it to the next layers in the form of feature maps, denoted by F m . Thus, each input x is convolved with the weight matrix, W i and bias value b i is added to it. This information is passed through ReLU activation function, denoted by f(x) to induce non-linearity in the deep-supervised network. This process is simplified using equation 1 where i=1 ,.... . ,F 1 and m=1. The encoder branch for the network is generally chosen as one of the Convolutional Neural Networks (CNNs) which is pre-trained on ImageNet dataset. This process of transfer learning helps to transfer the basic features learned on non-medical datasets to medical image datasets. Such networks are generally used for classification purposes but can also be used for the segmentation of objects by passing the downsampled image to the corresponding decoder. In this case, the original ResNet18 network [13] has been used as an encoder. However, the original architecture has been modified as the last fully connected and softmax layers have been removed and the feature maps are passed directly to the next module as shown in Table 1. The decoder branch, which forms a directed acyclic graph topography is used to restore and refine the set of semantic features learned from the encoder module. This restoration process involves transposed convolutions (also known as deconvolutions) that helps to upsample the feature map and thus expand feature dimensions. Spatial resolution is upsampled by 2×2 factor which reduces the number of feature maps by half. It is then followed by corresponding 3×3 convolution, batch normalization and ReLU activation function. The inclusion of skip connections from an encoder to decoder helps to retain fine information and enhance decoding performance by avoiding the loss of gradient information. A 1×1 convolution used at the end of the network helps to associate each of the n-component feature maps to its corresponding classes. Thus, relevant features are learned and the decoder outputs a mask corresponding to the input image. In this paper, a network of dual encoder-decoder has been proposed which consists of two sub-networks X 1 and X 2 as shown in Fig. 3. A 3-channel RGB input is passed to the encoder which downsamples the image feature map and increases the count of channels using convolution operation. Now, the spatial resolution of the image has been reduced and only significant 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 8S a b r i n a D h a l l a e t a l . features are retained by the network. This feature map is then passed to the decoder which up-scales it and helps to compensate for the loss of spatial resolution that occurred during the encoding phase. The motivation behind the double encoder-decoder model proposed in this paper is a network called Wnet [37] which simply stacks two such architectures sequentially in the shape of "W". However, modification in our model ensures that minute details in an image are preserved. Given x as a 3-channel input image to the first network X 1 ,X 1 (x) is a single channel output of the first network. This output along with the original 3-channel input is passed to the second network (refer equation (2)) Here, x and X 1 (x) are concatenated and hence, four-channel input is passed to the second network X 2 . In general, attention modules are used by CNNs to target contextual information and discard non-useful features learned by the network. In our case, the output of the first U-net network acts as an attention map to suppress the irrelevant background pixels. Concatenation   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 of output from the first decoder helps the model focus on fine and significant regions in an image. Although, architectures of both networks can vary we have restricted ourselves to using the same architectures to simplify understanding and reasoning of the novel concept.

Experimental Setup
In this section, various data augmentations and the experiment setup details have been discussed.
1. Training Phase: Since, medical image datasets are small in size, it is advised to perform augmentations to prevent over-fitting of the model [38]. Thus, various types of data augmentations have been applied to increase the number of images in our dataset namely, horizontal flip, vertical flip, and rotations (90 , 180 and 270 ). In this way, each image in the dataset has been augmented four times randomly which increased the size of the training set from 88 images to 1408 images.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 hyper-parameters such as batch size, number of epochs, optimizer and scheduler have been fine-tuned to adapt to the network as a whole. Optimizer used during training is Adam (i.e. adaptive moment estimation) and its initial learning rate has been set to 0.001. In addition, CosineAnneal-ingLR is used as a scheduler The whole network is trained for 200 epochs using a batch size of 8.

Mixed Loss Function
The microscopic images consists of various types of cells (WBCs, RBCs and platelets) and background (cytoplasm). The majority area in an image is covered by RBCs and cytoplasm, thus causing a typical problem of class imbalance. This problem can significantly dominate the loss function values while the model is training and can fall for local minima. Thus, a new loss is computed which refined the results and made the model more robust.

Binary Cross-Entropy (BCE) loss:
It compares values of each pixel in the resultant mask images with the original pixel value in the ground truth and is shown in equation 3.
where N corresponds to total number of pixels in an image, p corresponds to the probablity that a pixel belongs to the indicated class and g is the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 ground truth for the pixel. If value of g i is 1, it means pixels belong to WBC class and for value of g i as 0, it corresponds to background class 2. Dice Coefficient (DC) loss: Dice-loss is used to handle the problem of classimbalance by adaptively weighing each class as per number of pixels. It can be found using equation 4.
where p and g indicate the same as that in equation (3) and δ ∈ [0,1] is a small constant value that is used to prevent divide-by-zero error and helps negative values propagate in the network. 3. Resultant loss: Based on the above two losses we propose a hybrid loss function which combines both the losses so that the convergence speed increases and the output results are improved. Thus, each of these is added to form the equation 5.

Intersection-Over-Union (IoU or Jaccard Index):
IoU is a very commonly used metric to evaluate the performance of a segmentation algorithm. It is defined as the percentage of the overlapped area between the target and predicted results. Hence, it is used to measure how close the resultant output is with the ground truth and is represented by equation 6.
2. Dice Coefficient (DC or F1 Score): It is defined as the total area of overlap of the respective classes divided by total area of both images and is represented by equation 7 For equations 6 and 7, G f and G b signify WBC and background region as per labels and P f and P b correspond to WBC and background region as per predicted results.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  Ta b l e 2 Comparison of performance of state-of-art methods with the proposed method images. As shown in Fig. 4, the qualitative similarity between the results of the proposed double encoder-decoder model and ground truth can be verified. This pixel-level segmentation model has an upper hand over classic objectbased models too because it preserves the important information at pixel level, leading to much higher performance. Whole-slide images which contain multiple WBCs have been used to frame the solution for real-time problems for leukemia detection which is the major limitation for other state-of-art models. Moreover, it can be seen from Fig. 5 that resultant masks obtain from a single encoder-decoder model also segment unwanted areas (highlighted in red). This limitation is removed by the proposed model which pays attention to only leukocytes and ignores other unwanted areas in the images. In a word, the proposed model can effectively segment leukocytes from images of blood slides. In addition to the qualitative results, quantitative results have also been calculated using standard performance metrics such as IoU, DS and accuracy in Table. 2. The model has attained IoU of 94.68%, DS of 97.19% and accuracy of 98.24%.

Ablation Study
The performance of the proposed segmentation model was tested using three types of ablation studies. Fair comparison has been done qualitatively (Fig. 6) and quantitatively (Table 3) amongst various models. Fig. 7 shows the progression of L Total over a total number of epochs for all three studies. The first ablation study was conducted to determine whether double or two cascaded UNets were superior to single UNet. In this experiment, a basic encoder (ResNet18) and all the hyper-parameters were kept the same. Although the size and the number of trainable parameters of the model have increased two  folds but significant improvement in the segmentation could be observed. In this case, IoU increased by 0.62%, dice score by 0.07% and accuracy by 0.35%. As per the results in Table 2, Lu et al. [15] segmented WBCs using UNet++ and ResNet34, and it outperformed our suggested model in terms of IoU score. Hence, the same model was used to cross-verify the results on our dataset too. A score of the various performance metrics implies that our model is more suitable for segmentation task. Under the third ablation study, a double model of UNet++ and ResNet34 was designed and compared with our model. After various experiments, it can be concluded that performance patterns obtained by using this model are quite similar to those obtained from our proposed model. However, this model's size and the number of trainable parameters are high, which may have a detrimental impact on its practical application.

Conclusion
In this research, we have proposed an innovative design that involves the usage of a double encoder-decoder network to segment WBCs from microscopic blood smear images. The resultant single channel feature map obtained from 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 the encoder of the first network is used as an attention map that focuses on the 'important' areas in an image. The second decoder in the network generates the final resultant mask which is used for performance evaluation. A fair comparison between various strategies has been done to evaluate the advantage of this novel architecture. This work advocates the use of similar networks at this stage. However, combinations of various novel and pre-trained architectures should be investigated as future work.