This section describes the dataset of peripheral blood cell images, Siamese twin network training with EfficientNet as the base model, N-way few-shot validation and testing, performance metrics, and generation of class activation maps between query image and support set.
Table 1
Summary of the cell types in the dataset, number of images for each cell type, number of images used for Siamese twin network training, few-shot validation, and few-shot testing. N: number of images.
Blood Cell Type | Training (N) | Validation (N) | Testing (N) | Total (N) |
Basophils | 125 | 125 | 968 | 1218 |
Eosinophils | 125 | 125 | 2867 | 3117 |
Neutrophils | 125 | 125 | 3079 | 3329 |
Lymphocytes | 125 | 125 | 964 | 1214 |
Monocytes | 125 | 125 | 1170 | 1420 |
Immature Granulocytes | 125 | 125 | 2645 | 2895 |
Erythroblasts | 125 | 125 | 1301 | 1551 |
Platelets | 125 | 125 | 2098 | 2348 |
Total | 1000 | 1000 | 15092 | 17092 |
Dataset
In the current work, an openly accessible dataset of normal peripheral blood cells was utilized from the Hospital Clinic of Barcelona. The dataset contains 17092 RGB images that were captured using the CellaVision DM96 [30]. The images of different types of blood cells include neutrophils, basophils, eosinophils, lymphocytes, monocytes, immature granulocytes, erythroblast, and platelets. Predominant image resolution was 360 × 363 with very few images having the resolution of 360 × 360 or 359 × 360. The images were graded with pre-determined cell types by expert clinical pathologists from the same clinic and were used as ground truth labels. Complete details about the dataset are given in Table 1. Sample images (one image per class) from the dataset are shown in Fig. 1.
Generation of Image Pairs
We have considered 125 images from each class with a total of 1000 images for training. Twenty image pairs are created for each image cumulating to a total of 20000 image pairs for STN training. When both the paired images belong to the same cell type it is termed as a positive pair labelled as 0 and if the images are dissimilar, they are designated as a negative pair labelled as 1. To avoid imbalance in the classes, 10000 positive pairs and 10000 negative pairs are randomly produced. Out of the 20 generated pairs for each image, ten were same pairs and ten were different pairs. To maintain uniformity in differently paired, we ensured that each specific cell was paired with rest of the seven cell types at least once. Before feeding the image pairs to the STN model, the images were initially verified to ensure that their intensity range is between zero and 255, a primary requirement for EfficientNets.
Siamese Neural Network Architecture
Figure 2 shows the proposed STN architecture with EfficientNet-B3 as the base model. The final softmax layer of the base model was discarded and a global average pooling layer is incorporated. The EfficientNet-B3 model was employed here to transform the input sample from image space to embedding space with a mapping \(\varphi (.)\). For the input image pair \({X}_{1}\) and \({X}_{2}\), \({\varphi (X}_{1})\) and \({\varphi (X}_{2})\) are their embeddings in the latent space respectively. Since the goal of the STN model is to make the embeddings of similar pairs closer and vice versa, the quantitative comparison of the embeddings via absolute differences is implemented using the lambda layer. After which, a sigmoid neuron is placed leading to a probability between zero and one, where a value less than 0.5 indicates a positive pair and vice versa.
EfficientNet and Tuning of Hyperparameters
Since the softmax layer of the EfficientNet-B3 model was eliminated, it outputs a feature tensor of size 10×10×1536 which is also the output for the final convolution layer. Further, this output feature tensor is global averaged to achieve a feature vector of the size 1536. Furthermore, the 1536 feature tensor is connected to the lambda layer for its quantitative comparison with another feature tensor of length 1536 as described earlier. The base model approximately contains twenty million parameters and therefore to reduce the computational time as well as to leverage the power of transfer learning, selected parameters (weights and biases) of the final ten percent layers from the EfficientNet-B3 model were allowed to get updated during backpropagation and rest of the model parameters were nontrainable.
There are several hyperparameters in the proposed model that can be tuned such as mini-batch size, learning rate, number of epochs, and the choice of optimizer during gradient descent. We evaluated models with different possible combinations with respect to four adaptive gradient descent optimizers namely RMSprop [31], Adam [32], Nadam [33], Adadelta [34]. Finally, for few-shot testing, we selected the model that gave better accuracies during few-shot validation as described in Table 2.
Contrastive Training
For contrastive training, NVIDIA Tesla P100 GPU with 26 GB RAM available in Google Colab Pro is utilized. It has a TensorFlow backend with Keras API. As required by EfficientNet-B3, the images are center cropped to attain a resolution of 300×300. Further, the loss metric for updating the model parameters through backpropagation is contrastive loss (\({C}_{l})\) that is computed using Eq. (1) given below:
$${ C}_{l}=\left(1-y\right)*{\sigma \left(d\left(\varphi \left({X}_{1}\right), {\varphi (X}_{2})\right)\right)}^{2}+y*{\left\{\text{max}\left(0, m-\sigma \left(d\left(\varphi \left({X}_{1}\right), \varphi \left({X}_{2}\right)\right)\right)\right)\right\}}^{2} \left(1\right)$$
where \(d\left(\varphi \right({X}_{1}), \varphi ({X}_{2}\left)\right)\) is the distance metric and in this study, it is the weighted sum of the absolute differences between the embeddings \({\varphi (X}_{1})\) and \({\varphi (X}_{2})\). Above, \(y\) indicates the true label of the image pair, m is the distance margin that is set to one in the current study and \(\sigma\) is the sigmoid function as described in Eq. (2).
$$\sigma \left(d\left(.\right)\right)=\frac{1}{1+{e}^{-d(.)}} \left(2\right)$$
Let \({a}_{d}\) is a tensor represents the absolute differences between the embedding tensors as given in Eq. (3).
$${a}_{d}^{i}=\left|{\varphi (X}_{1}^{i})-{\varphi (X}_{2}^{i})\right| \forall i \left(3\right)$$
Eventually, the distance metric \(d\left(\varphi \right({X}_{1}), \varphi ({X}_{2}\left)\right)\) is represented using the expression in Eq. (4).
$$d\left(\varphi \left({X}_{1}\right), \varphi \left({X}_{2}\right)\right)=\sum _{i=1}^{N}{w}_{i}{a}_{d}^{i}+{b}_{i} \left(4\right)$$
In Equations (3) and (4), \(\varphi \left({X}_{1}^{i}\right)\) and \({\varphi (X}_{2}^{i})\) are the ith values of the embedding vectors \({\varphi (X}_{1})\) and \({\varphi (X}_{2})\) respectively, \(N\) is the length of the tensor \({a}_{d}\) and \({w}_{i}\) and \({b}_{i}\) are the weights and biases that needs to be learned during backpropagation.
Support Set
A support set that is necessary for few-shot validation and testing containing eight images is formed from the images in the test set. To represent each class in the support set, one image is randomly sampled from the images of the corresponding class for N-way k-shot validation and testing as detailed below. Whenever k is greater than one, we used entirely new support set for each shot.
N -way k-Shot Validation and Testing
The value of N in N-way is the number of classes which is set to eight in the present study and k in k-shot is between one to five for few-shot learning. If k is equal to one, it is called one-shot learning and if k is equal to two, it is two-shot learning, and so on. In this study, we performed 8-way 1-shot, 8-way 2-shot, 8-way 3-shot, and 8-way 5-shot validation and testing for multiclass classification of eight cell types. Finally, the class of the query image is decided based on the highest similarity with respect to the images in the support set. Mathematically, the class prediction for a query image \({x}_{q}\) for 8-way k-shot learning is given in Eq. (5).
$${Y}_{q} = argmin\left\{\sum _{i=1}^{k}{\Psi }{({X}_{q}, {X}_{s})}_{i}\right\} \left(5\right)$$
Above, \({X}_{q}=\left\{{x}_{q}^{c}={x}_{q}:1\le c\le N \right\}\), that means \({X}_{q}\) is the set of images where the query image \({x}_{q}\) is repeated N times to match with the number of images in the support set \({X}_{s}=\left\{{x}_{s}^{c}:1\le c\le N\right\}\) so that the comparison of \({x}_{q}\) with \({X}_{s}\) would happen in one epoch, \({\Psi }\left({X}_{q}, {X}_{s}\right)\) is the prediction result of STN model which is a vector of N similarity values between \({X}_{q}\) and \({X}_{s}\). Finally, \({Y}_{q}\) is the predicted label for the query image \({x}_{q}\) that is between zero and seven. The calculation of overall accuracy for predicting a class c during k-shot validation/testing is implemented using expression (6), where \({N}_{c}^{{\prime }}\) is correctly predicted images for calss c and \({N}_{c}\) is the total number of images in class c.
$$Overall Accuracy = \frac{\sum _{i=1}^{k}({\frac{{N}_{c}^{{\prime }}}{{N}_{c}})}_{i}}{k} \left(6\right)$$
Creation of Visual Saliency Maps
For the creation of saliency maps, to infer explainability to the network while making decisions, the output of the lambda layer of the STN is used as the weight tensor. The activation maps \({A}^{H\times W\times C}\)of the final convolution layer of the EfficientNet-B3 for the query image are multiplied with the weight vector to get weighted activation maps \({A}_{w}^{H\times W\times C}\) as described by Eq. (7).
$${A}_{w}^{H\times W\times i}= {A}^{H\times W\times i}* {a}_{d}^{i} \forall i \left(7\right)$$
In Eq. (5), \({a}_{d}^{i}\) is the weight tensor as already described in Eq. (3). Afterward, an average activation map \({A}_{m}^{H\times W}\) of spatial size of \(H\times W\) is obtained by averaging all \(C\) weighted activation maps as described in Eq. (8).
$${A}_{m}^{H\times W}=ReLU\left(\frac{1}{C}\sum _{i=1}^{C}{A}_{w}^{H\times W\times i} \right) \left(8\right)$$
The negative values in the mean activation map are removed using the \(ReLU\) (rectified linear unit) activation function, which is given in Eq. (9).
$$ReLU\left(z\right)= \left\{\begin{array}{c}z if z>0\\ 0 if z\le 0\end{array}\right. \left(9\right)$$
For EfficientNet-B3, \(H\times W\times C\) = \(10\times 10\times 1536\). The depth of the activation maps \(C\) and length of the weight tensor \(N\) are identical. Finally, the 10×10 coarse activation map \({A}_{m}^{H\times W}\) is resized to match with the spatial resolution of the input query image which is 300×300 using python based scikit-image toolbox. For identification of highly activated regions in the query image when it is compared with the images in the support set, the resized heatmaps are overlaid onto the corresponding RGB images in the support set showing the most similar/dissimilar regions.