3.1. The design of YOLO-BC detection pipeline
For the original blood cell data set and YOLOv8-BC model, this study designed the training mechanism and model architecture as shown in Fig. 1. The original data set only included 364 picture samples. Through data augmentation, a data set with more comprehensive coverage of the scene was obtained, and then divided into a training set (765 images), a validation set (73 images) and a testing set (36 images). There were a total of 11,789 blood cell samples, including 10,031 red blood cells, 898 white blood cells, and 851 platelets. Finally, the YOLO-BC model is obtained after training on the dataset.
In terms of blood cell detection and counting, the YOLO-BC should be designed to automatically identify and count different types of blood cells in blood images, such as red blood cells, white blood cells, and platelets.
YOLO-BC performs image feature extraction and detection through a convolutional neural network (CNN), then divides the image into grids and predicts possible object bounding boxes and categories on each grid (Fig. 2 (a)). This enables YOLO-BC to maintain high detection speed and accuracy while processing large numbers of blood cells.
As shown in the Fig. 2 (b), taking white blood cell detection as an example, the YOLO-BC algorithm provides bounding boxes and corresponding judgment probabilities for the detection targets, and finally achieves the detection of corresponding types of blood cells. This process design plays an important role in the promotion of YOLO-BC in practical medical applications. Traditional medical diagnosis often relies on manual visual inspection and analysis, and the introduction of YOLO-BC visual detection software has the potential to provide doctors with a fast and accurate blood cell-related diagnosis method.
Taking the reading of microscope images as an example, doctors usually need to observe and analyze cells in a large number of blood samples and determine whether there are abnormalities. By using visual detection software developed by YOLO-BC pipeline, doctors only need to input microscope images into the system, and the software will automatically identify and locate blood cells, providing fast and accurate diagnostic results.
3.2. YOLO-BC algorithm
Based on the YOLOv8 model, incorporating the aforementioned EMSA and ODConv, this study designed and implemented an improved network architecture called YOLO-BC (YOLOv8 for Blood Cell), as illustrated in the Fig. 3 below.
EMSA is put to end of C2f module which is connected to second Concat structure of neck part, and the third Conv module is modified to ODConv in backbone part. The EMSA module can effectively aggregate multi-scale feature information, giving the network better perception capabilities at different scales. By placing the EMSA module at the end of the C2f module, the feature representation capability of the module can be enhanced and detection accuracy improved. By introducing the ODConv module into the backbone part, the network's ability to extract target features can be improved, thereby improving detection performance.
Then, YOLOv8 model has made significant improvements in its network structure. The backbone is primarily used for feature extraction, and YOLOv8 replaces the CSP (Cross-Stage Partial) module in YOLOv5 with a lightweight C2f (Context to Features) module, enhancing feature representation through dense residual structures. The tail of the network utilizes a SPPF (Fast Spatial Pyramid Pooling layer) to increase receptive field and capture features at different levels in the scene. The neck section is mainly used for feature fusion, utilizing path aggregation networks and the C2f module to fuse feature maps of different scales from the three stages of the backbone.
In blood cell images, the presence of occlusion and overlap between blood cells adds to the challenge of object detection. Thus, an attention model was presented in the Fig. 3 to overcome the drawback on blood cell detection of YOLOv8, such as repetitive counting on overlapping blood cell and low recognition accuracy in small target blood cell recognition, etc.
3.3. Efficient Multi-Scale Attention
The EMSA (Efficient Multi-Scale Attention) mechanism is a novel attention module that has gained considerable attention recently [9, 10]. Its core idea is to reduce computational costs while preserving information on each channel.
By reshaping a portion of the channels into the batch dimension, the EMSA mechanism can process channel information more efficiently, as is shown in the Fig. 4. By grouping the channel dimension into multiple sub-features, the module ensures an even distribution of spatial semantic features in each feature group. Assume that the input feature map is X, there are m samples in total, each sample has n features, and the feature map is divided into G groups. For each group g, calculate the mean \({\mu }_{g}\) and variance \({\sigma }_{g}\) of the features within the group, defined as [10]:
$${\mu }_{g}=\sum {X}_{i}/(n/GHW)$$
$${{\sigma }_{g}}^{2}=\sum {({X}_{i}-{\mu }_{g})}^{2}/(n/GHW)$$
1
Intuitively, H and W represent the height and width of the feature map respectively.
Then, it normalizes the features \({X}_{i}\) within the group as:
$${Y}_{i}=({X}_{i}-{\mu }_{g})/\sqrt{{{\sigma }_{g}}^{2}+\epsilon }$$
2
where ε is a small positive number to prevent the denominator from being zero. Finally, the normalized feature Y is restored to its original shape.
This grouping improves the expression capability of different blood cells’ pixel features and reduces computational costs, enhancing model efficiency.
The EMSA mechanism encodes global information and adjusts channel weights in each parallel branch to further improve channel weight calibration. Taking advantage of the benefits provided by the novel EMSA structure, the C2f module of YOLOv8 is improved by introducing a multi-scale attention mechanism that is capable of addressing differences in pixel-level features across different blood cell categories, especially for small targets like platelets, and modifying the backbone P4/16 after C2f accordingly.
On the other hand, EMSA mechanism utilizes cross-dimensional interaction further to aggregate the output features from two parallel branches. This allows the module to capture pairwise relationships at the pixel level of blood cells, better understanding the connections between information from different positions in the image. This cross-dimensional interaction enhances feature consistency and accuracy, further enhancing the expressive power of the model. In the backbone, EMSA is added before the bottleneck module (Fig. 5).
3.4. Omni-Dimensional Dynamic Convolution
ODConv (Omni-Dimensional Dynamic Convolution) is [11] a full-dimensional dynamic convolution method that extends and expands upon CondConv [12]. It considers the dynamics of multiple dimensions, such as spatial domain, input channels, and output channels, and improves model performance through parallel strategies and multi-dimensional attention mechanisms to learn complementary attention.
The core idea of ODConv is to decompose the convolution operation into multiple sub-operations and introduce dynamic weights for feature representation. By using matrix decomposition techniques [13], it decomposes the convolution kernel into the product of two low-rank matrices, reducing the number of parameters and computations, and improving model efficiency.
ODConv acts as a plug-and-play operation that can be conveniently embedded into existing YOLOv8 models to address the low detection accuracy issue in RBC by selectively replacing parts of the Conv modules in the existing structure of YOLOv8. And this study further optimizes the existing ODConv to version 2, by connecting a batch normalization layer afterwards and selecting the SiLU activation function, then it makes adjustments to Conv module of YOLOv8 P3/8 pixel level, including spatial size, input channel size and output channel size. And all the following results are using ODConv version2 in YOLO-BC.
3.5. Dataset
Data augmentation was performed on the original BCCD dataset [13, 14], thereby improving the model's ability to generalize over unseen data. Each image has a 50% probability of being horizontally flipped. This transformation aids the model in learning to recognize objects from different perspectives, enhancing its ability to generalize and potentially improving detection performance. Similarly, each image has a 50% chance of undergoing a vertical flip.
This augmentation can particularly help in cases where objects may appear upside down or in different vertical orientations in real-world scenarios. And a portion of the images ranging from 0 to 15% is randomly cropped, which can help the model focus on various parts of an image. Furthermore, the exposure of the images is varied within a range of -20% to + 20%. In total, 874 clinical blood cell image samples were used as the dataset. Among them, 765 samples were randomly selected for the training set, 73 samples for the validation set, and 36 samples for the test set.
In the training set split by BCCD, the number of platelet instances is 739, the number of red blood cell instances is 8814, and the number of white blood cell instances is 789 in total (Fig. 6 (a)). At the same time, normalized statistics were performed on the position of the detection box from 0 to 1, and it was found that the length and width of the detection frame were mainly concentrated around 0.2. Here, the x and y axis represent relative positions of all the detection boxes in blood cell dataset, so they have no units (Fig. 6 (c)), and the width and height axis represent relative size of the detection boxes (Fig. 6 (d)).
Label correlogram can facilitate the detection of patterns or correlations in the spatial arrangement of object annotations, wherein such analysis involves distinct classes and scales. This technique can uncover instances where particular classes exhibit high co-occurrence within a single image or where specific classes tend to manifest at particular scales with greater frequency. The figure below is a label correlation plot for a blood cell dataset in xywh space (Fig. 7), for each label's x, y, width and height variables of detection box, their corresponding relationships are displayed. And the size distribution of the blood cell data set is relatively balanced and is more suitable for training.
3.6. Statistics
In this study, accuracy (Precision), recall rate (Recall), F1 score (F1), mean average precision (mAP) [15, 16], number of parameters (Params), computational complexity (Flops), and frames per second (FPS) [17] were used as evaluation metrics for detection performance. Below is a detailed introduction of these metrics along with their corresponding formulas:
(1) Precision
Precision represents the ratio of true positives in the samples predicted as positive (the blood cell’s actual class), indicating the model accuracy for detection. Its formula is [17]:
$$Precision=\frac{TP}{TP+FP} \left(3\right)$$
In Formula (3), TP represents the number of true positives and FP represents the number of false positives.
(2) Recall
Recall represents the proportion of true positives detected among all positive samples, which is used to measure the recall rate of the model. Its formula is:
$$Recall=\frac{TP}{TP+FN} \left(4\right)$$
In Formula (4), TP represents the number of true positives, and FN represents the number of false negatives.
(3) F1 score
Combines Precision and Recall evaluating the performance of the model comprehensively. Its formula is:
$$F1=\frac{2\times Precision\times Recall}{Precision+Rrcall} \left(5\right)$$
(4) mAP@50 and mAP@50–95
mAP (Mean Average Precision) is a frequently employed measure for evaluating performance of blood cell object detection, measuring the matching degree between the inference results of blood cells and the truth labels, and the overall formula is:
$$mAP=\frac{1}{n}\sum _{i=1}^{n}{AP}_{i} \left(6\right)$$
In Formula (6), n represents the number of object categories, and \({AP}_{i}\) represents the Average Precision of the i-th category, where mAP@50 and mAP@50–90 are the mAP values, where the threshold of IoU is set to 0.5 and 0.5–0.95 respectively.