A real-time detector of chicken healthy status based on modified YOLO

In modern times, the development of an intelligent system that can automatically detect and recognize poultry diseases is vital for efficient poultry farming and for reducing human workloads. This paper presents a real-time detector that can analyze frames captured by monitoring cameras and simultaneously detect chickens and identify their healthy statuses. To overcome the challenge of chickens appearing small and having variant scales in monitoring camera frames, we integrate a scale-aware receptive field enhancement module into the YOLOv5 algorithm to enhance the receptive filed of chicken in the frames thus improving detection accuracy. In addition, we utilize a slide weighting loss function to calculate the classification loss. This helps the network to concentrate on classifying hard classified samples, leading to an improved ability to recognize the healthy statuses of chickens with greater precision. Experimental results demonstrate the proposed detector outperforms the original YOLOv5 and other one-stage object detectors, thus meeting the requirements for automated poultry health monitoring.

shows an example of an intelligent system running in chicken ranch. The system is composed of an autonomous patrolling robot and SONY's low power consumption IoT module that includes SPRESENSE [1], multiple sensors, and a camera (see Fig. 1a). However, the intelligent system of this kind faces a significant challenge, which is achieving real-time detection of chickens and recognizing their healthy statuses in real time.
In order to solve the issue of real-time detection of chickens and recognizing their healthy statuses, some researchers have tried traditional machine learning algorithms. For example, Cedric Okinda et al. [2] proposed a machine learning method for classifying sick chickens by using SVM and twodimensional shape descriptors and motion features. Jianqin Xu et al. [3] improved the Bayesian formula and applied the Bayesian network to classify sick chickens. Bi Minna et al. [4] utilized a method to segment chicken head regions based on color features, then extracted visual features to identify sick chickens. Canlong He et al. [5] proposed a gait detection method that uses logistic regression to classify healthy and unhealthy chickens by determining whether their legs are healthy or not. Zhuang et al. [6] proposed a SVM-based method for the classification of sick and healthy chickens using their body posture feathers. The authors in [7] introduced a classification tree-based method to classify species-specific behaviors exhibited by chickens. Neverthe- Fig. 1 An example of an automatic IoT system used in a smart chicken ranch, a shows the automatic IoT system consisting of a set of SONY's Spresense and its camera, which is highlighted in a red circle. b and c show examples of images captured by the camera in (a) less, traditional machine learning methods usually require manual extraction of variant features from sick chickens, which may pose difficulty in achieving high-precision detection results. Compared with traditional machine learning methods, deep learning techniques have emerged as a promising alternative due to their ability to automatically extract a large amount of useful feature information. So far, there has been a trend towards using deep learning techniques to identify sick chickens.
Recently, deep learning techniques [8] have demonstrated excellent capabilities in solving image recognition problems. Many CNN-based network models have been proposed, some of which include: VGG Net [9], Googel-Net [10] and ResNet [11]. And Mbelwa et al. [12] utilized CNN technology to predict the health statuses (healthy and two classes of diseases) of broiler chickens based on images of bird droppings. The study showcased the effectiveness of employing a pre-trained CNN model. The results demonstrated the feasibility of leveraging CNNs in accurately determining the health statuses of broiler chickens through the analysis of their droppings. Kholil et al. [13] expanded the classification of disease categories in broiler chickens by employing a CNN model for the analysis of chicken droppings. Pu et al. [14] introduced a method based on CNN to classify the behaviors of broiler chickens using images captured by a depth camera. The images were acquired under three different stocking-density conditions, specifically high, medium, and low crowding in a poultry farm. Fang et al. [15] presented a method for chicken pose estimation utilizing CNN to detect key body feature points from chicken bodies. Subsequently, the chicken behaviors were classified based on the pseudo skeletons constructed by connecting the detected body feature points. Nasiri et al. [16] also presented a method for chicken pose estimation utilizing CNN and LSTM to improve the accuracy of body feature points detection from chicken body. However, the above CNN-based methods encounter a challenge in effectively addressing and distinguishing multiple targets in one frame within an autonomous patrolling system.
As for the detection of chickens and recognition of their healthy statuses, a good solution is to improve the general CNN-based object detector such as Faster-RCNN [17], SSD [18] and YOLO [19] series. Those general object detectors take into account more common features and broader characteristics of objects, making it possible to become task-specific detectors by leveraging the common features and further enhance their performance through specialized design. For example, Zhuang et al. [20] proposed a sick chicken detector based on SSD, it demonstrated successful results in detecting chickens and recognizing their healthy statuses in real time by improving the general CNN-based object detector. Liu et al. [21] designed and implemented a compact removal system specifically for the detection and removal of deceased chickens in a poultry house. The system utilized the YOLOv4 network for the identification of dead chickens and achieved good precision in its performance. Hao et al. [22] expanded the application scope of the system to chicken houses with caged chickens by using the YOLOv3 network. In contrast to the detection of dead chickens, the identification of sick chicken statuses for early disease warning poses a greater challenge due to the limited distinctiveness of appearance features among healthy chickens and sick chickens. In addition to the aforementioned YOLO-based methods, the authors in [23] proposed a face detector based on YOLOv5 [24] which is another real-time detector and demonstrated another example of improving general CNN-based object detector for specific task.
Inspired by [20] and [23], we propose a real-time detector to detect chickens and recognize their healthy statuses by modifying YOLOv5 detector due to YOLOv5's high performance than other detectors. In particular, the modified detector based on YOLOv5 has a high potential for running on embedded or mobile devices, which makes it suitable for our autonomous system.
Our main contributions are summarized as following.
• We apply a receptive field enhancement module to enhance the feature pyramid representation capability of the YOLOv5 detector. This approach aims to improve the detection accuracy on small-and variant-scaled chicken body parts, such as the head and neck. • We apply a slide weighting loss function on the classification loss, which serves to amplify the loss values for hard classified samples. Specifically, this is accomplished by assigning higher weights to those samples that are difficult to classify. This adaptive weighting mechanism can improve the overall accuracy of the classification model.

Related works
Recently, a lot of deep learning-based solutions have exhibited impressive performance in identifying the healthy status of poultry. Haiyang Zhang et al. [25] proposed a residual neural network-based method to detect sick chickens and the final classification results can achieve 93.7% on the test dataset. Yi Shi et al. [26] proposed a lightweight YOLOv3based detection network for chicken recognition and monitoring. Notwithstanding the promising results achieved by deep learning-based solutions in poultry health recognition, achieving high recognition rates in complex scenes remains a challenging task. Zhuang et al. [20] proposed an improved SSD (Single Shot MultiBox Detector) model to detect and classify healthy and sick chickens in real time, which is a good example of utilizing object detection networks for real-time detecting chickens and recognizing their healthy statuses. Recently, object detection networks have been widely applied for various tasks due to their excellent performance. The famous object detection frameworks include Fast-RCNN [27], FSSD [28], RefineDet [29], Reti-naNet [30] and YOLO [31] series. And several YOLO-based methods such as [21] and [22] have demonstrated successful applications in automated systems within chicken farms. Especially, the YOLOv5, a one-stage framework, has become a state-of-the-art (SOTA) framework for object detection due to its high accuracy and fast inference speed. As for detecting chickens and recognizing their healthy statuses, the detection of chicken heads plays a crucial role. However, the accuracy of detection and recognition can be substantially diminished in practical scenarios, primarily due to the intricate scale variations exhibited by chicken heads. Therefore, many multi-scale detectors based on scaleinvariant features have been proposed to address this issue. In which, [32][33][34][35] have tried to extract scale-invariant features more accurately and efficiently. some other methods such as [36,37] applied fewer down-sampling layers and dilated convolution to improve the detection performance. [38][39][40] have tried to fuse deep semantic information with shallow features. Other methods such as SNIP [41] and TridentNet [42] also provided new approaches to solve the multi-scale problem.
The quality of training data is critical to the learning performance. Even though the training data correlation occurs if samples are taken at close locations. Data collection with separation distance has been proposed in [43], sample imbalance problem is still a common issue for one-stage detectors and is the focus of this paper. In many real-world scenarios, the number of positive samples (i.e., samples of interest) is significantly smaller than the number of negative samples, resulting in imbalanced training data. This can lead to biased models, where the detector may struggle to correctly detect the positive samples. To deal with this problem, [30] [44] suppresses the gradients from positive and negative simple samples to focus more on difficult samples. Prime Sample Attention (PISA) [45] assigns weights to positive and negative samples according to different criteria. Even though the above methods can effectively solve the problem of sample imbalance, they also artificially introduce some hyper-parameters, which increase the difficulty of adjusting and is very inconvenience in practice.

Modified YOLOV5 detector
In this section, we present the key items of our proposed modified YOLOv5 detector. Figure 2 shows the network architecture of our proposed modified YOLOv5 detector. As we can see that the architecture of the proposed method exhibits similarity to that of the original YOLOv5's architecture. It can be divided into 3 parts: backbone, neck and heads. The backbone part includes a Focus layer, multiple CBS blocks, multiple C3 blocks. The neck part includes SPP and our one key modification-RFEM with C3 module. The head parts are utilized to regress the locations and to classify the categories of the targets.

Network architecture
As shown in Fig. 2, a convolution layer followed by a batch normalization layer and a SILU [46] activation function. The C3 block consists of the CBS block, a convolution layer and several (n) bottlenecks. Each bottleneck consists of multiple CBS blocks, and the number of these blocks depends on where is the position of the "container" C3 block in YOLOv5 model. And YOLOv5 model has different types, where the type "n" is the fastest but with the lowest accuracy, while Fig. 3 The structure of RFEM, a is the detailed architecture of the Receptive Field Enhancement Module, b is the example of receptive field related processing in our method the types "s", "m", "l", and "x" provide better accuracy at the cost of slower speed. The different types have different numbers of layers, channel sizes, and other hyper-parameters that affect their performance.
In the neck region of the network architecture, a Spatial Pyramid Polling (SPP) block is utilized to extract the most significant contextual features. C3RFEM is a network architecture that concatenates convolution layers and Receptive Field Enhancement Module (RFEM). RFEM is a module that enhances the receptive field of the network, which helps to capture context information of the input image and it will be introduced in the next Sect. 3.2.
In the head part of the YOLOv5-based model, the architecture involves the concatenation of CBS blocks, C3 blocks, and up-sampling blocks, which are used for target detection. Additionally, the model also employs three output blocks, namely P3, P4, and P5, to finalize the detection process.

Receptive field enhancement module
To ensure the practicality of our detector in automated chicken health monitoring scenarios, the input image resolution cannot exceed VGA resolution. However, the captured area should be maximized, which means capturing more chicken targets in a single input image is preferable. That means the regions of interest, such as the chicken head and neck, are typically small in the VGA resolution input images. Moreover, these target body parts can vary in scale, further complicating the detection and classification tasks. Therefore, to address this challenge, we have made modifications to the YOLOv5 detector by incorporating the Receptive Field Enhancement Module (RFEM). This modification helps increase the receptive field of the feature map, thereby improving the accuracy of detecting and recognizing chicken body parts. As shown in Fig. 3, multiple branches of dilated convolutions are utilized to capture the information from multi-scales. In each dilated convolution branch, the 1 × 1 convolution is the same but 3 × 3 convolution is different due to the different dilated rate such as the dilated rates 1, 2 and 3. After the multiple branches of convolution and residual connection, the branches of the features are weighted to balance the representation of different branches. As shown in Fig. 3b, the dilated convolutions in our modified YOLOv5 detector utilize dilated convolutions to emulate the receptive field mechanism of the human visual system. This technique has been demonstrated to be effective in improving the accuracy of multi-scale target detection and recognition.

Sliding loss function
In real-world scenarios, it can be challenging to distinguish between healthy and sick chickens due to various reasons, such as the appearance features of sick chickens not being easily distinguishable (see Fig. 4). Hence, we utilize a sliding loss function in our method. The sliding loss function is used to assign higher weights to hard classified samples, which can help the network focus more on these samples during training. This can lead to better performance on difficult samples and help improve the overall accuracy of the network. The sliding loss function is a form of adaptive weighting, where the weights are adjusted based on the rule that we designed. The sliding function is defined as, where μ represents a threshold based on the IoU size of the prediction bounding box and the ground-truth bounding box. For example, we can use the average IoU sizes of all bounding boxes as the threshold μ to divide the positive sample (its IoU size is larger than μ) and negative sample (its IoU size is smaller than μ), β is a shifting parameter to control the weight values and is set to 0.2 in our experiments. As shown in (1), there is a boundary area from μ−0.1 to μ in negative samples, we set weight e 1−2μ+x+β for this boundary area in negative samples and we set weight e 1−x+β for positive samples. Some of the plots of sliding weights corresponding to different threshold μ are shown in Fig. 5. The IoU rate-based threshold μ is smaller, the sliding weight on classification loss is higher thus enlarge the classification loss. Also, by utilizing the sliding loss function, both positively and negatively classified samples can be assigned high weights such as e 1−x+β (positive) or e 1−2μ+x+β (negative) when they are close to the threshold μ thus enlarge the classification losses of the samples at boundary of positive and negative samples. By incorporating the sliding loss function into the classification loss, the samples that are hard to classify (close to the decision threshold) can be assigned a higher weight. This results in the network paying more attention to these samples during the training process. Consequently, the overall accuracy of the network in identifying healthy and sick chickens can be improved in real-world scenarios.

Dataset
Since there is a lack of publicly available datasets for chicken detection and healthy status recognition, we created our own dataset consisting of 9,000 images of healthy chickens and 5,000 images of sick chickens. To be precise, 2/3 images in each category of healthy chicken and sick chicken are captured by high quality camera (SONY α1) with a resolution of 1920 × 1080 manually, the rest 1/3 images in each category are captured by our IoT camera with a resolution of 640 × 480 during robot's autonomous patrolling. To mitigate the impact of data imbalance, we enabled the class weights updating and image weights updating functions predefined in YOLOv5. These functions adjust the sampling frequency of the training data in each training epoch based on the recip-rocal of the number of images in each category. Specifically, the sampling frequency of images in a category with fewer images is increased to balance the training process. Based on our preliminary experiments, we found that for all the YOLOv5 models, the accuracy measured by mean AP (IoU at 0.5 and 0.5:0.95) increased, when using the class weights updating and image weights updating functions. Moreover, it was observed that the magnitude of the increased mean AP value was greater for smaller model sizes. Specifically, for the YOLOv5s model (the second smallest model), the mean AP (intersection over union (IoU) at 0.5 and 0.5:0.95) increased by approximately 3 % and 5 %, respectively, while for the YOLOv5n model (the smallest model), the mean AP (IoU at 0.5 and 0.5:0.95) increased by approximately 6 % and 12 %, respectively. We manually labeled the data using publicly available labeling tools such as "LabelImg". Please note that we labeled the data under the guidance of experts from layer breeding who are affiliated with an academic institution of agricultural sciences. As for the types of chickens in our dataset, we have included an equal number of images of white feathered chickens and yellow feathered chickens, which are the two main types of chickens in chicken farms. Additionally, all the chickens in our dataset are layers between 20 and 60 weeks old.

Implementation details
The computer configuration employed for both training and evaluation comprised an Intel Xeon Silver 4214R CPU with 24 cores, operating at a frequency of 2.40 GHz per core. The system was equipped with 256 GB of memory and utilized an NVIDIA RTX 3090 Ti GPU with 24 GB of GDDR6X memory. Operating system version is Ubuntu 18.04. Versions of Python, Pytorch, CUDA and CUDNN are 3.8, 1.11.0, 11.3 and 8.2, respectively. We implement the modifications described in the previous Sect. 3 in Pytorch. We use SGD as optimizer, the initial learning rate and the final learning rate are 1e−2 and 1e−5, respectively. The weight decay is set to 5e−3. And a momentum of 0.8 is used in the first three warming-up epochs. After that, the momentum is changed to 0.937. The batch size and the IoU for NMS are set to 64 and 0.5, respectively.
Our dataset described in the previous Sect. 4.1 is randomly shuffled and the training dataset and validation dataset are divided by 9:1 and the augmentation methods are default methods in YOLOv5.

Ablation study
In this subsection, we will discuss the impact of our modifications on the performance of our detector using our own dataset. The metrics used for evaluation are mAP0.5 (intersection over union at 0.5 threshold) and mAP0.5−0.95 (intersection over union at thresholds from 0.5 to 0.95 with a step size of 0.05). As shown in Table 1, √ and −− represent that the baseline YOLOv5 detector is with and without the specific modification, respectively. As we can see, both the receptive field enhancement module ("RFEM") and the sliding loss function on classification loss ("Slide Loss") can improve the performance of the original YOLOv5 model. The mAP0.5 and mAP0.5−0.95 are both improved when using either modification or both together. Specifically, using the RFEM alone increases the mAP0.5 by 5.8 % and the mAP0.5−0.95 by 5.9 %, while using the Slide Loss alone increases the mAP0.5 by 5.2 % and the mAP0.5−0.95 by 7 %. Using both modifications together further improves the performance, with an increase in mAP0.5 by 9.6% and mAP0.5−0.95 by 9.4%.
These results demonstrate the effectiveness of our modifications in improving the accuracy of chicken detection and recognition in our dataset. please note that due to space constraints, we only present the ablation study results for the YOLOv5s6 model in here. However, we will provide a comprehensive performance comparison between our modified YOLOv5 models and the original YOLOv5 models such as YOLOv5n6, YOLOv5l6, and others in the next Sect. 4.4.

Comparison of different models
In this subsection, we evaluate the performance of our proposed detectors by comparing them to existing detectors such as SSD-based detectors and the original YOLOv5-based detectors. Table 2 presents the results, where mAP0.5 (IoU at 0.5) and mAP0.5−0.95 (IoU at 0.5:0.05:0.95) are still used as the performance metrics.
As shown in Table 2, SSD300v and SSD300m represent the SSD detectors with input images of the resolution 300 × 300 and with VGG and MobileNet backbones, respectively. SSD512v and SSD512m are similar to SSD300v and SSD300m, but the input image has a resolution of 512 × 512, and the backbones are VGG and MobileNet, respectively. It can be observed that utilizing MobileNet as the backbone in SSD detector leads to a reduction in weight and faster processing, albeit with a slight decrease in precision. FSSD512v is an abbreviation for "Feature Selective Single Shot Detec- The further down the model is in the table, the larger its size and the higher its accuracy, but the slower its running speed. However, it should be noted that each model with the proposed modifications showed an improvement in accuracy while experiencing only a minimal loss in speed (see the 11th-14th detectors in Table 2). It is important to mention that we did not consider the largest model YOLOv5x6 in this study due to its large model size. In order to ensure fair comparisons, the speed measurements were taken after resizing the images to the required input size of each model and were obtained as the average values of 20 times running. Figure 6 shows the precision-recall (PR) curves of the detectors. As we can see that the precision-recall curves demonstrate the effectiveness of our proposed modifications. It is evident that the proposed YOLOv5 models with modifications are highly effective, as they outperform the existing detectors in terms of precision and recall. The modifications implemented in the YOLOv5 detectors have yielded enhanced performance on our dataset. Our modified YOLOv5 models achieved the highest performance among all detectors tested, surpassing both the original YOLOv5 models and other popular detectors such as SSD and FSSD. Overall, the modifications have proven to be successful in enhancing the accuracy of the detectors without significant speed loss.   bottom rows present the outcomes of sick chicken detection and recognition, marked by pink bounding boxes. The other rows demonstrate the results of healthy chicken detection and recognition, denoted by red bounding boxes. The results demonstrate that our modified YOLOv5s6 detector can accurately detect and recognize healthy and sick chickens.

Conclusion
In this paper, we present a modified YOLOv5 detector for recognizing the healthy statuses of chickens, with the aim of automating the process of monitoring chicken health using IoT modules on a robot patrolling in a chicken ranch. The proposed detector is capable of running in real time and achieving high accuracy. Moreover, our detector outperforms other state-of-the-art detectors on our chicken dataset, which consists of images captured from real chicken farms.
In future work, we plan to address several aspects that can improve the accuracy and efficiency of our detector in practical use cases. (1). Incorporating and integrating multiple sources of information such as images of chicken bodies, images of droppings, environmental indexes and auditory cues from chicken sounds, can potentially provide comprehensive information to an automated system based on IoT devices for accurate identification of chicken health statuses. (2). Since obtaining images of sick chickens is more difficult than obtaining images of healthy chickens in practical chicken ranches. It is still worth exploring more effective methods to address the issue of imbalanced data.
(3). Our future work also involves deploying the detector on edge devices, aiming to reduce the model size and computational costs while maintaining a satisfactory level of performance.