2.1. Materials
2.1.1. Image acquisition
All animal handling procedures were approved by the Institutional Animal Care and Use Committee of Shandong Agricultural University (Approval Number: SADUA-2021-053), all methods were performed in accordance with the relevant guidelines and regulations. The dataset of dairy cow feeding behaviors in this study comes from Tai'an Jinlan Dairy Cows Breeding Company, Tai'an, Shandong, China. The company currently has more than 2,000 high-quality Holstein cattle. The No. 3 cowshed was selected as the experimental cowshed. There were 17 lactating dairy cows in the cowshed, aged between 1.5 and 2.5 years old and in good condition. The data collection time of the dairy cow feeding behavior experiment was from August 10, 2021, to September 5, 2021, a total of 27 days, from 8:00 to 12:00 and from 14:00 to 18:00 every day (feed supplementation started at 9:00–9:30 and 15:00–15:30 every day). The data collection was completed by a ZED2 binocular depth camera (the manufacturer: STEREOLABS company), and the maximum shooting angle of the ZED 2 binocular depth camera was 110° (H) × 70° (V) × 120° (D). However, the transmission frame rate under this pixel was only 15 frames; therefore, to ensure the stability and accuracy of the data, the monocular resolution was 1280×720 pixels, and the transmission frame rate under this pixel could reach a stable 30 frames. The height of the top of the cow pen was 1.35 m from the ground, and the width of the feed belt was approximately 0.8 m. To avoid interfering with normal feeding behaviors of dairy cows, in this paper, the camera was selected to be 1.75 m above the feeding area, 1.2 m in front, and 0.8 m high to collect feeding behaviors of dairy cows from different shooting directions.
In the process of image acquisition, a total of 20 groups of video data were collected, one group contained 4 videos, and two groups of videos were taken above and in front of the two time periods of 8:00–12:00 and 14:00–18:00. The ZED API was used to extract images frame by frame from the video to remove duplicates, blurs, ghosts and invalid images. A total of 10,288 images were obtained in the data set, and the feeding actions of 17 dairy cows were included in the data set. Among them, the number of front-facing images is 5,242, including 1,324 images containing one dairy cow, and 3,918 images containing multiple dairy cows; The number of images taken above is 5046, including 1263 images containing one dairy cow and 3783 images containing multiple dairy cows.
2.1.2. Dataset Labeling
In this paper, the open-source labeling tool LabelImg was used to manually label 10,288 original images of dairy cow feeding behaviors. The dataset was made by labeling and dividing the feeding behaviors of 17 different dairy cows, dividing the images into the top-view dataset and the front-view dataset. The upper-shooting dataset was divided into a training dataset with 4,320 images and a test dataset with 726 images according to the ratio of 1:6. The front-view dataset was divided into a training dataset with 4,484 images and a testing dataset with 758 images according to the ratio of 1:6. In normal feeding, multiple dairy cows may compete for a piece of feed, which would cause their heads to overlap when feeding. Therefore, the following labeling rules were formulated: 1) Do not label the occluded and incomplete dairy cow heads; 2) Label the adjacent dairy cow heads. LabelImg automatically generated the corresponding XML file every time a manual labeling of a dairy cow head was completed. The XML file recorded the coordinates of the upper-left corner and the lower-right corner of the labeling rectangle on the head of the dairy cow, that is, the feeding behavior, the length and width of the labeling rectangle, and the labeling category information. To improve the training efficiency and speed of the model, the annotation frame of a dairy cow head was normalized, as shown in Formulas (1)–(4).
$$x=\frac{{{x_{\hbox{max} }}+{x_{\hbox{min} }}}}{{2{R_w}}}$$
1
$$y=\frac{{{y_{\hbox{max} }}+{y_{\hbox{min} }}}}{{2{R_h}}}$$
2
$$w=\frac{{{x_{\hbox{max} }} - {x_{\hbox{min} }}}}{{{R_w}}}$$
3
$$h=\frac{{{y_{\hbox{max} }} - {y_{\hbox{min} }}}}{{{R_h}}}$$
4
where, x, y, w and h represent the coordinates of the center point of the rectangular box for the labeling of the feeding behavior of dairy cows after normalization and the width and height of the labeling box; xmax, xmin, ymax and ymin are the coordinates of the upper left corner and the lower right corner of the rectangle frame of the manually marked feeding behavior of dairy cows, respectively; Rw and Rh are the width and height of the dairy cows’ feeding behavior image, respectively.
In the breeding environment of a farm, dairy cows can eat only when they are in the feeding area and the head of the dairy cow is in contact with the feed. When the head of the feeding area is raised, they are basically in the process of chewing the feed and preparing to continue to eat. The dairy cows leave the feeding area when they have had enough feed. Therefore, the feeding behavior of dairy cows is divided into two parts: feeding and chewing. An example is shown in Fig. 1(a) - (f), which show the chewing, feeding, and grass arching behaviors of the dairy cows photographed from the front and the chewing, feeding, and grass arching behaviors of the dairy cows photographed from above.
Among them, the proportion of each behavior in the data set is shown in Table 1:
Shooting direction of the dataset
|
Number of training datasets
|
Number of feeding behaviors
|
Number of chewing behaviors
|
Number of grass arching behaviors
|
Table 1
Number of dairy cows' feeding behaviors marked in the dataset
Front
|
4484
|
5684
|
792
|
1613
|
Above
|
4320
|
6958
|
960
|
1946
|
2.2. Methods
The YOLO series model is a single-stage detection model, and the detection speed is faster than that of the two-stage model, but the corresponding accuracy is worse than that of the two-stage detection model. In the identification and tracking of the feeding behavior of dairy cows, due to the rapid feeding action of dairy cows, it is necessary to ensure not only fast detection speed, but also high detection accuracy. Therefore, based on the YOLOv5 model, we used the three model enhancement modules of Transformer、CBAM and SE to enhance the feature extraction ability of the model, and increased the accuracy of the recognition of the feeding behavior of dairy cows on the basis of ensuring the identification speed of 17 dairy cows and feeding behavior.
2.2.1. YOLOv5 model
Figure 2 shows the YOLOv5 model, which uses CSPDarknet as the backbone network and continues to use in the neck network. Compared with the YOLOv4 model, the Mosaic data enhancement and adaptive anchor frame structure were used at the input, the Focus structure was added to the backbone network, and the BottleneckCSP structure was applied to the neck network. This change allowed the YOLOv5 model to reduce the number of parameters used under the premise that the detection accuracy was consistent with the YOLOv4 model. In addition, YOLOv5 used the adaptive image scaling function, which could make the scaled image aspect ratio the same as the original image, ensuring that the target tracking would not affect the detection position pixel judgment due to the change in the image aspect ratio.
In testing, the detection speed of the YOLOv5 model was very fast, as it only took approximately 0.01 s to detect one frame of the image. However, the detection accuracy was not high, and the feature extraction of the model was not comprehensive. Therefore, based on YOLOv5, we used the model enhancement module to improve the detection performance of the model.
2.2.2. Transformer module
The transformer module was proposed by Ashish Vaswani et al.19. The internal structure is shown in Fig. 3, which is mainly composed of encoding components and decoding components. The encoding component consists of multiple encoders, the decoding component consists of multiple decoders, and the number of encoders corresponds to the number of decoders. The encoder is composed of a multi-head adaptive attention layer and a feed-forward neural network. Each sublayer has a residual connection around it and follows a "layer-normalization" structure. The decoder is composed of a masked multi-head adaptive attention layer, a multi-head adaptive attention layer and a fully connected feed-forward neural network. The adaptive attention layer can help the current node obtain the current key content that needs to be considered. Through the multi-head adaptive attention layer, the ability of the model to focus on different positions is expanded, and multiple subspaces are formed to allow the model to focus on different aspects of information, which can effectively combine the identity characteristics and the different behaviors of dairy cows when feeding to achieve the integration of high-dimensional global features.
The transformer module can parallelize the amount of computation to obtain the job done with the minimum sequence needed, reducing the distance between any two positions in the sequence to a constant due to the use of an adaptive attention mechanism. The adaptive attention layer can directly capture the global connection, while the number of operations required by the CNN network to calculate the association between two locations through convolution increases with the distance between the two locations. As a result, it is easier for the transformer module to learn more feature information.
2.2.3. CBAM module
The CBAM module20 included two independent sub-modules, the channel attention module and the spatial attention module, which performed channel and spatial attention work, respectively, as shown in Fig. 4.
Channel attention needed to compress the spatial dimension of the input feature map. After average pooling and maximum pooling, the transformation result was obtained through the multilayer perceptron layer, and finally applied to the two channels and the channel-to-channel attention result through the sigmoid function. Spatial attention was a supplement to channel attention. The feature map output by channel attention was used as the input of spatial attention, and the dimension of the channel itself was reduced to obtain the results of average pooling and maximum pooling respectively. Then spliced into a feature map, and finally used a convolutional layer for learning.
By combining the channel attention module and the spatial attention module, the CBAM module saves parameters and computing power, and ensures the lightweight of the module, which can be directly integrated into the existing network model framework.
After the introduction of the CBAM module, the features of network identification cover more head features when feeding dairy cows, and the features representing head movements of dairy cows have become the key information of the model.
2.2.4. SE module
The full name of the SE module is Squeeze and Excitation21, and its workflow is shown in Fig. 5. The SE module first performed compression, compressed the feature map into a 1×1×C vector through global pooling, and then performed the excitation operation. Through the fully connected layer and activation function, channel reduction was performed to reduce the amount of calculation. Finally, the scaling operation was performed, and the feature map was scaled through a 1×1×C vector.
After passing through the SE module, the channel attention can be enhanced so that it has a global receptive field through the feature map of the SE module, which enhances the feature map extraction of the network and improves the network's classification of the target.
2.2.5. Establishment of TCS-YOLO model
Since the premise of target tracking required an accurate extraction of model features, the transformer module was added on the basis of the high-efficiency network YOLOv5, and the adaptive attention mechanism of the transformer module was used to achieve an accurate grasp and judgment of the characteristics of the feeding behaviors of dairy cows. The CBAM module was added after the focus layer so that the model's attention was focused on the characteristics of dairy cow feeding behaviors, and the SE module was used at the end of the backbone network to improve the global receptive field of the feature map and the classification ability of the model. For the model prediction module, CIOU22 was used instead of GIOU (the ratio of the intersection and union of the predicted bounding box and the ground truth bounding box). Considering the influence of scale, distance, penalty term and the overlap rate between anchor boxes on the loss function, the target frame regression became more stable. The formula is as follows:
$$CIOU=IOU - \frac{{{\rho ^2}(b,{b^{gt}})}}{{{c^2}}} - \alpha v$$
5
$$\alpha =\frac{v}{{1 - IOU+v}}$$
6
$$v=\frac{4}{{{\pi ^2}}}{(\arctan \frac{{{w^{gt}}}}{{{h^{gt}}}} - \arctan \frac{w}{h})^2}$$
7
$$Los{s_{CIOU}}=1 - IOU+\frac{{{\rho ^2}(b,{b^{gt}})}}{{{c^2}}}+\alpha v$$
8
where c represents the diagonal distance of the smallest enclosing region of the bounding box that contains both the predicted bounding box and the classification accuracy of the supervised learning training set; w, h, wgt, hgt respectively represent the width and height of the current detection frame and the width and height of the prediction frame of the next frame;αis a weight function; v represents the similarity of aspect ratio; \({\rho ^2}(b,{b^{gt}})\) represents the Euclidean distance between the prediction center point and the real frame; and finally, 1-CIOU is used to obtain the corresponding loss.
The workflow of the TCS-YOLO model is shown in Fig. 6. When the input image entered the model, the focus module was first used to perform the slicing operation, and the W and H information was concentrated into the channel space. Next, through the CBAM module, the CSPDarknet network was entered to perform a series of convolution operations to obtain a high-dimensional feature map, and then the transformer module was entered through the SPP structure. The multi-head adaptive attention structure was used to perform parallel learning of features, and the results were input to the SE module to enhance the global receptive field of the model. After that, the Feature Pyramid Network23 and the Path Enhancement Network24 were entered, connections with the backbone network were established through upsampling and tensor splicing, etc., and the feature information of the feature map was enriched. Finally, the feature map was sent to the YOLO detection module to obtain the detection result.
2.2.6. Deep Sort Algorithm
The Deep Sort algorithm25 is often used in multitarget tracking. Its workflow is shown in Fig. 7. The individual identity of a dairy cow and the corresponding feeding action were calculated through TCS-YOLO, and then the position of the object in the next frame was predicted by the Kalman filter. The predicted position was compared with the actual detection position of the next frame, and the IOU was calculated to obtain the similarity of two adjacent targets. Finally, the corresponding identity documents of adjacent frames were obtained through the Hungarian algorithm, which tracked the identities and feeding actions of dairy cows.
The Deep Sort algorithm used the comparison and confirmation of the front and back features and added cascade matching and new trajectory confirmation. As shown in Fig. 8, after the target detection network obtained the unique feature of a dairy cow, it strengthened the feature information of the dairy cow through prediction, observation, update, etc. Therefore, the dairy cow identity document detected by the Deep Sort algorithm was not easily changed, so dairy cow feeding behaviors could be tracked and detected more accurately.