Smoothness-based consistency learning for macaque pose estimation

Macaques are a rare substitute and play an important role in study of human psychology and spiritual science. Accurate estimation of macaque pose information is key to these studies, macaque pose estimation remains to be hindered by the scarcity of labeled images. To address this problem, this work introduces a novel semi-supervised approach called smoothness-based spatio-temporal consistency learning (SSTCL) and a dual network structure (DNS) to leverage the amounts of unlabeled real images. Specifically, the SSTCL introduces the smoothness assumption to help the model generalize from the labeled training images to the unlabeled images, and the spatio-temporal consistency is designed to leverage both spatial and temporal consistencies to pick the most reliable pseudo-labels. Moreover, a dual network structure (DNS) is proposed to empower the model the ability of self-correction, which can prevent the degeneration caused by the noisy pseudo-labels in semi-supervised learning. In ablation experiments, the effectiveness of DNS for pseudo-label quality assurance is demonstrated. We evaluate the proposed method on the public OpenMonkeyPose dataset, the results show that the proposed method can achieve competitive performance while using less labeled images, and the final accuracy surpasses the strong baseline HRNet-w48 of 2.1 AP.


Introduction
Owing to the biological similarity and homology between macaques and humans, macaque has become one of the most important animal stand-ins to humans in the study of disease research and drug development. The quantitative behavioral analysis of macaque is necessary and essential for understanding the disease mechanisms and evaluating the drug effectiveness. However, historically, the quantitative measuring of macaque behavior requires experts to analyze manually, which is time-consuming and costly. Nowadays, pose estimation based on deep learning methods has rapidly advanced to become the state-of-the-art technique to automatically locate the anatomical keypoints in vision contents such as images and videos, which provides a solution with B Ping Xue xueping@hrbust.edu.cn 1 great potential for the efficient and accurate quantification of macaque behavior in medical trials. Inspired by this, recent adaptations [1][2][3][4][5][6] of methods originally developed for human pose estimation [7,8] have made macaque pose estimation possible. However, there still remains a problem that the existing methods for humans are often required large-scale datasets [7,8] for training. In the same network, human pose estimation can be achieved with high accuracy, because there are many large-scale human pose datasets available. In contrast, macaque pose estimation models are suffering from data scarcity and the fact that large-scale pose datasets annotating macaques are costly and infeasible. Therefore, this paper proposes to use a small number of labeled and a large number of unlabeled data sets to train the model, so as to make up for the shortage of data sets.
Semi-supervised learning (SSL), which is concerned with using labeled as well as unlabeled or relative data to boost the accuracy of the model, permits harnessing large amounts of unlabeled data during training. Compared with annotating macaque images, it is cheaper and easier to engage directly unlabeled images to assist the training. In the field of image classification, consistency-based semi-supervised learning methods [9][10][11] have achieved remarkable performance. Therefore, recent attempts [12,13] alleviate the data scarcity in animal pose estimation by introducing the synthetic or unlabeled animal images during training to improve the generalization and performance of the model. Despite the achieved progress, the mentioned studies did not solve the common problem of pseudo-label instability and model degeneration in semi-supervised learning as mentioned in [14]. Especially with the macaques mentioned in this work, their postures are usually more complex than those of other animals such as mouse and horses due to their four flexible limbs and long tail, which means the predicted pseudo-labels have a higher chance of being inaccurate. When introducing these noisy pseudo-labels into the training set, the model degeneration will get worse and then hurt the model performance.
To overcome the mentioned problems, this work proposes a semi-supervised learning framework called SSTCL and a dual network structure to the field of macaque pose estimation. The SSTCL is proposed to solve the problem of pseudo-label instability, and it consists of two main parts: smoothness assumption and spatio-temporal consistency. The former introduces the smoothness assumption to the model training and restricts the model to infer similar output when the input images are perturbed (such as Gaussian noisy and random cutout [15]), which is beneficial to help the detector "recognize" the unlabeled images. But the latter is designed to evaluate the reliability of the pseudo-labels by taking spatial and temporal consistencies into consideration, which is utilized to eliminate those noisy labels predicted by the pose detector. To handle the degeneration during semi-supervised learning, the dual network structure (DNS) consists of two peer pose detectors, which are designed to act as the teachers to correct each other's mistakes mutually. As demonstrated by extensive experiments, the proposed DNS is able to stabilize the pseudo-training and avoid the model degrading compared with the single detector structure.
The contributions of this work can be summarized into threefold: • A novel method called smoothness-based spatio-temporal consistency learning (SSTCL) is proposed to utilize the massive unlabeled images to alleviate the scarcity of the labeled images in the field of macaque pose estimation, and the experimental results show that there is a remarkable promotion of performance when using SSTCL compared to the supervised training. • To avoid the degeneration of pseudo-label accuracy during semi-supervised learning, a dual network structure (DNS) is designed and succeeds in solving the degeneration problem according to the experiment results. • The proposed SSTCL and DNS boost the pose detector's performance of 2.1 AP compared with the strong baseline detector HRNet-w48.

Human pose estimation
Human pose estimation has been studied extensively and advanced rapidly in recent years along with deep learning, where the methods can be generally categorized into topdown approaches [16][17][18][19][20][21] and bottom-up approaches [22][23][24][25]. The top-down approaches are two-stage approaches, which first detect instances in the image using a bounding box object detector like Faster R-CNN [26] and then obtain the poses within these boxes, while the bottom-up approaches start by localizing identity-free keypoints for all semantic entities and then group them into the person instances [27]. When predicting the keypoint coordinates, the above paradigms usually transfer this problem by predicting the keypoint confidence maps and then decode the final coordinates, which has been verified with better accuracy than obtaining the keypoint coordinates in a regression manner.

Animal pose estimation
To quantitatively measure animal behavior, recent studies [1-4, 6, 12, 13, 28, 29] have payed increasing attention to the field of animal pose estimation. Zhou et al. [29] introduce a new challenging mouse dataset PDMB for mouse pose estimation and a novel graphical model-based structured context enhancement network (GM-SCENet) to model mouse behavior. Mu et al. [12] generate an animal dataset with 10+ different animal CAD models and proposed a consistencyconstrained semi-supervised learning framework (CC-SSL) to train synthetic and real dataset jointly. Mathis et al. [6] develop a novel horse dataset and have achieved favorable accuracy with DeeperLabCut [1]. Cao et al. [13] propose a novel domain adaptation method called "weakly and semisupervised cross-domain adaption" (WS-CDA) to extract cross-domain common features between human and fourlegged mammals (i.e., dogs, cats and horses). With respect to macaque pose estimation, Negrete et al. [4] introduce bottom-up method OpenPose [30] to estimate the poses of macaques in the wild, which is mainly an application of existing human pose detector in the field of macaque pose estimation. Bala et al. [31] propose a system called OpenMonkeyStudio, which provides a multiview data augmentation method that can generate multiview coordinates of macaques via 3D reconstruction and develops a large Open-MonkeyPose dataset with over 190,000 monkey images. Though OpenMonkeyStudio has alleviated the scarcity of annotated macaque pose data, it can only be carried out under experimental environment and has high requirements on hardware, which limits the application and extension of this method. Different from previous methods, this work pro-poses a semi-supervised training strategy that can leverage massive unlabeled macaque images to facilitate the training of macaque pose detector and is compatible with existing methods of pose estimation.

Semi-supervised learning
Semi-supervised learning has been extensively studied in the field of classification and has been proven effective by [9][10][11], and it leverages the labeled data and unlabeled data to jointly train the model and obtains better performance than the supervised learning. As one of the most representative methodologies of semi-supervised learning, pseudo-training [9,[32][33][34] first learns a basic model using labeled data and then utilizes the learned model to assign pseudo-labels to the unlabeled data. After the inspection by the pre-defined selection criterion, the qualified pseudo-labels will cooperate with the labeled data to iteratively train the model. However, the performance of the ultimate model learned by pseudo-training is sensitive to the noisy pseudo-labels. If too much wrong-labeled data are added into the training set, pseudo-training even impedes the performance of the model. Therefore, finding a criterion to measure the quality of the pseudo-labels is of vital importance. Laine et al. [9] store the predictions of the dataset in history epochs and requires the most recent predictions to be consistent with the history ones. The approach is shown to be more tolerant to incorrect labels because of the temporal smoothing. Mu et al. [12] define the invariance, equivariance and temporal consistencies to control the generation of pseudo-labels, where temporal consistency is designed for frames in a video and utilizes the optical flow as the supervised information. Inspired by these methods, this work introduces the smoothness assumption to help the pose model gain better generalization performance and proposes a spatio-temporal consistency to handle noisy pseudo-labels.

Proposed approach
The selection of reliable pseudo-labels and solving the degeneration problem are the main issues that this work is dedicated to. To achieve this goal, the proposed approach consists of two components: (1) a smoothness-based spatio-temporal consistency learning strategy aiming to predict and pick out the most reliable pseudo-labels for training and (2) a dual network structure that targets to relieve the degeneration problem.

Smoothness-based spatio-temporal consistency learning
Smoothness assumption As one of the most widely recognized assumptions in the field of semi-supervised learning, smoothness assumption requires that if two samples are close in the input space, their labels should be the same accordingly [35]. In the task of image classification, this assumption means that the classifier should be robust to perturbations in its input and predict similar category activations as the "clean" input. Pose estimation is the task of locating the keypoints in images I , current advanced methods [16,21,30,36,37] transform this problem to predicting K Gaussian heatmaps H where each heatmap denotes the confidence map of a keypoint. Similar to the applications in image classification, the smoothness assumption in pose estimation means that the pose detector ought to predict similar heatmaps even if the input image is perturbed. Specifically, the smoothness assumption is introduced to enhance the generalization performance of the pose detector by minimizing the discrepancy between the heatmaps of the "clean" and the "noisy" input images.
As formulated in Eq. 1, given an unlabeled image I u from unlabeled set U , a conventional augmentation (including scale, rotation and crop) T c is applied to I u to generate the "clean" input I c and then perturbations (such as Gaussian noise and random cutout) are added to I c to get the "noisy" input I n .
where f denotes the pose detector and T n represents the perturbation operation. According to the smoothness assumption, the output heatmaps H c and H n of the clean and noisy inputs ought to be close in the output space, and the mean square error (MSE) is leveraged to measure the discrepancy loss between H c and H n . As illustrated in Eq. 2, the basic detector f is trained by minimizing the supervised loss and the discrepancy loss.
where I l denotes the sampled image from the labeled set L, H l is the corresponding ground-truth heatmaps and λ is the scale factor of the discrepancy loss. As shown in Fig. 1, Spatio-temporal consistency Given a basic pose detector f abides by the smoothness assumption, it can be inferred Fig. 1 The framework of the proposed smoothness-based spatiotemporal consistency learning (SSTCL); the blue forward path represents the supervised training, the green forward path shows the proposed smoothness-based training scheme, and the orange path denotes the generation of the pseudo-labels. L MSE and L SA are the supervised loss and the smoothness assumption loss, respectively that, for two input images I l , I u sampled from the labeled and unlabeled set that are close by in the input space, the detector f should have the ability to predict reliable labels for I u because f has seen the similar image I l during the supervised learning. The ideal situation is that the detector predicts each label correctly, while it is impossible in practice due to the discrepancy between labeled and unlabeled images. Instead of providing more useful general knowledge, using pseudo-labels may mislead the detector and undermine the performance if there are lots of noisy pseudo-labels. Hence, the procedure to evaluate the quality of pseudo-labels is of particular importance. In this section, the spatio-temporal consistency is proposed to generate the final pseudo-labels, which requires that the reliable predictions ought to be consistent under different augmentation transforms and across different training epochs. The generation of pseudo-labels is illustrated in Algorithm 3.1.
For image I u sampled from the unlabeled set, two different affine transform operations T a and T b are applied on I u to generate different inputs. As formulated in Eq. 3, a wellbehaved detector should predict consistent labels; no matter how I u is transformed, the predictions ought to be close after the outputs are transformed to the original image space by the inverse transformations T −1 a and T −1 b .
where Decode means decoding the keypoint coordinates from the predicted heatmaps and f n denotes the pose detector at epoch n. In practice, image I u generates N variant inputs by applying different augmentations. The spatial vari-ance of each keypoint in the image is calculated according to the predicted N labels of the variant inputs. If the spatial variance of keypoints is less than the spatial threshold, it is a qualified pseudo-label. As illustrated in Eq. 4, keypoints that have smaller variance than the spatial threshold will be chosen to calculate the spatial consistent pseudo-labels.
where P k sc and C k i are the spatial consistent pseudo-label coordinates and the ith coordinate of keypoint k, τ s is the spatial variance threshold, var k is the spatial variance of keypoint k and δ is the step function.
After the spatial consistency validation, the temporal consistency is proposed to further eliminate the unqualified pseudo-labels. Specifically, pseudo-labels predicted in different training epochs are stored to calculate the temporal variance of each keypoint within a temporal window. Similar to spatial consistency, keypoints that have smaller temporal variance are utilized to obtain the temporal consistent pseudo-labels, which can be formulated as: where P k tc is the temporal consistent pseudo-label coordinates of keypoint k, T is the temporal window and τ t denotes the temporal variance threshold.
To minimize the impact brought by the outliers, the exponential moving average (EMA) is adopted to update the

Algorithm 1 Pseudo-Label Generation Algorithm
Input: 1. The unlabeled dataset U ; 2. The labeled dataset L; 3. The verified pseudo-labels P n−1 4. The supervised model f n−1 ; 5. The initial predicted keypoints P init ; 6. Spatial consistent pseudo-labels P sc ; 7. Temporal consistent pseudo-labels P tc ; 8. Spatial variance threshold τ s ; 9. Temporal variance threshold τ t Output: 1. The trained model f n ; 2. The verified pseudo-labels P n ; 1: for step = 1, 2, · · · do 2: Sample a mini-batch B from L ∪ P n−1 3: Update pose detector f n−1 by training on batch B 4: end for 5: for I u in U do 6: for keypoint K i in P init do 8: if spatial variance of K i ≤ τ s then 9: Update P sc using K i 10: end if 11: if temporal variance of K i ≤ τ s then 12: Update P tc using K i 13: end if 14: end for 15: generate the pseudo-labels P n by EMA. 16: return f n , P n ; 17: end for pseudo-labels, and the process is formulated as: (6) where P tc is the valid pseudo-labels coordinates after the spatio-temporal consistency check, β represents the decay factor and P n is the generated pseudo-label on epoch t, which will cooperate with the labeled data to iteratively train the model.

Dual network structure
Although spatio-temporal consistency is designed to pick out the most reliable pseudo-labels, there still lies a problem that it cannot handle the systematic error of the trained pose detector. For example, the detector learns a wrong pattern which entails that the predictions are consistent but wrong. As the training goes forward, the wrong pattern introduces an increasing number of noisy pseudo-labels and degrades the performance. As illustrated in Fig. 2, the dual network structure employs two peer pose detectors to supervise each other and reduce the systematic error. Specifically, the peer detectors are initialized with diverse random seeds to yield different responses of the same input image. During semisupervised training, the detector generates pseudo-labels for its peer detector at the inference stage and is trained in a supervised way by regrading the pseudo-labels predicted by its counterpart detector as the ground-truth labels.

Experimental setup
Dataset We conduct our experiments on the OpenMonkey-Pose dataset [31], which consists of 195,228 images that include a large variety of monkey poses recorded by 62 cameras in different locations. For each macaques instance, 13 body landmarks are annotated to represent the pose information. Artificially, the whole dataset is split into four sets of TRAIN, VAL, TEST and UNLABELED with 80K, 10K, 10K and 90K images, respectively. In the ablation study, four Bold values represents the best performance of comparing different methods under the same backbone All models are trained on the whole 80K TRAIN set and tested on the 10K TEST set; note that the 10K VAL set is used to select the best model and the 90K UNLABELED set is utilized in our proposed SSTCL, and adopt the backbone model corresponding to SimpleBaseline or HRNet. SSTCL-D represents train with Dual Network Structure (DNS). Gain represents the AP gain compared to the baseline method mini-training sets with 1K, 5K, 10K and 20K images are randomly sampled from the TRAIN set, and the rest part of the TRAIN set is then regarded as the corresponding unlabeled set. In some experiments, all images in the TRAIN set and the UNLABELED set are used, which will be specified. Metric We use object keypoint similarity (OKS) proposed by MS-COCO [7] dataset to evaluate the detector, which is a validation metric to indicate the similarity between the predicted and the ground-truth monkey poses. The calculation metric is formulated as Eq. 7, where N denotes the number of keypoint categories, d i represents the Euclidean distance between the ith ground truth and predicted keypoint and s and k i are object scale and deviation of the ith keypoint, respectively; note that the deviation values are recomputed for the OpenMonkeyPose dataset due to the difference between humans and macaques, and δ(v i > 0) is the visible function.
Implementation details The SimpleBaseline [16] and HRNet [21] are introduced as the baseline pose detector, which is initialized by the pretrained weights on ImageNet [40]. The pose detectors are trained for 140 epochs in all experiments, and Adam optimizer [41] is utilized to optimize the model with an initial learning rate of 1e−3 which will drop to 1e−4 at epochs 90 and 120. Concerning the data preprocessing, each image is first cropped to the size Bold values represents the best performance of comparing different methods under the same backbone of 256 × 192 and then goes through a series of affine transforms including scale (±30%), rotation (±40 • ) and random flipping like the original work. With respect to the configuration of SSTCL, the spatial variance threshold (τ s ) and temporal variance threshold(τ t ) are set as 0.05, and the number of the input image variants N is 5, the temporal window is 3 epochs, and the decay factor β of EMA is set as 0.5. We perform multi-processing distributed training on eight RTX 3080 Gpus, and the batchsize processed by each GPU is set to 16.

Comparisons to the state-of-the-arts
To show the generality of our method, the proposed method is applied to various baseline networks with various depths. The results in Table 1 demonstrate that the proposed approach consistently improves the performance of different baseline pose detectors. We refer to AP (average precision), AP.5, AR (average recall) and AR.5 in CocoAPI as evaluation metrics for pose estimation. They measure the accuracy and coverage of algorithms. AP computes the average precision across different confidence thresholds, while AP.5 considers a higher confidence threshold of 0.5. AR calculates the average precision at different recall levels, and AR.5 focuses on recall above 0.5. These metrics provide quantitative measures to assess and compare algorithm performance in scientific papers. Specifically, the presented method brings a remarkable promotion of 3.0 AP to SimpleBaseline-34 and its performance is even better than the SimpleBaseline-101 trained in a supervised way. With respect to the larger and more powerful HRNet-w48, the improvement of our method still reaches 2.1 AP, which has further shown the effectiveness of the proposed SSTCL.

Ablation study
In this section, extensive experiments are conducted to study the effectiveness of each components of the proposed method. And SimpleBaseline-34 is leveraged as the pose detector in all ablation experiments.
Smoothness-based spatio-temporal consistency learning As given in Table 2, the proposed SSTCL surpasses the baseline with a large margin, and there shows a tendency that the improvement is larger when the baseline is weaker. Besides, we also observe that the proposed method can achieve comparable performance compared with the supervised learning while using less labeled samples, which demonstrates that the proposed method can leverage the common knowledge contained in the unlabeled images and use the knowledge to facilitate the pose detector to predict accurate keypoint locations. More notably, without the smoothness constrain, the semi-supervised learning even hurts the performance if using 1K training data, and the improvement appears when the smoothness assumption is used, which is consistent with the analysis of the smoothness assumption in Sect. 3.1.
Dual network structure As analyzed in Sect. 3.2, the DNS is used to alleviate the degradation problem in semisupervised learning, which consists of two pose detectors that can teach each other, and the network model is HRNet-48. The experimental results in Table 2 show that the DNS is able to further improve the performance of the proposed SSTCL. Besides, the AP values are plotted in Fig. 3 to study why the proposed DNS can bring the promotion. It can be observed that the orange AP curve with dual detectors is more stable and has smaller fluctuations, while the blue AP curve with one pose detector declines significantly as the training proceeds. In the process of generating pseudo-labels, two identical verify each other to improve the accuracy of pseudolabels, which proves that the proposed DNS can prevent the degeneration brought by the noisy pseudo-labels.

Conclusion
In this work, a complete solution is proposed to address the scarcity of macaque pose estimation data, using small amounts of labeled data and large amounts of unlabeled data to facilitate the performance of rhesus pose estimation. Specifically, a novel smoothness-based spatio-temporal consistency learning method is proposed to help the model achieve better generalization performance and produce reliable semi-supervised learning pseudo-labels. According to the systematic errors generated by the proposed SSTCL model, a dual network structure can be corrected automatically to deal with the errors in the generation process of pseudo-label training, which successfully solves the degradation problem of pseudo-label accuracy. Demonstrated by extensive experiments on the OpenMonkeyPose dataset, the proposed methods succeed in solving the problem of pseudolabel instability. Specifically, the presented method brings a remarkable promotion of 3.0 AP to SimpleBaseline-34 and its performance is even better than the SimpleBaseline-101 trained in a supervised way. With respect to the more powerful HRNet-w48, the improvement of our method still reaches 2.1 AP. We hope that our approach will be helpful in primate pose estimation.
Author Contributions Xue Ping and Deng Shixiong prepared the manuscript text. Xue Ping prepared Table 2 Table 1 and Algorithm 1.
Funding This research was funded by the Natural Science Foundation of Heilongjiang Province of China (F201310).

Data Availability
In this article, public dataset OpenMonkeyStudio is available at https://github.com/OpenMonkeyStudio/OMS_Data Declarations Competing Interests All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Ethical Approval This work did not require ethical approval under the research governance guidelines operating at the time of the research.