Lightweight human pose estimation algorithm based on polarized self-attention

In recent years, human pose estimation has been widely used in human-computer interaction, augmented reality, video surveillance, and many other fields, but the task of pose estimation still faces many challenges. To address the large number of parameters and complicated calculation in the current mainstream human pose estimation network, this paper proposes a lightweight pose estimation network (Lightweight Polarized Network, referred to as LPNet) based on a polarized self-attention mechanism. First, ghost convolution is used to reduce the number of parameters of the feature extraction network; second, by introducing the polarized self-attention module, the pixel-level regression task can be better solved, the lack of extracted features due to the decrease in the number of parameters can be reduced, and the accuracy of the regression of human keypoints can be improved; finally, a new coordinate decoding method is designed to reduce the error in the heatmap decoding process and improve the accuracy of keypoint regression. The method proposed in this paper was evaluated on the human keypoint detection datasets COCO and MPII, and compared with the current mainstream methods. The experimental results show that the proposed method greatly reduces the number of parameters of the model while ensuring a small loss in accuracy.


Introduction
Human pose estimation is an important research direction in the field of computer vision. Its purpose is to locate the coordinates of human keypoints from video or image data. This task is the preprocessing step of many visual tasks such as pose tracking and human action recognition. In addition, the task has rich application scenario, such as virtual fitting [31] and motion monitoring. Currently, conventional human pose estimation network research is carried out along the direction of deepening the depth of the network, expanding the resolution of the feature map, and designing different resolution networks for multi-scale feature fusion and feature extraction. Such networks need the support of high-performance computing equipment and face many problems, such as a large number of parameters, long training times, and difficulties deploying on low-performance computing equipment and hence they cannot be implemented in practical applications. Therefore, under the premise of ensuring little loss in keypoint detection accuracy, further reducing the parameters of the model is a problem to be solved in the current human pose estimation task. Human pose estimation approaches based on deep learning can be divided into top-down and bottom-up methods. The top-down method first performs human object detection on the input image to obtain human objects with bounding boxes. Then, the bounding box is cropped to the size of a single human body, and feature extraction is performed using the pose estimation network to obtain the coordinates of each keypoint of the human body. In 2016, Wei et al. [25] designed the convolutional pose machine network, which uses convolutional layers to express texture information and spatial information, and designed a multi-stage structure to improve the detection performance of single keypoint s. In 2017, Fang et al. [6] designed the regional multi-person pose estimation network, focusing on the problems of detection frame positioning error and repeated detection in the top-down method of target detection algorithms. The human body bounding box is optimized by the spatial transformation network, which overcomes the influence of the target detection algorithm error on the subsequent keypoint detection task. In 2018, Chen et al. [4] designed the cascaded pyramid network (CPN), which mainly focuses on the difficulty of detecting different types of joint points, and designed two two-stage networks, GlobalNet and RefineNet, which further improve the accuracy of detection for more difficult keypoints (occluded keypoints). In 2019, Sun et al. designed a more representative network called the HRNet (high-resolution network) [19], which is characterized by a new parallel multi-resolution fusion architecture that can better extract high-resolution features and improve the detection performance for small and medium-sized people. In 2021, Rawal et al. [11] designed the MIPNet network structure to better cope with the crowding problem in the pose estimation task.Although the network designed based on the top-down approach can guarantee the detection accuracy of human key points, the number of model parameters is large and it is difficult to port to lightweight devices.
The bottom-up method first performs global keypoint detection on the input image to obtain all keypoints in the image. Then, using the positional relationships of human joints, the joint points are combined into multiple groups of independent human keypoints using a clustering algorithm. In 2017, Cao et al. [1] proposed Openpose and designed a classic keypoint clustering algorithm called part affinity fields that can simultaneously encode the position and direction of joint points to balance keypoint detection speed and accuracy. In 2018, George et al. [18] proposed PersonLab, which uses the combination of a heatmap and offset to predict the position of joint points, which better solves the problem of mutual occlusion between joint points. In 2020, Cheng et al. [5] designed HigherHRNet, which is an improved version of HRNet, and applied it to the bottom-up approach. There is a performance gap between bottom-up and top-down methods, and hence in 2021, Geng et al. [7] applied adaptive convolution to the keypoint regression part of the pose estimation task, which further advanced the performance of bottom-up methods. Although the network designed based on the bottom-up approach can better cope with the impact of the occlusion problem on the human pose estimation task, it is still lacking in controlling the model accuracy and the number of parameters.
Lightweight network includes both the exploration of network structure design and the application of model compression techniques such as knowledge distillation and model pruning, which further promote the application of deep learning technology in mobile and embedded devices. Lightweight network design refers to further reducing the amount and complexity of model parameters while maintaining model accuracy. MobileNet, proposed by Google in 2017 [9], was the first convolutional neural network that is small in size, low in computational complexity, and suitable for mobile devices. The network mainly relies on deep separable convolution and reasonable structural design to realize network parameters. In the same year, Zhang et al. proposed the ShuffleNet [29], which mainly adopts pointby-point convolution and a channel shuffling structure that greatly reduces the number of calculations of the model while ensuring its accuracy. In 2020, Kai et al. [8] designed a new network called GhostNet, that can overcome the need for extra parameters caused by convolution, further improving the model speed and reducing the amount of computation. In 2021, Yu et al. [28] designed Lite-HRNet, which integrates ShuffleNet into a high-resolution network and reduces the computational complexity while improving performance. Network models such as MobileNet and Shuf-fleNet are designed to make the model smaller and faster by employing a more efficient network structure rather than compressing or migrating a large trained model. The advantage of designing an efficient network structure is that it can be better applied to visual tasks.
We choose to design a lightweight network to cope with the problem of a large number of network parameters and computational complexity in the pose estimation task. The main contributions of the proposed method are as follows: 1. We design an efficient, lightweight human pose estimation network that maintains high detection accuracy while reducing the parameters of the network. 2. We combine ghost mudule and polarized self-attention mechanism to design a lightweight polarized self-attention module, which is replaced into the feature extraction network to ensure that the number of network parameters is reduced while retaining the important spatial and channel information to ensure the accuracy of the model. 3. We design a new coordinate decoding method that can effectively solve the quantization error problem in the coordinate decoding process, perform accurate coordinate decoding of the predicted nodes of the heatmap, further refine the coordinates of the final returned nodes, and improve the detection accuracy of the nodes.
In summary, we propose a lightweight method for human pose estimation, a redesign of the high-resolution network, and experiments on two mainstream datasets, MPII and COCO, to verify the effectiveness of the designed network. The rest of this paper is organized as follows: Sect. 2 briefly describes the existing methods applied in the network design highlighted in this paper. Section 3 introduces the proposed method. The experimental results are described in detail in Sect. 4. Section 5 presents the conclusion of this study.

High-resolution network
Because high-resolution networks can maintain high-resolution representations throughout the network, they are widely used in pixel-level regression tasks, such as semantic segmentation, human pose estimation, and many other visual tasks, and have achieved remarkable results. Most of the current pose estimation tasks use high-resolution networks as the backbone network, and the pose estimation networks proposed in the past 2 years, such as HigherHRNet [5] and DEKR [7] have also been designed and improved based on this network. A high-resolution network can improve the extraction of local joint point information, and hence the commonly used high-resolution network HRNet was selected as the basic network for the proposed method. Its structure is different from the traditional concatenated structure. The feature information of different resolutions cannot be fused in the form of a connection, resulting in poor joint point regression results. HRNet uses a parallel method to realize the fusion of information between feature maps of different resolutions and realizes the fusion of multi-scale features through multiple cross-parallel convolutions to enhance the high-resolution feature information so that the entire network can maintain a high-resolution representation. This improves the accuracy of joint point regression for human pose estimation tasks. A brief overview of the HRNet network structure is shown in Fig. 1.

Attention mechanisms
In recent years, attention mechanisms [21,22] have been widely used in various computer vision tasks. The main function of an attention mechanism is to improve the feature extraction network's ability to extract pixel information in pixel-level regression tasks, overcome the loss of spatial information in traditional convolution operations, and achieve better regression results for subtle joints in pose estimation tasks. Attention mechanisms can be roughly divided into two categories: strong attention and soft attention mechanisms. Because strong attention is a random prediction that emphasizes dynamic changes, although its performance is good, its application is very limited because of its non-differentiable nature. On the contrary, soft attention is differentiable everywhere. It can be obtained by neural network training based on the gradient descent method, so its application is relatively wide. A soft attention mechanism is divided according to the different dimensions of attention. The current mainstream attention mechanism can be divided into the following three types: channel attention, spatial attention, and self-attention. For channel attention, the purpose is to model the correlation between different channels (feature maps), automatically obtain the importance of each feature channel through network learning, and finally assign different weights to each channel. Weight coefficients are used to Representative methods include SENet [10] and ECANet [24]. For spatial attention, the purpose is to improve the feature expression of key regions. The spatial information in the original image is transformed into another space, and the key information is retained through the spatial transformation module. A weight mask is then generated for each position and weighted output, thereby enhancing the specific target regions of interest while weakening the irrelevant background regions. Representative methods include CBAM [26] and A 2 Net [3].
Self-attention is a variant of the attention mechanism whose purpose is to reduce the dependence on external information and use the inherent information inside the feature to interact with the attention as much as possible. In the self-attention mechanism, each input tensor is used to compute an attention tensor, which is then reweighted by that attention tensor. Following its success in sequence modeling and generative modeling tasks, self-attention has become a standard component for capturing long-range interactions. Representative methods include NLNet [23], GCNet [2], and SCNet [15].

Proposed method
Our goal is to build a model that can address the problems of large numbers of mainstream network parameters and difficulty in real-time detection in pose estimation tasks. We first describe the overall structure of the model and the improved parts. Then we go into great detail on how each significant component of the network architecture was designed. First, a thorough analysis of the ghost module and PSA module's attributes is conducted. The lightweight PSA module's design ideas are then described. Finally, a detailed analysis of the new coordinate decoding method's principle is presented.

LPNet architecture
Because of the particular characteristics of pixel-level regression tasks, high-resolution networks perform better on pixel-level regression tasks. Therefore, the design presented in this paper is based on a high-resolution network. As shown in Fig. 2, the network is divided into two parts. The first part mainly improves the feature extraction part, introduces the lightweight method into the feature extraction network, and uses the designed lightweight PSA module to replace the basic modules in the four stages to reduce the number of feature extraction network parameters. Moreover, from the channels, it learns finer pixel-level information in the spatial dimension, overcomes some shortcomings in traditional convolutional networks, and ensures the efficiency of the feature extraction process. The second part mainly consists of a new coordinate decoding method to overcome the error of the traditional heatmap coordinate decoding process and improve the accuracy of the heatmap decoding joint point coordinates.

Ghost convolution module
Limited by the need for human pose estimation tasks to be deployed on lightweight embedded devices, we propose to reduce the number of parameters of the pose estimation network by designing a lightweight network. The primary use of ghost convolution in this paper is to redesign the backbone network. The original feature extraction network consists of many convolutions, which leads to a large amount of computational overhead. In recent years, MobileNet and ShuffleNet have introduced depthwise convolution or shuffling (channel shuffle) operations to build efficient convolutional neural networks with smaller convolutional filters (number of floating-point operations), but the structure of the 1 × 1 convolution kernels still occupies a considerable amount of memory and FLOPs.
Unlike the operation of the above two networks, ghost convolution is divided into two steps. First, the normal convolution calculation is used to obtain a real feature map with Fig. 2 Architecture of LPNet. LPNet is divided into two main parts. The first part of the network extracts the features of the input image and predicts the generation of joint heatmaps. The second part decodes the coordinates of the predicted heatmap to obtain the joint point coordinates a small number of channels, and then a cheap operation is used to pass the real features through a linear transform. The transformation obtains a similar feature map. Hence, the real feature map is identically mapped and the similar feature map is spliced to form a new output. The ghost convolution module is shown in Fig. 3.
In the specific calculation, it is assumed that the input is X ∈ ℜ c×h×w , where c is the number of input channels, and h and w are the height and width of the input data, respectively. The operation of generating the feature map by the convolutional layer is as follows: where * represents the convolution operation, b is the bias term, Y ∈ ℜ h � ×w � ×n represents the output feature map of the n dimensional channel, and f ∈ ℜ c×k×k×n represents the convolution filter of this layer. In addition, h ′ and w ′ are the height and width of the output data, respectively, and k × k is the kernel size of filter f. For the convolution operation of the general process, the number of floating-point operations per second can be calculated by n ⋅ h ′ ⋅ w ′ ⋅ c ⋅ k ⋅ k , because the number of filters n and the number of channels c are very large, so the calculation results are usually in the thousands.
As shown in Eq. (1), the number of parameters to be optimized (in f and b) is determined by the dimensions of the input and output feature maps. There will be redundant feature maps in the output of ordinary convolutional layers, and some feature maps will be very similar. The process of generating such feature maps will waste a lot of computation. If this type of feature map is obtained by linear transformation from part of the real feature map, the amount of calculation will be significantly reduced. Moreover, such raw features are usually small and produced by ordinary convolution. Specifically, m original feature maps Y � ∈ ℜ h � ×w � ×m are generated by one convolution, as follows: The filter in Eq. (2) is expressed as f � ∈ ℜ c×k×k×m , where m is less than the number of convolution kernels n. Other hyperparameters are consistent with ordinary convolution to ensure the size of the output feature map. To obtain the required n feature maps, an inexpensive linear transform is applied to the original features in Y ′ , resulting in s phantom features. The specific calculation process is as follows: where y ′ i is the ith original feature map in Y ′ . In the above function, i,j represents the jth linear operation, which is used to generate the jth phantom feature map y ij , indicating that y ij can have one or more phantom feature maps y ij s j=1 . The role of i,s is to preserve the identity mapping of the original feature map. Using an inexpensive linear operation, n = m ⋅ s feature maps Y = y 11 , y 12 , ..., y ms can be obtained as the output data of the ghost model. Linear operation acts on each channel, and the amount of calculation is much lower than that of an ordinary convolution operation.
In terms of computational complexity, the ghost convolution has an identity map and m ⋅ (s − 1) = n s ⋅ (s − 1) linear operations, and the average convolution kernel size in each linear operation is d × d . Suppose the tensor of the input data is c × h × w , which represents the number of input channels and the feature map's height and width, respectively. After one convolution, the tensor of the output data is n × h � × w � , which represents the number of output channels and the height and width of the output feature map, respectively. The size of the regular convolution kernel is k, and the size of the linear transformed convolution kernel is d. After s transformations, the computation of the ordinary convolution operation (bn and relu are not included in the computation volume comparison here) is compared with that of the ghost convolution as follows: In Eq. (4), n/s is the number of output channels in the first transformation, and s − 1 is because the constant mapping does not need to be computed. Still, it needs to be done as part of the second transformation, so the ghost convolution can significantly reduce the computational effort. The same parametric number comparison calculation process can be expressed as follows: According to Eqs. (4) and (5), the adaptation of the standard convolutional kernel using the Ghost convolution allows an approximate s-fold decrease in both the computational effort and the number of parameters of the model. Moreover, ghost convolution can be easily embedded into other network models in a plug-and-play manner. Still, some extracted features may be lost when reducing the number of parameters and amount of computation. Therefore, the use of ghost convolution should be considered concerning specific requirements, and reducing the number of parameters cannot be blindly pursued while ignoring the model's performance.

PSA module
To solve the problem of computational complexity and memory explosion if the dimension reduction is not performed when modeling channels and spaces simultaneously, the PSA mechanism was proposed. The PSA mechanism adopts the mechanism of polarization filtering, which is similar to the mechanism of an optical lens. During photography, all lateral light is reflected and refracted. Polarization filtering only allows lights orthogonal to the transverse direction to pass through to improve the contrast of imaging. However, during the filtering process, the total intensity will be lost, and hence the filtered light usually has a small dynamic range, so it is necessary to carry out additional amplification to restore the details in the original scene. The design of the PSA mechanism (Fig. 4) is based on the above ideas. It compresses the current features in one direction and improves the intensity range of the loss, which is divided into the following two main structures: (i) the filtering module, which completely collapse the features of one (4) dimension (such as the channel dimension) while keeping the orthogonal dimension (such as the spatial dimension) at a higher resolution, and (ii) the HDR (high dynamic range) module, in which the softmax function is used on the smallest features in the attention module to increase the attention range, and the sigmoid function is used for dynamic mapping.
As shown in Fig. 4, the PSA module is divided into two branches, the channel branch and the spatial branch. When the input only passes through the channel branch, 1 × 1 convolution is used to convert the input feature X into Q and V, where the channel of Q is completely compressed and the channel of V retains its higher dimension ( C 2 ). Because the channel of Q is compressed, based on the idea of the PSA mechanism, information needs to be converted to HDR, so the softmax function is used to enhance the information of Q. Then, matrix multiplication is performed between Q and V, and 1 × 1 convolution and LayerNorm are used to restore the channel dimension to C. Finally, the sigmoid function is used to normalize all parameters.
The weight of the channel branch is expressed as A ch (X) ∈ ℜ C×1×1 , and the calculation process is as follows: Here, W q ,W v , and W z are 1 × 1 convolutions; 1 and 1 represent two-dimensional changes;F SM (⋅) is the softmax function; " × " represents the matrix dot-product operation;

Fig. 4
Structure of the PSA module. The module guarantees high channel dimensionality and spatial dimensionality. The input tensor is collapsed along the corresponding channel and spatial dimensions, and the important features in the different dimensions are enhanced using softmax-sigmoid. Compared to conventional convolution, this module has less computational overhead and is able to retain more important pixel-level information F SM (X) = ∑N p j=1 e x j ∑N p m=1 e xm x j ; and the number of channels between W v |W q and W z is C 2 . The output of the channel dimension-only branch is Z ch = A ch (X)⊙ ch X ∈ ℜ C×H×W , where ⊙ ch is the channel multiplication operator.
When the input only passes through the spatial branch, as in the channel branch, 1 × 1 convolution is used to convert the input feature X into Q and V, and for feature Q, the spatial dimension is compressed by global pooling and converted to a size of 1 × 1 ; by contrast, the spatial dimension of feature V remains high ( H × W ). Because the spatial dimension of Q is compressed, based on the idea of the PSA mechanism, the softmax function is used to enhance the information of Q. Then, matrix multiplication is performed between Q and V, a matrix transform is used to reshape the result, and the sigmoid function is used to normalize all parameters.
The weight of the spatial branch is expressed as A sp (X) ∈ ℜ 1×H×W , and the calculation is as follows: where W q and W v are both standard 1 × 1 convolutions; 2 denotes the intermediate parameters between convolution channels; 1 , 2 , and 3 represent the three-dimensional changes; and F SM (⋅) is the softmax function. Furthermore, F GP denotes the global pooling operator, where , and " × " means matrix dot product operation. The output of the spatial dimensiononly branch is Z sp = A sp (X)⊙ sp X ∈ ℜ C×H×W , where ⊙ sp is the spatial multiplication operator.
The channel and space branches are combined in parallel as follows: where + represents the elementwise addition operator. In contrast to other self-attention mechanisms, PSA retains the highest attention resolution in both channel ( C 2 ) and space ([W, H]), and can capture finer channelwise and spatial details when processing pixel-level tasks. In addition, in the single-channel branch part, softmax re-weighting as well as squeeze and excitation are adopted, and both Squeeze-and-Excitation Network (SENet) and Global Context Network (GCNet) benefit from this approach. In the single spatial branch part, not only is the full spatial resolution maintained, but more learnable parameters are retained internally for nonlinear softmax reweighting, which is a more powerful structure than existing self-attention mechanisms. Because of these advantages, PSA can achieve the optimal improvements in performance for pixel-level regression tasks.

Lightweight PSA module
To meet the requirement for a lightweight network for the human pose estimation task, the ghost convolution and PSA modules were redesigned, and the results is called the lightweight PSA module, as shown in Fig. 5. This module is similar to the BasicBlock module in a high-resolution network, which can extract features and reduce the number of parameters of the overall network. The lightweight PSA module is mainly composed of two ghost convolutions and a PSA module. The first ghost convolution expands the number of channels, and then process the data through normalization and ReLU functions. The processed data are sent to the PSA module to capture finer channelwise features and spatial features while ensuring high resolution, with almost no increase in the number of parameters and calculations. The data are then normalized again and fed to the next ghost convolution. The second ghost convolution restores the channel to the original number of channels, and finally combines the residual structure principle to sum the data and the data of the feature map to obtain the final output.

New coordinate decoding method
The ultimate goal of the human pose estimation task is to obtain the coordinate positions of each joint point of the human body in the original image. After predicting the heatmap of human joint points through the pose estimation network, the corresponding resolution recovery is required to convert the results back to the original coordinate space. This conversion process is called coordinate decoding.
The traditional coordinate decoding method is designed according to the specific performance of different models. Specifically, given the heatmap h predicted by the trained model, the peak (m) and sub-peak (s), which is the location of the second-largest activation value, are determined. The joint point position prediction is as follows: where ‖⋅‖ 2 represents the size of the vector. Equation (9) indicates that the joint point position prediction is a shift of 0.25 pixels from the largest activation position to the second largest activation position in the heatmap space. The final coordinate prediction calculation in the original image is as follows: where is the resolution reduction ratio.
The main purpose of the pixel shift in Eq. (9) is to compensate for the quantization error caused by the downsampling operation. The predicted maximum activation position in the heatmap is not equal to the exact position of each joint point in the original coordinate space. Instead, it is only a rough estimate. Hence, this paper introduces a new decoding strategy.
The new coordinate decoding method mainly focuses on predicting the distribution structure of the heatmap to infer a more accurate maximum activation value position. The specific operation is as follows: To accurately locate the second-largest activation, it is assumed that the predicted heatmap follows a two-dimensional Gaussian distribution, just as in the real heatmap. Therefore, the predicted heatmap is expressed as follows: where x represents the pixel location in the predicted heatmap and is the Gaussian mean (center) corresponding to the joint location to be estimated. The covariance ∑ is a diagonal matrix, expressed as follows, which is consistent with the coordinate encoding process: where is the standard deviation in both directions. A logarithmic transform is performed on Eq. (11), and then the derivative is taken. The specific process is as follows: The ultimate goal is to estimate . Assuming it is an extreme point in the distribution, the first derivative at position should satisfy the following: To continue analyzing this situation, Taylor's theorem is used. Activation p( ) is approximated using a Taylor series (up to the quadratic term), which evaluates to the following equation at the maximum activation m of the predicted heatmap: Here, D �� (m) represents the second derivative of P evaluated at m, and its specific form is defined as follows: The use of m was chosen to approximate , because m represents the optimal joint prediction close to . Next,P( ) and P(m) in Eq. (15) are both represented using Gaussian distributions, the constant term is reduced, and we merge Eqs. (14) to (16), which yields Here, D �� (m) and D � (m) can be effectively estimated from the heatmap. As long as is available, the coordinates in the original image space can be predicted from Eq. (10). In contrast to standard methods that only consider the second largest activation in the heatmap, the new coordinate decoding fully explores the heatmap distribution statistics to reveal potential maxima more accurately. Theoretically, the method is based on the principled distribution approximation under the assumption of consistent training supervision, that is, the heatmap is a Gaussian distribution. Hence, this method is very computationally efficient and only needs to compute the first and second derivatives at one location in each heatmap. Therefore, the method can be easily integrated into the existing heatmap-based human pose estimation tasks, which will further reduce the error in the heatmap decoding process without increasing the number of parameter calculations.

Experimental results and analysis
The environment of the experiments reported in this paper was an Ubuntu 18.04.6 LTS 64-bit operating system running on a computer equipped with an Intel(R) Xeon(R) Silver 4216 CPU @2.10GHz; 188.6GiB RAM; GPU RTX3090; and a CUDA v11.0.207, cuDNN v8.2, PyTorch v1.8.0, and Python v3.6.13 software platform. The pre-trained network parameters were taken from a model trained on the Ima-geNet dataset. In the experiment, the optimizer used the Adam optimizer, the initial learning rate of the model was set to 0.001, and the learning rate decay coefficient was 0.1. The learning rate was decayed after 170 and 200 epochs, respectively, with decay rates of 10-4 and 10-5. The training process ended after 210 epochs. Datasets: The MPII dataset is a mainstream human pose estimation dataset with single/multiple data types. The dataset includes 25,000 annotated images of more than 40 K people, and the image sources are all from YouTube videos. The test set also includes annotations of data such as body part occlusion, three-dimensional torsos, and head orientation.
The COCO dataset is a large, rich dataset for object detection, segmentation, and captioning. This dataset targets environment perception and was mainly collected from complex daily scenes. The target in the image is calibrated by precise segmentation. The images include 91 classes of objects, 328,000 images and 2,500,000 labels. By far the largest dataset for semantic segmentation, it includes 80 categories and consists of more than 330,000 images, 200,000 of which are labeled. The number of individuals in the entire dataset exceeds 1.5 million.

Evaluation indicators
The MPII dataset uses the percentage of correct keypoints (PCK) index to evaluate experimental performance. PCK is defined as the proportion of correctly estimated keypoints, which are keypoints for which the normalized distance between the detected keypoints and their corresponding real labels is less than a set threshold. The PCK is calculated as follows: Here, i represents the ith keypoint, k represents the kth threshold T k , and p represents the pth pedestrian. Furthermore,d pi represents the Euclidean distance between the predicted value of the ith keypoint i in the pth person and the manually labeled value, and d def p represents the scale factor of the pth person. The methods for calculating this factor differ in different public datasets. The MPII dataset uses the head diameter of the current person as the scale factor, that is, the upper left point LT of the head and the lower right point RB. Threshold T k is the manually set threshold,T k ∈ [0 ∶ 0.01 ∶ 0.1] , PCK k i represents the PCK index of the ith keypoint under threshold T k , and PCK k mean represents the mean PCK index for the algorithm under threshold T k .
The experimental evaluation index of the COCO dataset is the Object Keypoint Similarity (OKS). The equation for OKS is as follows: where d j is the Euclidean distance between the detected keypoint coordinates and the real value, v j indicates whether the keypoints of the human body can be observed, s is the size of the detection target, and k j is the attenuation coefficient of each keypoint.
The OKS is used in the experiment to determine the AP 50 (the average precision when the IoU is equal to 0.5), AP 75 (the average precision when the IoU is equal to 0.75). Furthermore, mean average precision (mAP) is the average AP for each category, AP M is the average precision for a medium-scale human body, AP L is the average precision for a large-scale human body.

Analysis of the results
The LPNet algorithm proposed in this paper is compared with other advanced pose estimation algorithms proposed in recent years. Table 1 shows the results on the MPII validation set. The LPNet algorithm uses approximately a quarter of the parameters used by the baseline network HRNet, but achieves a 0.5% point improvement in accuracy. Compared with other recent attitude estimation methods, LPNet is better in parameter quantity and accuracy. Example results from the MPII validation set are shown in Fig. 6.   Table 2 shows the experimental results on the COCO val2017 dataset. The results show that when the input resolution is 256 × 192 , the AP value of LPNet is 74.0, which is only 0.4% worse than the baseline network HRNet, but the number of network parameters is not as high as that of the baseline network. By increasing the input image scale and the number of input channels, the detection accuracy can be further improved. When the input resolution is 384 × 288 and the number of channels is 48, the best performance is

Analysis of the ablation results
In this study, ablation experiments were performed on the COCO dataset using a high-resolution network with an input channel of 32 as the backbone network and an input image size of 256 × 192 . The ablation experiment gradually replaced the basic feature extraction module in the four stages in the high-resolution network with a lightweight PSA module. The experimental results are shown in Table 3. In Table 3, 0 indicates that no basic modules were replaced, and 1-4 indicate that basic modules were replaced in each stage. As shown in Table 3, as the basic modules in the feature extraction network are replaced stage by stage with the lightweight PSA module in the high-resolution network, the number of parameters decreases rapidly and the average precision value decreases. However, thanks to the support of the PSA module, the average accuracy value is still 74.0. The data in the table show that using the lightweight PSA module to replace the basic modules in all four stages can achieve the best performance in terms of network parameter quantity and joint point detection accuracy.
Furthermore, ablation experiments were performed on the PSA module and the new coordinate decoding method. The ghost module, PSA module, and new decoding method were added to the basic network in turn. The results in Table 4 show that only adding the ghost module leads to a sharp reduction in the number of parameters and GFLOPs, but it also causes a large loss in the model's accuracy. Adding the NCD method can further improve the model's accuracy with almost no increase in the number of parameters computed. Adding the PSA module improves the accuracy of the network model while increasing the number of parameters by a small amount. The final experimental results show that adding the ghost and PSA modules and using the NCD method achieves the best attitude estimation performance.  In addition to the above experiments, independent experimental comparison of the NCD method proposed in this paper was conducted on the COCO dataset using HRNet as the baseline (Fig. 8). The final graph of the nodes' visualisation effect shows that the NCD method can further refine and correct the predicted nodes so that the human nodes predicted by the network are closer to the coordinates of the human nodes in the labels.

Conclusions
The LPNet proposed in this paper is an improved version of the high-resolution network. The ghost module was combined with the PSA module to create a lightweight PSA module to replace the basic module in the feature extraction network while reducing the number of network parameters and retaining the accuracy of the network model. In the final heatmap decoding part, a new coordinate decoding method was introduced that further improves the detection accuracy of keypoints and refines the coordinate positions of the joint points. The advantage of this network is that it has a lightweight architecture, is extremely scalable and easy to use, and provides a new solution and approach to the challenges of current pose estimation tasks such as complex models and a high number of parameters. A large number of experimental results on different datasets demonstrate that the model has good generalization ability. How to better control the parameters of the network and deploy the model on embedded devices while substantially improving the detection accuracy of the joint points will be the focus of future research.