MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition

Skeleton-based gesture recognition (SHGR) is a very challenging task due to the complex articulated topology of hands. Previous works often learn hand characteristics from a single observation viewpoint. However, the various context information hidden in multiple viewpoints is disregarded. To resolve this issue, we propose a novel multi-view hierarchical aggregation network for SHGR. Firstly, two-dimensional non-uniform spatial sampling, a novel strategy forming extrinsic parameter distributions of virtual cameras, is presented to enumerate viewpoints to observe hand skeletons. Afterward, we adopt coordinate transformation to generate multi-view hand skeletons and employ a multi-branch convolutional neural networks to further extract the multi-view features. Furthermore, we exploit a novel hierarchical aggregation network including hierarchical attention architecture and global context modeling to fuse the multi-view features for final classification. Experiments on three benchmarked datasets demonstrate that our work can be competitive with the state-of-the-art methods.


Introduction
Dynamic hand gesture (DHG), one kind of efficient and natural communication modality for human-computer interaction (HCI), has been adopted in many applications, such as robotics control [1], sign language recognition [2], virtual assembly [3], and others. Especially, dynamic hand gesture recognition (DHGR) has been a bottleneck for intelligent hand interaction in past decades. Based on the research of [4], the approaches used in DHGR can be gathered into two main categories according to the types of their inputs: imagebased approaches [5] and skeleton-based approaches [6,7]. Compared with images, hand skeletons are robust to variations in lighting conditions and surrounding distractions [8]. Additionally, thanks to the rapid advances of low-cost depth cameras, hand skeletons can be reliably estimated by hand pose estimation algorithms [9]. In this paper, we focus on the skeleton-based hand gesture recognition (SHGR).
B Guifang Duan gfduan@zju.edu.cn 1 State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310027, China As for SHGR, since long short-term memory neurons (LSTM) possess the ability to capture temporal dependence, Deep-LSTM [10] and motion feature augmented network (MFA-Net) [11] have been proposed to recognize hand gestures by LSTM. However, LSTM have limited ability to learn the spatial relationships of hand joints. Convolutional neural networks (CNNs) have the excellent ability of extracting spatial and temporal features simultaneously [12]. Some CNN-based approaches [13,14] have been proposed to build a pseudo-image from raw hand skeletons and then fed it into CNNs for recognition. However, these approaches often use the single view hand skeletons as inputs, thus the spatial structure of hand skeletons cannot be analyzed explicitly. Moreover, as mentioned in [15], some hand gestures or actions may be easier to recognize from other viewpoints. Therefore, observing the hand skeletons from reasonable viewpoints can greatly facilitate the recognition.
Multi-view-based methods, namely utilizing multi-view information by considering the diversity of different views, have been extensively used in various recognition tasks such as human action recognition and 3D object classification [16]. The multi-view information, such as multi-view images and multi-view human skeletons, are usually recorded by several virtual cameras arranged around 3D object or human  [17] is proposed to learn feature from each view by parameter shared CNNs and aggregate multi-view features into a compact representation. Recently, the hierarchical architecture [18] has been proposed to aggregate multi-view features from view level to group level. However, on the one hand, only the correlation of local views can be learned from hierarchical architecture, the global context of all the views is often neglected. Global context can describe hand gesture from a macroscopic perspective, which may improve the generalization of the model. On the other hand, skeletons of diverse viewpoints contribute differently for gesture classification, how to learn the discriminative weights of multiple views is an urgent problem to be solved.
Additionally, as the foundation of multi-view skeletons generation, the arrangement of virtual cameras, which can also be termed as viewpoints, has a great impact on recognition performance. The hand gesture performers usually have subjective initiative to deliver hand gestures explicitly by keeping in line with observer's perspective. In this way, hand gestures observed from unreasonable views may deliver ambiguity. For example, the gesture "Swipe Right" probably be misunderstood to "Swipe Left" if people observe it from the back of gesture performer. Therefore, how to determine the suitable viewpoints to capture multi-view hand skeletons is challenging.
To tackle these challenges, as depicted in Fig. 1, we present a novel multi-view hierarchical aggregation network (MVHANet) for SHGR. Firstly, to determine the appropriate distribution of viewpoints, we propose a two-dimensional non-uniform spatial sampling (2DNUSS) strategy which considers the direction of gesture performer conveying information. Secondly, multi-view hand skeletons are generated by coordinate transformation and multi-view features are extracted by a multi-branch CNN. Finally, a novel hierarchical aggregation network is proposed to fuse the multi-view features effectively. Specifically, our network is a dual structure including a hierarchical attention architecture (HAA) and global context modeling (GCM). HAA is designed to learn the discriminative weights and local relationship of different views, and then aggregate the multi-view features from view level to group level. GCM aims to learn a global understanding from all the views in each level. The outputs of HAA and GCM are fused for gesture classification. Extensive experiments on the DHG-14/28 dataset [6], SHREC'17 Track dataset [7] and First-Person Dynamic Hand Actions dataset [19] demonstrate that MVHANet can achieve stateof-the-art performance.

Two-dimensional non-uniform spatial sampling
Existing sampling strategies for multi-view data generation mainly include uniform circular (UC) sampling [16] and uniform spherical (US) sampling [20]. For UC, the cameras are set as a horizontal circle. However, it has limited observation area and inevitably results in information loss. US sets cameras on the surface of a sphere uniformly, while it is mainly suitable for observing 3D objects. Hand gestures need to be understood from a specific direction since unreasonable viewpoints may lead to ambiguity. Therefore, we propose a novel sampling strategy called two-dimensional non-uniform spatial sampling (2DNUSS) to generate reasonable viewpoints for hand gestures.
As shown in Fig. 2, given the bounding sphere around hand skeletons, we first conduct icosahedron-based loop subdivision [21] on the surface of it to generate a set of candidate positions, which are shown in green dots. The proposed 2DNUSS consists of two steps including region segmentation of sphere and in-region sampling. For the first step, we intend to divide the sphere into different regions from two dimensions including latitude and longitude. The location of fixed camera P F can be regarded as the pole of the sphere. Since hand gesture performers usually have subjective initiative to convey hand gesture information by keeping in line with observer's perspective, P F is the best viewpoint to observe gestures. That means the distribution of sampling points near P F could be dense, while that away from P F could be sparse. In the latitudinal dimension, the sampling process is essentially sampling the values of θ , hence the values of θ are non-uniform distribution. Thus, we adopt normal distribution as the distribution function.
where Q denotes the probability density of θ , μ and σ denote the mean and standard deviation of θ . Afterward, the Box-Muller transform [22] is utilized to sample θ .
where u 1 and u 2 denote two independently uniformly distributed random numbers. By generating L different paired values of u 1 and u 2 , we can calculate values of θ and obtain L latitude lines accordingly. In the longitudinal dimension, in order to guarantee a wide observation horizon of each viewpoint, we sample ϕ in a same angle interval of 2π /M, where M represents the number of longitude lines. In this way, the spherical surface is divided into L × M regions non-uniformly from two dimensions. For the second step, we intend to sample one point from candidate positions in each region. To ensure sufficient interval between sampling points, we sample the point from the relative center position of each region. Thus, we totally obtain N L × M sampling points, which can serve as the positions of virtual cameras. As shown in Fig. 2, the red dots denote the final sampling result.

Multi-view hand skeletons generation and features extraction
where h t, j denotes the joints coordinates under O. R and D denote the rotation matrix and translation matrix from O F to O, respectively. Note that, observing the hand skeleton from certain viewpoint P i is equivalent to rotating hand skeleton around axes of O. Therefore, for the hand joint coordinates h i Thus, the multi-view hand skeletons V {H n ∈ R T ×J ×3 } N n 1 can further be derived. Afterward, utilizing V as inputs, we adopt a multi-branch CNN to extract multi-view features F ∈ R N ×H ×W ×C , where H × W × C denotes the dimension of feature from single view. The CNN is implemented by six convolution layers and four pooling layers, which is a simplified version of hierarchical co-occurrence network (HCN) [12].

Global context modeling
As depicted in Fig. 3, the global context modeling consists of two steps including global context generation (GCG) and global context fusion (GCF). As for GCG, since the generation processes of global-view context G view and global-group context G group are similar, we present generation of G view for example. As shown in Fig. 4, given the multi-view features are concatenated along channels and form a concatenate feature F concat . Combining average pooling and max pooling can provide more effective statistical characteristics for inferring channel-wise attention [24]. Therefore, we learn the view weight W view by where σ denotes the sigmoid function. After that, F are fused to G view according to W view by dot-product. Then, the globalgroup context G k group can be derived by GCG when the input is group feature G k , where k denotes the index of grouping, K denotes the number of grouping, k ∈ [1, K ].
Then, G view and G k group are sent to GCF for feature fusion. Based on the element-wise addition, we incorporate convolution operation to adaptively fuse global contexts, which can be formulated as where Conv(·) denotes a 1 × 1 convolution. G k f denotes the fused global context after k time grouping. In this way, G K f is served as the final global context G F .

Hierarchical attention architecture
To capture the dependencies of local views, we design a novel hierarchical attention architecture (HAA) with global context-guided attention mechanism. As shown in Fig. 1, multi-view features F are first clustered into different groups in the view level. Then, we aim to learn the discriminative weights of local views, also called view-level attention α. Based on the G view , we design a similarity function to calculate α where p represents the position in the feature map, p ∈ [1, is a very small positive number to ensure denominator is not zero. Thus, a higher score means that certain view feature and global context have higher similarity. In order to aggregate the features in the same group, we use softmax function to normalize view-level attention into α v, g , g denotes the index of group. Then, the gth group feature G g can be derived by view-level attentional aggregation where f v, g denotes the vth feature in group g. V g denotes the total numbers of views in group.
In the group level, after obtaining the group feature G 1 G g N 1 G g 1 by first time grouping, a recurrent clustering [18] is conducted on G 1 for further grouping until the clusters do not change. Then, G k group can also be derived by GCG. Given G k and G k group as inputs, the group-level attention β k g can be calculated by Eq. (7). After the last time grouping, we can learn the fusion feature F fusion by group-level aggregation according to β k g .
where N K G denotes the number of groups after K times grouping.
Finally, in order to leverage the complementary information from local features and global context, we concatenate F fusion and G F along channel dimension into the comprehensive feature F C for gesture classification.

Implementation details
The proposed MVHANet is implemented based on the PyTorch platform using NVIDIA 2080TI. The stochastic optimization method Adam [25] is adopted to optimize the parameters of entire network. We set the learning rate as 1e-3 Fig. 4 The implementation of global-view context generating (GCG) and utilize cross-entropy to calculate loss. The batch size for training is 64 and the dropout rate is set to 0.5. The loss value of the model tends to be stable when the model is trained for 50 epochs. We employ one-dimensional linear interpolation to sample T 32 frames from each hand sequence as inputs.

Datasets
In this work, we adopt three benchmarked datasets to evaluate the effectiveness of our MVHANet. The SHREC'17 Track dataset (SHREC'17) [7] is a challenging dataset for SHGR. The individuals perform 14 kinds of gestures using one finger or the whole hand. Each frame has the three-dimensional coordinates of 22 hand joints. We follow the evaluation protocols in [7], and 1960 sequences are employed as training set and 840 sequences are employed as testing set. The DHG-14/28 dataset (DHG-14/28) [6] adopts the same collection method as SHREC'17. All of the sequences are performed by 20 subjects. The leave-one-subject-out cross-validation strategy is used for evaluation. The First-Person Dynamic Hand Actions dataset (FPHA-45) [19] comprises 1175 hand action sequences. Each frame has 21 hand joints. A total of 600 action sequences are employed for training and remaining 575 for testing.

Different sampling ranges
As shown in Table 1, we compare different ranges of three sampling strategies including uniform circular sampling (UC), uniform spherical sampling (US) and our 2DNUSS. We sample ten viewpoints at the same interval on a circle as our baseline. To ensure fairness, we sample same number of viewpoints in subsequent experiment.
Compared to UC, when we adopt the US, the accuracy is improved. That means multiply viewpoints can increase the diversity of hand gesture information. Furthermore, when we narrow the range down to a half sphere (−π/2 < θ < π/2), where the angle θ is shown in Fig. 2, the accuracy is further increased. It suggests that observing the hand gestures from behind may lead to ambiguity. When the unreasonable sampling range is excluded, the performance of our model is improved. When we narrow the range down to a quarter of sphere (0 < θ < π/2), the accuracy increases to 93.21%, 86.35% and 83.64% for SHREC'17, DHG-14/28 and FPHA-45. It suggests that observing the hand gestures from below is unreasonable. However, when we narrow the range to 0 < θ < π/4, the accuracy has decreased. This may be caused by the sampling range being too small and lacking of sufficient multi-view information. Note that, the sampling ranges have greater impact on SHREC'17 and DHG-14/28 than FPHA-45. The reason is that many gestures in SHREC'17 and DHG-14/28 are related to movement direction, which are sensitive to sampling ranges. Furthermore, when we adopt the proposed 2DNUSS, the accuracy is better than other two sampling method regardless of sampling range. This demonstrates the superiority of non-uniform sampling.

Different numbers of viewpoints
As described in Sect. 2.1, the number of sampling points is L × M. To reduce the parameters of the models, we let L be equal to M. Besides, the initial viewpoint is also included. Thus, the actual number of viewpoints is N L 2 + 1. We assign different values to L to change the number of viewpoints. As shown in Table 2, for all the datasets, as the number of viewpoints increases, the accuracy of recognition increases accordingly. It suggests that more viewpoints can provide richer multi-view information. However, when N increases to 17, the accuracy hardly improves anymore. That means that simply increasing the viewpoint has a limited improvement in accuracy.

Different fusion formats
We mainly consider two fusion formats including score-level fusion and feature-level fusion. Score-level fusion means  each single view feature is passed through a classifier to get a score or probability, and then, these scores are fused for prediction. We adopt averaging (Ave) and maximum (Max) to perform score-level fusion. In terms of feature-level fusion, all the single view features are fused into a comprehensive feature for prediction. The methods used for feature-level fusion include average pooling (AP), max pooling (MP), hierarchical architecture (HA), hierarchical attention architecture (HAA) and global context modeling (GCM). All the experimental results are shown in Table 3. Compared with single view CNN (SVCNN), both scorelevel fusion and feature-level fusion improve the accuracy by more than 3%. The result indicates that the diverse and informative information embedded in multi-view features has a significant impact on improving accuracy. As for score-level fusion, the accuracy of adopting averaging and maximum is similar, which means the fusion methods of score-level fusion have a weak effect on results. As shown in Table 3, MP is better than AP. The reason is that AP makes the specific information of each viewpoints lost. Compared with HA, our HAA brings further improvements. It suggests that HAA can learn the importance of different features and fuse them effectively. When we combine the outputs of HAA and GCM, the accuracy increases by 1.1%, 0.5% and 2.0% for SHREC'17, DHG-14/28 and FPHA-45. It demonstrates that the local and global information of multi-view features are both beneficial for recognition.
In addition, we further analyze the convergence of different methods for feature-level fusion, the result is shown in Fig. 5. For SHREC'17 and DHG-14/28, our model starts to converge after 20 epochs, while it needs 30 epochs for FPHA-45. One possible reason is that FPHA-45 has more categories and less training samples. Overall, compared with other methods, the "HAA + GCM" has higher convergence rate and smaller oscillations on all datasets, which has proved the effectiveness of our method.

SHREC'17 Track Dataset
As shown in Table 4, compared with traditional method [26], CNN-and GCN-based methods make great improvement. This demonstrates the superiority of deep networks for recognition tasks. Our method outperforms ST-GCN [28] by nearly  2% and 5% with 14 and 28 gestures setting, respectively. Our method can also get competitive performance with the excellent GCN-based method [29]. We attribute this to our HAA and GCM structure for integration of multi-view skeleton information.
The confusion matrix of our method on SHREC'17 is shown in Fig. 6. As shown, our method achieves the recognition accuracy of over 90% for most of gestures. Especially, for some gestures that express the direction of hand movement such as "Swipe Right" and "Swipe Left," our model has an excellent recognition performance. It suggests that observing hand skeletons from reasonable viewpoints can provide helpful information for recognition. Besides, we find that "Grab" is misclassified for "Pinch" at times by our model. The reason is that these two gestures are physically similar and small changes in joints movements are not easily recognized.

DHG-14/28 Dataset
It can be seen from Table 5 that our method achieves the state-of-the-art result with an accuracy of 92.36% on the 14-gesture setting and 89.56% on the 28-gesture setting. By comparing the performance of ST-GCN [28], our proposed method brings 1.1% and 2.4% raise of recognition accuracy for the 14 and 28 gestures, respectively. It indicates that our method taking advantage of multi-view hand skeletons can encode more spatial characteristics of hands.

First-Person Dynamic Hand Actions Dataset
For FPHA-45, the comparisons with other state-of-the-arts methods are shown in Table 6. Compared with ST-GCN [28], our method brings 6% raise of recognition accuracy. It shows that our method has greater potential in recognizing hand movements with more categories. As shown in Fig. 7, there is only a slight difference in the spatial structures of hand skeletons for "open wallet" and "open soda can" action. We may need to introduce object information and utilize multimodalbased method to improve the recognition performance in the future.

Conclusion
In this paper, a novel MVHANet is proposed for SHGR. The experiments results demonstrate that our method achieves state-of-the-art performance. To derive a reasonable distribution of viewpoints for observing hand skeletons, we propose a 2DNUSS strategy. To cope with the challenges in existing hierarchical architecture, we exploit a novel hierarchical aggregation network including HAA and GCM. We will explore how to effectively capture dynamic characteristics of fingers movements in the future.