Direction-guided two-stream convolutional neural networks for skeleton-based action recognition

In skeleton-based action recognition, treating skeleton data as pseudoimages using convolutional neural networks (CNNs) has proven to be effective. However, among existing CNN-based approaches, most focus on modeling information at the joint-level ignoring the size and direction information of the skeleton edges, which play an important role in action recognition, and these approaches may not be optimal. In addition, combining the directionality of human motion to portray action motion variation information is rarely considered in existing approaches, although it is more natural and reasonable for action sequence modeling. In this work, we propose a novel direction-guided two-stream convolutional neural network for skeleton-based action recognition. In the first stream, our model focuses on our defined edge-level information (including edge and edge_motion information) with directionality in the skeleton data to explore the spatiotemporal features of the action. In the second stream, since the motion is directional, we define different skeleton edge directions and extract different motion information (including translation and rotation information) in different directions to better exploit the motion features of the action. In addition, we propose a description of human motion inscribed by a combination of translation and rotation, and explore how they are integrated. We conducted extensive experiments on two challenging datasets, the NTU-RGB+D 60 and NTU-RGB+D 120 datasets, to verify the superiority of our proposed method over state-of-the-art methods. The experimental results demonstrate that the proposed direction-guided edge-level information and motion information complement each other for better action recognition.


Introduction
Human action recognition involves providing a computer "human intelligence" so that it can recognize human action types. It plays a key role in "human-centered computing" and forms one of the important branches of human activity research (Trelinski and Kwolek 2021). With the rapid development of digital image processing and intelligent hardware manufacturing technology, human action recognition has been widely used in human-computer interaction, virtual reality, industrial systems, healthcare, and rehabilitation. Compared with red-green-blue(RGB) and depth data, skeleton data are high-order feature of the human body feature data, which have the advantages of simplicity and easy storage (Yun et al. 2021). In recent years, skeleton-based action recognition methods have received much attention.
Since skeleton data are time-series data, many early works introduced recurrent neural network (RNN) architectures that address time-series data to solve such problems (Ren et al. 2020). However, RNN-based models mainly focus on the temporal features of skeleton data for modeling, ignoring the existence of natural connections between human body joints, and it is difficult to build deep networks to extract deeper features (Krizhevsky et al. 2012). To better extract the spatial structure information of skeleton data, a graph convolutional network (GCN) architecture that models the human skeleton as a graph has been applied, but the encoding of graph information has complexity and diversity, and requires a large number of parameters to train the model. Because of the advantages of lightweight modeling and building deep networks easily, convolutional neural network(CNN) models are extensively used. Most researchers choose to regard the 3D coordinates of joint data as the R, G and B channels of an image or define higher-order features such as the distance and trajectory of joint data, encode the original joint information into pseudoimages, and use an image recognition method to solve the action recognition problem based on skeleton data ). Although the above CNN-based methods are effective in improving recognition accuracy, there are still some problems to be improved. First, most researchers have chosen to model joint-level information, rarely considering a method to perform action recognition with edge-level information (representing the size and direction of skeleton edges), which takes into account the body's own connections and can play an important role in action recognition (Shi et al. 2019). Second, in the study of motion variations, previous work has often neglected the importance of skeleton edge direction for motion description as the joint-level information is limited, while motion is directional and higher-order motion variation features (translation, rotation, etc.) guided by different directions may better characterize action motion information. Third, most of the above works only consider the rationality of a single type of feature, without defining an effective way to describe human motion for effective feature fusion.
To solve the above problems, we propose a new model that integrates the use of skeleton information. First, we use the edge vector with direction and its vector motion information, which is called edge-level information, to extract the spatiotemporal characteristics of human action. Compared with traditional methods using joint-level information, edge-level information can better characterize the natural connectivity of the human body. Second, we designed the translational variation and rotational variation of the motion based on the edge vector in two different directions, which is called the motion information of human action. Finally, we propose a combination of translation and rotation information to extract integrated motion information, and construct a new DG-2sCNN architecture to synthesize edge-level information and motion information. The experimental results show that the DG-2sCNN architecture can be well-integrated into existing CNN frameworks and form effective complementarity.

Feature definition for skeleton-based human action recognition
To describe skeleton sequences more precisely, researchers usually define joint-level or higher-order features of raw skeleton information to describe human actions. Du et al. (2015) proposed using the 3D coordinates of the skeleton data corresponding to the R, G, and B channels of an image and transforming the skeleton data into pseudoimages to perform action recognition using an image processing method. Wang et al. (2018) employed a joint trajectory map during motion to represent human action and projected this trajectory map onto three Cartesian planes for a three-way fusion of scores, which can synthesize joint-level information from the skeleton. To explore higher-order features between joints, Li et al. (2017) introduced distance encoding between joints as a feature map representation to synthesize information on four spaces for action recognition. Caetano et al. (2019) performed feature extraction by calculating the size and direction values of the skeleton to recode the skeleton sequence and map it into a picture. Qin et al. (2021) combined spatiotemporal GCNs to extract higher-order action features by defining several special angular encodings between joints. Referring to the above studies, we first associate the possibility based on the edge-level, and second, the human edge-vector exists in different directions, and the design of higher-order features in different directions considers the motion features of different directions.

Two-stream CNN-based human for action recognition
In CNN-based action recognition, a two-stream architecture is introduced to better portray human actions. Jing et al. (2020) extracted the appearance and optical flow features of each video frame to model the spatial structure information of the action. To better distinguish similar actions, Hou et al. (2021) proposed a spatial two-stream attention mechanism that first employs a network of multiple spatial transformers in a parallel fashion to locate discriminative regions associated with human actions. Then, feature fusion between local and global features enhances the representation of human actions. Liu et al. (2017a) proposed mapping skeleton joints to a 3D coordinate space, encoding the temporal and spatial information separately, and extracting information from two-streams using a 3DCNN model. To improve the model's ability to recognize new actions in one shot, Liu et al. (2019) introduced semantic information into action recognition, proposing a combination of action and body streams, emphasizing the subject part of each new class of actions. To explore the effect of different color spaces on action recognition, Hou et al. (2016) encoded skeleton edge motion and velocity information on HSV space and encoded spatiotemporal skeleton sequence as spectral maps for feature extraction. To use joint and edge information together, Shi et al. (2020) presented a two-stream CNN considering both articulation joint and edge features, and used asymmetric convolutional blocks to reduce the effects of skeleton sequence rotation and deformation. Through the study of the above work, we propose to use edge-level information to represent the spatiotemporal features of the action and motion information to represent the motion variation of the action, forming two streams of information for human action recognition.

Methods
In this section, the proposed DG-2sCNN method is described. First, a preliminary introduction, including the introduction of skeleton data and the CNN modeling approach for a skeleton, is performed, followed by the construction of edge-level and motion information, the introduction of a partial and co-occurrence feature learning unit (PCFLU), which is modified with reference to previous work, and finally, the feature fusion approach is presented.

Introduction of skeleton data
As shown in Fig. 1, skeleton data are used to portray the human body state using the 3D coordinate information of 25 joint points in different parts of the human body. In particular, we divide these joint points into five parts according to the trunk, left upper limb, right upper limb, left lower limb, and right lower limb. In this study, we separate the edge vectors formed by connecting adjacent joint points in these different parts into the same five parts , as shown in Table 1.
Modeling of skeleton data A CNN modeling approach is used for skeleton data as shown in Fig. 2, and CNN-based skeleton action recognition is constructed as a "pseudoimage" of a skeleton sequence with the number of frames, skeleton points (or skeleton edges), and 3D coordinates similar to the length, width, and channel of an image. Specifically, we use skeleton edges (number of 24), number of frames (size of 32) and 3D coordinates for modeling.

Edge-level information construct
The edge-level information constrains both edge and edge_motion information. Most traditional CNN-based methods choose skeleton point-level information to extract the spatiotemporal features of an action. Since the human body is a chain structure, the skeleton edges formed by the connection of adjacent skeleton points may be more consistent with the physical structure of the human body, so we use skeleton edge-level information with directionality to characterize an action. A diagram of the skeleton edge-level information construction is shown in Fig. 4.
First, we choose the base of the spine, which is more stable during movement, as the center point of all joints. Then, For example, "(21,5)" represents an edge between joint 21 and joint 5

Fig. 3
Direction-guided two-stream CNN architecture. First, the skeleton edge sequence is constructed from the skeleton point sequence, and then the skeleton edge sequence is divided into five parts to obtain the CNN-based model, which is shown in the rectangular square, and the specific modeling is shown in Fig. 2. Finally, action classification is performed using the constructed network model construction we define the direction pointing to the center as the inward direction, and construct each edge vector formed by the connection of adjacent joints in this direction, called the edge information. In time, we extract the variations in motion of the edge vectors in adjacent frames, called the edge motion information. Specifically, if skeleton sequence J consists of T frames and each frame has N joints, coordinates (x,y,z) of the ith joint of the jth frame. E and E_M represent edge information and edge_motion information, respectively. They are formulated as follows: where t denotes the frame, m and n are two adjacent joints, and D is the set of edge vectors in Table 1, so the number elements in D is 24.

Different directions of motion information generation
The motion information constrains both translation and rotation information. Since human action can be decomposed into translational and rotational motion, to extract the variation information of motion, we first define two edge information in different directions, and then combine the motion characteristics of the human skeleton to design distance or angle features in different edge directions to represent the translational and rotational variation of the motion, respectively. The distance or angle constructed from the edge in different directions represents the translational and rotational variation of the motion, that is, the translational information and the rotational information, respectively. The details are as follows: Using the method introduced in Sect. 3.2, we first define the edge vectors in both inward and outward directions. Then, as shown in Fig. 5, different motion characteristics are designed according to the motion characteristics of the skele- Fig. 4 Diagram of the definition of edge-level information with directions. The red node "center" represents the base of the spine joint. Taking the skeleton edge connecting the wrist joint and the elbow joint as an example, "T" and "T+1" represent the positions of elbow joint and wrist in adjacent frames, which are represented by the yellow solid line and yellow dashed line, respectively. Therefore, for this skeleton edge, the edge vector information at frame T is shown as the yellow solid line and the edge motion information is shown as the green solid line ton edges. As shown in Fig. 5(a) , in the inward direction, the motion involves more variations of the edge vector anchor points (gray dots in the figure), and we use the distance D generated by the anchor points of each edge in that direction within the adjacent frames to inscribe the motion, called the translation information of the edge motion. As shown in Fig. 5(b) , in the outward direction, the motion is more related to the change in the edge vector itself, and we use the angle A generated by each edge in the adjacent frames in this direction to portray the motion, which is called the rotation information of the motion.
Specifically, D n () and A n () are the distance between two points and the angle between two edges in Euclidean n-space, respectively. Suppose r = (r 1 , .., r n ) and s = (s 1 , .., s n ) are two points, u = (u 1 , .., u n ) and v = (v 1 , .., v n ) are two edges, and D n () and A n () are calculated as:

Partial and co-occurrence feature learning unit
Consider that action usually interacts and combines only with the set of joints involved in that action. For example, when drinking water, a joint-level feature set consisting of the elbow, wrist, and head joints is a great way to characterize the action. In the method of HCN , the first step uses two layers of convolution kernels of size 1 in the joint dimension to partially aggregate the coordinate information of the joints. The second step converts the joint dimension to the channel dimension using channel transformation to syn- Fig. 5 Schematic representation of the definition of motion information in different directions. Taking the edge connecting the wrist joint with the elbow joint as an example, and representing the position of the elbow and wrist at the adjacent frames, the "center" represents the base of the spine joint, and the distance variation and angle variation are shown in Fig. 5a, b, respectively

Fig. 6
Partial and co-occurrence feature learning unit thetically extract the co-occurrence features between joints. We follow the main HCN idea and introduce a partial and co-occurrence feature extraction method into our module. In this module, unlike an HCN, features are edge-level and motion information. Because of the same action of drinking water, the set of edge features of the hands, arms and head also characterize the action well. Specifically, as shown in Fig. 6, for a given edge vector information input, features are encoded with a convolution layers, ensuring that the dimensionality of the edge is constant and extracting its partially aggregate information in time, i.e., partial feature learning. Then, the edge dimension is swapped with the channel dimension through dimensional transformation, and the co-occurrence information between different edge features in time is extracted using two convolution layers, i.e., co-occurrence feature learning.

Feature fusion
This section describes the feature fusion method. For both edge-level information and motion information, feature extraction is performed using partial and co-occurrence feature learning unit (PCFLU). For the motion information, since any nonzero moment motion state can be represented by its previous moment translation and rotation information, we propose combining the translation and rotation to portray the action, fusing the translation and rotation information. Finally, the edge-level information is integrated using convolution unit (CU) to obtain the final recognition result. Specifically:

Fig. 7 Convolution unit
After PCFLU feature extraction, four types of deep features (including edge, edge_motion, translation and rotation) are obtained. Considering that in our definition of distance and angle information, the set of edges involved in an action yields distance information, but not necessarily angle information (e.g., a section of edges performs translational movement and only the distance variations), there is a risk of confusion between similar actions if the translational and rotational feature information is chosen without processing. To reduce this effect, we choose to fuse the translation and rotation information to obtain new branched information for the feature input to the CU later.
As shown in Fig. 7, the CU contains two convolution layers, which are used to extract edges, edge_motion, and deep features incorporating translation and rotation information. Finally, the feature map is sprawled into a vector that goes through two fully connected layers for final classification. In particular, for interactive actions involving multiple persons, we follow Li et al. (Li et al.) and adopt an elementwise max scheme for the features of multiple persons at the front of the CU.

Algorithm process
Our algorithm flowchart is shown in Algorithm 1. Initialize denotes the initialization operations such as cropping and filling of the original data. VectorConstruct is the construction of the edge information, and VectorDiff is the constructed edge vector difference operation in time. TranslationCount and Rotation represent the computation of translation information and rotation information, respectively. E in and E out represent the edge vector information in the inward and outward directions, respectively, and D, A, and F represent the distance information, angle information, and their fused feature information, respectively.

Experiments
We verified the effectiveness of the DG-2sCNN model on the NTU-RGB+D 60 ) and NTU-RGB+D 120 (Liu et al. 2019) generic benchmark datasets. To explore the impact of the constructed components on the model, some ablation studies on the NTU-RGB+D 60 dataset

Algorithm 1 Main Framework of DG-2sCNN
Input: S(the raw skeleton sequences) Output: class (the final result) 1: F) are conducted, and the results of different fusion structures are compared and analyzed.

Datasets and implementation details
The NTU-RGB+D 60 dataset , which is an existing large-scale indoor dataset for human action recognition, was created by the Rose Lab at Nanyang Technological University and contains RGB video, depth map sequences, 3D skeleton data, and infrared video data. It was developed using three Kinect 2.0 cameras located in the −45 • , 0 • , and 45 • directions, and a total of 40 experimenters performing 60 actions for a total sample size of 56,880, including both cross-subject (c-Sub) and cross-view (C-View) evaluation settings. Specifically, in the C-Sub settings, one-half of the experimenter's samples are used for the training set, and the other half of the experimenter's samples are used as the test set; in the C-View setting, the samples collected by camera 2 and camera 3 are used as the training set and the samples collected by camera 1 are used as the test set.
The NTU-RGB+D 120 dataset (Liu et al. 2019) is the largest skeleton dataset for human action recognition available and is an extension of the NTU-RGB+D 60 dataset. It contains 114,480 sequences, consisting of 106 subjects performing 120 action categories. The skeleton information includes the 3D positions of 25 different joints per frame. Two standard evaluations are given separately cross-subject(C-Sub) evaluation and cross-setup(C-Set) evaluation. For the C-Sub evaluation, we divided 106 subjects into a training set and a test set. Each set consisted of 53 subjects. For the C-Set evaluation, samples with an even setting ID were selected for training, and samples with an odd setting ID were selected for testing.
The proposed model is based on the PyTorch framework [39],and the Adam optimizer was selected for training for 600 epochs. The learning rate was initially set to 0.001 and decayed by a factor of 0.1 at 300,400, and 550 epochs. The weight decay was set to 0.0001, and the NTU-RGB+D 60 and NTU-RGB+D 120 batch size was set to 64 Outward and inward denote the edge direction toward the base of the spine and away from the base of the spine, respectively

Comparison of joint-level information and edge-level information
Without making any improvements, to verify the effectiveness of our proposed method based on edge-level features (edge and edge_motion), comparisons were made with a method based on joint-level features (joint and joint_motion) proposed in Ref . However, in doing so, the performance was slightly different from the original text. The experimental results in Table 2 show that whether the inward or outward direction of edge-level information is used, it is comparable to the joint-level method and improves by 1.1% and 0.9% under the C-Sub and C-View settings, respectively, which indicates the feasibility of selecting edgelevel information for feature extraction. At the same time, the information in the inward direction is found to be better than that in the outward direction, and we choose the edge vector information in the inward direction for all subsequent experiments.

Impact of adding motion information (translation and rotation)
The description in Sect. 3.2 introduces the motion information. Therefore, we verify the validity of the idea of translational(trans) and rotational(rotat) inscribed motion by increasing the variation of distance and angle, separately, which are based on the model in Sect. 4.2. From the results in Table 3, compared with the model using only edge flow, adding translation information improves the results by 1.8% and 0.9% under the C-Sub and C-View settings, respectively, and adding rotation information improves the result by 1.5% and 0.4% under the C-Sub and C-View settings, respectively. This shows that the edge flow information combined with motion flow information can play a better role in feature enhancement which can better represent human action.

Exploration of methods to integrate motion characteristics
For the feature fusion module introduced in Sect. 3.4, as shown in Table 4, we tried three combinations of the maximum, cat, and mean of the translation and rotation information. The experimental results show that the maximum fusion approach outperforms the mean and cat fusion, which may be because the cat and mean approaches use both translational and angular information and are prone to produce zero values in the subsequent convolutional operations, while the maximum fusion approach can better integrate the translational and rotational information to characterize the human action features. We also added time-efficient analysis for each module, as shown in Table 5. Compared with the baseline method, our model has slightly lower time efficiency with improved accuracy.

Comparison with state-of-the-art methods
The proposed DG-2sCNN method is compared with other state-of-the-art methods on the NTU-RGB+D 60 ) and NTU-RGB+D 120 (Liu et al. 2019)   so, the performance will be slightly different from the original text. Here, we call it the RHCN. The performance of the recognition on the NTU-RGB+D 60 dataset is shown in Table 5. "ElAtt-GRU (2018)" and "ST-JDMs (2021)" are two representative methods for RNN-based and CNN-based methods, respectively. The DG-2sCNN method outperforms them by 6.4% and 6.9% in accuracy for the C-Sub setting, respectively. To better explore the structural information of the skeleton, some methods (Si et al. 2020a, b) combine an RNN and GCN or a CNN and RNN. Our proposed DG-2sCNN method is also 2.3% and 7.9% more accurate than (Si et al. 2020a) and (Si et al. 2020b) under the C-Sub settings. Compared with the STA-CNN ) method, the DG-2sCNN method achieves comparable results under the C-Sub settings and slightly lower result under the C-View settings, which may be due to the better stability of the STA-CNN model in different views.
A comparison of our proposed method with state-of-theart methods was performed on the NTU-RGB+D 120 dataset, and the results of these methods are described in Table 6. To investigate the optimization of our model, we developed a confusion matrix of the CS setup on the NTU-RGB+D 60 dataset, as shown in Fig. 8. From the figure, we can see that the misclassification behaviors of our models are extremely similar. For example, "reading (11)" and "writing (12)" are two similar actions. "putting on shoes (16)" and "taking off shoes (17)" have extremely similar skeletal sequences. These results may come from the fact that only two joints on the fingers were detected in the NTU-RGB+D dataset. We also visualize the loss curves of the model training and testing, as shown in Fig. 9, and the results show that the model is able to converge well after several tests, which fully demonstrates the effectiveness of the model.

Conclusion
In this work, we have presented a novel direction-guide twostream neural network for skeleton-based human recognition. Considering the importance of bone direction for action recognition, we explicitly introduced edge-level information, i.e., edge and edge_motion, as part of the network input.
Since the motion is directional, we design distance or angle variations in different directions to portray the translational variations and rotational variations of human motion, respectively, called motion information. In addition, we propose fusing translation together with rotation to portray human motion and design a feature fusion method that combines translation and rotation information. On the NTU-RGB+D 60 and NTU-RGB+D 120 datasets, our model is capable of achieving excellent performance compared with other advanced methods. Due to the lightweight modeling and ease of building deep networks, we choose to solve the action recognition problem based on a CNN architecture. Considering that our directionguided information is based on edges, a GCN framework using modeling of the edge graph structure may also play a better role, which will be tried in subsequent work. In addition, in our architecture, a two-stream structure is used for edge-level information and motion information. Considering the connection existing between motion and edge-level information, the exchange of two-stream information may play a role in feature addition, which will also be explored in future work.