Action feature extraction based on transfer learning
Transfer learning is a prevalent method in the field of DL. Transfer learning refers to the process of knowledge transfer in two different fields. The knowledge learned in the source domain is used to help the learning task in the target domain. Transfer learning can transfer knowledge from one domain (source domain) to another (target domain). The target domain often has only a small number of labeled samples so that the target domain can achieve excellent learning results [12]. There are generally four ways to transfer learning, as shown in Fig. 1.
In image classification, the current application of transfer learning is very successful. There are many pre-trained image classification models on the ImageNet dataset, and these pre-trained models can be transferred to the target task. Image classification models pre-trained on ImageNet generally achieve high accuracy on other image datasets [13]. Here, an Inflated 3D Convolutional Network (I3D) is used as the basic feature extraction network. First, the I3D network is pre-trained on the large-scale video dataset of Kinetics. The Kinetics dataset corresponds to 400 action classes, so the I3D network must be slightly changed when using the pre-trained model for the action recognition dataset (UCF-101). In addition, the data volume of the Kinetics dataset is much larger than that of the UCF-101 dataset, so transfer learning will increase the model’s generalization. The I3D network, after transfer learning, can quickly converge, greatly reducing calculation and training time. The transfer learning effect of the features generated by the DNN model in different layers is different. The high-level abstract features generated by the later network layers are suitable for transfer learning. Therefore, the features generated by the last layer of the Inception module are selected to be input into the subsequent time series modeling network [14].
Unsupervised human action transfer methods
At this stage, researchers have proposed an unsupervised human action transfer method, which provides a new idea for modeling action sequence data. This novel action redirection network design can be trained end-to-end from unlabeled network data in a 2D keypoint space. The researchers designed a new loss function based on invariance to endow the network with the ability to decouple action feature representations unsupervised. Applying the above action redirection network and invariance-based loss function to the human action transfer task outperforms the original state-of-the-art methods in both qualitative and quantitative metrics, especially on complex real-world actions. Recently, the cost of obtaining human motion information has been dramatically reduced with the popularity of mobile computing and the application of DL in computer vision [15]. The action transfer process is divided into three stages to deal with the large difference in structure and perspective between the basketball player’s footwork movement video and the target movement video, as shown in Fig. 2.
The original and target motion videos have large structural and perspective differences, so it is difficult to establish the source-target mapping at the pixel level. Especially when the initial object performs complex actions or when the structure difference between the initial object and the target object is relatively large, the accuracy of the traditional action transfer method is low. The action transfer process is divided into three stages: human keypoint detection, action redirection, and video rendering. It is only necessary to focus on the problem of action redirection by decomposing the tasks. The input and output of this problem are both 2D human keypoint sequences [16]. Figure 3 displays the overall framework of action transfer.
Finding paired action and character data in the real world is generally tricky as effective supervision signals for human action transfer tasks. Human motion exhibits complex nonlinearity. It is difficult to establish accurate models and parameters to characterize the process of human action transfer. The invariance of features in three dimensions in human motion data is exploited to deal with these difficulties. The first is motion, which refers to the semantic information of the movement of various body parts. The second is structure, which refers to the proportions of the body. The third is the view, which refers to the relative orientation information of the body and the camera. In theory, the overall motion can be reconstructed from these three pieces of information. These three parts of information are independent of each other, and any information is invariant to the disturbance of the additional two [17]. Specifically, they have the properties shown in Fig. 4.
In the training implementation, the rotation of the 3D human body is used as the perturbation of the perspective information. Limb scaling is the perturbation of structural information. Motion information does not need to be perturbed explicitly, as it changes over time. Based on these perturbations, the features that are required to be re-encoded by the network have the invariances mentioned above. Then, a series of completely unsupervised loss functions can be derived. The human keypoint sequence information is decoupled into three mutually orthogonal components of motion, structure, and perspective by training an auto-encoder [18].
Multi-source isomorphic transfer method based on graph convolution
The multi-source isomorphism method based on graph convolution aims to solve the transfer problem of graph-structured data in multiple source domains and the situation of unlabeled data in the target domain, which belongs to the unsupervised transfer learning problem. The ultimate goal is to mine the spatial features in the graph structure by reducing the distribution difference between the source domain and the target domain to solve the adaptation problem of the source domain and the target domain. Finally, the source domain label data is utilized to classify the target domain [19]. Figure 5 shows the overall framework of the transfer model.
The idea of the graph CNN model is derived from the spectral decomposition of the graph Laplacian matrix, which is the feature decomposition. Laplacian matrices are constructed from graph structures and are often used in graph theory. It can be regarded as a linear transformation, which acts the same as the Laplacian operator in mathematical analysis. Laplacian matrices can be called Laplacian operators or discrete Laplacian operators [20]. The Laplacian matrix is described below by taking Fig. 6 as an example.
It is assumed that graph G has N nodes, and the defined function is an N-dimensional vector (f1,…, fi,…, fn). fi is the function value at node i in the graph. Assuming that a perturbation is added to the i node, it may become any adjacent node j, j∈Ni. Ni represents the set of adjacent nodes of the i-node. Then, the gain brought by the change of any node j to node i is expressed as:
$$\varDelta {f}_{i}=\sum _{j\in {N}_{i}}{f}_{i}-{f}_{j}$$
1
Suppose the weight of each edge is wij. Besides, when wij=0, node i and node j have no edge. After substituting the weight of the edge, Eq. (1) can be transformed into:
$$\varDelta {f}_{i}=\sum _{j\in N}{w}_{ij}({f}_{i}-{f}_{j})$$
2
After expanding Eq. (2), Eq. (3) can be deduced.
$$\varDelta {f}_{i}=\sum _{j\in N}{w}_{ij}{f}_{i}-\sum _{j\in N}{w}_{ij}{f}_{j}$$
$$={d}_{i}{f}_{i}-{w}_{i}:f$$
3
In Eq. (3), di represents the degree of vertex i and generalizes to all nodes to get the change gain. It is expressed as:
$$\varDelta f=\left(\frac{\begin{array}{c}\varDelta {f}_{i}\\ \dots \end{array}}{\varDelta {f}_{n}}\right)=\left(\frac{{d}_{1}{f}_{1}-wi:f}{\begin{array}{c}\dots \\ {d}_{n}{f}_{n}-wn:f\end{array}}\right)=Df-Wf=(D-W)f$$
4
In Eq. (4), D-W is the Laplacian matrix, denoted as L. Laplacian matrices are often used in spectral clustering algorithms. First, the similarity matrix W between data points is defined using the k-nearest neighbor algorithm according to the distance between them. Then, the Laplacian matrix L is obtained according to the similarity matrix, and spectral decomposition is performed. Finally, spectral clustering is performed on the original data points using spectral decomposition to obtain eigenvectors. D is a diagonal matrix and is symmetric. W represents the graph adjacency matrix, which is also symmetric. Therefore, the Laplacian matrix (D-W) is a positive semi-definite symmetric matrix. It can perform feature decomposition, which is called spectral decomposition, which is expressed as follows:
$$L=U\left(\begin{array}{ccc}{\lambda }_{1}& & \\ & \dots & \\ & & {\lambda }_{n}\end{array}\right){U}^{T}$$
5
In Eq. (5), U indicates that the column vector is a matrix composed of eigenvectors, and \({\lambda }_{i}\) indicates the corresponding eigenvalue. This model corresponds to the traditional Fourier transform. The eigenvalue corresponds to the frequency, the eigenvector corresponds to the basic function, and the Fourier transform is obtained. Therefore, the Fourier transform of f under the eigenvalue \({\lambda }_{i}\) is the multiplication of f and the eigenvector Ui corresponding to the eigenvalue \({\lambda }_{i}\). So, the matrix form of f in the Fourier transform is obtained as:
$$\left(\frac{\widehat{f}\left({\lambda }_{1}\right)}{\begin{array}{c}\dots \\ \widehat{f}\left({\lambda }_{n}\right)\end{array}}\right)={U}^{T}f$$
6
Similarly, the inverse Fourier transform becomes the summation of the corresponding eigenvalues \({\lambda }_{i}\), which is expressed as:
$$\left(\frac{f\left({\lambda }_{1}\right)}{\begin{array}{c}\dots \\ f\left({\lambda }_{n}\right)\end{array}}\right)=U\widehat{f}$$
7
The ultimate goal of the graph CNN model is to introduce variable parameters. Therefore, a convolution kernel h is defined. According to the convolution theorem, the Fourier transform of the function convolution is equal to the product of the Fourier transform. Multiply the graph Fourier transform of the convolution of h and f by U to obtain the convolution of h and f in the original domain, which is expressed as:
$${(f\ast ℎ)}_{G}=U\left(\begin{array}{ccc}\widehat{ℎ}\left({\lambda }_{1}\right)& & \\ & \dots & \\ & & \widehat{ℎ}\left({\lambda }_{n}\right)\end{array}\right){U}^{T}f$$
8
In this way, a multi-source isomorphic transfer algorithm based on graph convolution is obtained. The angle of spectral decomposition is to use the theory of spectral decomposition to perform convolution operations. The spatial angle is a convolution operation based on the node’s neighbors. Introduced above is graph convolution from the perspective of spectral decomposition.