ML-WiGR: a meta-learning-based approach for cross-domain device-free gesture recognition

Accurate sensing and understanding of gestures can improve the quality of human–computer interaction and show great theoretical significance and application potentials in the fields of smart home, assisted medical care and virtual reality. WiFi channel state information (CSI)-based device-free wireless gesture recognition requires no sensors and has a series of advantages such as permission for non-line-of-sight scenario, low cost, preserving for personal privacy and working in the dark night. Although most of the current WiFi CSI-based gesture recognition approaches can achieve good performance, they are difficult to adapt to the new domains. Therefore, this paper proposes ML-WiGR, a novel approach for device-free gesture recognition in cross-domain applications. ML-WiGR applies convolutional neural networks (CNN) and long short-term memory (LSTM) neural networks as the basic model for gesture recognition to extract spatial and temporal features. Combined with the meta-learning training mechanism, ML-WiGR can dynamically adjust the learning rate and meta-learning rate in training process adaptively and optimize the initial parameters of a basic model for gesture recognition, only using a few samples and several iterations to adapt to the new domain. In the experiments, the approach is tested under a variety of scenarios. The results show that ML-WiGR can achieve comparable performance against existing approaches with only a small number of samples for training in cross-domains.


Introduction
In recent years, with the development of artificial intelligence and computer technology, as well as people's increasing attention to the quality of life, device-free wireless sensing technology has gradually become an important research topic of intelligent applications. The device-free wireless sensing technology can utilize the wireless radio frequency signal in  (Neethu et al. 2020), wearables (Tramontano et al. 2019), ultrasounds (Chen et al. 2017) and other technologies (Qin et al. 2019), it has a series of advantages such as permission for non-line-of-sight scenario, low cost, preserving for personal privacy and working in the dark night.
WiFi is a common radio frequency technology and usually can be employed for intelligent sensing. For example, Halperin et al. released the CSI Tool (Halperinet al. 2011) to extract WiFi CSI (Channel state information) from commercial network interface cards, which greatly facilitates the acquisition of CSI from commercial WiFi devices. WiFibased sensing technology with CSI has been extensively studied for various applications, including sleep detection (Gu et al. 2019), fall detection (Ding and Wang 2020), gesture recognition (Tang et al. 2021), WiFi imaging (Adib et al. 2015), crowd detection (Xi et al. 2014), lip recognition , daily behavior detection , breathing and heartbeat detection (Abdelnasser et al. 2015;

Fig. 1
A scenario that the gesture will affect the received WiFi signal Liu et al. 2015), gait recognition , indoor localization (Zhang et al. 2020a, b) and trajectory tracking .
Accurate sensing and understanding of gestures can improve the quality of human-computer interaction and has great theoretical significance and application potentials in the fields of smart home, assisted medical care and virtual reality. As shown in Fig. 1, when the human hands moving in the sensing area of WiFi, it will disturb the normal propagation of the signal and there will be some phenomena such as reflection, refraction, diffraction and scattering. By analyzing the changes of received WiFi signal, the state of human hand can be inferred.
We collectively refer to the factors related to gesture recognition as domains and it is called crossing domains when the domains change. Although many WiFi-based gesture recognition systems have performed well in the fixed domain, such as WiGest (Abdelnasser et al. 2015), WiG (He et al. 2015), WiGeR (Al-qaness and Li 2016), WiFinger ), Smokey (Zheng et al. 2016 and WiMU (Venkatnarayan et al. 2018), they are difficult to adapt to the new domains when the person or environment changes. The performance of a classifier trained with original signal features in one domain usually drops sharply in another domain.
At present, there are some works researching crossdomain problems. Widar3.0 (Zheng et al. 2019) extracts the body-coordinate velocity profile (BVP) feature from the Doppler frequency shift (DFS) of WiFi CSI and can perform well with zero effort in cross-domain tasks. But Widar3.0 needs more network links to get better performance, which may consume more resources of devices. CrossSense ) and EI (Jiang et al. 2018) use transfer learning and adversarial learning methods to deal with cross-domain problems, respectively. However, they both need to pay more effort on collecting samples in the new domains to get better performance.
In order to improve the ability to adapt to the new domains, a new approach needs to be suggested to solve the crossdomain problem in WiFi-based device-free wireless sensing, hoping that a smaller cost of collecting samples and training models is required to get better performance with less network links when facing with the changes of domains.
Meta-learning can learn some general knowledge from a series of tasks in source domain which are similar to the target task. The knowledge can help new tasks iterate quickly and achieve better performance with few samples. Thus, in this paper in order to address the cross-domain problem, we proposed ML-WiGR, a meta-learning approach for cross-domain gesture recognition with WiFi CSI. ML-WiGR mainly contains two aspects: the improved recognition mechanism and the model structure.
We modified the training mechanism of Reptile (Nichol et al. 2018), a first-order meta-learning algorithm, to improve the performance to deal with cross-domain tasks. The new meta-learning strategy introduces online adaptive adjustment strategies of learning rate and meta-learning rate to avoid the expensive hyperparameter adjustment problem during training. ML-WiGR trains the initial parameters of the basic model so that it only needs a small number of gradient steps with a small amount of training data from a new domain to generate a new model that can adapt to the cross-domain tasks. Meanwhile, we designed a simple but practical neural network model for gesture recognition. It applies convolutional neural networks (CNN) and long short-term memory (LSTM) networks as a basic model to extract the spatialtemporal characteristics of BVP that extracted from DFS of CSI and it can perform well with the improved meta-learning mechanism.
The rest of this paper is arranged as follows. Section 2 reviews the related work. Section 3 gives the description and formulation of the problem. Section 4 presents the details of the proposed ML-WiGR, including the recognition mechanism and the model structure, Sect. 5 validates the proposed approach with extensive experiments. Finally, Sect. 6 concludes this paper.

Device-free WiFi-based gesture recognition
WiFi-based device-free wireless sensing has attracted a great amount of interest as it provides a ubiquitous sensing solution by using the pervasive WiFi infrastructure (Wu et al. 2017). The interests of WiFi have experienced from received signal strength indicator (RSSI) Abdelnasser et al. (2015) and Zhang et al. (2020b) to CSI which is accepted by more recent applications ; Tang et al. (2021); Ding and Wang (2020). The existing work is mainly divided into two categories: pattern-based and model-based recognition solutions. Different gestures or activities will exert different influence on surrounding wireless signals; however, the influence patterns are so weak and complex that many sys-tems are designed based on machine learning. For example, WiG He et al. (2015) uses statistical features and SVM to recognize four gestures. The recognition rate on the LOS path reaches 92%, and the recognition rate on the NLOS path reaches 88%. WiFinger Li et al. (2016) uses a series of signal processing methods and the KNN algorithm to realize the recognition of 9 American sign languages and realize finger-level gesture recognition with an accuracy of 90.4%. In general, although the existing machine learning approaches can enable WiFi-based device-free gesture recognition to get well performance in the fixed domain, they have limitations when used in complex cross-domain scenarios, including lack of consistent features and the needs of a large number of samples.
There are some works to deal with cross-domain problems. Widar3.0 Zheng et al. (2019) extracts the BVP feature from the DFS of WiFi CSI, and it can perform well with zero effort in cross-domain tasks. But Widar3.0 needs more links to get better performance, which may consume more costs of devices. CrossSense Zhang et al. (2018) and EI Jiang et al. (2018) use transfer learning and adversarial learning methods to deal with cross-domain problems, respectively, but need to pay more cost for collecting samples in new domain to get better performance. Therefore, a new approach is required to solve cross-domain issues with less network links and a small number of samples.

Meta-learning
In recent years, with the development of artificial intelligence, deep learning techniques has achieved remarkable successes in various scenarios, including recommender system Wu et al. (2019, power system Luo et al. (2019), autonomous vehicle Kebria et al. (2019), service computing Wu et al. (2020. Although these advances and solutions Luo et al. (2020), ample challenges remain to be solved, for example, the large amounts of data and training that are needed to achieve good performance. These requirements severely constrain the ability of deep neural networks to learn new concepts quickly. Therefore, meta-learning Sung et al. (2018) and Hu et al. (2018) has been suggested as one strategy to overcome these challenges. This kind of methods is used for training on a series of tasks which are similar to the target task, and the meta-learner can learn some general knowledge in multiple epochs of task iterations. The knowledge can help the training of the new tasks iterate quickly and achieve better performance.
Currently, meta-learning techniques mainly include metricbased, model-based and optimization-based approaches Huisman et al. (2021). The metric-based meta-learning Sung et al. (2018) and Chen et al. (2020) requires no adjustments for the test tasks, but may achieve bad performance when the distributions of test sets far away from train sets. Due to the flexibility of the system, the model-based methods Kuznetsova et al. (2020) have wider applicability than most metric-based meta-learning, but have worse performance for many supervised tasks. Compared with above two methods, the optimization-based meta-learning approaches, such as MAML Finn et al. (2017), LEO Anderson et al. (2018) and meta-learner LSTM Baydinet al. (2017), can achieve better performance when the tasks are more widely distributed. However, it needs to optimize the base model for each task with more expensive computation resources. In order to improve the performance for cross-domain tasks of device-free gesture recognition, in this paper optimizationbased meta-learning approach is chosen as the learner to classify the category of gestures.

Cross-domains
Although many previous systems can get better performance in the fixed conditions, the problem of crossing domains need to be considered before deployment. Taking WiG as an example, it has a good performance in the source domain, but the accuracy drops significantly when it is applied to other domains directly.
Different gestures will exert different influence on surrounding WiFi signals. However, the influence patterns are so weak and complex that if we could not find the most consistent characteristics from the changing patterns of gestures in different domains, the performance may be degraded significantly.
The problems are often caused by the changes of the environment or the person in the sensing area. In this paper, we classify the cross-domain problems as single cross-domain and multiple cross-domain problems. Single cross-domain problem involves only one factor changing in the environment. Figure 2 shows the single cross-domain problems caused by the changing of different single factors. In real scene, it is often faced with multiple cross-domain issues, that is, there are multiple factors changing together. Therefore, when verifying the feasibility of the approach, a variety of cross-domain influences need to be considered.

Problem description
In this paper, we hope that when facing with the changes of domains, only low effort is needed in collecting samples and training models to get better performance. Traditionally, a machine learning algorithm usually focuses on a special task T . In the field of gesture recognition based on WiFi CSI, we assume that a basic task is to apply a classifier to determine the The main goal of training model is to establish a classifier f , where the sample x i is used as the input of the model, and the estimated value of the corresponding label y i is the output of the model. In a general supervised learning scenario, the number of labels L is limited, the number of samples K is large, and all samples are simply divided into two parts: a training set D(train) and a testing set D(test). When facing with cross-domain problems, we hope to learn a better classifier in a domain with only a small number of samples and less training cost. At this time, the number of samples K will be small, and the problem can be regarded as a few-shot learning problem. When a small number of samples are directly applied to this problem, it will cause serious overfitting.

Problem formulation
To solve the above problem, meta-learning mechanism is introduced and the proposed approach no longer focuses on a specific gesture classification task, but serves to formulate a meta-task model F. It can obtain the ability to estimate the new task T N by learning from tasks in the task set T = {T 1 , T 2 , T 3 . . . , T N −1 }. The classification task N is different from the other N − 1 tasks which are in different domains.
In the field of applying meta-learning for WiFi-based gesture recognition, we defines a specific task T i as a classification task in a certain domain, which need to distinguish L different categories of gestures. There is a set of tasks T = {T 1 , T 2 , T 3 , . . . , T N −1 }, where each task T i contains L labels of gestures. The number of each gestures k is very small, generally k < 20. But these samples are usually easy to obtain. When the same gestures appearing in a new domain, these samples constitute a new classification task T N . At this time, for each category of gestures in the new domain, only k samples are available. The main goal of modeling is to complete the final classification task T N by learning L × k new samples and obtain the classifier model f N . In task T N , L × k samples can constitute a support set, denoted as: . . , L}, the samples to be classified constitute the testing set.
Since k is small, only L × k samples are not enough to complete the task T N . However, learning experience can be borrowed from similar tasks T i in other domains. The L × k samples in T i can constitute a sample set, denoted as S u = {(x 1 , y 1 ) , (x 2 , y 2 ) , . . . , (x L×k , y L×k )} , x i ∈ R d , y i ∈ {0, 1, . . . , L}. The samples with labels to be verified in T i constitute the query set. Here, the sample set and query set of the task T i simulate the support set and test set of the final task T N , respectively. The difference is that the sample set and query set are sampled from easily available labeled datasets.
A set of tasks T (train) is constructed, which contains a large number of tasks similar to the test set in structure. The task set T (test) contains tasks such as T N and is used as a set of tasks for testing. From the perspective of metalearning, T (train) and T (test) are the training set and test set in the meta-task model F, respectively. Therefore, T (train) and T (test) can be expressed as meta-training set and metatesting set, respectively.

Approach overview
ML-WiGR is an approach for cross-domain problem in WiFi-based gesture recognition. Figure 4 shows its overall structure. This approach uses a deep neural network model to mine the spatial and temporal characteristics of CSI and improves the training mechanism of Reptile which is a firstorder meta-learning method. An online adaptive adjustment strategy of learning rate and meta-learning rate is introduced to deal with the problem that the original Reptile requires expensive hyperparameter adjustment during training. By training the initial parameters of the basic model, it only needs a small number of samples and a few steps of gradient descent to generate a new model to adapt to new task.

Training mechanism
ML-WiGR first constructs a series of tasks with only a small number of samples in the source domains. The model trained by these samples will be used to evaluate the supervisory on the query set, and the supervision signal will be used to optimize the original network. The model will be more general after learning. In this process, the optimized model of the previous task can be directly used for the training of the next task. After multiple iterations, the model can be applied for evaluation on the test set. ML-WiGR adopts a first-order meta-learning algorithm to speed up the training for new tasks, and it is not easy to cause overfitting to the original data because of using stepwise optimization method in the previous training of similar tasks. In the training process, the original Reptile is sensitive to the choice of hyperparameters. It has two important hyperparameters, namely task-level learning rate α and meta-learning rate β, which correspond to the learning rate of the inner loop and outer loop, respectively. Increasing any hyperparameter grid search computation by an order will consume more time and resources. In addition, a good value of α and β in Reptile is more important than any algorithms based on traditional stochastic gradient descent (SGD), because only a small number of samples are available in few-shot learning problems.
In the original Reptile algorithm, given model parameters θ , SGD is used to adjust the parameters θ to adapt to the new task T t where α is the task-level learning rate, t is the task number, and T train(t) and T test(t) represent the training and test set of task t. These tasks are sampled from the defined task distribution p(T ). The goal of meta-learning is The model aims to optimize the parameters θ so that it can adapt to new tasks with only one or several steps of stochastic gradient descent operations. For the optimization in (2), we have where β is the meta-learning rate. Since ∇ θ L T test(t) f θ t is a second derivative, θ − θ t is used to replace it in Reptile (Nichol et al. 2018), denote as Therefore, only the first derivative is required when calculate θ t . But in order to find the appropriate hyperparameters for each training stage, we need to add updating rules for α and β.
First of all, our goal is to update α toward the optimal value α * i for a single task to minimize the loss in the training process L T train(t) f θ i , where i is the number of iterations in a single task. We assume that the optimal α value does not change much during each iteration, so that it can be estimated from α i−1 . Therefore, perform one step gradient descent in the previous α i−1 , and obtain the gradient Obtain α i as follows where α hyperlr is the hyperlearning rate of α. At the same time, we need to update the meta-learning rate β. Similar to α, update β toward the optimal value β * to minimize the loss during training L T test(t) f θ t . It is assumed 6150 Z. Gao et al.
According to Equation (4), the hypergradient of β is Obtain β i as follows The pseudocode of ML-WiGR is given in Algorithm 1.

Algorithm 1 ML-WiGR
Require: distribution over tasks p(T ) Require: initial learning rates α 0 ,β 0 Require: hypergradient learning rates α hyperlr ,β hyperlr 1: Randomly initialize θ 2: while not done do 3: for iteration=1, 2, · · · do 5: Compute loss L T train(t) ( f θ ) 6: Compute the gradient of α according to (5) 7: Update α according to (6) 8: Compute adapted parameters with gradient descent: end for 10: Compute the gradient of β according to (7) 11: Update β according to (8)  12: ML-WiGR modified the original Reptile and adaptively adjusts the learning rate α and the meta-learning rate β. The solution is based on the hypergradient descent (HD) (Baydinet al. 2017) algorithm, which automatically updates the learning rate by performing gradient descent on the learning rate based on the original optimization step. This algorithm does not require additional gradient calculations and only needs to store the gradients in the previous optimization steps.

Recognition mechanism
ML-WiGR applies a deep neural network as a basic model to mine spatial and temporal features in the tasks of gesture recognition, of which overall structure is shown in Fig. 4. The model consists of a convolutional neural network (CNN) and a recurrent neural network (RNN), which are used for spatial feature extraction and temporal modeling, respectively.

Spatial feature extraction
BVP that extracted from the DFS of CSI is used as the input of this approach. The input data of the basic model is similar to a set of image sequences. Each BVP describes the energy distribution of speed in a short time interval. CNN can identify simple patterns in the data and then generate more complex patterns from these simple patterns in higher-level layers. Therefore, in order to fully understand the generated BVP data, we need to extract the spatial features from a single BVP and model the time dependence of the entire series. A simple model is designed to learn higher-level features.

Temporal modeling
In addition to the local spatial features in each BVP, the BVP sequence also contains dynamic connections with time. RNN can understand context of the input features and model complex time dynamic sequences. LSTM layers are added after the CNNs. Finally, fully connected layers with Softmax as the activation function is applied to output a category vector of length L of which value represent the probability of each labels.

Performance
This section shows the dataset and the performance of the proposed approach.

Experiment setting
In this paper, we use the Widar3.0 gesture recognition dataset (Zheng et al. 2019) of Tsinghua University to verify our proposed gesture recognition method. The dataset contains raw CSI data and extracted signal features (DFS and BVP), including about 260,000 action examples collected in 75 different scenarios (including different positions, orientations and environments). The experiments in Widar3.0 are mainly carried out in 3 indoor environments: an empty classroom equipped with tables and chairs, a spacious hall and an office equipped with sofas, tables and other furniture.
The L-way K -shot cross-domain problem is defined as follows: select L categories from known gestures in the source domains, provide the model with K different instances about each of L labels, and evaluate the ability of the model about classifying the new instances of these L categories in the target domain. In the experiment, we consider a 5-way 5-shot scenario, using 5 samples to quickly learn 5 labels of gesture classification with only 10 steps of gradient descent.
We implement the proposed approach with PyTorch and run it on a computer with Intel Xeon E5-1620 CPU, NVIDIA GeForce GTX 1080 Ti GPU and 16GB RAM. We extracted the BVP from DFS of four links.

Evaluation for crossing domains
Crossing domains is often caused by changes of many factors such as environments, locations, orientations and personnel. Specifically, single crossing domain is often caused by only one factor, and multiple crossing domains is caused by multiple factors. In order to evaluate the feasibility of the proposed approach in detail, we decompose cross-domain issues into crossing single domain and crossing multiple domains, and then discuss them separately.

Crossing single domain
Firstly consider the single cross-domain problem. In the experiment, we strictly control variables about the domains and select five categories of gestures for experiments, such as push and pull, sweep, clap, slide and draw zigzag. Specifically, when facing a problem about crossing rooms, all actions come from the same person with the same orientation in the sensing area. Figures 5, 6, 7 and 8 show the recognition performance when crossing different domains with 5 shots, respectively. It can be seen that the proposed approach gets better performance when crossing locations and environments under 5-way 5-shot scenario and the performances of crossing orientations and crossing persons drop slightly.

Crossing multiple domains
Next, consider the cross-domain problem when multiple factors change simultaneously. Specifically, we also select five categories of gestures for experiments, such as push and pull, sweep, clap, slide and draw zigzag. When the user, location, orientation and residential room are all changed, we try to quickly learn 5 categories of gesture classification with only 5 samples. Figure 9 shows the performance of the proposed approach when there are multiple factors changing at the same time. The displayed accuracy will be slightly reduced.

Comparison of different sample sizes
Different sample sizes will influence the performance of the approach. In order to verify whether the proposed approach has learned "how to learn," we also select above five gestures and 1, 5 and 10 samples are selected separately from each category of these gestures for experiment. The selected samples constitute 5way-1shot, 5way-5shot and 5way-10shot tasks, respectively. Table 1 shows the performance of the proposed approach under different number of samples. From the table, we can find that as the sample size increases, the recognition accuracy will also improve. The approach can still achieve better performance when the number of sample is small. However in real scenarios, we should obtain as many samples as possible under the limited conditions to improve the recognition accuracy of the model.

Compared with the state-of-the-art works
In addition to comparing the classification capabilities of the proposed approach when using different numbers of samples, we also compare the proposed approach against several state-of-the-art approaches including Widar3.0, EI, CARM and WiG, where the first two are feasible for cross-domain recognition. Figure 10 shows the performance comparison of these approaches.  performance may be degraded significantly. Although our proposed approach requires additional training data in new domains, it only needs to pay the cost of collecting a few samples and acceptable training to obtain a satisfactory performance improvement.

Compared with different features
Although the learner has ability of crossing domain, it is also important for the input features to adapt to new domain. Different features will make difference to the performance of the model. We need to examine which input features are more suitable for our model. The CSI raw data and DFS are often chosen as the inputs of some system. Thus, they are also considered as the input to validate ML-WiGR. Figure 11 shows the results of ML-WiGR with different input features. It can be seen that the model has worse performance when input unprocessed CSI raw data which does not have cross-domain capabilities. Both DFS and BVP have cross-domain capabilities, and BVP is more suitable as the input of ML-WiGR.

Compared with Reptile
As described in Sect. 2, the optimization-based meta-learning can achieve better performance, but it needs to optimize the Fig. 12 Comparison convergence of ML-WiGR and Reptile base model for each task with more expensive computation overhead. Therefore, it is necessary to improve the speed of convergence for training. ML-WiGR is extended from Reptile algorithm and introduces online adaptive adjustment strategy of learning rate and meta-learning rate to solve the problem of the original Reptile for expensive hyperparameter adjustment during training. Compared with Reptile in Fig. 12, we can see that ML-WiGR shows fast convergence in training phase so that it can save the computation burden.

Discussions
It can be seen from the above experiments that although ML-WiGR requires additional training data for the new domains, it only needs to pay the cost of collecting a few samples for acceptable training to obtain satisfactory performance improvement. Especially combining with BVP, a kind of consistent features of changes, ML-WiGR can improve the performance even more in cross-domain tasks. How to get the samples for cross-domain training may influence the performance of our proposed approach, and it may get worse performance when choosing similar samples. It is deserved further study to use samples by considering their similarities.

Conclusions
This paper proposes ML-WiGR, a meta-learning approach for WiFi-based device-free gesture recognition. It applies CNN and LSTM deep neural networks as a basic model to extract the spatial-temporal characteristics of BVPs. We present a modified meta-learning strategy and introduce online adaptive adjustment strategies of learning rate and meta-learning rate to avoid the expensive hyperparameter adjustment problem during training. ML-WiGR trains the initial parameters of the basic model for generating a new model to adapt to cross-domain classification tasks. The experimental results show that ML-WiGR can achieve supe-rior performance against existing approaches with only a small number of samples for training with cross-domains. In this paper, a few samples is used for training the model of gesture recognition in crossing domains, and future work can be done to further enhance the performance of ML-WiGR by considering the similarities of the collected samples. To extend ML-WiGR to more complex non-line-of-sight scenarios and multi-target gesture recognition are also deserved for further studies.

Data availability
The data that supports the findings of this study are openly available in Widar3.0 at https://doi.org/10.1145/3307334. 3326081, reference number (Zheng et al. 2019).

Conflict of interest
The authors declare that we have no conflicts of interests regarding the publication of this article.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent I consent the journal to review the paper. I inform that the manuscript has not been submitted to other journal for simultaneous consideration. The manuscript has not been published previously. The study is not split up into several parts to increase the quantity of submissions and submitted to various journals or to one journal over time. No data have been fabricated or manipulated (including images) to support my conclusions. No data, text or theories by others are presented as if they were of my own. Proper acknowledgements to other works are provided, and I use no material that is copyrighted. I consent to submit the paper, and I have contributed sufficiently to the scientific work and I am responsible and accountable for the results.