An Efficient Data Imputation Technique for Human Activity Recognition

The tremendous applications of human activity recognition are surging its span from health monitoring systems to virtual reality applications. Thus, the automatic recognition of daily life activities has become significant for numerous applications. In recent years, many datasets have been proposed to train the machine learning models for efficient monitoring and recognition of human daily living activities. However, the performance of machine learning models in activity recognition is crucially affected when there are incomplete activities in a dataset, i.e., having missing samples in dataset captures. Therefore, in this work, we propose a methodology for extrapolating the missing samples of a dataset to better recognize the human daily living activities. The proposed method efficiently pre-processes the data captures and utilizes the k-Nearest Neighbors (KNN) imputation technique to extrapolate the missing samples in dataset captures. The proposed methodology elegantly extrapolated a similar pattern of activities as they were in the real dataset.


INTRODUCTION
The recent advancements of the technology (Cardinale and Varley, 2017;Kim et al., 2016;Ni Scanaill et al., 2011;Patel et al., 2012;Sousa et al., 2015) and the presence of sensors in the commonly used off-the-shelf mobile devices (Shahriyar et al., 2010;Stankevich et al., 2012;Steele, 2011;Tian et al., 2019;Ventola, 2014) allow the development of solutions for the identification of the human daily living activities in order to monitor its lifestyles, e.g., the creation of a Personal Digital Life Coach (Garcia, 2016). However, the accuracy of these systems and their resilience of fails is essential for the recognition of activities in different environments (Dimitrievski et al., 2016a;Pires et al., 2017Pires et al., , 2018aZdravevski et al., 2015). In general, sportspeople, older adults, and other persons with special needs are living in conditions with bad network connection, but the development of solutions for this type of people is vital to improve their quality of life (Dimitrievski et al., 2016b;Sendra et al., 2012;Seneviratne et al., 2017).
There are different types of daily activities, and most of these are detected with inertial sensors, e.g., walking, running, moving upstairs, moving downstairs, and standing (Ferreira et al., 2020;Pires et al., 2018bPires et al., , 2018cPires et al., , 2018dPires et al., , 2019Pires et al., , 2020. These are simple activities with different types of motion that can be distinguished with low-cost sensors. The combination of artificial intelligence methods and the data acquired sensors available in the off-the-shelf mobile devices may empower the monitoring of older adults, and the development of intelligent solutions for sports and medicine (Costa et al., 2015;E. Zdravevski et al., 2017;Kumar and Venkatesan, 2014).
Currently, the devices used for the data acquisition may fail due to the memory, battery, and power processing constraints (Pires et al., 2017(Pires et al., , 2018a which cause missing samples/values while data acquisition. Thus, the activities are wrongly recognized. However, the correct recognition of activities is essential to develop solutions that support various types of people in different daily activities. The implementation of data imputation minimizes these constraints measuring the data in faults. As the proposal uses three sensors (i.e., accelerometer, magnetometer, and gyroscope), they have the same number of records and the same frequency of data acquisition to be fused and comparable. Before this study, we used different statistical features for the identification of activities and environments. Still, the data imputation allows the implementation of different techniques for the recognition of the activities, which may improve the accuracy of the recognition performed by our framework (Ferreira et al., 2020;Pires et al., 2018bPires et al., , 2018dPires et al., , 2019Pires et al., , 2020. This paper's primary motivation is to increase the reliability of the previously proposed framework for the recognition of daily activities (Ferreira et al., 2020;Pires et al., 2020Pires et al., , 2019Pires et al., , 2018bPires et al., , 2018d with the imputation of the missing data to fill the data and correctly identify the daily activities. This paper proposes implementing the k-Nearest Neighbors (KNN) imputation algorithm for the estimation of the values of the different datasets to fulfill the number of outputs correctly. The inertial sensors, including accelerometer, magnetometer, and gyroscope sensors, have three axes (i.e., x, y, and z), where the measurement should be related to the different axis. The proposed method first identifies the missing data, then performs data segmentation, and, finally, imputes the missing data.
In the literature, the KNN imputation is one of the most used methods to support the recognition of activities and environments. With our implementation, it is verified that the pattern of the imputed data is similar but with higher amplitude than the original data. This paper is organized as follows: Section 2 presents the methodology, including the definition of data acquisition, identification of missing samples, and implementation of data segmentation and imputation algorithms. The discussion and results are presented in Section 3. Finally, Section 4 presents the conclusions of the study. Figure 1 shows the flow diagram of the proposed methodology to extrapolate the missing samples in a human activity recognition dataset. The proposed method consists of four major stages, i.e., data acquisition (section 2.1), missing samples identification (section 2.2), data segmentation (section 2.3), and data imputation (section 2.4). These stages are explained in the subsequent sections.

Data Acquisition
Data acquisition is the premier stage for performing the proceeding steps. In this stage, we first acquire the dataset for which we have to extrapolate the missing samples. Thus, we used a publicly available dataset (Pires, 2018). The dataset (Pires, 2018) includes five daily living activities, i.e., walking, running, standing, moving upstairs, and moving downstairs. The dataset (Pires, 2018) is acquired using three motion sensors, i.e., accelerometer, gyroscope, and magnetometer. The five daily living activities included in the dataset (Pires, 2018) are performed by 25 subjects age ranging from (20-60) with sedentary and active lifestyles.
This dataset contains several captures with 5 seconds of sensors' data acquired every 5 minutes. While having a smartphone in the pocket, each activity was recorded by using an Android application. The data was obtained by hundreds of hours to implement methods for the identification of the activities. Due to the memory, battery, and power processing constraints, sometimes the data acquisition may fail (Pires et al., 2017(Pires et al., , 2018a, and the imputation of the mission data is needed.

Missing Samples Identification
Once the dataset (Pires, 2018) is acquired, the next step is to check if there are some missing samples in each recorded activity. The missing samples in a dataset have a lousy impact while training the machine learning (ML) models. It occurs because due to missing samples, the ML models do not learn the activity patterns properly. The missing samples in the dataset are mainly due to two reasons, i.e., the user did not perform an activity for a complete defined activity duration, or there could have some issue in the device that is being used for recording the activity.
To identify the missing samples in the dataset, we first analyzed each activity time duration and frequency rate. In our case, each activity was performed for a time duration of 5 seconds. The frequency rate for accelerometer and gyroscope was 100 Hz (i.e., 100 samples/s), while for magnetometer, it was 10 Hz (i.e., 10 samples/s). So, for a 5sec activity, there should be 5 x 100 = 500 samples for the accelerometer. On the other hand, there should be 5 x 10 = 50 samples for the gyroscope in an activity of 5 seconds.
Once we figured out the number of samples for each activity across the three sensors, we then write a python script to identify the missing samples in each activity to perform further steps. While recognizing the missing samples in an activity, we ignored the activities having missing samples for duration more than 1sec, i.e., ignored the captures that have more than 100 missing samples in case of accelerometer or more than 10 missing samples in case of gyroscope. It is done to be closer to the originality of the data than filling all synthetic samples to fulfill the space of missing samples.

Data Segmentation
After identifying the missing samples, we first inserted Null values to fulfill the space of missing samples. Then, we segmented the samples based upon the missing samples count. If the missing samples count was greater than 10, we segmented data into a chunk of 100 samples to include 90 original samples and the first 10 samples with Null values. If the missing samples count is less than or equal to 10 samples, we made a chunk of the last 100 samples, including the recently inserted and the Null value samples and original samples.

Data Imputation
Once the data is segmented, we then applied the KNN imputation algorithm (Beretta and Santaniello, 2016) to extrapolate the missing values. The KNN impuation technique is based upon the KNN algorithm. In KNN imputation, we firstly find k-closest neighbors to the missing data and then impute these missing values based upon known k-closest neighbors.
We noticed the count of missing samples every time, before applying the KNN imputation. If the missing samples were less than 10 samples, then the missing data is filled in the first iteration. However, if the missing samples are more than 10, we need to perform segmentation steps again to make another chunk of data and apply the KNN imputer algorithm still to extrapolate the missing values. As shown in Figure 1, this process continues until all the missing samples are extrapolated.

RESULTS AND DISCUSSION
The proposed data imputation technique is applied to a publicly available dataset (Pires, 2018) of human activity recognition. Based on the time of 5 seconds (500 samples per file in case of accelerometer and gyroscope and 50 samples per file in case of magnetometer), the dataset had many missing samples. Table 1 shows the missing samples in dataset (Pires, 2018) for the accelerometer data across each activity. In Table 1, the second column shows the total number of records available in the dataset (Pires, 2018) across each activity. The proceeding columns shows the number of missing samples in the dataset (Pires, 2018) for each corresponding activity. It can be verified that the most missing samples are related to moving upstairs activity. To cope with this challenge, we proposed an imputation technique for extrapolating the missing samples by following the steps as described in the methodology section. Figure 2 presents an excerpt of accelerometer data for moving downstairs activity, taken from the dataset. The selected excerpt had 50 missing samples. We first applied the proposed methodology to identify the missing samples to validate the actual missing count. The proposed method accurately identified the correct missing samples and found 50 missing samples in the given excerpt.
After discovering that 50 samples are missing in the given excerpt, we then filled the missing samples by NULL values, as shown in Figure 3, for further implementation of the KNN imputation algorithm. Afterwards, we performed data segmentation, as explained in Section 2.3. Figure 4 shows how the data is segmented for the given excerpt.
Finally, we applied the KNN imputation algorithm to extrapolate the unknown samples, as explained in the previous sections. Figure 5 shows the missing calculated values extrapolated after using the KNN imputation algorithm.   This technique is implemented for the files that have less than 100 missing samples in case of accelerometer and gyroscope and 10 missing samples in case of magnetometer. If more than 100 samples are missed, this capture should be discarded. In short, we ignored the activities having missing samples for duration more than 1sec, i.e., ignored the captures that have more than 100 missing samples in case of accelerometer or more than 10 missing samples in case of gyroscope. It is done to be closer to the originality of the data than filling all synthetic samples to fulfill the space of missing samples as explained in (section 2.2). Figure 6 shows an example of an activity recorded while moving downstairs, where it is possible to verify that the file has 375 samples. As the file must have 500 samples, 125 samples are missing. By applying KNN Imputation methods, we found data with the same format, but it has more high amplitude. Thus, the peaks for the downstairs activity are similar, as presented in Figure 7.
For the future work, the data classification should be implemented with different machine learning methods, where we can now use different variables of previous works, e.g., the mean, standard deviation, variation, and median of the values acquired and imputed from each axis. Thus, we can discover different patterns to improve the development of a method for the recognition of daily activities and environments.

CONCLUSION
The missing samples in a dataset capture affect the performance of machine learning models. Therefore, in this work, we proposed a methodology to extrapolate the missing samples of every dataset capture related to motion sensors data recorded for human activity recognition. The proposed method first identifies the missing samples in each excerpt with respect to time duration and frequency of the dataset. Secondly, it inserts Null values to fil the gap of missing samples. Thirdly, it performs the segmentation of the data, and finally, it applies the KNN imputation technique to extrapolate the missing samples. The proposed methodology extrapolated the missing samples with the same pattern as found in the original dataset but with some high amplitude.
As future work, we aim to improve the data amplitude to make it closer to the reality and train different machine learning models to better detect human activity patterns.