Materials.
In this study, we used the ETRI-Activity3D dataset by filming the lives of older adults from a robot’s point of view to solve the problems of an aging society and to learn the behavior of older adults32. The data was collected in a 102m2 apartment setting reflecting a real-home environment, and it was constructed to elicit similar behaviors of older adults as accurately as possible. Figure 1 shows a typical example of the video data. The actual data used from the ETRI-Activity3D data are shown in Table 1. ETRI-Activity3D was approved on April 20, 2021 by submitting an agreement on the ETRI AI Nanum Website(https://nanum.etri.re.kr/share/dhkim008/robot_environment2?lang=en_KR).
Table 1
Definition of ETRI-Activity3D data
Items
|
Content
|
Total number of samples
|
5,339(Train 4091 / Test 1248)
|
Number of behavior classes
|
13
|
Number of people filmed
|
20 people (10 older men, 10 older women)
|
Filming environment
|
102m2 apartment living environment
|
Filming location
|
Bathroom, kitchen, living room, etc
|
Used data format
|
RGB videos
|
FPS
|
25
|
There was a total of 55 classes in the ETRI-Activity3D data set. However, in accordance with the behavior classes for intrinsic risk outlined, the categories were defined as (1) behaviors that can be used to assess the ability to perform ADLs as a basic daily routine; (2) behaviors that may be unhealthy to health in the long term; and (3) behaviors that can indicate the social relationships of the older adults. Therefore, we define a total of 13 classes, and these classes satisfying respective conditions were presented in Table 2. Several behavior classes were merged and redefined into a single class. The behavior of using a vacuum cleaner and that of cleaning the floor while bending forward were combined and defined as “Cleaning the room.” Similarly, reading a book and reading a newspaper were combined and defined as “Reading.” The action of making or receiving a call and the behavior of operating a smartphone were also combined and de-fined as “Using a phone.”
Table 2
Definition of behavior classes used by the developed system
Class criteria
|
Behavior Class
|
Total Number of Data
|
(1) Behaviors that can be used to assess the ability to perform ADL as a basic daily routine
|
Using a gas stove
|
266
|
Cleaning the room
|
398
|
Cleaning the furniture
|
286
|
Hanging laundry
|
392
|
Reading
|
558
|
Using a remote
|
324
|
Lying
|
378
|
(2) Behaviors that may be unhealthy to health in the long term
|
Eating
|
507
|
|
Taking medicines
|
580
|
|
Drinking
|
376
|
|
Smoking
|
362
|
|
(3) Behaviors that can indicate the social relationships of the older adults
|
Talking
|
330
|
|
Using a phone
|
582
|
|
Action recognition
Action recognition technology is essential to monitor older adults’ daily lives. since it has been widely applied to video information search, daily life security, and CCTV surveillance23. This technology is divided into two types of actions: action classification and action detection24. Action Classification implies classifying the type of action a person is performing in a video. Consequently, the video data used as the input should consist of an image of an action. In contrast, action detection detects which action is taken at a certain point in a video that was not cut, based on a specific class criterion23,24. Since most of videos collected in a real life are unedited videos that include multiple actions rather than videos that only include a specific action, action detection is essential for recognizing the specific target actions of a person in the actual video24,25.
The methods of action recognition can be broadly divided into the following: (i) a method that uses RGB image data without changes26,27,34, and (ii) a method that detects an action using the skeleton coordinates of human body derived from the RGB image data28,29. However, when RGB images are used, information other than actions such as background and color may be also sensitive to action detection19,21,22,34. Therefore, if action recognition is performed using the skeleton coordinates of human body in RGB images, then it is possible to overcome the limitations by solely obtaining motion data without being affected by the background or lighting30,31. For this reason, Posec3d28, which performs skeleton-based action recognition, was used in this study.
Posec3d can be divided into two main segments: a pose estimator and a behavior detector. In the pose estimator, a video, entered as the input data, is sliced into images at a certain frame per second, and human skeleton coordinates are derived from each image. In this paper, in the pose estimator stage, the human pose was estimated using the Top-Down method, which has an advantage over the Bottom-Up method in terms of accuracy36. Therefore, the human body was first detected and then the human skeleton coordinates were derived. First, the object detection algorithm Faster Region-based Convolutional Network (Faster CNN)37 was used to detect a person, and the pose estimation algorithm High Resolution Network (HRNet)35 was used to extract the human skeleton coordinates of the detected person. After converting each extracted human skeleton coordinates into a 2D heatmap, it is stacked according to time flow to construct a 3D heatmap. The behavior detector uses the 3D ResNet-based 3D convolutional neural network (CNN) as the input for a 3D heatmap to detect and identify the actions performed by a person in a video. Figure 2 shows the framework of Posec3d.
Development of and ADL Monitoring System
Framework
The study framework (Figure 3) allows older adults to record their behaviors using Posec3d which is designed to evaluate older adults’ ADLs detect the intrinsic risks. Posec3d enables state-of-the-art recording in the field of action recognition using human skeleton coordinates33.
In this study, we evaluated different behaviors of older adults, such as social isolation, self-neglect, and long-term unhealthy behavior, these behaviors elucidated in this study are: (1) basic behaviors that can evaluate the personal capacity of ADLs, (2) behaviors that can be unhealthy to long-term health, and (3) behaviors that can identify the social relationships of older adults.
Data Processing
All the performance processes were carried out using Python. Figure 4 shows a series of steps used for preprocessing the data. These processes created an annotation source of learning in the Posec3d algorithm.
The training data and test data were divided into a ratio of 8:2 based on the number of participants and were set as the training dataset (8 males, 8 females) and test dataset (2 males, 2 females). The ETRI-Activity3D dataset was a video filmed at 25 fps and sliced to 25 fps using ffmpeg to create several images.
Annotation was a collection of information on the data to be used in the learning. The annotation contained information for each video as shown in Table 3. This information was stored in a dictionary format, and the final annotation was in the form of a list containing the annotation of each video. Here, “Frame_dir” was used to distinguish the name of the video, and “Img_shape” and “Original_shape” was 1080 width and 1920 height (1080, 1920). Because each video had a different length, the total number of frames also varies.
Table 3
Annotation definitions and examples used in the system
Item
|
Explanation
|
Data Format
|
Explanation on data format
|
Frame_dir
|
Name of video
|
String
|
Ex) A001_P001_G001_C004
|
Img_shape
|
Size of image data
|
tuple
|
(height, width) = (1080, 1920)
|
Original_shape
|
Original size of video
|
Tuple
|
(height, width) = (1080, 1920)
|
Total_frames
|
Total number of video frames
|
Integer
|
Ex) 456
|
Keypoint
|
Skeleton’s x,y coordinates for people inside each frame of the video
|
Array
|
[N(number of people), T(number of frames),
K(number of keypoint = 17), 2(x,y coordinates)]
|
Keypoint_score
|
Confidence value of the skeleton value of the person in each frame in the video
|
Array
|
[N(number of people), T(number of frames),
K(number of keypoints = 17)]
|
Label
|
Class of the video
|
integer
|
Ex) 0
|
The keypoints were extracted using HRnet35, and then a total of 17 skeleton coordinates were used in Table 4. The order of each skeleton coordinate was specified in Table 4. Here, “Keypoint_score” indicated the confidence value of each keypoint. Therefore, if the confidence value was high, the human skeleton coordinate could be detected with a high accuracy. Furthermore, “Label” represented the information on the behavior being filmed in the video and showed that each behavior can be expressed using numbers (Table 4).
Table 4
Definitions of the keypoints, used for pose extraction, and behavior class number
Algorithm for judging the behavior of older adult
Posec3d is an algorithm for action classification during action recognition using skeleton coordinates; that is, it is an algorithm that classifies the entire image into a single class. However, it was difficult to use in its original form because older adults’ daily behaviors are too complex to be detected, i.e., what behavior is performed at a specific time on basis of daily routine. Therefore, 90 frames were defined as the time to act, and the problem was mitigated by repeating Posec3d after every five frames. Using this Posec3d algorithm, action detection was performed every five frames to detect what actions older adults are doing, and then the actions and respective corresponding time were coincidently recorded in the database. According to the set number of frames per second, the video was divided into images data. For example, based on 25 fps, five images correspond to 0.25 s, and 250 images corresponded to 10 s. Therefore, time information was derived in this way, which allowed us to record respective start time and end time of older adult’s activities, configure a database, and derive information on what activities the older adult did at what time through the database. Thus, using this process, his/her repeated daily behaviors can be identified and implemented as the baseline data for ADLs through the stored database.
Experiment
In this experiment, we employed the Ubuntu 18.04 LTS operating system for machine learning using two RTX3090. The total batch size was set to 64, the learning rate was set to 0.025–1000 epochs, and the stochastic gradient descent was used as the optimizer function. In this case, the loss function is cross entropy. The model calculation of such leaning process was verified using the test annotation every 10 epochs, and the final model yielded the highest accuracy of older adults’ target behaviors. Specifically, among the 100 validations out of 1000 epochs, the 940-epoch model was selected as the best performance shown in the Figure 5.
<Figure 5 about here>