Reinforcement Learning-Based Assistive Framework for Disabled Persons

Robot programming by demonstration (PbD) enables human users to teach the robot and increases its capabilities by interactive teaching without having to manually program the robot. PbD combined with intelligent machine learning algothrims can help us to develop autonomous robots for various industrial and domestic tasks. One such task is the pouring of liquids from bottle into the cup/glass. In this paper the rst step is to teach the robot liquid pouring task by the human user in order to facilitate physically disabled people in making various types of drinks and then dataset is created from the user taken demonstrations. In training stage the dataset obtained is feed to the decision-making algorithm based on reinforcement learning. The algorithm enables the robot to learn to pour different liquids under different pouring conditions with the help of minimum number of human user demonstrations needed for the learning of task. The acquired results show that the robot can learn to pour different liquids and is able to accurately adapt to different pouring conditions by using reward-based system and online feedback. Furthermore the results show that the proposed framework can work with different types of liquid and container sizes without any further need for reprogramming or demonstrations.


Introduction
Autonomous assistive robots aim to help humans in those tasks which are not possible by humans or to help people in their daily life activities for better living standards. Recent developments in the eld of robotics and availability of computationally fast algorithms enable the robots to interact with humans and learn skills from human examples. Assistive robots can provide the most e cient way of assisting humans, especially physically disabled people to make their life independent [1] [2].
There are different ways to execute the pouring process such as tilting type pouring [3], pressure type [4], etc. The main task in this research is to control the falling position of the pouring container in such a way that the required amount liquid is poured without spilling liquid from the container. Numerous control schemes such as [5] [6] and machine learning techniques combined with PbD [7] [8] have been developed to control the liquid level and ow rate of the liquid in order to avoid liquid spilling and sloshing of liquids.
In [9] teleoperation teaching is used for the demonstrations and force-based perception is used for pouring operation. Kinesthetic teaching is used in [10] for learning of motion primitives using motion re nement tube. In [11] task is demonstrated by using motion-based sensors and markers attached to the robot in order to design humanoid behaviors. In our research the precise control of robot gripper rotation is achieved by using robot operating system (ROS) for the collection of demonstration. Furthermore, the liquid ow rate and gripper rotation are controlled by measuring the liquid weight poured in the cup. This eliminates the use of multiple sensors.
Tilting type automatic pouring system is developed in [5] by using a mathematical model that is based on Navier-Strokes and two-dimensional continuity equations. The proportional control schemes are used in [5] for the controlling purpose. Container's falling pouring position and also the height between the cup and container are controlled in [6] by feedforward control system which consists of a motor model, the pouring model, and the free fall process. Tilting type automatic liquid pouring system is proposed in [12], where the modeling of ow rate is designed by using mass equilibrium and Bernoulli's theorem. In our research, as the robot learns the pouring task with help of human taken demonstration, so it doesn't require any complex mathematical modeling in order to establish a relationship between liquid ow rate and robot gripper rotation. Visual feedback is used for feedback purposes in [8] in order to pour speci ed amount of liquid. In this work weight machine is used to acquire the information of poured water in the cup for controlling the gripper rotation that reduces the computational complexity due to less processing.
The current work is based on programming by demonstration for the convenience of user, it uses reinforcement learning to encode the relation between liquid ow rate and robot gripper rotation. The developed system learns new tasks with the help of online feedback.

Literature Review
In the recent few decades, industrial robots have become a signi cant commercial success. Due to the growing technology in robotics and the development of intelligent and fast algorithms, robots are not only limited to industrial applications but their use is also extending to domestic and assistive applications. An assistive robot is one that provides aid or support to a human user. Assistive robots can be used to improve the independence and quality of life for people with disabilities. Assistive robotics covers a wide range of research areas such as adaptable robots that act as aides or companions [13] or robots that are designed to provide physical assistance for mobility and support [14].
In the domestic domain, several kinds of assistive robots have been developed to provide assistance to elderly people in homes and hospitals. The robot pearl was designed and developed to autonomously guide elderly people with cognitive and physical disabilities [15]. Different robots have been designed to emulate domestic animals such as a cat or a seal in order to provide companionship to elderly people.
The purpose of the research was to reduce the stress level in elderly people and nursing staff [16].
Various researches have been carried out to help people with physical impairments. Assistive robots in this domain are usually designed as wheelchairs or robot manipulators. FRIEND robot [17] [18] is an example of such an intelligent wheelchair. The robotic wheelchair was mounted by the robotic manipulator and different sensors for the perception of its surrounding environment like stereo camera vision, force-torque and weight measuring sensors. The robot wasn't only to provide mobility to the disabled person but also help in daily life activities such as serving food, drinks, etc. and to support at work [16]. Four different generations of FRIEND robot is shown in Fig. 1.
Research has shown that assistive robots can help children with cognitive order through therapeutical interaction [19]. In another research [20] assistive robot was employed to help people with cognitive impairments. The robot in such a case acted as a motivator and assisted people through cognitive stimulating challenges.

Page 4/32
The approaches proposed in this research for a liquid pouring task are based on machine learning techniques. As described in [5] [6] and [12], the mathematical model of liquid ow rate and its environment is needed for controlling the pouring of a liquid. However, by using machine learning techniques such as [7] and [21] only human demonstrations are needed for learning and robot performs quite well even in the presence of disturbance and in unknown scenarios. The rst approach proposed here is somehow similar to [7] in which the hidden Markov model, Gaussian mixture model, and Gaussian process regression are used to encode the relation between the robot's gripper trajectory and weight of the liquid in the cup. And Gaussian mixture regression is used to retrieve the learned trajectory. The second proposed approach is based on reinforcement learning as in [21], but instead of learning DMPs, a decision-based control system is developed by learning required actions of gripper arm for the given pouring conditions. The learning algorithm is reward-based and the robot tries to nd the optimal path by selecting actions that give maximum reward values. In this way, a generalized algorithm is developed which can work with any type of pouring container and conditions because the algorithm can be updated online thus it can learn new pouring actions that are not given at the timing of training.

Experimental Setup
In our experiment, a square table is used on which the robotic arm is placed at a xed. It is assumed that the robot has already gripped the bottle and it is already at the pouring location. The cup is placed on the weight machine at the target location. With the help of the experimental result, the initial tilt angle of gripper which is the angle after which the pouring process starts is calculated. Different types of cup sizes and bottle sizes are used in different experiments. The experimental setup is given in Fig. 2.
The robot is controlled by the user through ROS. The gripper rotation is discretized and only prede ned gripper rotations (or actions) are allowed. The gripper rotation steps that can be taken by the robot are of 2°, 3°, 4°, 5°, 6°, 7°, and 8°. The user has to take demonstrations with all the gripper actions one by one by completely lling the empty cup and the pouring bottle is initially full. In this way, the overall demonstration needed for learning is seven. However, the second approach for learning used in this research needed fewer human demonstrations and is explained in Sect. 5.3.2. The demonstrated gripper data and corresponding weight machine data are synchronized with the help of timestamps.
The change in the quantity of liquid lled in the cup for all demonstrations using different gripper rotation step sizes or actions is shown in the gure. Where the x-axis shows the amount of liquid for a given sample in the y-axis. Figure 3 shows that the bigger gripper rotations pour more amount of liquid but as the level of cup rises su ciently than smaller actions are needed to pour less water to avoid overspilling.
The proposed method uses this feature to ll the cup with the desired amount of liquid without spilling.
Hardware used in this work consists of Jaco 2, 7 DOF spherical robotic arm for the pouring of liquid and an electronic weight balance to calculate the weight of the poured liquid in the cup for every robot gripper rotation. The robot gripper consists of three ngers and can manipulate different physical objects. The movement of the robot can be controlled by the controller or computer. The weight machine used is PCE-PM 1.5 T. The weight machine data with timestamps is transferred online from the weight machine to the computer via serial commination. ROS is used for manipulation of robotic arm and data logging. Many functions can be performed in ROS such as control and planning, sensor processing and evaluation, etc.
There are three ways to control the Kinova robotics arm.  is about to be lled completely, small gripper angle steps are required. Figure 11 shows the change in the amount of water lled in the cup during respective demonstrations.

Feature Selection
After demonstrating the task to the robot, the next step is to create a database for learning algorithms.
Only the gripper joint is considered for the task and the rest of the joint angles are discarded. The weight data and gripper data are synchronized with each other with the help of timestamps. Each demonstration contains an equal number of weight and gripper angle samples. Several databases having a different number of demonstrations are created to compare the e ciency of different machine learning algorithms. First, databases are created by using only a single demonstration in order to check the response of the system for each demo and other databases having 2 and 4 number of demonstrations are also considered for comparing and improving system performances. Figure 12 shows the amount of liquid change in the cup during the demonstration concerning gripper rotation.

Liquid Pouring Framework
The framework of open-loop control of the pouring process through machine learning algorithms is described in Fig. 13.
In open-loop robot control, the robot only learns the pouring actions from given demonstrations and it transfers the same amount of liquid each time. There is no feedback from the output of the system. The overall system performs depend on the parameters of the learning algorithm. The open-loop robot control is implemented in MATLAB. The closed-loop decision-based pouring system is shown in Fig. 14.
In closed-loop control, the user has more options and exibility while teaching the robot. The algorithm needs information about the size of the bottle, the required amount of water and online feedback from the user. Due to user-provided feedback, the algorithm has become more robust and accurate and at the same time provides pouring of different uids from different bottles with little information provided by the user.
9 Implementation Of Pouring Task By Using Reinforcement Learning Reinforcement learning provides learning from interaction and it is based on how the environment responds to the performed action. Bigger rewards are assigned to those actions that take the system closer to its target goal. The relationship between gripper actions and liquid ow out from the bottle is developed by using reinforcement learning. The overall system is based on reward-based learning. The task of the system is to achieve the goal (i-e to ll the cup with a de ned amount of liquid) by taking actions that can give maximum rewards and leads the system towards its nal goal. The owchart in Fig.  17 shows the used approach for the learning of robot using reward-based reinforcement learning.
The reinforcement learning components are de ned as I. A set of K numbers of state .
II. A set of number of possible actions And each action must affect the state of the system. III. A set of rewards for every possible action based on the current state of the system [ ].
IV. A set of transition probabilities [ ] that de nes the possibility of moving one state to another state.
In our approach, three parameters play an important role in the learning of the task. These parameters are the states of the system, the actions that can be performed by the robot and the reward which is based on the amount of liquid poured in response to any robot action.

Reward Matrix
Each possible state of the system is represented by the reward matrix and it also shows the actions that can be taken from any given state in order to carry out a transition from one state to another state. Two approaches are used in this research for constructing the reward matrix.

Approach I
The rst demonstration is taken by using only 2-degree rotation by completely lling the empty cup. The amount of liquid poured during each action is considered a reward value. The total amount of liquid present in the cup after each action is regarded as the state of the system. The state sequence sample after the 2-degree demonstration is shown in Table 1. Then the user will perform the same experiment by using all remaining gripper actions. With the help of these demonstrations information about each action from any given state is stored in the reward matrix. The elements of the reward matrix represent the amount that will be poured in response to any action from the given state. Table 2 shows the reward matrix for pouring water. Where − 100 represents that the respective action is not allowed from the current state.

Approach II
In the rst approach, seven demonstrations are needed for learning, which is quite time-consuming and di cult for the user. In this approach, the user has to take only two demonstrations with a 2-degree gripper action and 3-degree gripper rotation. Data for the rest of the actions are estímated by the algorithm based on the given demonstrations. The reward matrix with estimated reward values is given in Table 3. The calculation of estimated reward values from the given demonstrations are shown in Table 4. In Table 4, represents the state of the system.

Working Process
Once the reward matrix is created with the help of a demonstration, the robot is now able to pour the liquid and its online training is beginning which depends on the data of the reward matrix. In online training, Q-matrix is initialized with the help of a reward matrix. An initial amount of cup, the liquid present in the bottle and user requirements of liquid are also taken into consideration for suitable actions. Depending on the initial amount of liquid in a bottle the algorithm determines the initial tilt angle of the bottle after which the pouring process begins. According to the required amount of liquid and current state of the system (initial amount of liquid in cup) the algorithm chooses the most appropriate action from the given state. After each action the algorithm gets the feedback of the actual amount of liquid poured in response to a certain action. This value is then compared with the predicted reward value, if there is a signi cant difference in predicted and actual reward values then the reward value for that action is updated in the Q-matrix. The algorithm completely learns the pouring task when there reward values in Q-matrix stop updating and algorithm converges.

Results
In this research, various experiments are done with different liquids and container sizes and also with different initial conditions. In all experiments, there are some user given requirements required by the algorithm to predict the current environment such as maximum cup capacity, required amount of water, the initial amount of water in cup and bottle and the maximum capacity of the bottle.
1. The rst experiment is done by pouring water and reward matrix used herein constructed from seven user demonstrations and speci cations of the experiment are maximum cup capacity: 350g required amount of water: 350 g, initial amount of water in the cup: 0 g, weight of liquid in a bottle: 434 g, maximum capacity of the bottle: 500 g. The Q-matrix used for prediction of reward used is given in Table 5. In the 1st trial, the actions performed by the robot to ll the required amount in a cup is given in Table 6. In the 1st trial, the robot lls the cup with 344g of liquid and the difference between the actual amount and predicted is only 6g. However, three reward values are updated in the Q-matrix so for the next trial the robot should not do a similar mistake. The updated matrix is given in Table 7. The value highlighted with red color represents the value that is updated in the Q-matrix.
II. In the second experiment, the water is poured in the cup but the reward matrix used here is constructed by only two demonstrations and the rest of the values are predicted. The user requirement is the same here as in experiment 1. The Q-matrix from predicted data is given in Table 8. In the 1st trial, the actions performed by the robot to achieve the tasks are given in Table 9. The nal amount of water in the cup is 330g which represents the error of 20g because the required liquid was 350g. However, as the robot performs the task it learns about the mistake it made during tasks by online learning. In the 2nd trial, the robot action sequence is given in Table 10. In the second trial, the robot lls the 335g of water which in comparing the rst trial is more close to the required amount. In this way, the level of perfection increases in proportion to the number of iterations it performs.
III. Experimental results for pouring soft drinks In this experiment, the robot is rst o ine trained with soft drinks by taking two demonstrations and predicted data and the resultant reward matrix represent the amount for every action from the given state. The reward matrix used in the 1st experiment can also be used here but it will take a long time to converge as it was trained for water. The user requirements in this experiment are maximum cup capacity : 280g required amount of water: 280 g, Initial amount of water in a cup: 0 g, Weight of liquid in a bottle: 396g, maximum capacity of the bottle: 424 g, In this scenario the system states are given in Table 11. The reward matrix from the predicted soft drink data is given in Table 12. In the 1st trail, the robot action sequence is given in Table 13. The total amount poured in a cup is 227g with an error of 53g. Where the offset angle de nes the amount of rotation required to start the pouring process. The actions performed in the second trial is given in Table 14. In the 2nd trial, the performance of the robot is improved. It poured overall 260g of water with the error of 20g only which is much less as compared to the error of 53g in the rst trial. Due to online feedback, the algorithm continuously updates its reward values in Q-matrix until it achieves satisfactory results.
IV. In the last experiment, the water is poured into the glass and some water is already present in the cups.
This represents a non-zero initial condition. The updated Q-reward matrix that was used in the 1st experiment is used here. The user requirements are given as Maximum cup capacity: 350g, Required amount of water: 350 g, Initial amount of water in the cup: 77 g, Weight of liquid in a bottle: 263 g, Maximum capacity of the bottle: 500 g. The actions performed by the robot to perform the task are given in Table 15. The nal value in the cup is 348g and the required amount is 350g. The error of the only 2g. This shows that as the robot performs the operation more and more, its performance becomes much better due to online adaptation.

Gaussian Process For Regression
Gaussian process regression is also used to encode and retrieve the required trajectories for the pouring process using different demonstrations. The single demonstration is given to GPR which gives a trajectory for pouring liquids after processing the input trajectory. GPR using two demonstrations gives better results as compared to using a single demonstration except at the start of the pouring process where gripper will go back slightly before pouring begins. This is because of the large difference in the beginning points in both demonstrations. The rest of the regions are successfully reproduced according to the requirements. The results are shown in Fig. 19. The results of GPR with four demonstrations are not very encouraging. As can be seen in Fig. 20, an almost linear relationship between the gripper angle and the amount of liquid has been established, which is not the case here.
The performance of the decision-based pouring process gives satisfactory results for both demonstrated and predicted data. In both cases, the algorithm is able to perform suitable actions for the pouring process. However, for demonstrated data, the user has to perform seven demonstrations to teach the robot. In the pouring process, it is di cult to maintain the same initial conditions for all the demonstrations. For example, it is di cult for the user to keep the same grasping point of a bottle for the robot gripper at the start of each demonstration. Also, to re ll the bottle with the same amount of liquid each time is not possible for a human user. Varying initial conditions for the demonstration may produce wrong reward values for the actions. Additionally, it is time-consuming for the user to demonstrate a pouring task for each gripper action. To avoid this situation, predicted data can be used so that the user has to take only two demonstrations. If the predicted reward values are not similar to actual reward values, the algorithm updates its reward value during online training.

Conclusion
Programming by demonstration and reinforcement learning-based approach is used here in order to teach pouring tasks to an assistive robot so that they prepare drinks, especially for disabled people. The advantage of using PbD is that the robot can be reprogrammed easily by any unskilled user. The proposed approach is based on reward-based learning where the learning agent always tries to improve its action sequence based on reward values it receives in consequence of particular actions. Due to online learning, the generalized pouring model is developed that can work with various sizes of glasses and bottles and with different types of liquids. The results show the satisfactory performance of the algorithm even in unknown environments and have the capability to correct actions sequence that would result in better performance. In future work, the amount of liquid will be measured by visual data instead of the weight of the liquid. The weight machine takes more space and the sampling frequency for a weight machine is not high enough for the pouring process. Furthermore, to measure the weight of the liquid, a separate weight machine has to be installed to give real-time feedback to the algorithm. However, visual devices take less space and are easy to install. With the help of visual data, the volume of the uid can be estimated, which can be further used to control the pouring process. In this research only gripper rotation is controlled, however, the gripper movement can also be controlled in other dimensions in Cartesian space as well to eliminate uid sloshing. Figure 12 Gripper rotation with respect to change in weight Figure 13 Robot control for pouring liquids using probabilistic modeling approaches Figure 14 Decision-based robot control for pouring liquids Figure 15 Caption not included with this version.

Figure 16
Caption not included with this version.

Figure 17
Reward-based reinforcement learning of the liquid pouring task The reinforcement learning components are de ned as Page 31/32 Figure 18 Demonstrated and retrieved trajectories using GPR Figure 19 GPR results using two demonstrations Figure 20 GPR results using four demonstrations