Dynamic Goal Tracking for Differential Drive Robot Using Deep Reinforcement Learning

To ensure the steady navigation for robot stable controls are one of the basic requirements. Control values selection is highly environment dependent. To ensure reusability of control parameter, system needs to generalize over the environment. Adding adaptability in robots to perform effectively in the environments with no prior knowledge reinforcement leaning is a promising approach. However, tuning hyper parameters and attaining correlation between state space and reward function to train a stable reinforcement learning agent is a challenge. This paper is focused, to design a continuous reward function to minimize the sparsity and stabilizes the policy convergence, to attain control generalization for differential drive robot. To achieve that, Twin Delayed Deep Deterministic Policy Gradient is implemented on PyBullet Racecar model in Open-AIGym environment. System was trained to achieve smart primitive control policy, moving forward in the direction of goal by maintaining an appropriate distance from walls to avoid collision. Resulting policy was tested on unseen environments including dynamic goal environment, boundary free environment and continuous path environment on which it outperformed Deep Deterministic Policy Gradient.


Introduction
Wheeled robots contribute as one major paradigm of mobile robots.They are gaining importance in the routine tasks by effectively assisting humans in diverse range of disciplines.Including healthcare and elderly assistance, contributing socially and commercially at marketplaces and in transportation, as well as providing assistance at schools, offices and even at homes, covering both indoor and outdoor environments.Roomba, TurtleBots, Pioneer3-AT, Pepper robot are few examples [1,2].
Such robots are interacting with human-inhabited environments.These environments are dynamic, typically unstructured, and highly uncertain.Traditional control systems in such environments experience diminution in terms of performance or even lead to failure [3,4].As traditional control techniques are based on manual pre-programming of hand-crafted models or traditional physics-based modelling tools.Therefore, they fail to generalize over the environment setting and are incompetent to effectively adapt with the changing environment [4,5].
Model learning controls have tendency to generalize over the problem setting and can adapt with dynamic environment.Contrary to traditional controls, parameters of model can be estimated directly from the dataset of the real system.Therefore, they can effectively incorporate the unknown nonlinearities of the system [4].However, to deploy model-predictive controls, state-of-the-art approaches often adopt sequential pipeline.It starts with state estimation, along with managing contact with the environment, followed by trajectory prediction and optimization, model-based control prediction and finally operational space command [6,7].Design and development of such approach is dependent on availability of accurate dynamic models of robot.It is not a trivial task but requires technical expertise of the system.
In contrast to that, end-to-end deep reinforcement learning (DRL) can effectively abstract the low-level system dependent specifications with relatively simpler reward function.Therefore, excluding the need for any prior knowledge about the dynamics of robots and environments.This can ensure the performance of robot without explicit system identification or manual engineering [8].If DRL is effectively implemented, it can automate controller design by eliminating the requirement for system identification.It produces controls that are directly tuned and generalized for a specific robot and environment [7].
For training an agent with a continuous action space using Reinforcement learning (RL), Twin Delayed Deep Deterministic Policy Gradient (TD3) [9] is state of the art algorithm.It is a successor of Deep Deterministic Policy Gradient(DDPG) [10], which ensure a stable and robust actor update along with the minimization of overestimation bias in critic network and eventually stabilizes the learning for continuous action spaces.Implementation of DDPG and TD3 on numerous robotic systems are available in literature [8].Extensively used on unmanned aerial vehicles (UAVs) [11], covering legged robots of OpenAIGym environments including ant, half cheetah and walker [9].Moreover, fewer implementations of DDPG on differential drives are available in literature, e.g., autonomous driving cars [12], skid steering in differential drive [13], optimal torque distribution [14], and obstacle detection for differential drive [15].However, implementation of TD3 on differential drive to the best of our knowledge has not been done yet.This paper is focused on designing and testing a forward moving primitive policy using TD3 for differential drive using PyBullet Racecar model in OpenAIGym environment.We ensure the movement in the direction of goal while maintaining an appropriate distance from the surrounding obstacles.In this paper, we make the following contributions: i Design a continuous reward function to train TD3 on differential drive car.ii Test the performance of TD3 on various unseen environments.iii Performance based Comparison of TD3 with DDPG on similar training and testing setup.
Structure of this paper is as follows; RL preliminaries and required algorithms are covered in Sect.2; Design of system is discussed in Sect.3; implementation of experimental setup and testing details of system are in Sect. 4. Results have been provided in Sect.5; whereas section 6 covers discussion and results, followed by conclusion in Sect.7.

Background
This section introduces the essential terminologies and architecture of RL framework.Along with the fundamentals of RL algorithms DDPG and TD3 are covered.

Reinforcement Learning Preliminaries
RL framework solves a problem by trial and error-based interaction with the environment.An agent observes an environment state and takes an action, receives a reward against every action and experiences the transition to the next state.Goal is achieved by selecting actions that maximizes the received reward.Therefore, RL-agent aims to learn a maximum cumulative reward generating policy over timesteps for a given problem setting.Sequential data is considered, as RL-agent applies the policy on each step and improve itself over the time to estimate the best action for the given problem state.To efficiently deal with the sequential data, RL problem is typically modeled using Markov decision process (MDP).As, tuple of MDP can effectively represent the data of sequential transition.To model the interaction of agent with the environment for discrete time step t = 0, 1, 2, . . ., ∞.A single tuple of MDP represents, existing state s t ∈ S state space of agent, executed action a t ∈ A action space of agent, received reward r t based upon the reward function R(s, a), next state s t+1 ∈ S experienced based on transition probability P(s t+1 | s t , a t ) and discount factor γ .It is used to compute the return G t , discounted reward over time steps.
Where R t+1 and R t+2 are the expected sequence of future reward received from the given state s t and γ is bounded between [0, 1] ranging myopic to farsighted evaluation.This is to incorporate the reward of future states in the current state by incentifying s t based on the expected sequence of future reward lead by s t and a t .G t is aimed to improve leaning for RL agent by considering the expected future trail of reward before selecting next action.
An agent starts form state s 0 ∼ P(s 0 ), for a given s t it samples an action a t using given policy π(A | S) distribution of action space given state space, receives a reward signal r t using reward function R and experience the transition to next state s t+1 .Consequently, tuples of experience s t , a t , r t , s t+1 is constructed to be utilized as environment experience to solve MDP i.e., optimize policy, distribution selecting actions with maximum expected reward.
To evaluate policy and deduce the goodness of state-action pair Q-value function approximator Q(s, a) is used.Q-value function computes expected G t the discounted reward over the given state-action pair Eq. ( 2).Maximizing Q-value leads to optimal policy i.e., selected actions for all the state are expected to produce maximum reward.

DDPG
DDPG is a model free RL algorithm [10] i.e., no environment model required, it learns by directly interacting with the environment.It is based on actor critic architecture [16] and significantly contributes for continuous action space.In actor critic architecture policy approximator is utilized as actor and Q-value function approximator is used as critic.Both actor and critic approximators are concurrently updated to solve MDP.DDPG architecture comprises of 2 critic approximators and 2 actor approximators, typically all the approximators are modeled using neural networks.As demonstrated in Fig. 1 one of the two critic network is target Q-network Q θ target , utilizing s t , a t , r t , s t+1 ∼ D distribution of environment state tuples to generate discounted target i.e., discounted estimated Q-value for given s t+1 and estimated next action a t+1 by target policy network π θ target .Sum of actual r t and discounted target is computed to compare it with the estimated Q-value for s t and a t by current Q-network Q θ , to minimize the difference loss L(Q θ , D) Eq. ( 3) is computed for Q θ update.Whereas current policy π θ is used to explore environment by computing a t against s t .As shown in Fig. 2 π θ is updated in the direction to maximize Q θ against action approximated by π θ belonging to continuous action space.To update π θ loss L(π θ , D) is computed Eq. (4).However, both π θ target and Q θ target networks experience soft update i.e., weights of target networks are updated probabilistically by the current networks π θ and Q θ respectively on each epoch.

TD3
TD3 [9] is a successor of DDPG effectively enhancing the performance by reducing overestimation bias problem, stabilizing actor network updates, and ensuring robustness against noise.As shown in Fig. 3, to deal with over estimation bias TD3 introduces the twin pair of critic networks i.e., 4 in total, 2 comprising target Q-networks Q θ target1 and Q θ target2 .Other 2 are used as current Q-networks Q θ 1 and Q θ 2 .Minimum of both the networks contributes to compute loss leads to underestimation of bias, this changes loss function as follows Eq. (6).
As shown in Fig. 4 number of actor networks remain 2. However, to stabilize the actor learning delayed actor updates are introduced i.e., frequency of actor networks update is lower than critic networks update.This reduces the chance of actor converging on unstable critic update spike.Moreover, to ensure robustness in policy, noise regulation is done by clipping noise to the generated action.Training is conducted over the corresponding noise regulated actions.However, both π θ target and pair of Q θ target networks experience soft update i.e., weights of target networks are respectively updated by the current networks π θ and pair of Q θ probabilistically.This update in not conducted on every epoch but with the delay.

State Space and Action Space
State space is the only interpretation of environment for the agent.Therefore, it must be composed of sufficient factors for agent to make sense of its environment, in terms of problem identification.
Therefore, to approximate the desired policy a continuous state space is designed to ensure localization and mobility towards the goal along with obstacle detection.State space consist of 7 components.Action space of robot consist of 2 components velocity of robot, and steering angle.Both of these components are continuous in nature, bounded by [−1, 1].The dimension of action space is (1 × 2), resultant velocity and steering angle respectively.To ensure robustness from the external noise, a random noise factor is added to the generated action.This random noise factor has a normal distribution with 0 mean and 0.1 standard deviation.

Reward Function Design
The reward functions sets the foundation of solution formulation in RL since, performance of RL agent is entirely dependent upon received reward signal.Therefore, is a vital component of RL agent training [17].
Reward function binds the environment's state space with the problem under consideration to acquire the effective solution.Therefore, the stability in learning is closely associated with the correlation between state space and reward function.However, sparsity in reward function can lead the training system to divergence [18].
The performance of learning system depends upon the stable reward function which can effectively map state space over action space, in the direction of goal.To attain compatibility, a continuous reward function is designed for a continuous state space.Designed reward function comprises of 4 components, each component adds a distinct feature to achieve a step toward the goal.Consisting of collision avoidance reward, closeness to goal reward, linear velocity, and angular velocity reward.
Collision avoidance is ensured by collision reward (CR) Eq. ( 7).It generates sharp -1 in case of collision with the wall and 1 otherwise.
To add perception of goal, and encourage the displacement in the direction of goal, Eq. ( 8) a component closeness to goal reward(CGR) is incorporated.
Where current distance is the perpendicular distance between the car base and goal whereas, total distance is perpendicular distance between car base and goal at initial position of episode.CGR is bounded between -1 and 1, approaches -1 if car displaces away from goal and approaches 1 if displacement is in the direction of goal.
To encourage rectilinear motion in the direction of goal, Eq. ( 9) linear velocity reward (LVR) suppresses the velocity in lateral direction (LVX) and encourages high velocity in vertical direction (LVY).Resultant velocity of the system is bounded between 1 and -1 therefore, max attainable velocity in either direction (lateral or vertical) is 1 and minimum is -1 in case the other component is suppressed to 0. So, the difference between lateral and vertical components of linear velocity will be bounded between 1 and -1 representing movement in positive and negative direction of axis respectively.However, to avoid local maxima weightage of linear velocity is multiplied by the factor of 2.
To avoid deflection from the straight path the angular velocity of the car is suppressed.Equation (10) angular velocity reward component (AVR) is introduced to ensure -1 reward if angular velocity exceeds the empirically set threshold value.However, reward approaches to 1 if angular velocities in all the directions are suppressed to zero.
Sum of all the components of the reward function is normalized Eq. ( 11) to ensure stability in learning [8] and get a continuous, composite reward equation (Reward) bounded between -1 and 1.Where -1 represents the worst and 1 represents the best strategy as the car approaches goal.

Experimental Evaluation
This section describes the implementation and training of designed RL-agent.Section A discusses the experimental setup.Section B covers training specifications of both DDPG and TD3.Training performance comparison is described in section C.

Experimental Setup
To train and evaluate the performance of system on differential drive, PyBullet racecar model [19] in Open AI Gym [20] environment is utilized.Race car is equipped with LiDAR sensor to incorporate obstacle detection.LiDAR of 100 rays is fixed to Hokuyo joint covering the front portion of the car.A forward moving path environment was designed, bounded by wall from all four sides, following all the constrains of typical world model of Gazebo simulator [21] to train the RL agent.However, policy approximator (actor) and Q-value function approximators (critic) for RL agents are modeled using artificial neural networks.Actor networks' architecture as visualized in Fig. 6, are comprised of 2 hidden layers, along with input and output layer.Input layer consisting of 414 neurons to accommodate state space and Relu activation is used to add nonlinearity.Hidden layer-1 is composed of 400 neurons along with Relu activation non-linearity.Hidden layer-2 is composed of 300 neurons and tanh activation non-linearity.Finally, output layer consists of 2 neurons to accommodate the components of action space.
Similarly, critic networks' architecture shown in Fig. 7, comprises an input layer and 2 hidden layers followed by the output layer.Input layer consists of 416 neurons to ensure compatibility with 414 components for state space and 2 components of action space.Nonlinearity is ensured by Relu activation function.Hidden layer-1 comprises 400 neurons along with Relu non-linearity followed by 300 component based linear hidden layer-2, converging the flow to signal neuron comprising output layer to yield the Q-value.

Training
Training of both algorithm TD-3 and DDPG were implemented and trained in the similar experimental setup to draw an effective comparison on both performance and training efficiency.

TD3
The training of implemented TD3 system has acquired precisely converging graphical results Fig. 8. Losses of both actor Fig. 8a and critic Fig. 8b networks are converging effectively Actor network's loss plot in Fig. 8a shows loss decreasing against training time step.Starting from zero and settling around −30 after 150k training steps.As actor's loss is negative Q-value of estimated action against the given state.Therefore, loss minimization is proportional to Q-value maximization.Lower the actor's loss higher the Q-value for the estimated action.Critic network's loss defines the closeness between estimated target Q-value and estimated current Q-value, minimum difference i.e., closer to zero ensures convergence.Critic's loss plot in Fig. 8b, starting from zero due to identical target and critic network, progressed to 4 after being exposed to 150k steps of training.However, average reward gained is increasing with time steps.To estimate the performance of training models, they are saved on regular intervals to be evaluated in terms of gained reward, as shown in Fig. 8c.High reward generating peaks and stable plateau of the episode reward plot are evaluated.Highest reward producing model, peak at 26k steps generating reward value of 235 was selected for the testing on unseen environment.However, overall performance of training can be evaluated by average reward plot against training steps Fig. 8d.It started increasing after a sharp decline on the initial training steps, this ensures agent exploring environment and eventually converging in the anticipated direction.

DDPG
Training DDPG system ensured precisely converging graphical results Fig. 9. Losses of both actor and critic network are converging effectively and maximizing the received average reward.Best trained policy was attained at 105k steps of training, generated average reward value of 210.69 when evaluated over 10 episodes.Actor network's loss plot Fig. 9a shows loss decreasing against training time step.Started from zero and decreasing, approaching −20 after 200k training step.Actor's loss is negative Q-value of estimated action against the given state.Therefore, loss minimization is proportional to Q-value maximization.Lower the actor's loss higher the Q-value for the estimated action.Critic network's loss defines the closeness between estimated target Q-value and estimated current Q-value.Minimum difference i.e., closer to zero ensures convergence.Critic's loss plot in Fig. 9b, starting from zero due to identical target and critic network, progressed to 1 after being exposed to 200k steps of training.However, average reward gained is increasing with time steps.To estimate the performance of training models, rewards of episode on different training steps were evaluated, shown in Fig. 9c.Models with high reward generating peaks and stable plateau of Fig. 9c are evaluated.Highest reward producing model, peak at 105k steps generating reward value of 210.69 was selected for the testing on unseen environment.However, overall performance of training can be evaluated by average reward plot against training steps in Fig. 9d.The trend of Fig. 9d shows average reward increasing with the progressing training steps ensuring converging in the anticipated direction.

TD3 Versus DDPG Training Comparison
To conduct justifiable training performance comparison between TD-3 and DDPG, both the algorithms were trained under similar conditions.Values of corresponding hyper parameters were kept equal while training both TD-3 and DDPG.However, performance on the basis of convergence and reward maximization over training steps were observed.
Training updates of TD-3 and DDPG were based on batch size of 100 steps.Discount factor γ was set to 0.99 to maximize the reward foresight.However, neural network dependent hyper parameters were tuned accordingly, e.g., actor update rate was set 2 for TD3 and 1 for DDPG.Although, architecture of both actor and critic networks were kept similar for both the algorithms but learning rate α differed to aid the convergence of different setups.For TD3 both actor and critic observed same learning rates α of 10 −6 .Whereas actor and critic learning rate α for DDPG were 10 −4 and 10 −2 respectively.
Despite keeping similar training architecture, convergence performance of the two algorithms visibly differs from each other as summarized in Table 2. Actor network loss of DDPG has converged to −20 and critic network loss has converged to 1 with maximum average reward of 165.26 even after 250k steps of training.Contrary to that, actor network loss of

Results
To analyze the stability and performance of trained policy, it was tested on numerous unseen environments.Evaluation was conducted on the bases of reward gain and performance in simulated environments.Best policies of both DDPG and TD3 were tested under similar conditions.The performance of agent is evaluated on 400 step episode based on reward per step graph.Following are the 3 environments on which both the policies are tested:

Dynamic Goal Environment
Performance of trained agent was evaluated on a continuous path in which goal is shifted forward, as the robot approaches the defined vicinity of goal, environment shown in Fig. 12a.Starting from less than 0.5, increasing on every step approaching 1 experiencing max reward of 0.63.It is experiencing a sharp decline when the goal is shifted but rapidly recovers by gaining reward.Episode ended with total reward 241.38 over 400 steps.However, performance of DDPG policy reward plot Fig. 12c is relatively noisy, starting from 0.25 and approaching 1 with max reward experienced is 0.58.Episode ended with total reward of 217.58 over 400 steps.

Boundary Free Environment
Boundary free environment neglects the notion of obstacles by removing the boundary wall of the environment, visualized in Fig. 13a.This can effectively change the knowledgeable state space values, as agent will experience no active lidar values unlike training environment.Evaluation results of TD3 in Fig. 13b, clearly display that change in environment is not affecting the agent's performance, evident by gained reward plot.Trend of reward graph shows the increasing reward gaining policy performance, starting from around 0.44 reward and progressing to 0.54 reward, as agent approaches in the direction of goal, without deviating from the straight path.Total reward of 400 steps episode is 238.73.However, according to reward plot of DDPG agent in Fig. 13c, starting reward is 0.52 and it is increasing maximum up to 0.625.Sharp spike on initial steps shows the noisy performance of policy.Total reward gained is 213.78 over the episode.

Continuous Path Environment
To evaluate the agent on continuous path visualized in Fig. 14a, front wall of the training environment is removed which was the closest obstacle to goal in the training environment.Subsequently this changes the experienced states of environment for agent.Performs of TD3 agent on this environment can be evaluated by gained reward plot shown in Fig. 14b.Graph displays the increasing reward gaining policy performance, starting from around 0.53 reward and progressing to 0.625 reward in just 400 steps episode, as agent approaches in the direction of goal.Total reward of episode is 238.60.Performance of DDPG agent on continuous path environment is shown in Fig. 14c, starting from 0.20 and reaching 0.60.However, experienced reward plot is considerably noisy.Total reward of episode is 213.79.

Conclusion
TD3 is effectively implement on differential drive system, using PyBullet model of racecar in OpenAIGym environment with continuous action space.A continuous reward function and state space is designed to attain the required policy.Trained TD3 agent is effectively tested on various unseen environments including dynamic goal environment, boundary free environment and continuous path environment.To evaluate the performance TD3 agent it is compared with DDPG agent trained and tested under similar setting.According to training and evaluation results of designed system, it is concluded that, training TD3 for differential drive is stable and efficient compared to DDPG.As TD3 agent converged earlier policy generating maximum reward of 235 on 26k training steps was achieved whereas DDPG agent's maximum reward generating policy is achieved at 105k training steps with reward of 210.However, smoothness of reward plots proves TD3 to be more stable compared to DDPG.Moreover, correlation between designed reward function and state space is ensured by effective training of both TD3 and DDPG.Testing results verifies that trained policy can be effectively adapted in unobserved environments.Therefore, future direction of this work will be implementation of trained policy over hardware.Moreover, this policy can be combined with other primitive policies to perform a complex task, by generating composite policy.Another possible future direction of this research can be its implementation on wheeled

Fig. 1 Fig. 2
Fig. 1 DDGP critic network update architecture, gray networks are the target networks and colored networks are the current networks.Each epoch updates target critic networks to current critic network by soft updates

Fig. 3
Fig. 3 TD3 critic network update architecture, gray networks are the target networks and colored networks are the current networks.Target critic networks updates to current critic network by soft updates with the delay

Fig. 5
Fig. 5 All points and vectors are in world coordinate frame.The point R(x, y, z) represents robot position, point G(x, y, z) represents position of goal.D is the shortest distance between robot and goal.LV vector represents linear velocity of robot

Fig. 8 Fig. 9
Fig. 8 TD3 Training Graph, a actor loss converging to −30 in 150k training step, b critic loss close to zero, starting from zero and reaching 4 in 150k training steps, c episode reward over train steps is with peaks and stable plateau and d average reward during training increasing with maximum peak at 200 after a sharp decline at initial steps ensuring exploration and then convergence in expected direction

Fig. 10 DDPG
Fig. 10 DDPG Reward against Training Steps Plot, highlighted coordinate for the best reward generating policy at 105 × 10 3 training step with reward value 210.69.Trend of the plot is relatively smooth at the start, with the rise in noise with increasing training steps

Fig. 11 TD3
Fig. 11 TD3 Reward against Training Steps Plot, highlighted coordinate for the best reward generating policy at 26 × 10 3 training step with reward value 235.27.Overall trend of plot is a noisy start leading to smoothness over increasing training steps

Fig. 13 Fig. 14
Fig. 13 Testing results of TD3 and DDPG on boundary free environment, a shows Boundary free environment.b testing reward plot for TD3, c is testing reward plot for DDPG

Table 1
State space components

Table 2
Training performance comparison between TD3 and DDPG

Table 3
Testing results on DDPG and TD3 on unseen environmentsTesting results of DDPG and TD3 on unseen environment clarified the performance and stability of TD3 agent over DDPG agent.Testing results stated in Table 3, shows the performance of both the agents in previously unobserved environments tested over 50 episodes, where each episode is of 400 steps.It is evident that TD3 out-performed DDPG in all the environments, generating average rewards of 241.38, 238.72 and 238.60 in dynamic goal environment, boundary free environment and continuous path environment respectively.Contrary to that, DDPG generated 217.58, 213.79 and 210.69 in dynamic goal environment, boundary free environment and continuous path environment respectively.Moreover, training performance of TD3 and DDPG as discussed in Table 2, shows that training TD3 is efficient and stable in terms of early convergence and reward maximization.Training of TD3 produced max reward generating policy, with 235.26 reward value is trained in 26k training steps.Resulting in a relatively smooth episode reward over training steps plot.However, maximum reward generating policy by DDPG agent was acquired after 105k steps with the maximum reward value of 210.69.