With the continuous development of robotics and artificial intelligence technology, intelligent robots are playing a more and more important role in various industries, and their types are increasingly rich, such as household sweeping robots, shopping guide robots, automated sorting robots in the logistics industry, etc. Path planning is the basis of intelligent robot motion, and it is also a hot research topic at present, which can help the robot perceive the information of surroundings through sensors, determine its own pose, and then find an optimal path from the current position to the target position in the environment. Traditional path planning algorithms mainly include Dijkstra algorithm[1], A* algorithm[2], ant colony algorithm[3], and various algorithms optimized and improved on the basis of these algorithms. These algorithms perform well in the known environments, but in the unknown environments, they have big problems in convergence speed, operation time and environment adaptation[4–5].

In recent years, deep reinforcement learning (DRL) has been widely used in many fields[6–7], and path planning algorithms combined with DRL have gradually become a research hotspot. DRL does not require the robot to know the environment in advance, but rather to predict the next action by sensing the surrounding environment. After performing the action, the robot will be rewarded with the environment feedback, so that the robot can migrate from the current state to the next state. Repeated these steps until the robot reaches the target point or reaches the set maximum number of steps[8–10]. The Q-learning algorithm[11], one of the traditional reinforcement learning algorithms, mainly stores the state and action value into the dynamic Q table for scenes with small state action space. In each iteration, appropriate actions will be selected from the Q table for execution, and then the Q table will be updated according to the reward feedback from the environment after execution. As the motion environment of the robot becomes more and more complex, the motion area becomes larger and larger, and more and more states and actions need to be stored, the capacity of the Q-table increases exponentially, resulting in long search time of the Q-table, which seriously affects the learning efficiency of the robot [12]. To solve this problem, DeepMind combined neural networks with Q-learning algorithm and proposed Deep Q-Network (DQN) algorithm [13]. DQN uses neural network instead of Q table to store data, takes the state and action as the input of neural network, and obtains the optimal control strategy through iterative learning. However, DQN is mainly applied in discrete environments and cannot solve the continuous action space problem[14]. DeepMind then proposed the DDPG (Deep Deterministic Policy Gradient) algorithm[15], which combined the Actor-Critic framework with DQN, and used convolutional neural network to simulate the policy function and Q function, so that the output results were certain action values. Thus, it solves the problem that DRL cannot be applied or performs extremely poorly on high-dimensional or continuous action tasks, making DDPG a more effective path planning algorithm at present. However, in the training process of DDPG algorithm, due to the use of uniform random experience replay mechanism to obtain experience transitions, the utilization rate of experience transitions is insufficient, resulting in poor adaptability of DDPG algorithm to the environment and the problems of low success rate and slow convergence speed [16–18].

The uniform random experience replay mechanism stores the experience \(\left[ {{s_t},{a_t},{r_t},{s_{t+1}}} \right]\) generated by the robot in the experience pool, and randomly selects experience transitions to train the neural networks[19], which breaks the temporal correlation between experience transitions and accelerates the learning process of the robot. However, the uniform random experience replay mechanism uses a unified random sampling strategy and does not consider the different importance of different experience transitions to robot learning, which leads to the failure to make full use of the experience with high importance and has a negative impact on the training efficiency of neural networks[20]. To solve this problem, DeepMind proposed the concept of prioritized experience replay (PER), which measures the priority of experience by the absolute value of temporal-difference (TD) error. The larger the absolute value of TD-error, the more important the experience is considered to be for robot learning, and conversely, the smaller absolute value of TD-error, the less important the experience is. The robot can focus on the experience transitions with high importance, thus improving the utilization rate of experience and speeding up the learning process of the robot[21]. However, PER ignores the effect of immediate reward and experience transitions with smaller TD-error on robot learning, which leads to the problem of single transition in the learning process[22]. In order to improve the utilization rate of experience transitions, many scholars have conducted in-depth studies and a series of results have been achieved. Cicek et al.[23] used KL Divergence (KLPER) for batch prioritization of experience replay, and used KL divergence of multivariate Gaussian distribution with mean value of zero to measure the deviation between the batch generation strategy and the latest strategy, which was used as the batch prioritization criterion. Han et al.[24] proposed the Regularly Updated Deterministic (RUD) policy gradient algorithm, which stored the experience transitions in the form of blocks into the experience pool, so that the exploration and learning process could be executed alternately and the utilization rate of experience could be improved. Xu et al.[25] proposed the DDPG algorithm with averaged state-action estimation (Averaged-DDPG), which calculated action rewards by averaging the estimates of previously learned Q-value, so as to reduce the fluctuation in the training process and improve the performance of the algorithm. Cao et al.[26] integrated TD-error, Q-value and data volume, focused on different importance indicators in different training stages of the neural network, and dynamically adjusted the weight of each indicator to achieve an adaptive experience importance estimation. Li et al.[27] introduced internal curiosity module (ICM) to provide internal rewards for the training process of the robot, which were combined with external rewards provided by environmental feedback, and then introduced prioritized experience replay and transfer learning to improve the success rate and convergence speed of path planning. In summary, DRL algorithms aiming at solving the problem of transitions utilization efficiency have made some achievements, but there are still problems such as low convergence rate and insufficient experience utilization.

In this paper, we propose a multi-dimensional transition priority-based prioritized experience replay mechanism, giving full consideration to the importance of different experience transitions to the training process to ensure that transitions with high importance can be fully learned, while avoiding transitions with low importance from not being learned, taking into account the diversity of transitions while improving the utilization of transitions. The main contributions of the paper are as follows:

(1) Based on the immediate reward, TD-error and the loss function of Actor network, the priorities of experience transitions can be calculated respectively, and then the weighted sum of the three is used as the final priority of the transition by using information entropy, and the sampling probability of each experience transition will be calculated according to the final priority;

(2) Giving full consideration to the beneficial effects of positive transitions on the robot learning process, assigning higher priorities to positive transitions on the basis of the calculated final transition priority, so that the positive experience can be sampled first to accelerate the network convergence;

(3) In order to take into account the diversity of experience transitions in the training process, and ensure that the low-priority transitions can also be sampled and the robot can learn the latest high-priority transitions as soon as possible, the priorities of positive transitions will be exponentially decreased after they are sampled.