A Modi�ed Q-Learning Algorithm for Robot Path Planning in a Digital Twin Assembly System

: Product assembly is an important stage in complex product manufacturing. How to intelligently plan the assembly process based on dynamic product and environment information has become an pressing issue needs to be addressed. For this reason, this research has constructed a digital twin assembly system, including virtual and real interactive feedback, data fusion analysis and decision-making iterative optimization modules. In the virtual space, a modified Q-learning algorithm is proposed to solve the path planning problem in product assembly. The proposed algorithm speeds up the convergence speed by adding dynamic reward function, optimizes the initial Q table by introducing knowledge and experience through the case-based reasoning (CBR) algorithm, and prevents entry into the trapped area through the obstacle avoiding method. Finally, take the six-joint robot UR10 as an example to verify the performance of the algorithm in the three-dimensional pathfinding space. The experimental results show that the modified Q-learning algorithm's pathfinding performance is significantly better than the original Q-learning algorithm.

and error method", and through interaction with the environment to reward and guide behavior so that the subject can obtain the greatest reward [7] . Q-Learning, one of the commonly used algorithms in reinforcement learning, is a value-based algorithm, which is the expectation of gaining benefits by taking actions at a certain moment.
The environment will feedback corresponding rewards based on the agent's actions.
To sum up, the fundamental idea of this algorithm is selecting the action with the greatest benefit from the constructed Q-table, in which the data represents the state and action of the agent. With the application of Reinforcement Learning, human involvement can be reduced from the assembly system, and the robustness of the system can be increased. It also allows the assembly robot to learn complex behaviors based on interaction with the outside environment. In the path planning process, it's quite important for the agent to connect with the surrounding environment, and make the right decision on the next behavior, which means the agent can perform a high-precision assembly process instead of translating human behaviors into robot program during the path planning process. [8] The rest of this paper is organized as follows. Section 2 reviews the related research literature. In Section 3, the digital twin framework for assembly is described. The detailed algorithm definition and improvement of the Q-learning model are analyzed in Section 4. Computational experiments of the UR10 robot are described in Section 5.
Finally, conclusions are given in Section 6, together with some discussion of potential future work.

II. Related works
A. Path planning Path planning has always been a vital part in production process. It has great influence on production efficiency and safety. Thus, how to let the industrial robots automatically plan their path has become a very important issue. Path planning can be divided into global path planning and local path planning. The location of the obstacles in the global path planning are known. Thus, the main mission of global path planning is to build up the model of the environment after calculation, and then optimize the path. Commonly used methods are: visible method, free space method, grid method, Dijkstra algorithm, A* algorithm. However, the accuracy of global path planning is hard to ensure under complex and changing environment. While the local path planning needs the industrial robots to collect information and data from the environment through the whole assembly process, which requires the robots to have learning ability. Artificial potential field method, fuzzy control algorithm, hierarchical algorithm, neural network algorithm and reinforcement learning method are most commonly used methods in local path planning. Khatib et al.(1986) first introduced the path planning of artificial potential field method, a virtual force method, to the robot field, which helps create the robot movement in the surrounding environment into an abstract artificial gravitational field. This method can result in a generally smooth and safe path, while may also lead the robot to the problem of local optimum [9] . Tsong-Li Lee et al.(2003) introduced a fuzzy algorithm, which can determine the priority of several possible paths, and repeatedly drive the robot to the final configuration in the highest priority forward direction [10] .
Zuo, L et al. proposed a hierarchical path planning approach, a new layered path planning method, which contain two different level of structure. A grid-based A* algorithm was applied in the first level for finding the geometric path and select multiple sub-goals path points for nest level. The paper applied the Least Squares Strategy Iteration (LSPI) into the second level, which was an approximate strategy iteration algorithm, to generate a smooth path under the restriction of robots' kinematics. [11] . Mihai et al. applied neural networks together with reinforcement leaning into a new approach to solve the problem of robots in environments with static and dynamic obstacles. This approach helped people to take control of the robots at any speed, and avoid local minima and converges in complex environments [12] .
In recent years, researchers have also made great progress in modifying those methods above. Luviano et al. (2017) modified the classic algorithms of multi-agent so that it does not need the unvisited state. With estimating unknown environment, the paper applied the neural kernel smoothing and networks to the approximate greedy actions, and verified the feasibility of this modified algorithm with experiments and simulation [13] . Zhuang et al. (2020) proposed a digital twin-based assembly data management and process traceability approach for complex products, namely Digital Twin-based Assembly Process Management and Control System (DT-APMCS), which can verify the proposed approach efficiency [14] . Saeed et al. (2020) generated the collision-free path with grid models and potential functions. In order to find the initial feasible path and avoiding obstacles, they developed a boundary node method and a path enhancement method. The above two methods were proved to be feasible by testing in work environments in different degree of complexity [15] .

B. Reinforcement learning
As one of the machine learning methods, reinforcement learning is used to describe and solve the learning strategies problem to achieve goals or maximize returns in the process of interaction between the agent and the surrounding environment. Samuel introduced the concept of reinforcement learning into the game of checkers, and used a value function represented by a linear function to guide the selection [16] . After years of research, the development of reinforcement learning becomes rapid. More and more algorithms are proposed, among which the Q-learning method has been widely used.
Recently, more and more modifications are applied to Q-learning. Konar, A et al. (2013) provided a new deterministic Q-learning with a presumed knowledge, which used four derived properties to update the entries in the Q-table in one time.
Compared with the traditional Q-learning approach, this modified method had less time complexity and smaller required storage [17] . Robots sometimes face the problems of slow convergence speed and long-planned path when they are in an unknown environment with applying the Q-Learning method to perform path planning. Aiming to solve these problems, Meng Zhao et al.(2020) propose the Experience-Memory Q-Learning (EMQL) algorithm, which continuously updates the shortest distance from the current state node to the start point [18] . Chunyang Hu et al.(2020) also introduced an experience aggregative reinforcement learning method with a Multi-Attribute Decision-Making (MADM) to complete the mission of avoiding the real-time obstacle by decomposing the original obstacle avoidance task into two separate sub-tasks. Each sub-task is trained individually by a double Q-learning with a simple reward function. This method can solve the problem that the robots can hardly choose an appropriate action in the control task result from the exploration-exploitation dilemma [19] .

III. Digital Twin Framework for assembly
Digital twin system integrates information control system and hardware implementation system including man-machine. It provides a new virtual assembly mode for industrial production, including virtual control environment and actual physical system. The integration of computing process and physical process, in order to maintain the space-time consistency of virtual model and equipment entity in the process of motion, provides a strong support for improving the robot path planning in unknown environment and reducing the error between virtual and actual assembly. So as to improve the application efficiency of virtual assembly in the actual industrial production. And it is more convenient for six axis industrial robot and other new robots to be widely used in industrial production.

Figure 1 Digital Twin Framework for assembly
The sensing layer contains several measuring, feedback and assembly devices, which have sensing, communication and executing abilities respectively. The sensing device, such as grating ruler, laser sensor and vision sensor, sends the timely measurement data to the monitor after sensing the status attribute information concerned by the user; the execution device, including the flexible assembly part on the assembly robot as well as the whole digital production line, completes the corresponding operations according to the received instructions, such as adjusting the posture, drilling, trimming and so on. All these data can be transmitted through the network. The ability of sensing and controlling the physical elements is the basis of realizing the interaction with the information system.
The model library together with the perception of the underlying data constitute the assembly situation, including the data of each joint of the assembly robot, and the position and size data of each product. Through matching and knowledge reasoning the assembly task and situation, the corresponding assembly process schemes are formed. Based on that the needed assembly resources are obtained. The algorithm library sequentially generates the corresponding instruction code and adjustment scheme to drive the assembly equipment. During the real-time operation of the assembly equipment, the feedback equipment receives the corresponding feedback data and interacts with the simulation results. This off-line pre-assembly process and CPS on-line simulation control further ensures the smooth assembly.
IV. Theoretical Framework a) Q-learning Q-learning is a value-based algorithm in reinforcement learning. , also represented as ( , ), is the obtainable feedback when taking action , under a certain state .
The main objective of this algorithm is to get the optimal Q value through iteration. A Q-table will be created to reserve the value. The agent will select an action a based on the -greedy theory under a randomly given state s. After taking the given action, a new state s' will be generated, and a reward r will be given from the outside environment. The reward value will then be reserved in a matrix (R) for later calculation of Q value in next step. After using the reward to calculate the Q value, the Q value will then be reserved in a new matrix, called P. Finally, the action sequence will be generated based on the optimal (i.e. maximum) Q value. The detailed algorithm is shown below. Table 1 Initialize ( , ) arbitrarily Repeat (for each episode):

Initialize
Repeat (for each step pf episode): Choose from using policy derived from (i.e. -greedy) Take action , observe , ' until is terminal.
In this algorithm, s is the current state, while s' is the next state. a and a' represent the current action and next action. ( , ) is the corresponding value of s and a. r is the corresponding reward.
(0 < < 1) is a factor that reflect the future feedback on current situation. represents the learning efficiency.

b) Dual reward in improved Q-Learning
In Q-Learning, the reward function plays an important part. It could help the agent to better find the right way towards the goal. With a well-designed reward function, agent can find the optimal way in a short time with least steps. In most of the cases, the goal node is represented by an especially large reward value, while the common nodes that contains no goal and obstacles will only have a small positive number or even zero. The nodes that represent obstacles will have a negative number, which can prevent the conflict between agent and obstacles. However, with this kind of static reward function, the agent will be easily trapped in blind searching. Therefore, in this paper, a dual reward function, which include static reward function and dynamic function, will be introduced.

 The static reward function
In most of the cases, the state nodes can be represented by obstacles nodes, common nodes and goal node. In this research, the reward function only contains two different kinds of nodes, common nodes and goal nodes. And the reward function is shown below.
= { 0, if the node is the common node 10, if the node is the target node  The dynamic reward function In an unknown environment, the robot with static reward function can be easily trapped in blind searching. For example, when the potential next state nodes of the robot are all common nodes, which means the robot will not be punished or highly rewarded no matter what action it takes in the next state. It can choose the action of any direction, or even backwards, which will increase the distance between the robot and the goal node. Therefore, the dynamic reward function is applied to calculate the distance between its current node and the target node. The dynamic reward function is shown below.
In these equation, represents the distance between current state node and target node. And the +1 represents the distance between next state node and target node.
is a factor that depends on static reward, since it represents the proportion of dynamic reward in the total reward the agent would get. In this research, we give the value of 0.1 to .

c) Q-table optimized by case-based reasoning
In the training process of Q-learning, the initial parameter values are difficult to choose. The reward matrix in the Q learning algorithm is a constant matrix, and the initial Q-table is often initialized to a zero-valued matrix. Therefore, it is more sensitive to the parameter values in the algorithm in the early stage of the algorithm and is often directly given based on empirical values. However, the same initial parameter values cannot be applied to networks of different scales and densities, so multiple experiments are required in different networks to determine the parameter values. Moreover, large space is necessary for exploration. At the beginning of the algorithm, under the guidance of lack of prior knowledge, the agent has been in the random exploration stage for a long time, and it is difficult to quickly approach the optimal solution. At the same time, random exploration may cause the algorithm to converge to the local optimal value in advance, and under the positive feedback effect of the roulette exploration strategy, the degree of exploration of the sub-optimal path is strengthened, which causes the algorithm to fail to jump out of the local optimal. In order to make better use of the experience of historical cases, we use case-based reasoning methods to initialize q-

Figure 2 Flow chart of Q-table optimization based on CBR
The case-based reasoning method has the function of analogical learning, which relies on past experience for learning and problem solving. A case library is constructed to store historical path planning cases. The solution of the new case can be obtained by modifying the case in the case library that is similar to the current situation. First, obtain the information used or generated in path planning. Each case in the case library contains information such as the task, environment, and path generated during each path planning. Each case can be obtained through a specific index, this index value is generated by the current planning task, environment and other information.
By matching the context and task of the current task with the cases in the library, and evaluating and correcting the matched candidate cases, the Q table used by the case with the best performance is finally used as the initial Q table of the current task. At the same time, the case-based planning method continuously expands and updates the case library through iterative learning. In this study, the calculation of task similarity is as follows: (π 1 , π 2 ) = (π 1 ∩π 2 ) ( (π 1 ), (π 2 )) where, π 1 , π 2 respectively represents the collection of obstacle points in the two scenes. (π 1 ∩ π 2 ) represents the number of point sets in the same area covered by obstacles in two scenes, and ( (π 1 ), (π 2 )) represents the number of point sets in obstacles with a larger coverage area.

d) Obstacle avoiding method
In some of the cases, a large negative reward would be set for the obstacle nodes in the reward function in order to keep robot from encountering the obstacles. However, this research try to record down the obstacle nodes, and find the feasible actions and states at first every time the agent is going to move on to the next state. With this method, the possibility of hitting a obstacle can be reduced comparing to the method of giving the obstacle nodes a large negative rewards, since the Q-Learning method will sometimes choose actions and states randomly, ignoring the value of reward.

V. Experience and analysis
In order to verify the theoretical analysis effect of path planning based on the improved Q-learning algorithm proposed in the previous section, the manipulator and its working environment in the assembly digital twin system were modeled and verified. For specific assembly tasks, the end of the six-joint industrial robot needs to start from the starting position coordinate point in its working space, avoid possible obstacles in the working space, and plan the path to finally reach the target coordinate point. This study selects the path planning scenario shown in the figure below, and takes a typical six-joint robot UR10 as an example to explore its performance in pathfinding in the below scenario. The virtual environment contains obstacles, targets in the shape of rectangular solid. Although most of the obstacles and targets in reality are in different irregular shape, we modified them into rectangular solid for convenient calculation. The constructed virtual environment is shown below. The software and hardware environments for system implementation are summarized as follows: (1) Application server: A ThinkPad desktop with Intel Core i7 CPU (3.40 GHz) and 32GB memory, 2T SCSI HD and a Windows Server operating system; (2) Programming platform: PyCharm and Python 3.    Figure 4. In fact, when λ is set to 0, the learning algorithm becomes the original Q learning algorithm. Therefore, if λ is too small, the effective income of the update will be annihilated more quickly.
If λ is too large, it will cause a certain number of repeated steps to explode. Locally optimal, so the iterative effect of some rounds does not converge, and the movement is repeated in the trap area. iterations is equivalent, so in the scene with trap area obstacles, the pathfinding performance of the Q learning algorithm with dynamic reward function and CBR is significantly better than the Q learning algorithm. Figure 6 The GUI of manipulator modeling In order to realize the centralized management of distributed resources by the cloud service architecture and the assembly requirements of centralized resources for decentralized services, a prototype system was developed using the web-based B/S (Browser/Server) architecture model. Figure 6 shows an example of editing UR10 in the equipment model library shown in the integrated Vrep modeling software.

VI. Conclusions
The assembly path planning of manipulators is playing an increasingly important role in modern manufacturing. The digital twin system can provide an intelligent assembly paradigm, which realizes the real-time synchronization of virtual assembly and actual assembly through the integration of intelligent calculation process and physical process. In this research, a digital twin system for manipulator assembly is constructed, which includes functions such as real-time perception, data exchange, communication, intelligent decision-making, and virtual-real interaction. Moreover, this research also proposes a modified Q learning algorithm, which speeds up the convergence speed by adding dynamic reward function, optimizes the initial Q table by introducing knowledge and experience through the CBR algorithm, and prevents entry into the trapped area through the obstacle avoiding method. Finally, take the six-joint robot UR10 as an example to verify the performance of the algorithm in the three-dimensional space of pathfinding. The experimental results show that the improved Q-learning algorithm's pathfinding performance is significantly better than the Q-learning algorithm.
As far as future research is concerned, there is great potential for the extension of the current model and solution algorithm. The proposed algorithm can be improved from the following aspects: in theoretical research, a more in-depth discussion on specific scenario modeling methods and control algorithms is still needed; application exploration also needs to be expanded from more perspectives, and the research on system architecture needs to be more detailed. been written by the stated authors who are all aware of its content and approve its submission g. Consent for publication (include appropriate statements): this article has not been published previously. The article will not be published elsewhere in the same form, in any language, without the written consent of the publisher.