Adaptive disassembly sequence planning for VR maintenance training via deep reinforcement learning

VR training equipped with meta-heuristic disassembly planning algorithms has been widely applied in pre-employment training in recent years. However, these algorithms are usually authored for specific sequences of a single product, and it remains a challenge to generalize them to maintenance training with unpredictable disassembly targets. As a promising method for settling dynamic and stochastic problems, deep reinforcement learning (DRL) provides a new insight to dynamically generate optimal sequences. This study introduces the deep Q-network (DQN), a successful DRL method, to fulfill adaptive disassembly sequence planning (DSP) for the VR maintenance training. Disassembly Petri net is established to describe the disassembly process, and then the DSP problem is defined as a Markov decision process that can be solved by DQN. Two neural networks are designed and updated asynchronously, and the training of DQN is further achieved by backpropagation of errors. Especially, we replace the long-term return in DQN with the fitness function of the genetic algorithm to avoid dependence on the immediate reward. Several experiments have been carried out to exhibit great potentials of our method in on-site maintenance where the fault is uncertain.


Introduction
Given the high proportion of maintenance costs in the lifecycle costs of products, seeking efficient maintenance schemes has become an urgent crunch [1]. Due to the uncertainty of damaged parts caused by the randomness of faults, current maintenance methods rely heavily on the accurate memory and rich experience of workers, which have great instability and low efficiency. With the development and maturity of human-computer interaction technology in industry, VR training equipped with disassembly planning algorithms has become a promising alternative [2][3][4][5]. These algorithms are usually authored for specific sequences and stored locally or in a server, then called when an external command is received, and visualized by a VR device. As the product complexity is raised, many intelligent algorithms begin to be introduced to the field of disassembly sequence planning (DSP) for improving disassembly efficiency [6].
Over the past decade, great quantities of meta-heuristic methods have been developed for DSP problems [7], including genetic algorithm (GA) [8][9][10][11], artificial bee colony (ABC) [12][13][14], ant colony optimization (ACO) [15,16], particle swarm optimization (PSO) [17,18], simulated annealing (SA) [14], and tabu search (TS) [19]. These methods are independent on specific problems and expose acceptable performance in searching optimal or near-optimal solutions and generalization. Recently, several works were proposed to integrate advantages of different algorithms, and tried to make up for the limitations of classic algorithms in local search ability and convergence speed. Tao et al. [20] presented a TS-based hyper-heuristic algorithm with an exponentially decreasing diversity management strategy, which was proved more efficient for complex DSP problems. Inspired by the regenerative properties of the flatworm, Tseng et al. [21] proposed a novel flatworm algorithm that has great mechanisms to avoid trapping into a local optimum of DSP problems. Moreover, other fantastic algorithms such as artificial fish swarm (AFS) [22], fireworks algorithm (FWA) [23], firefly algorithm (FA) [24], and flower pollination algorithm (FPA) [25] were successively proposed for the solutions of DSP problems. Most of the swarm intelligence and evolutionary algorithms above are inspired by some random phenomena in nature and widely applied for the goal-determined DSP tasks. However, these traditional meta-heuristics seem incompetent at on-site maintenance tasks, owing to the unpredictability of damaged targets.
In view of this, Tuncel et al. [26] applied a reinforcement learning method based on Monte Carlo for settling flexible disassembly problems, which extends new approaches to solve DSP problems in stochastic environments. Nonetheless, similar to Xia et al. [27] who employed the Q-learning method for the selective disassembly of waste electrical and electronic equipment, it is inevitable to cause a memory explosion since the complexity of products enlarges. By introducing deep neural networks to achieve the approximation of action-value function [28], deep reinforcement learning (DRL) methods have witnessed significant breakthroughs in several fields such as games [29][30][31][32], path planning [33][34][35][36], and dynamic scheduling [37][38][39]. Even so, when applied to DSP problems, it remains quite difficult to associate actions with returns since the reward is only observed after a series of disassembly operations, which is referred to as temporal credit assignment problem on sparse rewards.
To address all the issues mentioned above, this paper presents an improved deep Q-network (DQN) optimized by GA, namely GO-DQN, for adaptive disassembly sequence planning (ADSP) in VR maintenance training system. Firstly, disassembly Petri net (DPN) is utilized to model the disassembly process. Then, an improved DQN is employed for the solution of sequences by defining the DSP problem as a Markov decision process (MDP). Specifically, the fitness function of GA takes over from the long-term return in DQN to avoid dependence on the immediate reward. Finally, a VR maintenance training system is established to verify the effectiveness of the proposed method within a laboratory environment. Experimental results will hopefully serve as potential value for improvements in on-site maintenance work.

DRL method for adaptive disassembly planning
Taking into account the uncertainty of the targets to be disassembled at the maintenance site, this section represents an approach of ADSP based on DQN.

Modeling for the disassembly process
To describe the disassembly process during training, we erect a nine-tuple DPN as follows: where 1. P is the set of places, which denotes the disassembly compositions at each step, i.e., it represents the set of products when as root, the subassemblies when as intermediate nodes, and the components when as leaves. 2. T is the set of transitions, each transition means a disassembly operation of a component, P ∩ T = � , and P ∪ T ≠ �. 3. I n×m ∶ P × T → N is the matrix of n × m dimensions that indexes all the directed arcs from places to transitions, where n denotes the number of places and m denotes the number of transitions.
which indexes all the directed arcs from transitions to places. 5. u 0 is a vector of n dimensions representing the initial state of disassembly, which means the product keeps intact and no parts have been disassembled. 6. G is the time gradient representing the time spent on each disassembly operation. To make the process easy to calculate, we divide it into 16 grades from 30 s to 30 min. G is defined as follows: 7. C is the disassembly cost, including tools to use and change of directions. Similar to the time, the costs are divided into 4 gradients. 8. D is the degree of difficulty for disassembly. For convenience, we divided the difficulty into 5 levels, including very easy, easy, normal, hard, and very hard. 9. W is the set of weights on the above three indexes.
Let E be the evaluation of the disassembly process, then it can be computed as A simple example of a product is shown in Fig. 1, and the DPN can be expressed as follows: If we disassemble the product in the order

Markov decision process
Constructing an environment in which a DRL agent can take actions and gain rewards is of great importance when applying the DRL method to ADSP. Furthermore, the newly established environment must meet the MDP framework.
The MDP framework describes the environment with a 5-tuple ⟨S, A, P, R, ⟩ , where S indicates the finite state space, A indicates the finite action space, P indicates the probability of a state transition, R indicates the reward, and indicates the discount factor. As to the disassembly problem for VR maintenance training, DPN can be used as a bridge to convert it into an MDP. As mentioned above, the disassembly states can be expressed by the set of places ( P ) in DPN, whether there is a token in the place indicates whether the corresponding part could be disassembled. By establishing a mapping from the places ( P ) to the state space ( S ), the disassembled states are further transformed into finite state space in the MDP framework. Similarly, the disassembly operations can be conveyed by the transitions ( T ) in DPN as well as the action space ( A ) in MDP. Describe all the possible disassembly states of the agent's environment as a set ( S ), where the current state is represented by S t ∈ S . During the interaction process, the action ( A t ) is selected and executed from the set of all the possible disassembly operations ( A S t ). After that, the environment will return a reward value ( R t ∈ R ) and reach a new state ( S t+1 ). Since the probability and reward of transforming to the next disassembly state only depend on the current disassembly operation and state, having nothing to do with historical ones, the disassembly task fulfills the MDP condition.
Further, we define the cumulative reward as the return ( G t ), which represents the cumulative sum of the reward ( R ) from the current moment until the end of the task. Since the impact of reward on the action will gradually weaken over time, G t is defined as follows: The discount rate (0 < ≤ 1) denotes the present value of future rewards.

Sequence planning using DQN
To effectively make a trade-off between exploration and exploitation during the disassembly process, − greedy algorithm is adopted to select the action a . After executed in the state s , a new state s ′ will be explored, which can be expressed as And, the reward R i of the action a that represents the ith operation in the state s can be computed as Then, the state-action value function Q(s, a) can be updated by the reward R i : Then, we fabricate two neural networks to complete error backpropagation and update the weights, one of which generates the value function approximation of the current state, called Q_evaluate_net, while the other generates that of the next state, called Q_target_net. Both networks can be expressed as where is the parameter of the fitting function, representing the weight of the connection between two layers of neurons in the network.
As shown in Fig. 2, the Q_evaluate_net outputs Q evaluate (s, a, 1 ) , while the Q_target_net chooses the action for the next state that yields arg max a Q(s � , a � , 1 ) in the Q_ evaluate_net, i.e., Q target (s � , a � , 2 ) . Based on this, the action a is selected by − greedy algorithm and executed in the state s , receiving the reward R and then transferred to the next state s ′ , all of which are stored in the replay memory D as a tuple (s, a, R, s � ) . During this process, T mini-batch samples are randomly extracted from D to update 1 . After that, the fitting action value y i = r + max a �Q(s � , a � , 2 ) can be calculated, and the loss is further computed as Then, the gradient descent method is applied to update 1 of the Q_evaluate_net. Further, 2 is updated as 2 = 1 after a fixed interval.

Optimization of DQN based on GA
In DSP problems, only when the entire disassembly process is finished will the agent receive a reward, which makes it quite difficult to associate each action with rewards. Benefiting from GA's powerful ability to address temporal credit assignment, the use of fitness indicators that merge the returns of the entire episode makes it indifferent for GA to the sparsity of reward distribution and robust to long-term horizons [40]. Inspired by this, we propose an improved DQN optimized by GA named GO-DQN for the solution of optimal sequence in a VR training system, which uses the return in RL as the fitness function where N denotes the number of episodes, l denotes the length of an episode, and G t denotes the total return of each episode. Figure 3 clearly illustrates the workflow of GO-DQN, which mainly includes the following four steps: Step 1: initialization. A population of Q_evaluate_nets is initialized as pop k , as well as an empty cyclic replay buffer D . Moreover, one additional Q_evaluate_net alongside Q_target_net is initialized as Q eva and Q tag with weight Q eva and Q tag , respectively.
Step 2: elite selection. Recycle the interactions between individuals in the population and the environment, and use the cumulative reward as the fitness of individuals to select elites. Fig. 2 The framework of the DRL method for ADSP Step 3: crossover and mutation. Select individuals from the population by segmented weight selection as parents with elites for crossover and mutation to create the next generation of Q_evaluate_nets.
Step 4: network update. Periodically, we sample a random mini-batch of T transitions (s i , a i , R i , s � i ) from D and acquire the action-value function Then, the loss function is computed as follows: Thereby Q eva can be updated by minimizing the loss, and the weights can be updated as well.
Finally, we copy the Q_evaluate_net Q eva into the population pop k .
For optimum Q eva ∈ pop k : Q eva ⇒ Q tag

System design and implementation
Embedding the GO-DQN method, the overall architecture of the training system consists of three major layers: data layer, function layer, and interaction layer, as illustrated in Fig. 4. Three-dimensional models and the interference (11) matrix of industrial products to be disassembled are prepared in advance and stored in the data layer, which are used for training RL models. Once the sequences are generated in the function layer, the system will create disassembly animations for demonstration automatically and store them in the data layer. VR guidance instructions for training and evaluation will be created as well. The interaction layer supplies an interactive interface that allows users to specify their input preferences and accept corresponding feedback. Other VR modules, such as collision detection, haptic feedback, tracking, and rendering, have been developed to realize immersive training.
To conduct the data transmission between function layers, a server-client communication module has been established to integrate the disassembly sequence planning module with the VR application. A local server with Python installed is set up to handle requests from Unity.
When Unity receives inputs from the user, it will send requests to the server, which relies on the UnityWebRequest system. Then, the server will call the Python scripts for sequence calculation and return the results to Unity in JSON format with the help of a web framework called Flask. The results are further presented through the user interface (UI) and communicate with HTC VIVE at the same time to achieve multi-channel information feedback, as shown in Fig. 5.
Given the poor adaptability and waste of space resources caused by storing disassembly animations locally in most VR systems, this study has achieved automatic generation of disassembly animations by obtaining prefabricated templates Fig. 3 The workflow of GO-DQN associated with the parts to be disassembled from the server. Specifically, the 3D model of each part to be disassembled in the input product is stored as a Unity template and matched with the corresponding interference matrix. After the optimal sequence is generated, the system will successively call the templates stored in the Unity resource folder according to the order of the parts in the sequence and determine the removal directions (one of the six directions: +x, +y, +z, −x, −y, −z) of the parts according to the interference matrix. Finally, by combining all the called templates, the automatic synthesis of disassembly animations is achieved.
For obtaining the best experience, multi-sensory feedback is realized in this system integrating visual, auditory, and haptic feedback. Visual feedback includes the feedback of part status and text information; the former is acquired by selected part highlighting algorithm. When the virtual hand touches the part to be disassembled, its outline will be highlighted to indicate that the selection is correct. After the disassembly operation is completed, the part will return to its original state, and the panel will prompt the user for the next operation. If the user makes a wrong choice, the handle will vibrate to warn, and the helmet will give voice prompts at the same time, as shown in Fig. 6.

Experimental demonstration on aircraft engine
The model of the aircraft engine used in this demonstration is shown in Fig. 7, which consists of 25 parts. Since the To train the RL model based on the proposed method, a modified MountainCar-v0 environment has been developed based on OpenAI Gym. Each step is defined as an instance where the agent takes an action and receives a reward back from the environment, and all the steps taken by Q_evaluate_nets in the population are cumulative. During each training generation, the Q_evaluate_net with the highest fitness is selected as the best and then tested on 5 task instances, and the average score is logged as its performance.
Then, three experiments have been designed to verify the effectiveness of our method. In experiment I, we select the rear main bearing (#9) as the only disassembly target. For experiment II, we select the nose cone (#2), the rear main bearing (#9), and tapped covers (#12) as the disassembly targets with no additional preferences. As to experiment III, we choose additional preferences which are to exclude cam pulleys (#22) and the exhaust pipes (#11).

Results and discussion
As for the on-site maintenance, it is usually more desirable to use a shorter time and less cost to complete the training, thereby fixing the fault as soon as possible. Therefore, we set the weights of each index in the formula (6) as follows: w 1 = 0.5, w 2 = 0.4, w 3 = 0.1 . The size of the population is set to 10, and the elite fraction is set to 0.1. The mutation probability is set to 0.5 with the expectation for more searches in the solving space, and the mutation fraction is set to 0.1 as well as the mutation strength corresponding to a 10% Gaussian noise. Adam optimizer with a learning rate of 0.0001 is used while the discount rate is set to 0.99. The training process is shown in Fig. 8, and the results for the above three experiments are represented in Table 1 and Fig. 9.
As indicated in Table 1, the developed system is able to automatically generate the optimal disassembly sequence according to different inputs of the trainee. Compared with the previous RL method [26,27], the proposed GO-DQN algorithm can arrange the temporal credit assignment in DSP tasks quite well, which attributes to the following two improvements: (1) replace the return in DQN with the fitness function of GA and (2) update the optimal individuals of the population in each iteration by gradient descent. Since the states experienced by individuals with high fitness have a higher long-term return, their experience will be more likely to be stored in the replay buffer, which is a form of implicit prioritization effective for domains with long-term horizons and sparse rewards.
In experiment I, the system calculates the optimal disassembly sequence 25-24-23-22-21-11-10-9 and converts it into operable instruction after the user selects the target part #9, the whole process of which takes 0.7 s on average for 50 attempts. When the targets increase to three, as set in experiment II, the system is still able to immediately calculate the Fig. 6 Feedback of vision during operation Fig. 7 The structure of the aircraft engine Fig. 8 The training process of the RL model based on GO-DQN optimal sequence, which only takes 0.2 s longer than experiment I. In experiment III, after the user chooses additional preferences which are to exclude #22 and #11, the system is unable to acquire any feasible sequence, this is because they are the parts that must be removed before disassembling parts #2, #9, and #12, which is determined by the constraint relationships between parts. As a summary of the above analysis, GO-DQN succeeds GA's invariance to sparse rewards with long-term horizons as well as DQN's ability to leverage gradients for higher sample efficiency. Benefiting from these advantages, the built system can generate optimal disassembly sequences within a short time after receiving user commands.
For most VR training systems, the adoption of traditional meta-heuristics such as GA makes them only suitable for specific sequences of a single product [5,41,42]. Once the disassembly targets are not included in the sequences designed in advance, these algorithms will work abnormally and need to be modified or even rewritten. By contrast, the proposed RL-based method can flexibly output the corresponding optimal sequences according to different input targets, which is more satisfactory for on-site maintenance where the disassembly targets are often fickle due to the randomness of the faults. It is noted that this study is based on a real engineering issue of our cooperative enterprise. Yet due to its complex production line involving a large number of products and components as well as complicated environmental restrictions including illumination, radiation, and electromagnetic compatibility, it will take some time to launch this training system. Presently, it is still in the design and simulation stage and will be online and put into use after all the production line tests are completed in the later phase. Admittedly, with the increase in the complexity of the input products, the training cycle of the RL model will extend. How to shorten the training process is one direction of our future exploration. In addition, the system can also be used in different production lines and maintenance workshops by importing the models and interference matrices of other new products.

Conclusion and future work
To deal with the high cost and giant threshold of equipment maintenance caused by the haphazardness of faults in the maintenance task, this paper puts forward a method of GO-DQN coupled with GA optimization, which has been confirmed to be applicable for ADSP in on-site VR maintenance training system. The disassembly process is modeled by DPN, and then the DQN method is adopted for solving the DSP problem defined by the MDP. After that, two networks with different parameters are constructed to perform the backpropagation of errors for training. To minimize the impact of  In summary, the main contributions of our work are as follows: 1. From the method perspective, GO-DQN inherits GA's competence to handle temporal credit assignment by its adoption of a fitness metric that integrates the return of an entire episode, as well as DQN's update method based on gradient descent for speeding up the iteration. 2. From the technical application, the VR maintenance training system equipped with the GO-DQN method can dynamically generate optimal sequences based on different input targets, thus meeting the needs for costsaving and efficient repair under the on-site maintenance environment with uncertain fault.
For future work, we will explore the following aspects: 1. Combine Petri nets with graph neural networks to seek more efficient expressions for the disassembly process. 2. Introduce heuristic knowledge for training such as expert experience, and cloud server if possible, so as to accelerate the training process. 3. Verify the robustness and efficiency of the system in much more complex and changeable environments.
Author contribution Haoyang Mao: conceptualization, methodology, investigation, data curation, software, formal analysis, experiment, and writing of the original draft. Zhenyu Liu: conceptualization, resources, funding acquisition, supervision, project administration, and writing including review. Chan Qiu: methodology, software, equipment, visualization, validation, supervision, and writing including review and editing.
Funding This study is supported by the National Natural Science Foundation of China (52075480, 51875517), Key Research and Development Program of Zhejiang Province (2021C01008), and High-level Talent Special Support Plan of Zhejiang Province (2020R52004).

Data availability
The data that support the findings of this study are available upon reasonable request.

Declarations
Ethics approval This manuscript has not been published elsewhere in part or entirety and is not under consideration by any other journals.

Conflict of interest
The authors declare no competing interests.