Tool remaining useful life prediction using deep transfer reinforcement learning based on long short-term memory networks

Tool wear and faults will affect the quality of machined workpiece and damage the continuity of manufacturing. The accurate prediction of remaining useful life (RUL) is significant to guarantee the processing quality and improve the productivity of automatic system. At present, the most commonly used methods for tool RUL prediction are trained by history fault data. However, when researching on new types of tools or processing high value parts, fault datasets are difficult to acquire, which leads to RUL prediction a challenge under limited fault data. To overcome the shortcomings of above prediction methods, a deep transfer reinforcement learning (DTRL) network based on long short-term memory (LSTM) network is presented in this paper. Local features are extracted from consecutive sensor data to track the tool states, and the trained network size can be dynamically adjusted by controlling time sequence length. Then in DTRL network, LSTM network is employed to construct the value function approximation for smoothly processing temporal information and mining long-term dependencies. On this basis, a novel strategy of Q-function update and transfer is presented to transfer the deep reinforcement learning (DRL) network trained by historical fault data to a new tool for RUL prediction. Finally, tool wear experiments are performed to validate effectiveness of the DTRL model. The prediction results demonstrate that the proposed method has high accuracy and generalization for similar tools and cutting conditions.


Introduction
Tool is a key part in manufacturing process, including turning, milling, and cutting. Tool wear in manufacturing process affects the machining performance and reduces the productivity of high-speed computer numerical control (CNC). Thus, effective tool wear monitoring and remaining useful life (RUL) prediction are of great significance for improving machining quality and predictive maintenance [1][2][3]. Generally, tool monitoring and RUL prediction methods can be roughly classified into statistical model-based and data-driven methods.
In the statistical model-based methods, the core idea is to establish a failure mechanisms model for RUL prediction on the basis of stochastic process. Si et al. [4] developed a Wiener process-based degradation model for RUL prediction, and recursive filter was used to reduce the estimation error. Yan et al. [5] designed a stage-based Gamma process to predict the probability density function of tool unobservable degradation. Wang et al. [6] presented a particle filtering method for tool wear state prediction.
In the data-driven methods, machine and deep learning approaches are used to process the observation data for diagnosis and prognosis [7]. In this kind of method, vibration sensors, torque sensors, or other kinds of sensors are installed on machining centers to monitor tool working states. Sensory signals are extracted by signal processing technology to get discriminant signal features [8][9][10]. Due to the advantages of high prediction accuracy and easy modeling, data-driven methods have been a research hotspot for tool state monitoring and RUL prediction. For instance, Widodo et al. [11] reviewed the implement of support vector machine (SVM) in machine condition monitoring and diagnosis. Chen et al. [12] utilized logistic regression model to process vibration signals for cutting tool monitoring. Karandikar et al. [13] evaluated the performance of two different machine learning methods in predicting tool life curve. Yang et al. [14] established a v-support vector regression (v-SVR) model to study the relationship between fused features and actual tool wear for tool wear monitoring. Zhang et al. [15] used a least square support vector machine (LS-SVM) to predict tool wear of cutting edge position under joint effect of machining conditions. Kong et al. [16] presented a Gaussian process regression technique for accurately monitoring flank wear width. Kong et al. [17] developed a Gaussian mixture hidden Markov models to determine the tool wear states. Zhou et al. [18] utilized extension neural networks (ENNs) to fast recognize cutting tool conditions with high precision.
The machinery health prognostic program generally follows a similar technical process: first is extracting artificially designed features from acquired signals for determining the state change of tool, and then establishing the nonlinear mapping function between extracted features and tool state by regression methods. But in the above methods, there are two main shortcomings in artificial neural network (ANN)-based fault prognosis approaches. First, the inputs rely heavily on signal preprocessing techniques. Second, the simple architecture of ANNs lacks sufficient breadth and depth to map complex nonlinear relationship. The development of deep learning has relieved the above problems to a certain extent [19,20]. Deep learning can adaptively learn hierarchical representation without extracting the fault features manually [21,22], which is beneficial to improve the adaptability of the model. In addition, more hidden layers are added to process nonlinear inputs, which is more likely to learn deeper hidden information and then to improve prediction accuracy. Deep learning models have attracted increasing attention in fault diagnosis and prognosis. Jia et al. [23] designed a deep neural network (DNNs)-based method for fault diagnosis in rolling element bearings and planetary gearboxes. Shao et al. [24] constructed a convolutional deep belief network for fault diagnosis of rolling bearing, which used compressed sensing (CS) for reducing the amount of data. Wu et al. [25] utilized bidirectional long short-term memory neural network (BiLSTM) to deal with singular value decomposition features to predict current tool wear value. Zhao et al. [26] proposed a deep residual network with dynamically weighted wavelet coefficients for planetary gearbox fault diagnosis.
Different from above-mentioned approaches, deep reinforcement learning can directly map raw extracted features to the corresponding tool wear state, which is helpful to further improve intelligence of prediction methods. Combining the advantages of deep learning and reinforcement learning, deep reinforcement learning is able to construct the environment according to extracted features, from which artificial agents can learn observations and rewards. Reinforcement learning gives agents the ability to interact with its environment, while deep learning enables agents to learn the better decisions to scale to problems with high-dimensional state and action spaces [27]. As the most significant breakthrough in the field of artificial intelligence, AlphaGo proves the effectiveness of DRL mechanism [28]. Since then, DRL algorithms have been widely applied in the domain of modern manufacturing systems, natural language processing, and automated machine learning. In modern manufacturing systems, as a common solution for optimization problems by trial and error, DRL has already been used in fields such as robot training [29,30], management of Industrial Internet of Things [31,32], dynamic scheduling of flexible job shop [33,34], and machinery fault diagnosis [35], while how to transfer a DRL network to an effective application against the limited availability of training data for RUL prediction is still a hotspot issue in accurate prediction of tool RUL.
To overcome the deficiencies of limited data and to further improve the accuracy and intelligence of prediction methods, a deep reinforcement transfer learning (DTRL) method is researched in this paper. Two optimization strategies, including value function approximation through LSTM, Q-function update and transfer, are researched to realize the transfer of a trained DRL network to a new application scene. In DTRL method, local features are first extracted from consecutive time series data to reduce the network size. Then in DRL prediction method, LSTM network is adopted to construct the value function approximation for deeply mining temporal information. A novel strategy of Q-function update and transfer has been proposed to guarantee transferability of trained network to new domain. Finally, tool wear experiments are carried out and the effectiveness of the proposed method is verified by analyzing the datasets.
The rest of this paper is organized as follows. Theoretical foundation about DRL is introduced in Section 2, based on which the framework of proposed DTRL method is shown in Section 3. Tool wear experiments and case study on RUL prediction are conducted in Section 4. Model comparison and validation are shown in Section 5. Finally, the conclusions are summarized in Section 6.

Theoretical foundation
Deep reinforcement learning (DRL) is a branch of dynamic programming-based reinforcement learning, in which agents interact with the environment while learning. The interactive learning process can be modeled by Markov decision process (MDP) expressed by a tuple: where S is a finite set of states, A is a finite set of actions,T : is a deterministic policy to demonstrate the probability of the action. The state-action value function Q : S × A → ℝ following policy π can be defined as: The goal of each MDP is to find an optimal policy π*, and it owns expected return V*(s) and value function Q*(s,a). The Q-function in DRL satisfies the Bellman optimality equation. Therefore, an optimal state-action value satisfies the following equation: One of the most popular methods to estimate the value of state action is the Q learning algorithm. The basic idea of deep Q learning is to estimate Q-values based on rewards and the agent's Q-value function. The Q-update rule for model-free online learning can be expressed as: whereαis the learning rate. The max error is utilized to evaluate the quality of Q-function: 3 Proposed DTRL architecture In the practical application of deep learning, there are two common problems: the first is to deal with extremely large state space of tabular Q learning in time series analysis; the second is to process unlabeled data. To solve the above issues, the DTRL architecture is designed, which combines deep learning with transfer reinforcement learning. More specifically, the DTRL method inputs the current state and action, then adopts a LSTM to estimate the value of Q(s,a), and at last transfers the Q-values to another LSTM network. The estimated value of Q(s,a) is defined as:

Parameter reinforcement Q learning
To solve the problem caused by large state space in the RUL prediction and improve the generalization ability of deep Qfunction, Deep Q-Network (DQN) is adopted. DQN is a model parameterized by weights and biases collectively denoted as θ. In DQN, Q-values at each training iteration t can be denoted by Q θ t s; a ð Þ. More specifically, Q-values are estimated by performing forward propagation then querying the output nodes. To obtain the estimation of Q-values shown in Eq. (6), the proposed DTRL method adopts the experience replay method [36]. After one prediction, the experiences at current time step, denoted as e t =(s,a,R,s′, are recorded in the replay memory M={e 1 ,e 2 ,...,e t }, and then sampled randomly at training time. Instead of updating Q-table lookups, now the network parameters θare updated with stochastic gradient algorithm to minimize the differentiable loss function: When the Q-function changes quite rapidly, the updates may oscillate or diverge. At the same time, when there are too many iterations, the algorithms will be inefficient. To avoid the above problems, the proposed DTRL method adopts the fixed Q-targets method. Instead of using the latest parameters θ t to calculate the maximum possible reward of the next state γ max a 0 Q θ t s 0 ; a 0 ð Þ, we update the parameters θ' every certain iteration. Differentiating the loss function with respect to the parameters, the gradient is shown as follows: where Þis the stale update target.

Deep Q-network based on LSTM
In the DQN-based RUL prediction method, we use densely connected networks to capture the state correlations. However, the real-word RUL prediction tasks also exist temporal correlations and vanishing gradient problems, which may result in degradation of DQN's performance. Therefore, in the proposed DTRL method, LSTM layers are adopted instead of dense layers to carry information across many timesteps. More specifically, the Q-function in the DTRLbased RUL prediction can be defined as: where h t − 1 is the supplementary input calculated by LSTM layers according to the previous information. Consequently, the gradient of the loss function is shown as: The DTRL method adopts LSTM layers instead of simple dense layers to process temporal series. As shown in Fig. 1, LSTM layers take as input temporal sequential experiences e={e 1 ,e 2 ,...,e t } and Q-values are calculated after the output layer. In the practical prediction process, as LSTM layers save multi-timesteps information, we are supposed to choose experiences traces with certain length instead of single experience.

Deep Q transfer learning
After deep reinforcement Q learning, we can calculate the Qfunction of each state-action pair, which can help to select the values with the least error in the prediction tasks. But the above algorithms require a large amount of experimental fault datasets, which are almost impossible to be acquired at the beginning of practical manufacturing. Hence, the transfer learning method is introduced to reduce the amount of training datasets and make full use of the trained Q-function.
In the RUL prediction tasks, the proposed DTRL method transfers the Q-function calculated from different tool tasks (source domain) to another tool task (target domain), which aims to improve the learning ability in new prediction tasks by introducing knowledge from a similar learned prediction task.
According to the DRL, the tool RUL prediction tasks can be defined asM = (S, A, T, R, γ), and the tasks are different in transition function T, reward function R, and discount factorγ.
As shown in Fig. 1, the source domain, denoted as M 1 = (S, A, T 1 , R 1 , γ 1 ), is the trained DRL network, and the target domain, denoted as M 2 = (S, A, T 2 , R 2 , γ 2 ), is the new DRL network. Assume Q 1 * and Q 2 * are corresponding optimal Q-functions. The main goal of the DTRL method is using the information of M 1 and Q 1 * to update Q 2 and improving the training speed of M 2 while ensuring the prediction accuracy.
To learn similar and joint Q-functions for the source domain and the target domain, the distance between Q-functions calculated by two networks is minimized. The distance between two tasks is defined as: By minimizing the distance between two tasks, Qfunctions learned in the target domain are restricted to be similar to those in the source domain, consequently deep Q transfer learning is achieved. According to the distance between two tasks calculated in Eq. (12), the Q-function update in the new DRL network is performed. Finally, the forward iteration process is implemented in the new DRL network, and the RUL prediction results of the target domain are presented.
The DTRL network is realized through above-mentioned steps, and the whole architecture is shown in Fig. 1.

DTRL for RUL prediction
In the real prediction tasks, the first step is feature extraction and data preprocess. We model the mapping function between the extracted features and the corresponding RUL sampling points, which can be expressed as:  ð13Þ where [f 1 ,f 2 ,...,f m ] denote the extracted features with m denoting numbers of input features at a certain point in time, and [t 1 ,t 2 ,...,t n ] T denote the RUL with n denoting sampling time points during run-to-failure. In the real tool processing, the tool performance is generally nonlinear due to the influence of nonlinear factors such as crack growth in material and sudden change of machining parameters. Therefore, a nonlinear activation function ϕ( * ) is first adopted to fit mapping function between feature matrix and RUL series, and then linear regression is conducted on the feature. The nonlinear mapping can be defined as: where RUL=[t 1 ,t 2 ,...,t n ] T are time sample series, ϕ(F) is a nonlinear feature matrix, ωis the corresponding weights for nonlinear regression, and b is the bias.
In kernel-based method [37], weights can be expressed by feature samples as ω = φ(F) T α with α denoting representation coefficient. Then the product can be transformed as: where K is a kernel matrix. Equation (14) can be defined as: where the coefficient and the bias can be calculated by the least square method.
The DTRL-based RUL prediction algorithm is summarized in Algorithm 1.

Benchmarking data description
To validate effectiveness of the proposed DTRL method, extensive experiments on tool performance degradation were conducted. The experiments were carried out on a highspeed CNC machine, as shown in Fig. 2. The cutting tools were end-milling cutters with 4 teeth, and the diameter and the length were 6mm and 50mm respectively, as shown in Fig. 3 b. The workpiece was steel alloy sheet and the material was CR12moV. The key signals collected synchronously were cutting force and vibration signals. The Kistler dynamometer was installed under the workpiece to collect cutting force signals by monitoring the workpiece. Meanwhile, the vibration signals were collected by the same method through installing the accelerometer on the workpiece and the radial vibration of the cutting tools was collected. Other experimental conditions for tool processing were as follows: The spindle rotation frequency was 3000 r/min; the feed rate was 1200 mm/min; the cutting depth in radial and axial direction was 0.5mm; the sampling frequency was 10kHz, and a run-to-failure tool produced a total of 312 cutting segments. A Kistler compact multi-component dynamometer 9129AA was mounted to collect cutting force signals in real time. A DAQ Elsys TraNET 404S8 and a triaxial accelerometer were used to collect vibration signals synchronously, and the measuring range and the frequency range were 50 g pk and 10k Hz. The tool was considered to be failure when the tool wear range was over 0.3 mm. Six kinds of sensor signals were acquired, including force in three directions and vibration in three directions. As shown in Fig. 3, the flank wear of the cutting tools was measured by a digital measuring microscope INSIZE ISM-WF200. Two sets of tool wear experiments were carried out: data sampled from tool1 were used for training, and data sampled from tool2 were used for testing. The cutting parameters kept constant to ensure the stability of the external

Data preprocessing and DRL training
In order to make the raw sensor signal more aenable to models, the first step is data preprocessing, including feature extraction and normalization. As the energy features can Fig. 4 Cutting force signals, vibration signals, and their energy in x direction calculated by OWPT Fig. 3 Tool wear. a Tool wear monitoring system. b Tool used for processing. c Tool wear measurement effectively reflect the tool wear state [38,39], orthogonal wavelet packet transform (OWPT) is used to extract the energy features of cutting force and vibration signals in three directions for comparisons. OWPT shows a good effect on noise elimination and dimension reduction, which can improve the prediction accuracy and calculating speed. In this paper, to acquire discriminated feature, six-level OWPT using db1 is adopted to calculate the energy of each sub-band. Each feature can be separated into 64 sub-bands, of which the energy is defined as where x i,m is wavelet coefficient in scale 2 i , and n is the oscillation parameter. The magnitude and the distribution of energy can effectively reflect the tool wear state. The energy spectrum of cutting force and vibration signal in each cutting segment can be converted into the form shown in Fig. 4, in which the energy features are dimensionless indicators. Then, each energy feature is normalized independently to accelerate model convergence.
After data preprocessing, frequency spectrum of the signals monitored from tool wear process is extracted as weighted average energy, which is used as input to DTRL for model training and RUL prediction. To configure network structure of the first Q-function of tool1, the number of nodes in both input and output layers is set to be 6, which is equal to the number of extracted features. Some experiments were conduct to evaluate the state and reward settings and the discount factorγis set to 0.9 and the learning rateαis set to 0.05 in Eq. (10). In order to ensure that the training results converge to the global optimum, the learning rate is set to decrease exponentially with the number of training epochs. Then to accelerate the convergence speed of network in training process, RMSprop is used to optimize gradient descent process. Finally, training the Q-function on the training dataset by conducting updating until the Q-function is converged.

Tool prediction results
After model training stage, a well-trained Q-function is obtained and applied in next testing stage. The network structure of the trained Q-function is first updated and transferred to establish a new DRL network for feature learning of tool2. Then the sensor data of tool2 is preprocessed and the wavelet packet energy is calculated by OWPT. To clearly display the prediction results, six features are drawn separately on two graphs. As shown in Fig. 5, the force and vibration energy features (f1 to f6) learned by the DTRL network are drawn independently. The X axis represents the cutting steps of tool2, and the Y axis represents the corresponding energy features. It can be seen that with the increase in time of tool running, the predicted value has a similar increasing trend with the real value. Increasing trend of each extracted feature is consistent with that of tool wear degree in tool2, which shows the extracted features can effectively reflect the degradation of tool performance. The result demonstrates that DTRL network is an effective feature prediction model for tool wear monitoring.
To further perform RUL prediction by the DTRL network for tool2, first is establishing nonlinear mapping function between predicted energy features and time series of RUL. Support vector regression (SVR) is used to construct the regression model, in which radial basis function (RBF) is   chosen as the kernel function. The kernel function can be expressed as: where γdenotes the width parameter. The energy features and corresponding RUL time of tool1 form the training samples. In order to minimize the training error, the width parameter γ is adjusted to 30.
Finally, the energy features of tool2 are input to the trained regression model to further evaluate the effectiveness of the proposed DTRL method. RUL prediction result is shown in Fig. 6, and it can be seen that the predicted curve is very close to the real curve. To quantitatively measure the performance of DTRL method on RUL prediction tasks, two indicators for evaluating prediction precision are utilized including mean absolute error (MAE) and root mean squared error (RMSE). The corresponding equations to calculate errors are expressed as follows: where e T i and T i are true and predicted tool RUL, respectively.
The prediction error is shown in Table 1. It can be seen that the validation MAE and RMSE translate to RUL of 16.54min and 20.23min after denormalization. The results show a low prediction error, which demonstrate the effectiveness of DTRL method in RUL prediction.

Model comparison and validation
In Section 4, the proposed DTRL method is used to predict RUL of a new tool. The tool types and cutting conditions are rather limited. To verify the effectiveness of deep Q transfer learning (DQTL) strategy in improving prediction performance, a comparison research without transfer learning for tool2 is performed. Furthermore, to investigate the generalization ability of the proposed DTRL method, more experiments are conducted.

Model comparison
The energy features of tool2 are input to the trained regression model without DQTL. RUL prediction result is shown in Fig. 7, which shows a larger prediction error compared with Fig. 6, especially from middle stage of cutting processing. The corresponding prediction error is shown in Table 2. It can be seen that the quantitative error is higher compared with that in Table 1, which demonstrates the proposed DQTL is an effective method to improve prediction accuracy of DRL network.

Model validation
To verify generality of the method, two kinds of experiments are extended. One kind is utilizing a new type of tool under same cutting conditions, in which the diameter of original tool is 6mm and the diameter of new type is 8mm. The other kind is under different cutting conditions, in which the spindle rotation frequency is 10000 r/min and the cutting depth in radial and axial direction is 0.25mm.
Comparing parameters include tool geometries and cutting parameters. During the experiments, tool geometries are fixed, thus cutting parameters have greater influence on force and vibration amplitude. Cutting parameters, including spindle speed, and cutting depth, may cause different dynamic ranges under similar cutting conditions. To eliminate the above effects and improve the generalization ability of the proposed model, a zero-centered step is first applied to the collected signals. Then, wavelet packet energy features are extracted from the cutting force and vibration signals of each sampling point. In this paper, 6-level wavelet packet decomposition is adopted, thus the harmonics are decomposed into 64 wavelet packet energy coefficients. Finally, energy features are input into the model instead of original signals (see Fig. 4), so the model is robust to various cutting conditions. Table 3 shows the MAE and RMSE translated to RUL in two extended experiments. The results indicate that the DTRL method can effectively predict tool wear under various cutting conditions.

Conclusion
In this paper, a DTRL method is introduced to conduct deep Q-function transfer in reinforcement learning network for tool wear and RUL prediction. Two strategies, including introducing LSTM into DQN and Q-function update and transfer, are designed to realize DTRL network. Experimental results show that by introducing the deep Q transfer learning strategy, it contributes to more accurate and reliable tool wear prediction results. Furthermore, a DRL network trained by similar tools or conditions can be transferred to the target tool when researching on new types of tools or processing high value parts. As the reinforcement learning has been widely used in control system, further works will extend the model to cutting edge control and cutting path planning.
Author contribution Jiachen Yao performed the data analyses and wrote the manuscript; Baochun Lu contributed to the conception of the study; Junli Zhang performed the experiment and helped perform the analysis. Data availability The data sets supporting the results of this article are included within the article.

Declarations
Ethics approval NA.
Consent to participate NA.
Consent to publish Written informed consent for publication was obtained from all participants.
Competing interests The authors declare no competing interests.