Layer-wise relevance propagation for interpreting LSTM-RNN decisions in predictive maintenance

Predictive maintenance (PdM) is an advanced technique to predict the time to failure (TTF) of a system. PdM collects sensor data on the health of a system, processes the information using data analytics, and then establishes data-driven models that can forecast system failure. Deep neural networks are increasingly being used as these data-driven models owing to their high predictive accuracy and efficiency. However, deep neural networks are often criticized as being “black boxes,” which owing to their multi-layered and non-linear structure provide little insight into the underlying physics of the system being monitored and that are nontransparent and untraceable in their predictions. In order to address this issue, the layer-wise relevance propagation (LRP) technique is applied to analyze a long short-term memory (LSTM) recurrent neural network (RNN) model. The proposed method is demonstrated and validated for a bearing health monitoring study based on vibration data. The obtained LRP results provide insights into how the model “learns” from the input data and demonstrate the distribution of contribution/relevance to the neural network classification in the input space. In addition, comparisons are made with gradient-based sensitivity analysis to show the power of LRP in interpreting RNN models. The LRP is proved to have promising potential in interpreting deep neural network models and improving model accuracy and efficiency for PdM.


Introduction
In the era of Industry 4.0, the Industrial Internet of Things (IIoTs), smart manufacturing, and cyber physical systems [1] are providing stimulus to manufacturing sectors where smart sensors, artificial intelligence, etc., are employed to improve manufacturing performance.As an advanced maintenance technique, predictive maintenance (PdM), also known as prognostics and health management (PHM), is broadly applied to monitor machine health to improve productivity, safety, and reliability and reduce costs and waste.PdM not only provides diagnostic information on what is wrong and where the problem is, but also gives prognostic information on when and what failure is going to happen.An effective PdM system can detect incipient failure or fault and then schedule maintenance procedures at an early stage.
The concept of PdM has been applied to predict equipment failure for decades and it has experienced a great evolution in methodologies.Conventionally, statistical approaches (e.g., regression-based methods [2][3][4].Gamma process-based methods [5][6][7] and Markovian-based methods [8][9][10]) have been widely employed to simulate the deteriorating patterns of machines.These approaches regard the machine health state as a random variable and use a history of condition monitoring data to build its density function.Despite the success that traditional statistical methods have achieved in certain applications, they are often inaccurate and incapable in modelling complex machinery components or systems.To face this challenge, machine learning (ML) based methods, especially deep learning (DL), have gained momentum for automatically and quickly learning subtle patterns in big data and responding to specific needs with new methodological advancements.Considering the abundance of sensing data, advances in computational infrastructure, and ever-developing data analytical techniques, researchers have explored a variety of approaches to apply DL in PdM.Among all sorts of deep learning models, a recurrent neural network (RNN) is one of the most frequently used DL networks owing to its capability in dealing with time-series data.As an advanced variant of RNN, long short-term memory (LSTM) is broadly used in sequence prediction problems.Guo et al. [11] constructed an LSTM-RNN-based health indicator to predict remaining useful life (RUL) of bearings.Huang et al. [12] developed an LSTM-based RNN method for machine health prognostics under multiple operational conditions.Gugulothu et al. [13] used RNN to create an embedded structure that captures machine degradation patterns for time to failure (TTF) estimation.
Although RNN-based algorithms have been widely reported as a cutting-edge technique in PdM and have achieved great prediction accuracy in various applications, they are commonly recognized as "black box" approaches, due to their multi-layered and non-linear structure that consists of hundreds of thousands of parameters that must be determined through training [14].These properties of an RNN make it nontransparent with its predictions difficult to trace.The consequence of this is that an RNN does not offer a comprehensive explanation on how decisions are made.What is needed is some method that allows for interpretation of the model and provides insight in terms of what the RNN has actually "learned."Such an interpretation would enable a better understanding on how the RNN associates the input features with its output decisions.
Interpreting neural networks is actually not a completely new concept in areas where image recognition and natural language processing are widely employed.To interpret neural networks, a few research studies have been published in these fields.Bohle et al. [15] successfully linked and explained patterns learned by a DL image classifier to the clinical magnetic resonance imaging (MRI)-based diagnosis for Alzheimer's.Arras et al. [16] focused on explaining prediction in sentiment analysis that verifies whether a RNN classifier can (or cannot) detect important patterns in text datasets.In spite of the increasing interests in interpreting RNN models, little work exists in the PdM literature that involve RNN model interpretation approach in machine health analysis.
To fill this gap, we conducted analyses on an LSTM-RNNbased degradation diagnosis and prognosis model for monitoring manufacturing equipment performance.The model analyzed features extracted from vibration sensing data and predicts bearing health conditions.A layer-wise relevance propagation (LRP) technique was applied to decompose the model output relevance (also known as a relevance score) into contributions from each input neurons.During this process, the relevance was redistributed backwards from layer to layer and finally to the output layer, and the total amount of the relevance score in each layer is kept consistent.The relevance scores of the input layer are mapped into a 2D space that indicates both a time-series dimension and a feature-wise dimension respectively.The LRP results are analyzed on both dimensions with different labels to interpret individual classification decisions.Furthermore, the LRP results are also used to identify the most/least relevant features/ time step input contributed to the RNN model.The obtained LRP results provide insights on how the data flows inside the model and demonstrate positive contribution/ relevance to the neural network classification in the input space.The LRP has been proved to be effective in explaining neural network models and improving model accuracy and efficiency for PdM.
The remainder of the paper is organized as follows.Section 2 reviews the theory of the LSTM-RNN algorithm and its application in PdM.Section 3 describes the LRP technique for LSTM-RNN models.Section 4 illustrates a case study experiment where an RNN model for bearing health prediction is established and the LRP is implemented.Results and discussion of LRP analysis are shown in Section 5. Conclusions are drawn in Section 6.

Background on LSTM-RNN models
Compared to general deep neural networks, the RNN is a specialized deep architecture for sequential modeling.Its variant LSTM was designed to further face the challenge of learning long-term dependencies [17], which has drawn increasing attention recently in many research areas.The following parts in this section review the basic theory of RNN and LSTM, as well as the related work in PdM using LSTM-RNN.

Recurrent neural network and long short-term memory models
When dealing with the time sequential data generated from machine condition monitoring system, recurrent neural network is well-known for its capability of capturing the dynamics of sequential data.Different from conventional neural network, neurons in RNN are strengthened by including edges that span adjacent time steps.These edges, called recurrent edges, augment the model data space with a dimension of time by forming cycles that are self-connected of a neuron to itself across time.For a simple recurrent network as shown in the left of Fig. 1a, the dynamic of a neuron with recurrent edges can be described as: where, h (t) is the hidden layer activation at time t, h (t − 1) is the previous hidden representation, and x (t) is the current input from the input layer.The RNN has hidden-to-hidden connections, input-to-hidden connections, and hidden-to-output connections parametrized by the weight matrices W, U, and V, respectively.b h and b y are the bias parameters in hidden layer and output layer to allow offset learning.F(•) and G(•) are the activation functions for the two layers.y (t) is the output of the simple RNN.
If we unfold the network in Fig. 1a from left to right, it can be observed that the data propagation in the network is not cyclic but one way along the time direction, which is similar to the propagation across layers.The difference lies in that the weights (W) are shared across time steps.Thus, the network is trainable across multiple time steps using backpropagation algorithm.The algorithm introduced by Werbos [18] is called backpropagation through time (BPTT).However, since the weights across all time steps are always the same, the contribution of the input at time step t 1 to the time step t 2 will either move to infinity (explode) or decay to zero (vanish) as t 2 − t 1 grows large.The gradient of the loss will also explode or decay with respect to the input, with this behavior depending on whether |W| > 0 or |W| < 0 and on the activation function f.
In order to overcome the gradient exploding/vanishing problem, Hochreiter and Schmidhuber [19] introduced the long short-term memory model.The model has the architecture of a standard recurrent neural network with a hidden layer, but each neuron in the hidden layer is replaced by a memory cell structure (shown in Fig. 1b) with a core node called the state unit s (t) .Like a normal neuron in a hidden layer, the cell has external inputs from both the previous layer and previous state and external outputs to the next time step and the following layer.But it also has an internal system of gating units that controls the flow of information via multiplication.For every time step t, the values of forget gate unit f . The computation procedure for an LSTM model at every step is provided below: The three gate units and the state unit each have their own bias b i , input weights U i, j and recurrent weights W i, j (for the connection of input j to cell i ) and they are all activated by the sigmoid function σ(•).Finally, the hidden layer vector h t ð Þ i is updated as the output of LSTM cell.In the forward direction, the LSTM can learn when and to what extent to let values in/ out by activating an input/output gate.If both gates are closed, the value of hidden layer will neither grow nor shrink, so that no outputs or intermediate time steps will be affected.Similarly, the gradients can propagate back across many time steps stably without exploding or vanishing in backward propagation.In other words, gates are capable of learning when to let error in and when to let error out [17].LSTM has been widely used for a variety of practical applications and has been shown to learn long-term dependencies more efficiently than the regular recurrent architectures.

Related work based on LSTM-RNN in PdM
The goal of a PdM system is to provide diagnostic and prognostic information on a monitored system (e.g., manufacturing equipment and turbine engine) from in-process sensory signals.Since these sensory signals are continuously being collected by the monitoring system, they naturally form a time-series that could be used to estimate and predict the health condition of a manufacturing component (e.g., bearings  and motors) or system (e.g., machine tools and turbine engines).
Let us assume that we periodically (perhaps once a day or once an hour) collect dynamic data from multiple sensors monitoring a system.Let X i = [x (1) , …, x (T) ] be a dataset consisting of a series of consecutive observations for the ith machine condition (e.g., the machine condition for the ith day; for such a case, the monitoring interval between machine conditions is 1 day), where x (t) ∈ ℝ d is a vector of length d that contains data for multiple sensors sampled at time step t.There are in total T samples in the dataset.The corresponding machine health condition to X i is denoted as y i .LSTM-RNN model is used to predict y i based on the sequential data X i .
Given the LSTM-RNN's strength in learning from sequential data, it has been used widely in many applications in PdM and achieved better performance compared to other methods.In those applications, two sorts of research work have been of particular popularity in the field, namely anomaly detection and RUL prediction.Using the prediction error or a function of prediction error as a measure of the severity of anomaly, time-series prediction based on LSTM has been proved to be effective for anomaly detection.Wu et al. [20] proposed a synergy of the LSTM and the Gaussian Bayes model for outlier detection in the IIoT.The proposed technique outperformed the best-known competitors on real-life datasets that involve both long-term and short-term time dependence.A new trend is to employ encoder-decoder with LSTM for anomaly detection, which learns a representation of the timeseries and reconstruct the sequence; the reconstruction error is then used to identify an anomaly.Wang et al. [21] proposed a LSTM-based encoder-decoder architecture with attention unit incorporated for anomaly detection.The method was assessed by the data from a diesel engine assembly process and achieved great performance.Li et al. [22] developed a LSTM-based classifier model for detecting abnormal condition of rotating equipment and realized 99% accuracy in an unsupervised learning environment and offer an alternative method for anomaly detection without empirical knowledge.
As for RUL prediction, Dudukcu et al. [23] proposed a LSTM-based model combined with WaveNet to predict remaining useful life.A turbofan engine degradation simulation dataset was used to examine the model and close estimation to actual data was achieved.Guo et al. [11] constructed an LSTM-RNN-based health indicator to predict remaining useful life (RUL) of bearings, where related-similarity features were employed.Zhang et al. [24] conducted an LSTMbased analysis to forecast equipment working condition.The method was evaluated for a cooling pump in a power station and the LSTM-based model had less RMS error than that of autoregressive integrated moving average (ARIMA) model.Similar to RUL prediction, Zhao et al. [25] developed a LSTM-based health monitoring systems for cutting tool wear prediction.Basic and deep LSTMs were used to predict the tool wear with sensory data and a deep LSTM was able to outperform several state-of-art baseline methods.
As is evident, many investigators have applied LSTM-RNN to PdM with great success.They have not, however, examined the LSTM-RNN evolves in response to newly acquired input data.The next section will introduce LRP as a tool to explain how the LSTM-RNN method "learns" from data.

General LRP theory
Given a trained neural network classifier, one of the things we are interested in knowing is how much each input contributes to a target class of interest, c, or, a relevance score of each input with respect to c. Layer-wise relevance propagation, proposed by Binder and Bach et al. [26] in 2015, is such an explanation technique to compute relevance.The main idea of LRP is to assign a relevance score to each individual input by tracing back their contribution to the final prediction, f(x), layer by layer.The LRP process follows a conservation principle that the total amount of relevance distributed to one layer should be equal to the total amount of relevance distributed in the previous layer.Say that m and n are two consecutive layers of a neural network, the relevance scores satisfy the following rule: are the relevance scores of the individual neuron i in layers m and n respectively.In order to fit the characteristics of different neural network structure, there are various rules defining how the relevance scores propagate between two layers in compliance with Eq. (8).A basic rule, namely LRP-0 is shown in Eq. ( 9): where z i, j is the amount of input contributed by activated neuron i in layer m to neuron j in layer n; ∑ k z k; j is the total amount of contribution sent to neuron j from all connected neurons in layer m before a non-linear activation function is applied.The conservation principle is evident in this equation; it also applies to situations such as zero weight, deactivation, and disconnected neurons.
Although the LRP-0 rule has many appealing properties, when applied to real cases, robustness needs to be considered as well as other enhancements.In this research, two enhanced LRP rules were used: Compared to the basic rule LRP-0, the denominator includes a small positive term ε to ensure numerical stability.

Alpha-beta rule (LRP-αβ):
where z i, j + and z i, j − refer to the amount of positive and negative inputs that neuron j contributed from the higher layer m, respectively, and α and β are two parameters to control the weights of positive contributions and negative contributions.
In accordance with the conservation principle, α + β should equal to one.The rule is designed to favor the effect of positive contributions over the negative ones, so that certain stability and interpretation could be brought to the final results.By choosing the value of coefficients α and β carefully, one can manually control the importance of positive and negative contributions.

Apply LRP for LSTM
Besides linear mapping computation in multilayer perceptron structures, RNN-LSTM as well as other gated neural networks have a special computation known as multiplicative interaction.In such computation, there are two neurons multiplied by each other: one serves as signal while the other serves as a gate, which controls how much the signal influences the output: where f(•) is the activation functions for the gate unit and g(•) is the activation function for the signal unit; z g and z s are two neuron values from previous layers sent to gate and signal unit, respectively.a p is the output controlled by the gate g.Unlike linear mapping, the nonlinearity lying in multiplicative interaction has its own inherent difficulties associated with redistributing relevance to the previous layer.In such a case where an activation is gained by the value of a gate neuron multiplying the value of a signal neuron, one widely accepted redistribution strategy [25] is called "signal-takeall," which refers to: where R g and R s are the relevance scores assigned to the gate neuron and the signal neuron.The signal neuron takes all the relevance R p from the upper layer while the gate neuron takes zero to obey the conservation principle.One way of interpreting this strategy is that the gate controls the flow of information, but is not information itself.Rather, information is wholly embedded within the signal.Although it seems that the contribution of z g is totally neglected, the effect of z g has been actually considered during the process of computing the value of R p from upper-layered structure.
Another particular operation in LSTM is the accumulation as stated in Eq. ( 4).Controlled by the forget gate, the new state unit is updated from the previous value at each time step.If we define the relevance R p = a p • c k and a p = a f • a k − 1 , through the iterating accumulation operation during the whole redistributing process [27], we get Eq.( 14): where a f is the forget gate value controlling the output of the gate and c k is a coefficient converting neuron value to relevance.And the signal-take-all strategy is applied in the product (gate does not receive relevance but influence the relevance a signal neuron receives).
The corresponding LRP rule is similar to the one for multiplicative interaction.However, while redistributing relevance scores to the previous time step input, the value tends to decrease exponentially backwards with the time step.That means only the few most recent time steps will receive relevance.It should also be noticed that the value of the product neuron a p directly influence the relevance and the term c k is just a constant value that does not change with time steps.

Case study experiment
It is now desired to examine the effectiveness of using LRP techniques to interpret LSTM-RNN models.To do this, a model developed by Wu et al. [28] based on data published by the Center for Intelligent Maintenance Systems (IMS) at the University of Cincinnati [29] was used as a case study.The model takes time-series data of multiple features extracted from vibration signal as the input and output classification results in terms of 4 categories/labels as an estimate of the health state of the bearing.

Experiment setup and data preprocessing
The data for this experiment [29] was from a setup that had four Rexnord ZA-2115 double row bearings that were installed on a shaft.The shaft was driven by an AC motor via friction belts with a constant rotational speed of 2000 RPM.A radial load of 6000 lb (2722 kg) was applied to the shaft and bearings by a spring mechanism.Accelerometers (PCB model #353B33) were installed on the housing of each of the four bearings.A sketch of the test rig including the sensor placement is shown in Fig. 2. Run to failure experiments on bearing health degradation were conducted, and 12 sets of accelerometer data were collected.For each set of accelerometer data, a 1-s vibration signal time-series snapshot was collected every 10 min.Each snapshot consisted of 20,480 data points (i.e., the accelerometer signal was sampled at a rate of 20 kHz).A NI DAQ Card 6062E was used in data collection process.
Through the life of the bearings, each vibration snapshot was analyzed in the time and frequency domains, and 35 features were extracted and normalized as a feature set.The names of the features and corresponding domains are listed in Table 3 in the Appendix.The listed features were selected from the most commonly used features in the bearing health diagnosis and prognosis research literature [31].In order to provide the model with time-series information for learning, a series of consecutive feature sets were utilized (one feature set every 10 min).A time sequence window was chosen to be 15 in this case; thus, a 15 × 35 matrix was formed as the input to the model.This time sequence window moved along the feature sets series from the first 15 feature sets.Every time the window moved towards the end of bearing life by one time step, a new matrix will be generated as one input matrix.For example, a feature set series with length of 2000 can generate 1986 matrices based on this "moving window."The health conditions of the bearings were classified into 4 states (labeled as "normal I," "normal II," "warning," and "failing") in terms of the TTF of 100-50%, 50-20%, 20-5%, and 5-0%.For each input matrix, the corresponding label/output is the health state at the last time step for that window (time step 15).

LSTM-RNN architecture
In this case study, a multi-layered deep neural network architecture was designed to classify the health states of bearings.The LSTM-RNN model with specific layer details is shown in Table 1.The input layer takes a 15 × 35 matrix and sends it into the LSTM layer.As the core layer in this network, the LSTM layer has 30 neurons with a hyperbolic tangent (tanh) activation function.Each neuron in this layer spans 15 time steps in accordance with the input time dimension.The LSTM layer is connected to a fully connected (dense) layer with 50 neurons (dense #1) and an activation function of a rectified linear unit (ReLU), followed by two dense layers with 10 (dense #2) and 4 neurons (dense #3), respectively, both activated by a ReLU function.The last layer is the output layer of a softmax function that takes the maximum output from the previous layer (dense #3) as the classification result.The selection of hyperparameters, such as neuron numbers in each layer, is based on existing successful LSTM-RNN structure references.This design was based on the idea that the network should first expand the data space from the input to generate abundant information, then compress the space to automatically select necessary information, and finally make a decision via a single output.
The LSTM-RNN model was constructed using Keras API for Tensorflow and python coding.During the training phase, 10 out of 12 sets of experiment data were used to train the model.The mean squared error (MSE) was used as the loss function and adaptive moment estimation (Adam) was selected as the optimizer.After 100 epochs of iterations, the model reached an accuracy above 99% and a loss below 0.001.Then, the other two sets of experiment data were imported into the trained model to evaluate its adequacy.From the testing phase, an overall accuracy of 90.07%was observed.It should be noted that the classification accuracy in terms of "warning" and "failure" states was over 95%.The results from the testing phase are shown in Fig. 3 as a confusion matrix.

Applying LRP to the LSTM-RNN model
As noted above, the LSTM-RNN model was evaluated using two of the 12 datasets.For each of these evaluations, the input matrix from the dataset was processed by the LSTM-RNN model to estimate the bearing health state.The LRP technique was then used in concert with the structure of the LSTM-RNN model, health states, and input matrices to analyze the degradation patterns and distinguishable information from different categories.
A flowchart is shown in Fig. 4 to demonstrate the whole process.The upper flow demonstrates how the LSTM-RNN model generated a prediction by forward propagation.And the lower flow shows how LRP back-propagated the relevance value to the input space.From the beginning of the LRP process, the relevance score back-propagated to the dense #3 layer where values of 4 neurons representing 4 classes were compared, only the neuron of the predicted class received a whole relevance score of 1 (the other three received 0).The rest of relevance propagation towards the input layer followed the principles introduced in Section 3.1 and used one basic linear relevance mapping function, where the LRP-αβ rule was employed to reduce the effect of negative contributions (relative to positive contributions), and α and β were set to 2 and − 1.This function was directly applied to all the dense layers (dense #1, 2, 3) to calculate the relevance scores for the lower layer.When the back-propagation reached to the LSTM layer, in addition to the linear mapping function, multiplication, and accumulation (discussed in Section 3.2) were considered where an epsilon term ε = 1.0 × 10 −8 was introduced for numerical stability.For each sample, the relevance score of the output layer neuron was set to 1 for consistency.Finally, each sample received a new 15 × 35 matrix with relevance score mapping to the input space.
In these analyses, three aspects were of particular interest: (1) how the inputs, time step-wise and feature-wise, contribute to the classification decision; (2) how the features contribute to the output through different degradation stages; and (3) how LRP analysis could improve the accuracy and efficiency of the LSTM-RNN model.Results are shown and discussed in the next section.

General LRP results in features and time steps
An LRP analysis provides relevance scores that are assigned to the input space (in this case a 2D space).Thus, a 2D relevance score matrix will be created for each input dataset.LRP

Backward propagaƟon
Fig. 4 Flowchart for interpreting LSTM-RNN via LRP provides insight into how the LSTM-RNN model uses contributions from the input layer to make a decision.Of course, it is necessary to first scrutinize the LRP technique to verify its ability to interpret the model accurately.Also, in many cases, to visualize a relevance matrix, a heat map is utilized.For such a heat map, the relevance scores are converted to colors to create a map.However, unlike LRP analysis for an image recognition application, where the heat map overlaid on an image gives an intuitive explanation on the patterns learned by the model, a little bit more care must be exercised in interpreting a heat map overlaid on a time-series feature matrix in this case.
As is evident from Fig. 4, the vertical dimension in the relevance score matrix is associated with time steps, and the horizontal dimension is associated with features.If it was desired to know which time steps are most relevant, the matrix could be collapsed in the feature dimension by summing the relevance scores across features to form a column vector.The key time steps could then easily be discerned.Likewise, by summing the relevance scores across time steps, a row vector could be gained to know the key features.After doing the time steps-dimensional summation for all the correctly classified samples, a distribution of features is shown in Fig. 5 (the features are shown in Table 3 in the Appendix).This summed relevance score illustrates the impact of the features on the output of the LSTM-RNN model.It is noted that there are significant variations in the scores associated with the features.Some features (such as #4, 10, 20, and 21) have large sums, indicating a strong impact on the output, while others (such as #2, 3, 16, and 28) have small sums, which suggests little importance in terms of the output.
In order to validate LRP's capability of interpreting the LSTM-RNN model, a comparison was made with a commonly used approach for examining the importance of variables within a nonlinear model, namely sensitivity analysis (SA) [32].SA evaluates relevance based on the model's locally evaluated gradient or some other local measure of variation.A common formulation of sensitivity analysis based on gradients defines relevance scores as: where the gradient is evaluated at the data point x.The summed relevance scores of features based on this SA approach is also shown in Fig. 5 for comparison.It presents a different profile, i.e., different set of important features, from the LRP method.To quantitatively validate that LRP provides a better interpretation, feature removal experiments were performed.The idea of these experiments was that the removal of the features with the most relevance to the model output would most likely change the model decision.In these experiments, the LSTM-RNN model was retrained with the same number of important features removed from the input based on either the LRP and SA approach.The losses in testing accuracy were compared for the two approaches.Four experimental trials were conducted (trial 1: the most relevant feature was removed, trial 2: the two most relevant features removed, and so forth) for each approach (LRP vs. SA).For each trial, 5 sets of initializing parameters were considered during the training phase to obtain 5 models; the same parameter sets were used for each approach (LRP and SA) for a given trial.
It should be noted that the parameters changed from trial to trial because of changes in the dimensionality of the input matrix.For each trial, the model was initialized with the same parameters for both approaches (but, of course, the input matrix was different from trial to trial) and an average accuracy loss was calculated from 5 trainings (corresponding to 5 sets of initial parameters).The features removed in the experiments are listed in Table 2. Since The results are shown in Fig. 6.It is clearly seen that the accuracy losses caused by the removal of relevant features based on LRP were much more significant than that of SA.This suggests that the features identified by LRP are more important to the accuracy of the LSTM-RNN model than features identified with SA.
In an analogous manner to what was done by summing across time steps for each feature, a summation was performed across features to show the summed relevance scores for time steps as shown in Fig. 7.The same relevance scores were also computed using SA for comparison.While the SA-based relevance score does not show a distinct pattern related to time, it is clearly seen that the LRP-based relevance score grows with time.This result indicates that LRP is suggesting that the LSTM-RNN model assigns more weights to recent information than past information.The LRP result also matches the health state label that corresponds to the last time step in the input.The most recent assessment of health condition is obviously very relevant.

LRP results regarding different categories
The results in Section 5.1 demonstrate the extent to which features and time steps contribute to the decisions made by  the model.However, this discussion did not examine how the important features, from a relevance standpoint, may be different depending on the health state (category).To gain insights on how features could influence an individual classification decision, relevance scores of features for each category were computed separately.The relevance scores were normalized to eliminate the effect of different sample sizes for the categories.Figure 8 shows the relevance scores of features for the 4 health states ("normal I," "normal II," "warning," and "failing").It is seen that the impact of certain features on one state could be dramatically different from that on another state.For example, feature #4 contributes more to the classification of "failing" state compared with the other three states, while feature #10 has major relevance to the identification of the "normal II" state, but has little relevance to the "failing" state.Such observations suggest that a feature could strongly influence being identified in one health state while playing a very small role with being classified in another state.Seemingly, when the model is identifying different bearing health states, the features are weighted differently by the LSTM-RNN model.In other words, the relevance of features to the model output varies by category.
To validate these findings, experiments were conducted by retraining the model with a single feature removed and examining how this influences the classification accuracy of each state.In PdM, the major focus is on identifying when a critical state is entered ("warning" and "failing") for safety and maintenance purposes.In this experiment, feature #4 (entropy) was removed from the input since it has the most relevance to the "failing" state.A second set of input data (serving as a control case) was constructed by removing a relatively unimportant feature (feature #2, variance).The method was applied to both data inputs to obtain LSTM-RNN models.These two models were trained and tested in the same manner as the original model but with one less feature input (the input is a 14 × 35 matrix).Confusion matrices of the testing results from the two retrained models, as well as the original model, are presented in Fig. 9.
Compared to the original model, the removal of feature #4 (entropy) resulted in accuracy declines in the identification of "warning" and "failing" state of 13.7% (95.9 to 82.2%) and 15.3% (95.6 to 80.3%) respectively.It aggravated the confusion between those two states as well.In comparison, the removal of feature #2 did not cause prominent changes in the accuracy of true state prediction, which corresponds to its slight relevance from the LRP analysis.
This result might be explained when we take a close look at the two features discussed above.According to the definitions in [33], variance simply measures the dispersion of a signal about its mean value, while entropy is a calculation of the uncertainty and randomness in the information content in the data, which may better reflect the dynamics of bearing vibration.In Fig. 10, entropy and variance obtained from a raw vibration signal are plotted throughout the lifespan of a bearing.As is evident, there is merely a sudden surge in the variance just before the bearing fails; virtually no warning is provided.By comparison, entropy starts increasing when the bearing approaches the later stage of its lifespan; a relatively early warning could be provided to the model to identify the critical states.Again, this could also be regarded as evidence that LRP's interpretation reveals the patterns behind the data.

Improving the model efficiency and accuracy with LRP
The analysis above has clearly shown how LRP may be used to explain the relevance between the input and output for an LSTM-RNN model.The reflected relevance has been validated by several experiments.By analyzing those relevance scores of input layer and the middle layer neurons, several observations were made.According to these observations, some methods designed for model efficiency and accuracy improvement are discussed below.
When computing the relevance score of each feature to the model output, some features were found to have very limited impact on the model decision-making.When checking the relevance scores regarding each of the four categories, the relevance scores of certain features remain low, which suggests that those features may not substantially contribute to model decision-making.This observation leads to a promising Fig. 8 Relevance distribution of features with different classes approach to reduce model complexity, by removing insignificant features so as to reduce model input size, while achieving comparable performance to the original model.Iterative trials of retraining the model were performed.These trials removed different combinations of features while keeping the same model architecture, except for the input layer (the number of input layer neurons should be the same as the number of input features).Finally, a new model with 23 input features (12 features removed: #2, 3, 6, 12, 13, 16, 18, 19, 25, 28, 30, 34) achieved an acceptable overall testing accuracy of 87.79% compared to the original 90.6%.This model also retained good performance in critical states identification: 90.34 and 100% for "warning" and "failing" states respectively.
The relevance scores of the neurons in the middle fully connected layers (dense #1, dense #2) were computed.It was found that around 40% (dense #1) and 20% (dense #2) of the neurons in these layers had relevance scores equal or nearly equal to 0. This means that there is redundancy in the original model and fewer neurons could be used.Therefore, we shrank the number of neurons in those layers by 40% in dense #1 and 20% in dense #2 respectively.By making this adjustment, the total number of trainable parameters was reduced from 9904 to 5394.After training the reduced parameter model with the same training procedure as the original model, an overall test accuracy of 89.81% (compared to the original 90.6%) was achieved and the classification performance was comparable to the original model.
When checking the individual sample heat maps that were correctly and wrongly classified, it may be seen that there is a large number of pixels with excessive relevance scores (the yellow pixels in Fig. 11 that exceed 0.10) for the wrongly classified sample heat maps.Based on this observation, a double-check mechanism was established, which could be used to evaluate whether a prediction made by the model is trustworthy.We set the threshold as follows: when the number of pixel values exceeding 0.1 is greater than 3, the prediction made by the model should be regarded as questionable.For a test dataset containing 2142 samples, the model predicted 2035 samples with the correct classes and 107 samples wrong.Original (No feature removal) Fig. 9 Confusion matrices for the three models With this mechanism applied, more than 80% of wrongly predicted samples were found to be questionable.Thus, this mechanism based on LRP analysis was proved to be an effective way to further improve model accuracy.

Conclusion
Many research efforts have been carried out to apply recurrent neural network models in the field of predictive maintenance.However, there have been few studies focused on interpreting such models and understand their decision-making process.This paper addressed this gap by employing model explanation techniques to investigate the contributing factors to the decision made by a LSTM-RNN model for bearing health condition estimation.
The LRP technique was used to inspect the relevance distribution in the input space.Analysis was conducted on the feature contributions and time step contributions.A comparison was made with a common approach, i.e., sensitivity analysis, to understanding the role of factors in a model, and it demonstrated LRP's superior capability for model interpretation.Further analysis into individual classification categories showed that the LSTM-RNN model adopted information from the input data differently when making individual decisions.The LRP analysis also proved to be useful for model efficiency improvement by removing irrelevant features from the input and optimizing model architecture so as to have fewer parameters.The potential for LRP to help achieve higher classification accuracy was also illustrated.
Overall, these results demonstrated that LRP has great potential for improving neural network transparency and changing the "black box" impression of RNN.It is also promising to see LRP being applied to a predictive maintenance scenario.It is expected that additional maintenance challenges will be addressed in future research.
(a) Recurrent Neural Network unfolded in times steps (b) Long Short Term Memory (LSTM) Cell x y h

Fig. 1
Fig. 1 Graphical Depictions of RNN model.a Recurrent neural network unfolded in time steps b Long short term memory (LSTM) cell

Fig. 5
Fig. 5 Summed relevance score for features based on LRP and SA

Fig. 6 Fig. 7
Fig. 6 Accuracy loss by removing features according to LRP and SA

Fig. 10
Fig. 10 Entropy and variance during the lifespan of a bearing

Table 1
Layer details of LSTM-RNN

Table 4
List of notations Acronyms Hidden layer activation at time t WWeight matrix of hidden-to-hidden connection U Weight matrix of input-to-hidden connection