An Improved CNN-LSTM Network Based On Hierarchical Attention Mechanism For Motor Bearing Fault Diagnosis

Motor is widely used in industrial production, but the frequent motor bearing fault brings great safety hazard to the production. Traditional fault diagnosis methods often require prior signal processing knowledge and are inefficient. In order to solve this problem, the artificial intelligence fault diagnosis method has been applied in motor bearing fault diagnosis. With the help of the original motor running state signal collected by the sensors, non-invasive real-time detection of motor bearing fault can be realized. This paper presents an improved CNN-LSTM network based on hierarchical attention mechanism(CALSTM) for motor bearing fault diagnosis. In this artificial intelligence method, the fault characteristics of the original data can be learned by convolutional neural network, and then the importance of the features can be obtained by using hierarchical attention mechanism. Finally, the weighted results are sent to the LSTM network for time dimension selection. This method does not need signal processing and adaptively weights the features of each sample learned by the neural network, which enhances the explanatory ability of the learning process of the neural network. When carry out experiments on CWRU data set, and the experimental results indicate that, compared with several common models, CALSTM method has a better diagnosis effect, and the overall accuracy of the model reached 99.22%.


1、Introduction
As a common industrial equipment, motor is widely used in modern factories for automation and large-scale production.Motor bearing is a common part of motor equipment, and its health has a great impact on the performance, stability and service life of the whole equipment.Due to the complex operating environment and high load operating conditions, the rolling bearing is prone to damage, and its fault probability accounts for up to 40% in all motor faults (Lau et al.2010).In the actual production, the fault of a single motor may cause the stagnation of the whole production line and cause huge economic losses.Therefore, finding an accurate and effective fault diagnosis method for motor bearings has become an urgent need in industrial production.
With the development of modern electronic technology, sensor technology and detection technology, motor fault diagnosis has been further developed.In 1965, Cooley published the Theory of fast Fourier Transform, and spectrum analysis became a research hotspot (Cooley et al.1965).Various kinds of spectrum analyzers have also sprung up and been applied in motor bearing fault diagnosis.By comparing the characteristic frequency of vibration signal of damaged motor bearing and the analysis result of spectrum analyzer, it can be judged whether there is a fault.Since the 1980s, due to the rapid development of computer technology, computers have powerful information storage and processing capacity, which facilitates the integration of multiple technologies to realize the state monitoring and fault diagnosis of motor bearings.Because the data of motor running state shows the characteristics of big data.Therefore, higher requirements are put forward for the robustness, generalization ability and real-time performance of diagnostic technology.
The intelligent model is mainly divided into signal processing and artificial intelligence.The method of signal processing focuses on the effective extraction of artificial characteristic parameters and depends on the numerical calculation method and signal processing technology.In the signal processing method, the signal is extracted, transformed and analyzed, and the characteristics of mechanical faults are obtained through numerical calculation.The characteristic values commonly used in motor bearing fault diagnosis include the characteristics of time domain, frequency domain and time-frequency domain.Envelope analysis (Tsao et al.2012), Spectrum Kurtosis, Wigner-Ville(WVD) distribution (Choy et al.2009), wavelet transform ( Siddiqui et al.2016;Zhang et al.2013), empirical mode decomposition (Mohanty et al.2015) and Variational Mode Decomposition (Jinde et al.2016) are widely used in rolling bearing state monitoring and fault diagnosis.Signal processing methods are often combined with artificial intelligence methods to play a role in data processing and feature optimization.
Compared with the signal processing method, the method based on artificial intelligence focuses on learning historical and empirical data, does not rely too much on the calculation and analysis of signals, and shows a good prospect in the aspect of fault diagnosis.There is no requirement for the relevant motor professional background of the researcher, which is more suitable for data-driven research.For example, (Wen and Gao et al.2017) (Shao et al.2015).They also made a detailed comparison between the traditional fault detection method and DBN deep learning method, and proved that the fault diagnosis model using DBN deep learning method has better accuracy and robustness.Shao H et al. also used the SAE network optimized by fish swarm algorithm to diagnose motor faults and improve the accuracy of classification (Shao et al.2017).
CNN can capture sensitive fault information without the need of expert knowledge, so it is becoming more and more popular in various methods of motor bearing fault diagnosis.CNN can extract the invariant features from the original vibration data, but it cannot take into account the timing characteristics of the motor bearing vibration signal itself.Therefore, considering the effectiveness of LSTM in processing the timing data, this study combined CNN and LSTM to form a CLSTM network.
CLSTM network takes into account both feature learning and data time correlation, but CNN is still a black box and cannot automatically identify which features are more important.Therefore, this paper uses hierarchical attention mechanism to explore important features.The hierarchical attention mechanism is first applied to text classification, which can improve the classification performance (Yang et al.2016).In this paper, the hierarchical Attention mechanism can connect features extracted after CNN model processing with fault diagnosis results, so as to intuitively see the effect of CNN feature extraction.Then, without changing the timing sequence of CNN output features, weighted results are successively sent to LSTM network for further learning.We call this method CALSTM.This paper discusses the application of CALSTM method in motor bearing fault diagnosis, which does not require complex data preprocessing or signal processing of motor bearing vibration signal.
Researchers without professional background of electrical engineering or signal processing can easily obtain rolling bearing fault diagnosis results through this model.And compared with the results of other methods, the accuracy of fault diagnosis obtained by the method proposed in this paper has been improved to some extent.
The rest of this paper is arranged as follows: Section 2 introduces the method, then Section 3 conducts experiments and discusses the experimental results, and Section 4 draws conclusions.

2、Methodology
In this section, the proposed CALSTM method will be introduced.First, we introduce the convolutional neural network CNN, then the hierarchical attention mechanism and LSTM model, and finally the specific architecture of the CALSTM network combined with the first three parts.

Convolutional neural network (CNN)
The convolutional neural network (CNN) is inspired by the receptive field mechanism in biology, and it is an artificial neural network (ANN) with a special structure.Different from the traditional fully connected network, each neuron in the feature map of each layer in the convolutional neural network is only sparsely connected to a small part of the neurons in the upper layer.CNN has the characteristics of local receiving field, Shared weight and spatial sub-sampling, and its hidden layer is divided into convolutional layer, activation layer and pooling layer.
The convolution operation, which extracts features by translating the original image, can be defined as the multiplication operation between the input information I and the filter (convolution kernel) Where, b stands for bias and  represents activation function which can carry out nonlinear mapping and increase nonlinear segmentation capability.After convolution and activation, several feature maps can be obtained, and the overall feature mapping group is Pooling can not only extract the most important local information in each feature map, but also significantly reduce the feature size.Therefore, pooling layer can compress the amount of data and parameters, reduce the overfitting and reduce the complexity of the network.Each characteristic graph R , .These regions may or may not overlap, depending on the step size of the sliding window at the time of sampling, so . Each region is subjected to Down Sampling to obtain a value as a generalization of the region.The pooling layer selects the maximum pooling, that is, the maximum value of all neurons in each divided region d l h R , is selected as the representative of this region, which is defined as follows: Where i z is the value of each neuron in the For each feature map , the output of the feature map of the pooling layer d Y can be expressed as follows: In the convolutional neural network, both the sliding of the filter in the convolutional layer and the sliding window in the pooling layer need to use the filling method to control the size of the feature.
That is, the structure of the input data is assumed to be , and the structure of the output data is .Because the calculation of the data structure of the convolutional layer and the pooling layer is similar, the calculation formula under the option of "not filling" can be expressed as: Where, F is the size of filter (also refers to the size of pooling window); S refers to the size of the slide step and K to the number of filters.

2.2Hierarchical attention mechanism
Yang Z et al. proposed the concept of hierarchical Attention.Hierarchical "Attention" network was first used for text classification.The network is divided into two parts, the first part is "word attention" part, the other part is "sentence attention" part (Yang et al.2016).The whole network divides a sentence into several parts, and then maps each part into a vector by using the bidirectional RNN combined with the attention mechanism.Finally, the sequence vector obtained by the mapping goes through a layer of bidirectional RNN combined with the "attention" mechanism, so that text classification can be realized.The classification effect of this method is obviously better than other methods.
In this study, a similar hierarchical attention network is designed, that is, a layer of hierarchical attention is added between CLSTM models.First, hierarchical Attention replicates the output of CNN, and then maps it to a set of vectors that reflect the importance of each feature in the feature map.For each CNN feature map, hierarchical Attention can be expressed as: Where,Y is the set of all d Y , that is, the characteristic map output of characteristic map group Z of CNN network.A set of vectors obtained by softmax are multiplied by the matrix of the output of CNN network to obtain the weighted feature X .
Hierarchical attention assigns different weights to each output vector of CNN, so that the model can focus attention on key features and reduce the role of other features.The use of the Attention mechanism can make the results expressed by CNN more reasonable in the current task.

Long Short-Term Memory (LSTM)
The Recurrent Neural Network (RNN) has the ability to process sequential information and can represent the relationship between the current output of a sequence and the previous information.

Fig. 1 Structure diagram of circulating neural network
As shown in Fig. 1, there are connections between each neuron in the RNN hidden layer.After the network receives the input t x at time t , the value of the hidden layer is t s and the output value is t o .The point is, the value of st depends not only on t x , but also on 1 t s .
The following formula is used to represent the calculation method of Recurrent Neural Network: Where V and U respectively represent the weight matrix from hidden layer to output layer and input layer to hidden layer.g and f are nonlinear activation functions.
However, there is a phenomenon of gradient disappearance in the back propagation process of RNN, so RNN may not be able to effectively obtain the long-term dependence of data.Therefore, Long Short-Term Memory (LSTM), a special RNN, is proposed to solve the problem of gradient disappearance and gradient explosion in the long sequence training process (Hochreiter and Schmidhuber 1997).LSTM performs better in longer sequences than normal RNN.Its structure is shown in Fig. 2:

Fig. 2 LSTM network structure diagram
The LSTM network can delete or add cell state information through a structure called a gate.
Generally, the LSTM is controlled by three gates, which are called forgetting gate, input gate and output gate respectively.At each time step, cell units are accessed, written, and cleared through several gates to control the flow of information along the data sequence, thereby enhancing the ability to learn long-term dependencies.
First, the LSTM has to forget some information.This part is handled through a sigmoid unit called the forget gate.As shown in formula (11): After that, the LSTM network adds new information.
C to new cell status information t C .Formula ( 14) can be expressed as: The output gate judges and decides the output characteristics of the final RNN unit, as shown in formula (15)(16): In the above formulas,

Combining method (CALSTM)
The  For the convenience of calculation, the signal is normalized in this study.Then, these 10 kinds of signals are divided into several samples, and each sample is a continuous 400 data points.Each sample should meet the input requirements of CNN network, so the sample should be reshaped.
Data preprocessing of vibration signal is shown in Fig. 5. Assuming that the volume of the input data is . The activation layer only changes the mapping relationship of each neuron, but does not change the structure of the data.Therefore, the data structure through the activation layer is still , and finally is obtained after pooling.According to formula (4) -( 6), the output data structure of CNN is 32 8 8   .Then, the results of CNN are copied and processed by the hierarchical attention layer to obtain the weighted features without changing the data dimension, that is, after the attention mechanism, the output data dimension remained at 32 8 8   .The results are sent into the LSTM network, where the input data size of each time step is 8 8 and the time step is 32.In order to achieve the effect of diagnosis classification, the output of LSTM network needs to add a full connection layer and change through softmax to get 10 kinds of fault diagnosis results.

Experimental results
Firstly, the improvement effect of hierarchical Attention was experimented.In Fig. 6 and Fig. 7, the abscissa represents the predictive labels for each health state, and the ordinate represents the real labels for each state.By adding the hierarchical Attention mechanism on both LSTM and BiLSTM networks, it can be found that the overall diagnostic effect has been improved to some extent.In Fig. 6, the method of "LSTM+Attention" has a significantly improved effect than LSTM network in the diagnosis of moderate fault in the outer raceway(OR021), minor fault in the ball(B007) and severe fault in the ball(B021).In Fig. 7, the BiLSTM+Attention method also achieved better results than the simple BiLSTM network, and achieved higher accuracy in the categories of normal state(Normal), moderate fault in the outer raceway(OR014), severe fault in the outer raceway(OR021), minor fault in the ball(B007) and severe fault in the ball(B021).In Fig. 9, the darker the grid color is, the higher the weight coefficient of the corresponding feature point is, and the more important the feature learned by the feature map obtained by CNN network is.For example, as shown in Fig. 9 (a), the weight coefficient corresponding to the darkest grid is 0.0356231, which is the maximum value in the matrix.Therefore, after learning feature weighting, feature points at this position have a greater impact on the classification results.Hierarchical attention mechanism increases the weight expression for the feature map learned by CNN, and increases the relationship between the importance of the feature and the result.Moreover, the learning effect of CNN can be intuitively seen through the visualization of hierarchical attention mechanism, which enhances the interpretability of the neural network to a certain extent.

4、Conclusion
This study proposes a CALSTM fault diagnosis method.This method does not need to go through complex data preprocessing, but only normalizes the original signal and divides the data segment sample for the convenience of calculation.Such simple processing does not require the researcher to have the professional background of motor, or even need the researcher to know the relevant knowledge of signal processing.This method improves the CNN-LSTM method by adding a hierarchical attention mechanism layer to adaptively assign weights to the features of each sample learned by the neural network.This improvement enhances the performance of the model and achieves an accuracy of 99.2%.The diagnostic effect of this method surpasses common machine learning and deep learning methods, and has great potential in the field of motor bearing fault diagnosis.In the future work, we will consider applying the model to the embedded terminal to improve the computing efficiency of the model through model pruning.The visualization of hierarchical attention on a test set minimized the maximum mean difference (MMD) between the source problem and the target problem by using the Transfer Learning method, and realized the domain adaptive cross-domain fault diagnosis.In this research background, the most common motor fault diagnosis methods are CNN, DBN and SAE.CNN is a supervised neural network model that is good at image processing.At present, it also shows strong application ability in fault diagnosis.Hoang D T et al. proposed a new method based on CNN to diagnose the fault of rolling bearings(Hoang et al.2018).By converting 1-D vibration signals into 2-D images and utilizing the effectiveness of CNN in image classification, this method achieves very high accuracy and robustness in noisy environment without any feature extraction technology.Wen L et al. converted the signal into a two-dimensional (2-D) image and proposed a new CNN method based on Lenet-5 for fault diagnosis, which improved the diagnosis effect(Wen et al.2017).As an unsupervised learning method, DBN is one of the earliest methods proposed in the field of deep learning.Tamilselvan et al. proposed a new multi-sensor fault diagnosis method using deep belief network (DBN) and successfully applied it to the fault diagnosis of aircraft engines(Tamilselvan and Wang 2013).Shao H et al. used DBN to realize the fault diagnosis of rolling bearings under variable working conditions and high noise bias of the corresponding operation.

Fig
Fig 3.The basic framework of the CALSTM model 3、Experiments 3.1Dataset and Data preprocessing The CWRU bearing data set provided by the Bearing Data Center of Case Western Reserve University is widely used in the study of bearing fault diagnosis.Its acquisition platform is shown in Fig.4, which is mainly composed of motor, torque sensor, power meter and electronic control equipment.The motor is set to run at the speeds of 1797, 1772, 1750, and 1730rpm respectively, and each speed corresponds to the load of 0, 1, 2, and 3HP.The bearing wear fault of 7, 14, 21 and 28 mils is manually set, and the sampling frequency of vibration signal is 12K and 48K.

Fig. 4
Fig. 4 CWRU rolling bearing data set acquisition platform (Picture from https://csegroups.case.edu/bearingdatacenter/pages/apparatus-procedures)In this study, drive end signals with a sampling frequency of 12K, bearing speed of 1797rpm, and 0HP were used.The bearing state included normal state, inner raceway fault, outer raceway fault and ball fault, with minor(7mils), moderate(14mils), and severe(21mils) wear degrees.Therefore, it can be divided into 10 types of bearing states, namely normal state(Normal), minor fault in the inner raceway (IR007), moderate fault in the inner raceway (IR014), severe fault in the inner raceway (IR021), minor fault in the outer raceway (OR007), moderate fault in the outer raceway (OR014), severe fault in the outer raceway (OR021), minor fault in the ball(B007), moderate fault in the ball(B014) and severe fault in the ball(B021).

Fig
Fig. 5 Data preprocessing process diagram of vibration signal 3.2 Data preprocessing 3.2.1 Experimental network The input layer of the CALSTM model has 400 neurons in the shape of 20 20  .The input information is first entered into the CNN network for two-dimensional convolution.The size of the convolution kernel is 2 2 , and the slide step is 2 2 .The sliding window of the pooling layer is 2 2 , and the sliding step size is 2 2 .In order to prevent external filling from destroying the distribution of the data itself, both the convolutional layer and the pooling layer are unfilled.

Fig. 9 Fig 9
Fig.9shows the visualization of the Attention layer corresponding to any four feature maps in the test set.It can be intuitively seen that hierarchical attention mechanism gives different weights to different feature points.In the Fig.9, the four hierarchical attention mechanism layers correspond to four randomly selected samples , which means that the proposed CALSTM method does not give a fixed weighting coefficient to the learned features of each sample, but adaptively weights the coefficients of the learning features of each sample.Although this approach poses a challenge to computation, it makes the model more flexible and ensures that the learning features of each sample can obtain a unique weight coefficient matrix. Figures

Table 1 Fault diagnosis accuracy of different algorithms on CWRU data set
Table1shows the experimental results of different methods on the CWRU dataset.It can be seen from the results that the CALSTM method proposed in this study achieves a better accuracy, reaching 99.22%, which is better than other methods.The predicted results of traditional machine learning methods, such as decision tree, SVM and KNN are 52.67%,59.44% and 68.89% respectively.And the diagnostic results of common neural network models such as DFCN(Deep