Faulty gear diagnosis using weighted PCA with swish activated BLSTM classifier

The early faulty gear diagnosis is most necessary in the industry. In the current decade, with the tremendous growth of ANN (Artificial Neural Network), the researcher planned to use DL (Deep Learning) methods to sketch out faults in gear in an early stage. Traditional gear fault diagnosis method mostly utilizes deep NN (Neural Network) related to tine sequence of gathered signals. In this instance, feature extraction in the direction of inverse time domain signal is commonly ignored. To overcome this issue, here in this paper, proposed Weighted Principal Component Analysis (WPCA) and BLSTM (Bi-Directional Long Short Term Memory) along with Swish Activation function for faulty gear diagnosis from the vibration signals. WPCA is utilized to extract multi-scale features related to faulty gear from the vibration signal. Likewise, BLSTM is used to classify the extracted features to diagnose the fault in an earlier stage. Several experiments were conducted to evaluate the proposed work of categorizing the defects in gear from the vibrating signal. Experiments were conducted on three kinds of the dataset to classify the type of faulty gear accurately. The proposed work proves its superiority in organizing the gear faults in a most efficient way than existing methods.


Introduction
Gearboxes are broadly utilized in rotary Machines like artifacts engines; industrial gearboxes, power station, etc. are the essential elements of the different mechanical system. However, gearbox is subjected to tedious and harsh working conditions, which create spare parts like gear, prone to failure [18]. Transverse vibration is developed because of fatigue, in several rotor elements of dynamic rotor system which causes various catastrophic failure and damages of machinery. Vibration signals are produced due to defects in machine components like a pulley, belt, looseness, rotor unbalance, cracks, misalignment, shaft bow, coupling and rotor hub defects. Gear failure affects mechanical equipment's' regular operation and in some cases, damage the entire machine, which leads to threatening people's property and life. Therefore, it is necessary to extract and separate faulty gear's faulty feature to manage the working efficiency of mechanical equipment to safeguard people's property and life. Related survey express that localizing gear fault produces a periodic transient pulse in standard speed mode [22]. Because of these characteristics, vibration signal analysis technique is most effective and reasonable. Apart from the transient component with larger amplitude, gear fault signal consists of harmonic components produced by gear pair's meshing vibration [29]. Meanwhile, two segments were sub-merged quickly with the use of intense noise [30,31] creating mining of transient component was very tedious, so accurate extraction of faults from faulty gear is key to research. Figure 1a expresses that, to investigates the relationship among system and vibration mode of the gearbox using multi-body dynamic simulation.
To overcome the aforementioned issues, the researcher decides to use weighted principal components Analysis (WPCA) to extract multi-scale features regarding fault in gear. For classification BLSTM (Bidirectional Long Short Term Memory) along with swish activation function is used. BLSTM supports in the classification of responsibility in the gearbox. Figure 1b illustrates that the vibration experiment was carried out to evaluate modal optimization, decreasing gear vibration.

Objective of paper
& To classify faulty gear from vibration signal of the gearbox & To classify faults of gear, BLSTM method along with Swish activation function is utilized.

Organization of Paper
The paper's remaining section is organized as Section 2 discusses the related works regarding WPC, BLSTM and Swish activation function. Section 3 discuss the proposed work to extract and classify faults occurred in gear. Section 4 discusses the performance analysis of the proposed work, and its conclusion is expressed in Section 5.

Related works
Karim et al. [12] proposed LSTM-FCN (Long Short Term Memory-Fully Convolutional Neural Network) with multivariate time series classification model with augmenting convolutional blocks and excitation and squeeze block to enhance the accuracy. The proposed work outperforms existing models in the way of reduced preprocessing steps. Proposed architecture more efficiently on different tedious multivariate time series categorization tasks like action recognition. Additionally, proposed work is sufficient at test time and small in size to be easily implemented in the restricted memory system.
Greff et al. [6] introduce 1st large scale analysis of 8 LSTM variant over three representative tasks like handwriting recognition, speech recognition and polyphonic music designing. The LSTM variant's hyperparameter for every job was optimized by utilizing random search, and their significance was assessed using the powerful fA NOVA framework. The result shows that the proposed work shows superior output than existing work. Additionally, individual usage of hyperparameter offers a guideline for effective adjustment.
Karim et al. [11] proposed augmented CNN (Convolutional Neural Network) with LSTM-RNN (Recurrent Neural Network) to classify time series. Proposed work prominently improves CNN performance with minor improvement in model size and requests for reduced processing of dataset. The proposed LSSTM-FCN (Fully Convolutional Network) attains better understanding compared to others. Additionally, attention mechanism is utilized to enhance time series categorization with Attention (LSTM-FCN). Then, refinement is proposed as a method to better the trained model's performance. The overall performance of the proposed work is evaluated and compared with other existing techniques.
Akram et al. [2] presents proposed EMD (Empirical Mode Decomposition), and traditional PSD (Power Spectrum Density) and time waveform to predict localized spur gear. The test model is primarily created for analyzing vibration of gear at various RPM, and then a particular fault was presented in driven gear for different damage conditions. Data recorded by a wireless tri-axial accelerometer was analyzed utilizing PSD and EMD method and result express that EMD is better than traditional time waveform and PSD method.
Mishra et al. [19] introduce the novel fault diagnosis technique that integrates three methods like EMD, PSO-SVM (Particle Swarm Optimization-Support Vector Machine) and fractal box dimension. Primarily, fault gear's vibration signal is decomposed to various IMF (Intrinsic Mode Function) by EMD technique. Then, energy, frequency box dimension and time are computed from frequency, time, fractal, and energy domain. Then characteristic of gear fault under various load excitations was acquired. At last, mined features are given to the PSO-SVM model for classification of gear fault. Result express that proposed work discovers the kind of gear fault effectively under various load excitation.
Huang et al. [9] proposed to utilize minimal concave penalty function to develop the objective function. Among other process proposed function is superior in protecting high amplitude components. The entire dictionary was then launched depending on Fourier bases, where every two elements of every dictionary from the tight frame facilitate easy calculation and eliminate calculating the higher dimensional inverse matrix while doing repeated optimization. By constructing numerous dictionary, numerous components related to gear vibration signal was mined. Sparse depiction coefficient was computed utilizing split augmented Lagrangian shrinkage algorithm and completed dictionary and depicted transient could be mined. The experimental result expresses that the harmonic component and fault transient was extracted accurately without underrating higher amplitude components. Comparison result illustrates that proposed work outperforms existing work.
Li et al. [14] presents novel feature extraction technique by integrating EMD and ALC (Autocorrelation Local Cepstrum) to diagnose fault multistage gearbox. Primarily, in the preprocessing step, signal reconstruction was implemented to address the oversampled problem caused by higher resolution of angular sensor and test speed. Then adaptive EMD was utilized to obtain numerous IMF (Intrinsic Mode Function). Various IMF had various sensitivity to a fault. So all IMF is not used for further analysis. For this purpose, the cosine similarity metric is utilized to choose the most sensitive IMF. Only IMF is not sufficient for fault diagnosis, so ALC is utilized for feature extraction and signal denoising. Robustness of proposed work was evaluated experimentally using 2 sets of gear test rigs and gear in various working conditions. Result proves that proposed work was efficient in the diagnosis of fault in gear.
Li et al. [15] for diagnosing a fault in gear, VMD (Variation Model Decomposition) and DNN (Deep Neural Network) was proposed. Three-axial vibration signals of gear were gathered and decomposed to narrowband components with various frequency centre sand bandwidth depends on VMD. PSE (Power Spectral Entropy) is utilized as an original feature to denote distribution and the spectral amplitude of every component. DNN depending on AE (Automatic Encoder) and backpropagation NN is being used for signal feature reduction and gear states categorization. Results express that proposed work had the ability of mining sensitive features and diagnosing of fault.
Athiwaratkun and Stokes [3] proposed various novel malware categorization architecture which consists of LSTM and GRU (Gated Recurrent Unit) language models. Additionally, attention mechanism is presented. At last, introduced single-stage malware classifier depending on character level CNN. The experimental result expresses that LSTM with logistic regression, and max-pooling layer offers better enhancement than the current system.
Yildirim [35] proposed (Deep Bidirectional) DBLSTM-WS (Wavelet Sequence) for categorizing ECG signals. For this reason, this wavelet depended layer was employed to produce a signal sequence for ECG. This ECG signal was decomposed to sub-bands at the various scale of the coating. These sub-bands were utilized as input for the LSTM network. Here for comparison, unidirectional and bidirectional LSTM were used. Experiments were conducted on five kinds of heartbeats from datasets, its noticed that proposed work outperforms existing work in terms of improving the performance of recognition of CN.
Jinsakul et al. [10] the aim was to offer experimental alteration of Deep Learning (DL) of exception Swish and review the chance to create a primary colorectal polyp screening system by training the proposed work with colorectal program dataset. Result express that submitted work achieves better accuracy of prediction compared to the traditional model.
Cerrada et al. [4] deals with feature selection issues over attribute clustering. The proposed algorithm was encouraged by current methods, where relative dependency among attributes was utilized to compute dissimilarities values. The centroid of the developed cluster was chosen as representative attributes. Selection algorithm utilize the random process for offering centroid candidates, in a way, random search in inherent exploration was included. For this algorithm, the hierarchical procedure was proposed. In every level of the hierarchy, the whole set attributes were divided into disjoint sets and selection process is utilized with every subset. Once a prominent characteristic was proposed for every subset, a new set of features is developed, and once again selection process continues in the next stage. Hierarchical execution aims to refine every level of search space in minimized set of chosen attributes. Result express better diagnosis precision than current methods.
Park et al. [20] classification of gear teeth is done by utilizing Ensemble (EEMD) to TE (Transmission Error Rate) computed by encoders of output and input shafts. Gear with two faults was modelled, and TE was acquired by simulating faulty gear overloaded contact to discover various characteristics. Testbed for evaluating proposed work is created, where TE was measured for gear. EEMD was utilized to mine false feature of gear over noise from computed TE. The experimental result shows that proposed work gives a better measurement of gear fault directly with least noise, assures successful diagnosis.
Sun et al. [26] to mine transient features from vibration signal of gear, fault diagnosis technique depending on SSTFA (Sparsity Time-Frequency Analysis) was proposed. Impulsive and steady modulation components of faulty gear vibration signal were mined parallel by selecting a various time-frequency neighbourhood and standard thresholding operator. The diagnostic conclusion was made regarding the envelope spectrum of impulsive components or by the periodicity of impulse. The proposed work's method was evaluated to know its efficiency; proposed work express it's more superior than existing work.
Tripathi et al. [27] proposed DNN digital pre-distortion (DPD) utilize swish or sigmoid weight linear unit activation function instead of ReLU and sigmoid to eliminate dead neuron and gradient vanishing issues. A comparison was made for various activation functions for real value concentrated time delay NN had been done. Proposed work attains better enhancement than existing work.
Xie et al. [33] proposed stacking method based on WPCA. The principal component of data matrix-like sound signals was mined depending on the lower rank-decomposition method by resolving optimization issues with lower rank limitations. Optimization issue is determined through the standard singular value decomposition algorithm. Data matrix's lower rankdecomposition would improve the impact of non-Gaussian random noise, erratic and abnormal trace, and it's stronger than current alternatives. Here the performance of proposed work is evaluated on-field and synthetic data, and it proves its success in terms of its performance.
Qi et al. [21] investigates correlation among QoS criteria of multimedia services and also analyze criteria correlation's the bad effect for service selection for multimedia in the cloud. Then, W-PCA_MSSM () was proposed to remove criterion correlation and simplify a piece of the service process. Several experiments were conducted to estimate the possibility of proposed work in terms of efficiency and effectiveness, proving its success. In future, this proposed work several applications like service replacement and composition.
Shao et al. [23] developed a new DL framework to diagnose highly-accurate machine faults by utilizing transfer learning to accelerate and enable in-depth NN training. Compared to current techniques, the proposed work was more accurate and much faster to train. Primarily, original sensor information was altered to images by directing wavelet transformation to acquire time-frequency distribution. Then low-level features were extracted by the pretrained network. NN's higher level is fine-tuned by labelled time-frequency idea. Experiments were carried out to evaluate the performance of proposed work, and it shows that it works is superior to all other work.
Wen et al. [32] proposed TCNN (ResNet-50) with 51 convolutional layers for diagnosing the fault. By merging with transfer learning, proposed work use ResNet-50 on ImageNet to extract features for diagnosis. Primarily, method signal-image is created to transfer time-domain fault signal to the image format of RGB as input data kind of ResNet-50. Experiments were conducted on three datasets, and results show it a success for accurate prediction.

Proposed work
BLSTM-WPCA based fault gear diagnosis method is proposed to classify the kind of fault in gear with support vibration signal acquired from machine or gearbox (Fig. 2).

Weighted principal component analysis (WPCA)
Since PCA doesn't include time-varying process, a weighted criterion in the dimension minimization process is created to mine feature information. WPCA merges SFA (Slow Feature Analysis) and PCA (Principal Component Analysis) which includes time-varying features and a slow varying feature hidden in the process. WPCA can be done by resolving Where o ∈R tÂt is a transformation matrix. And Y ∈R tÂt is a matrix of high dimension data. Unlike SFA and PCA reduce the variation of hidden features. Optimization equation is given below: Where x i is i-th 1st order derivate of x and x is a t-dimensional slow feature. (·) this sample means of whole available time. X satisfy subsequent constraints: . Objective function I 1 and I 2 have the same structure. By merging two objective functions, SFA and PCA's following can be taken simultaneously. After merging the function, WPCA can be done by computing o satisfying subsequent objective function: Where Y′ = αYY D -(1-α)(ΔYΔY D ), and α is weightingindicatess α<= 1). PCA's relative influencegrThe optimal detectability of WPCA, α should be selected correctly because it indicates SFA and PCA's relative influence. Optimal solution of Eq. (3) corresponds to eigenvalue and optimal transformation vector as below: Q = [q 1 q 2 ………q t ] is optimal transformation vectors, also referred to as the loading vector. WPCA aims to mine lower dimensional linear features; the research uses few PCs that brief their variance. Then, the loading vector becomes Q new = [q 1 q 2 ………q t ] where p < <t. Now PC can be computed as Like, PCA, hotellingsD 2 and SPE (Square Prediction Error) are utilized to monitor the WPCA depended on process monitoring method. D 2 denotes distance among current variable's value and its PC subspace's average, taking the variance of data as below: Where DJ is the score on jthe PC and σ j is the standard deviation of scores on jthe PC. Score on jthe PC is acquired from WPCA. SPE (Square Prediction Error) denotes error among WPCA and process data in residual space below: Provided process data has non-Gaussian distribution, then utilizing KDE (Kernel Density Estimation) SPE and control limit of D 2 is computed. When SPE or D 2 exceeds control limits, it's taken as an alarm. KDE is done as where e denotes density function, m represents the count of the sample, g refers to smoothing parameter, y j is noticed value and K denotes kernel function. D 2 is summation denotes error and PC variation as a signal statistic, if the value of PC goes above the control limit, then detection of D 2 is not easy. Based on KDE, multi-scale feature is predicted.
Convolutional layer Convolutional layer performs the process of feature extraction from the input feature map over the convolution kernel. Downsampling is performed with the support of pooling layer. One of NN (Neural Network) 's main features is weight sharing, local perception, and pooling. In our architecture, the primary layer is convolution layer the formula for computation is as follows: Where, y l j is ithe feature map of lth layer e(·) denotes activation function, N denotes the count of the input feature map, y l−1 j denotes jthe feature map of l-1 layer, * denotes convolution operation, k l ji denotes trainable convolution kernel, and a l i denotes ithe bias of lthe layer.

Pooling layer
Then pooling layer is linked to the convolution layer, where the feature map was downsampled related to few pooling plans to acquire low-resolution feature map. The most frequently utilized pooling plan is max pooling. This pooling layer minimizes the count of output nodes and improves the network's robustness to input characteristics. L + 1 -the layer is pooling layer. It's computed as Where (.) denotes downsampling function. By considering overfitting issues and convergence speed, here the researcher decides to use Swish activation function.

Swish activation function
Most broadly and widely used to activation function in NN is ReLU, which is e(y) = max(0,y). Though there are several alternatives to ReLU has been emerged, but none can be able to replace it. To overcome this, Google Brain Team had introduced new activation function named swish activation function. Which is e(y) = y · sigmoid(y).
This function is similar to ReLU; it is different; domain around 0 differs from ReLU. It is a smooth function that doesn't alter the direction frequently as ReLU does nearer y = 0. Instead, it bends smoothly from 0 and move towards value <0 and then again move upwards (Fig. 3). This observation indicates that it's non-monotonic. Similarly, it doesn't move in a single direction like ReLU. These properties mSwishwish more superior than other activation function. Experiments express tSwishwish works better than ReLU for the various challenging dataset. Swish is non-monotonic, smooth operation achieves better test accuracy, which is steadily superior to ReLU on NN, which supports most challenging domains like machine translation and image classification.

BLSTM unit
BLSTM network improves convolutional neurons from RNN (Recurrent Neural Network) with a series of the gate, enforces standard error flow over internal network state. Current sequence value y d is considered input of memory cell, whereas previous internal state is denoted as E d − 1 and the earlier detection is denoted as g d − 1 . Then current prediction g d is computed using input gate JD, forget gate o d and output gate i d , while internal state E d is updated utilizing input and forget gate. More than one memory cell is merged in the form of series to form a single BLSTM-NN layer. Fig. 3 "Swish Activation Function [1] BLSTM network is utilized for forecasting time series by using sequence to sequence regression. At the same time, responses x d are input time series shifted by one-time step, x d = y d + 1 .
For the gearbox, the vibration signal is considered as input time series y d . BLSTM is trained on a dataset acquired from the gearbox, which is in a normal state. If the gearbox is in the normal state then determination would match the response well that is g d ≈ x d , and prediction error r d = x d -g d , would be small. In case when a fault occurs in the gearbox, then response time series consist of health vibration x d and a d as defect induced vibration. Fault state response is denoted as X d , then X d = x d + a d . If X d is fed to trained network, then since g d ≈ x d , prediction error would approximate bearing fault signal, r d = X d − g d ≈ x d . Here, BLSTM network act as a filter to expose anomalies.

Softmax layer
The final layer of the network is a softmax layer which non-linear. This layer supports in handling numerous classes. The softmax function can squeeze output of every level among 0 and 1and divides by sum of results. In short, this layer determines the probability of output. Mostly softmax function is utilized in classifier's output layer.

Performance analysis
Here for evaluating the performance of proposed work, three kinds of the dataset were utilized. 1st dataset, a random number were generated to pick data set regarding Local fault with noise and distribution fault [17]. 2nd dataset is gearbox fault diagnosis dataset which includes dataset regarding vibration recorded by utilizing spectraquest gearbox fault diagnostics simulator. This dataset was created by recording four vibration sensors positioned in 4 various direction. Records had been done with loads varying from 0% to 90%. Here for our purpose dataset related to Broken tooth condition is taken [24]. 3rd dataset is GFDS2 which group data related to faulty gear diagnosis from the sound signal. Here load normal, load pitting, load tooth fracture, load wear, normal, pitting, tooth fracture and wear datasets were utilized. Figure 4 express various faulty signals of gear, these signal is utilized for training classifiers. So that faults occurred in gear can be analyzed easily. Table 1 compares the proposed method with and without multi-scale features and with VGG16 and ResNet-59 methods. The graph illustrates that the proposed method with multiscale feature outperforms all other ways. Figure 5 illustrates the proposed method's performance analysis in terms of accuracy, sensitivity, specificity, precision, recall, and F-measures for various testing sets.
Learning rate is hyperparameter which controls weight adjustment of a network concerning loss gradient. Table 2 expresses the proposed method's performance analysis in terms of accuracy, sensitivity, specificity, precision, recall, and F-measures for various cross-validation set of learning rate.

Conclusion
Fault diagnosis plays a crucial role in maintaining equipment. This paper presents the gear fault diagnosis method depending on multi-layer BLSTM models. Additionally, for feature    classify the extracted feature and diagnose or predict the kind of responsibility in gear. The performance of the proposed work is evaluated to know its efficiency with challenging datasets. Proposed work proved its superiority compared to existing techniques.

Declarations
Conflicts of interest/competing interests No financial/personal interest.