A deep network architecture can be generative or discriminative depending on its operation. This type of classification is given in Fig. 4. A generative model learns the joint probability distribution p (x, y) and a discriminative model learns the conditional probability distribution p (y|x) where x is the input variable space and y is the target one. Generative model may serve to rebuild new states from the underlying data distribution where the only function of discriminative one is mapping from input space to the target. Different deep learning architectures can be summarized in Fig. 5. This figure shows basic architectures of:1. Restricted Boltzmann Machine RBM and its variants, 2 Auto-encoder (AE) and stacked AE (SAE), 3. Convolutional Neural Network (CNN), 4. Recurrent Neural Network (RNN) as being unidirectional and bi-directional, 5. Generative Adversarial Network (GAN).
In the next section, a brief description of different deep learning models will be presented listing some of their application areas in PHM Society.
4.1 Auto-Encoder and stacked Auto-Encoder
Involving two parts: an encoder and a decoder, its input data is compressed by the encoder to hidden layer shrinking in the number of neurons. The decoder tries to reconstruct the input data. The training process requires minimization of average loss of reconstruction. They are mainly used in unsupervised learning. Main features of auto-encoder include: 1- With minimizing the number of neurons of hidden layers, the network must learn representative features of input data for a success reconstruction. 2-Using nonlinear activation functions, such as relu, tanh and sigmoid enables the learning of complex feature representations. 3- Training is done using greedy layer-wise manner.
Stacked auto-encoder (SAE) consists of deep stacked layers of AEs which passes the unseen representation of the down layer as the input to the next layer. Training is due to greedy layer-wise technique. Variants of AE include sparse auto-encoder, denoising auto-encoder, contractive auto-encoder and variational auto-encoder. Auto-encoder and stacked auto-encoder architectures are shown in Fig. 5. However, sparse stacked auto-encoder (SSAE), denoising auto encoder (DAE) are also variants of AE. AE have been heavily used for fault diagnosis for different applications, some of the latest studies are described next.
Shao et al. [33] used a deep AE architecture for fault diagnosis of gearbox and electrical locomotive roller bearings. A new auto-encoder loss function design using correntropy to enhance the feature learning process. This deep auto-encoder design is optimized in parameter using artificial fish swarm algorithm (AFSA). It has proved to be better than earlier methods in terms of accuracy.
Gaussian noise can be used to add a noise part to the input data of DAE, then fed to the hidden layer. Binary noise can also be used. Another fault diagnosis research was done by Meng et al. [34] using a novel DAE for rolling bearing. Trying to overcome limitations of DAE in feature learning especially in the case of non-substantial input data, this study uses a modified AE enhancing norm penalty and an enhanced preprocessing method. Both studies used vibration data only in their investigation neglecting other signals like acoustic emission which is considered important in machinery applications. This can open horizons for future work in fault diagnosis of rolling bearings.
Jiang et al. [35] proposed a sliding window DAE (SW-DAE) algorithm for fault detection of wind turbines. First, the sliding window is applied on multivariate time series data to capture current and previous temporal information and then the DAE model was reconstructed. This study, however, is only applied for fault detection which is a limited view. Other studies looked at fault diagnosis for other applications like wind turbine [36] and for fault diagnosis of solid oxide fuel cell system [37]. We can conclude that AE and its variants are employed in fault diagnosis applications or as a feature extractor in other PHM applications.
4.2 Restricted Boltzmann Machine (RBM)
RBM is a generative stochastic artificial neural network that can learn a probability distribution over its sets of inputs. They have a bipartite graph i.e., visible and hidden units with no intra-layer connections thus being restricted. The training algorithm is gradient-based contrastive divergence algorithm [38]. Acting supervised, RBM is mostly used as a pre-processor for the classification process in other DL-built approaches as well as being a classifier for itself. Variants of RBM include deep belief networks (DBN) and deep Boltzmann machine (DBM).
4.2.1 Deep belief network (DBN)
Stacking multiple RBMs results in a deep belief network DBN. A DBN architecture has no direction concerning the top layer, but the other layers have a top-down direction. The training is done in two phases: pretraining, as unsupervised, using greedy layer-wise down-top manner [39] followed by a fine-tuning phase using back propagation algorithm in an up-down process.
DBN are the first effective deep networks trained and used in PHM applications. Again, optimization algorithms were added to improve performance of earlier studies. Optimization is the process to maximize benefit, i.e minimize error, by selecting the best hyper parameter of the model. Examples of hyper parameters are the number of hidden layers, number of neurons in a single layer and the learning rate. In their study, Shao et al. [40] investigated the rolling element bearings dataset to find the optimal hyper parameter for fault diagnosis purpose. Tang et al. [41] used Nesterov Momentum NM to increase the speed of training the network and enhance its performance. Another fault diagnosis problem considering bearing of traction motor in high-speed trains is investigated using DBN in [42]. The learning rate in this method was adaptive. DBNs may be used as the feature extractor part in the DL model.
Yuan et al. [43] held their research on Wavelet Packet Transform (WPT) features using a pair of different DBNs extracting their features and temporal dependencies. Most of these researches focused on fault diagnosis applications that might be utilized as a prior step for RUL prediction. In [44], Tang et al. proposed a new method for fault diagnosis using a new technique called Fisher discriminative sparse representation (FDSR) in which DBN is also used as a feature extractor. Using dictionary learning gives smaller within-class scatter with greater between-class scatter so the reconstruction error and sparse coefficients were discriminative; a big advantage for this method. In the domain of health assessment, Peng et al. [45] employed DBN in RUL prediction by constituting a health indicator for the degradation process. A particle filter was used for RUL estimation for aircraft engine dataset and was improved using a fuzzy inference system. Some researchers developed an end-to-end which is one merit of DL.
Xie et al. [46] proposed a fault diagnosis model based on an adaptive DBN for the extraction of deep features representing rotational machines to distinguish fault types and degrees of bearing. They employed a DBN with an adaptive learning rate optimized by NM. Comparison was done against SVM and DBN and achieved higher accuracy. Another fault diagnosis research that utilized the ability of DBN to capture higher level representations was presented by Liu, Zhenbao, et al. [47]. In this study, the raw signals output from analog circuits are applied to Gaussian-Bernoulli (GB) DBN to perform fault detection and isolation (FDI) to analog circuits. This process is a multi-class classification process which proved that fault diagnosis based on GB-DBN outperforms earlier methods. DBN can be used in detection of malware of android systems [48], prediction of traffic considering weather factors [49].
We can conclude from this that DBN are mostly used in feature extraction as a pre-phase of the model, for dimensionality reduction which is also one phase of PHM Cycle. It can be used as an ensemble with other DL architectures to perform the prediction process as will be discussed.
4.2.2 Deep Boltzmann Machine (DBM)
Deep Boltzmann machine (DBM) may be seen as a deep RBM with multiple hidden layers where all DBM connections are undirected. The training phase, all the layers are jointly trained using a stochastic maximum likelihood (SML) based algorithm. More information about the training of DBM will be found in [50].
Few studies employed DBM in PHM applications mostly for fault diagnostics applications [51, 52]. Both studies were applied on gear box for diagnosis and fault classification. Hu et al. [52] used an ensemble between DBM and Random Forest (RF) for fault classification to deal with industrial big data. They proposed a collaborative method between DBM and multi-grained scanning forest ensemble. The research was done using Tennessee Eastman Process (TEP) fault diagnosis. It is proved to have a classification accuracy competitive to earlier methods. They employed DBMs to generate 0,1 features which may waste some information which is considered a drawback for this method and may open new era for later research.
4.3 Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN) has proven its success in various applications, including natural language processing NLP, speech recognition and computer vision. Figure 4 shows the architecture of a 2-D CNN with three different parts, i.e., convolutional layer, pooling layer and a number of fully-connected layers. The convolutional layer performs a convolution operation between the inputs and a sliding window (filter or kernel). The output of this layer is the feature map. One merit of the network is the automatic learning of these filters or kernels not being handcrafted.
The convolution layer output is processed through the pooling layer extracting the most important local features. This can cause dimensionality reduction of an intermediate layer, thus avoiding over fitting. Moreover, the dimensionality reduction of the feature map lowers the number of the variables with increasing the shift-invariance property. In [53], they introduced a technique based on CNN to form health indicator (HI) and applied it on prognostics of the rolling bearing system. In a similar way, the DCNN was used by lmiloud et al. [54] to estimate the RUL of rolling bearings. DCNN was used in defect size estimation of bearing [55].
CNNs were basically used to analyze image. Therefore, many researchers proposed methods to preprocess and convert time-series data into 2-D inputs for the system health assessment.
Huang et al. [56] proposed a reshaped time series convolutional neural network (RTSCNN) method based on multisensory raw signal fusion to predict tool wear of CNC machine under milling operations. The raw sensor signals (3D forces, 3D vibrations, and AE ) are collected and reshaped in some form like image. Three convolutional layers and three pooling layers are applied to this reshaped matrix of raw signals to extract highly distinguishing features. A fully connected layer with Relu activation function and a regression layer are added to complete the prediction process of each flute tool wear. The proposed architecture has good performance compared with those which used hand crafted features for both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Experiments were done to determine the best number for training epochs, and the percentage for dropout for training acceleration. Nestrov is proved to better accelerate the training process. They also deployed multi domain feature fusion from three-dimensional cutting force and vibration, constructed them in an input matrix, and applied a deep convolutional network against them to predict tool wear of three cutters of the high-speed CNC machine under milling operations [57]. It outperforms [56] that uses a similar DCNN architecture held on the milling data set but using raw sensor data in the terms of MAPE, RMSE. DCNN was used to directly deal with raw data just normalized requiring no domain expertise for RUL estimation of C-MAPSS dataset set [58]. We can conclude that DCNN is a promising method and can be combined with other DL methods to as a feature extractor part in the prognostics model which can be noticed in the following subsections.
4.4 Recurrent neural network (RNN)
Recurrent Neural Networks (RNNs) acts as a memory cell that saves the status of previous cells thus being the most proper one for the sequential data applications like NLP and those involving time-series data. In the training phase, the hidden unit status is altered using the previous cell status and output between activation function and current input. RNNs are capable of catching long-time and transient dependencies from time series and sequential data, but they have several drawbacks like exploding gradient problem. To overcome such an issue, new versions of RNN were introduced: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) as shown in Fig. 6. A gating mechanism is the key feature behind these two architectures. It allows important features of the input to be maintained. Bidirectional LSTM depends on both previous and next states increasing the flexibility and power of RNN making it useful in time-series data applications.
RNNs are effective in computation capability and storage capacity of Big PHM data because of the following characteristics: 1) effective past state storage due to distributed hidden state and 2) hidden states adaptation in complex mapping methods due to non-linear dynamics.
Many researches have investigated the use of LSTM networks to control health of a machinery system. Zheng et al. [59] proposed a Method of RUL prediction using LSTM. Experiments were done on C-MAPSS and milling data sets. It outperforms other methods like CNN and SVR for both RMSE and score function.
Bidirectional LSTM (BiLSTM) network can capture the relationships between sensory data forward and backward achieving the maximum benefit of input. Thus being used by Wang, Jiujian, et al [60] for RUL prediction of turbofan engines. CMAPSS dataset again was used and PHM08 Data Challenge scoring function was computed for evaluation. The data is divided into a training part and testing part. It was compared against previous research which proved better performance in terms of RMSE. Zhao et al. [61] proposed an algorithm that used a local feature based GRU. The algorithm starts with handcrafted features extracted from time series data split into fixed size windows. These features are applied to a bi-directional GRU to capture higher level representations, then a supervised layer takes part of the learning and predicting process. However, they can be criticized for using handcrafted features which require domain expertise. In [62], a health index is constituted using KPCA and exponentially weighted moving average (EWMA) to depict degradation of rolling bearing. This HI is fed to a hierarchical GRU that is built by stacking layers to estimate future HI and predict RUL. They proved to outperform earlier methods. In [63], they combined an enhanced similarity based RUL estimation method with a RNN based autoencoder scheme. They applied the method on turbofan engine data set which gave advantage to the ensemble method suggested.
4.5 Generative adversarial network (GAN)
Generative adversarial networks powerful (GANs) are powerful generative models, were first introduced by Goodfellow [64]. It consists of two parts: generator and discriminator. The distribution of input data is learned by generator of the GAN model while the discriminator playing the adversarial role, has fake and real data as its input and evaluates them for authenticity [65]. The GAN invades all deep learning applications nowadays for improving their performance and prediction capability. Combining auto encoder with GAN results in variational AE (VAE). Many researches have adopted VAE in prognostic applications. It has proved good performance in anomaly detection and RUL prediction tasks.
Yao et. al [66] used VAE to capture dominant features for the unsupervised fault detection applications. Comparison were established using KDD CUP 99 dataset [67] and MNIST [68] dataset. Experiments prove that features extracted by VAE may improve performance for unsupervised anomaly detection techniques. AE and KPCA were used for comparison against VAE. VAE proved to have the best performance among them. Huang et.al [69] used VAE trained with Generative Adversarial Networks (GAN) for long-time prediction of the degradation progression and RUL without specifying a particular failure threshold. .Critical features of degradation are extracted using Monotonic and correlation metrics. Health indicators are constituted and fed with these features to train the model. The VAE consists of an encoder based on bidirectional LSTM and the decoder uses an auto-regressive LST M-GMM. The output of decoder is fed to a fully- connected layer of Gaussian mixture model. Experiments were done on MAPSS, HSSB and Lithium-ion batteries which proved that the adversarial training improves the VAE capability of learning how the degradation process is really distributed which leads to improving prediction accuracy.