Machine health surveillance system by using deep learning sparse autoencoder

Deep learning is a rapidly growing research area having state of art achievement in various applications including but not limited to speech recognition, object recognition, machine translation, and image segmentation. In the current modern industrial manufacturing system, Machine Health Surveillance System (MHSS) is achieving increasing popularity because of the widespread availability of low cost sensors internet connectivity. Deep learning architecture gives useful tools to analyze and process these vast amounts of machinery data. In this paper, we review the latest deep learning techniques and their variant used for MHSS. We used Gearbox Fault Diagnosis dataset in this paper that contains the sets of vibration attributes recorded by SpectraQuest’s Gearbox Fault Diagnostics Simulator. In addition, we used the variant of auto encoders for feature extraction to achieve higher accuracy in machine health surveillance. The results showed that the bagging ensemble classifier based on voting techniques achieved 99% accuracy.


Introduction
Nowadays, the deep learning is the promising area of research in artificial intelligence. Deep learning is a subcategory of machine learning that uses the neural network to design a highly accurate system. Neural network architecture contains several layers, made of neuron, that apply a nonlinear and linear transformation between the layers.
There are a number of successful implementations of supervised and unsupervised based deep learning techniques in computer vision and natural language processing (Almiani et al. 2020). The deep learning models learns in hierarchical fashion where the lower-level features derived from high-level features. For example, in classification of an image-processing task, the deep learning algorithm grabs pixel value in the input layer and allocates label value to the object in the output layer. Among these two layers, there are a number of internal layers, known as hidden layers, that assembles successive higher-order features (LeCun et al. 1999). The term ''deep'' in DL represents several transformation layers of representation that lie between the model's inputs and output. In deep neural network, there is no standard number of layers, but most research in this area considers at least more than two layers must be present. The reason for the success of deep learning is that it avoids the process of feature engineering (Komar et al. 2018). In conventional machine learning algorithms, feature engineering is the task of choosing relevant features compulsory for the algorithm to work efficiently. This task is complex and time-consuming as the precise selection of features is important to the performance of the algorithm (Bouktif et al. 2018). In this paper, four deep learning algorithms, i.e., Auto-Encoders (AE), Restricted Boltzmann Machine (RBM), Convolutional Communicated by Irfan Uddin.
Neural Network (CNN), and Recurrent Neural Network (RNN) and variants have been discussed and applied in the area of MHSS in detail. The data-driven algorithms and the industrial internet of thing (IoT) are having a revolution in the manufacturing. It is empowering computer network to collect a massive amount of data from the linked machines and transform it into valuable information (O'Donovan et al. 2015;Tao et al. 2018). As an important module of the modern manufacturing system, monitoring of health through the machine has fully accepted the big data revolution Luo et al. 2016). New models of bottom-up solutions for fault detection have overtaken traditional physics-based techniques. The data-driven systems for MHSS accomplish this by diagnosing certain faults arising in the system and has the ability to predict the remaining useful life of the machine (Nuhic et al. 2013). The developmental paradigms for complex and dynamic systems are difficult because of the noisy working conditions which delays constructing physical models. Similarly, the effectiveness and flexibility of physics-based models are hindered as they have no ability of the real time data updates (Mosterman 1997). Deep learning has proven to perform as a connecting entity between huge machinery data and intelligent MHSS. Deep learning models are developed using multi-layer approach to classify the data pattern. Deep learning goes as back as the 1940s, but recent popularity contributes mostly to the computer vision, speech recognition systems, bioinformatics and audio recognition (Liu et al. 2017). The increased use of the deep learning based model can be contributed to the cheaper GPU, exponential growth in data, and the deep learning research strength.
Cheaper GPU: The performance of deep learning models requires high end GPUs, and the recent development of cheaper GPU made it easier to build more efficient models. Resultantly, it has significantly reduced the time required to run algorithms specific to deep learning. As in Raina et al. (2009) the research has shown, the running time for a four-layered Deep Belief Network (DBN) reduced to a single day, as compared to several weeks, for over 100 million parameters.
Exponential growth in data: Our every operation is digitized nowadays, stored by sensors and computer, connected to the Internet, and stored in the cloud. As shown in Yin et al. (2014) that in industry associated systems such as electronics and industrial informatics, having 1000 Exabyte produced per annum and a 20-fold rise can be contemplated in the coming ten years. Research in Al-Sarawi et al. (2020) predicts that a minimum of 30 billion systems will be connected by end of 2025.
Deep Learning strengthens research: the ever first success of deep learning is the pre-training practice in the unsupervised manner (Hinton 2007), Hinton suggested maintaining a single layer at a time deploying RBM and fine-tune by utilizing backpropagation in 2007.
The main contribution of this article is to monitor the health of a by using Gearbox Fault Diagnosis dataset, that contains the sets of vibration attributes recorded by Spec-traQuest's Gearbox Fault Diagnostics Simulator. The rest of the paper is organized as follows: Sect. 2 has the literature about recent work on deep learning models used for MHSS. In Sect. 3, we present our proposed methodology in detail. In Sect. 4, experimental study and results have been carried out in a tool wear prediction task. In Sect. 5, we conclude this paper and outline for future direction.

Related work
In the area of MHSS, the multi-layer conventional neural network has been used for many years (Su and Chong 2007). In MHSS, recently a huge number of deep learning algorithms has significantly increased. Deep neural network based on Restricted Boltzmann machine or autoencoder smooths the training phase and improve the classification power to classify data. Recursion neural network and convolution neural network come up with highly complex and advance composition to extract a pattern from machine data. For MHSS based on deep learning, the first layer is used for input. To diagnose a task in the top layer, discrete values are used which is called the softmax layer. To diagnose a task having continuous values, linear regression is used. A deep learning survey based on MHSS is carried out in three different architectures and summarized in Table 1.

MHSS using auto encoder
Auto encoder (AE) algorithms can extract high-level information automatically from machine data. Hoang and Kang ( 2019) suggested a single layer AE solution to classify induction motor fault based on the neural network as shown in Figs. 1 and 2. The author focuses on how to overcome overfitting due to training data because the available dataset is having a limited dimension. Their model uses dropout architecture to hide the random output of the hidden layer. Lu et al. (2017) proposed a compressive study on stacked autoencoder denoising techniques having three hidden layers to diagnosis a fault of the rotary machine. In the suggested study, experimental work was carried out in cross working and single working environment which effects deep architecture, input size denoising operation, and sparsity control check. Tao et al. (2015) presented many structures of deep learning based on twolayer SAE. To perform a classification task of fault diagnosis proposed system consists of the various hidden layer having masking probability. The input dataset is huge in size, which causes overfitting and high computation. Therefore, some authors suggested normalizing input data using AE models. Jia et al. (2016) use spectra frequency of time series for features extraction from raw input data. The extracted features are utilized for the classification of fault diagnosis applications. Sun et al. (2017) proposed a soft threshold nonlinear and digital wavelet model to operate the vibration signal. SAE is used to classify fault diagnosis over the preprocessed input signal. Liu et al. (2016a, b) proposed Fourier wavelet transform used to extract features. SAE based on a deep neural network is used to classify roller bearing fault diagnosis. SAE utilized a dropout and ReLU activation function, which cause to prevent overfitting issue. In Input, the data-set is normalized by short time Fourier transform to generate a normalized spectrogram. The two-layer SAE based DNN is then used to classify bearing rolling fault diagnosis. Galloway et al. (2016) extract spectrograms from tidal turbine data using two unit layer SAE based on DNN to diagnose the fault. In Li and Wang (2015) SAE based on DNN is used to classify fault diagnosis and principal component analysis has been utilized to extract features from input. Auto encoders have been also utilized to extract a useful pattern from multiple sensors and time series data by Chen and Li (2017) and Reddy et al. (2016). Informative statistical input features with high frequency and specific time domain are derived from signal vibration and then utilized for pattern classification. Some researchers have taken a step further to investigate the application of autoencoders with other machine learning approaches. Jiang et al. (2017) proposed a scheme to extract frequency domain features by using auto-encoders and use conventional classifier algorithms such as random forest and support vector machine to perform the classification task. Wang et al. (2016) suggested a novel auto-encoder unsupervised features learning technique, which prevents overfitting and changes gradient direction for a fault classification task. Mao et al. (2017) presented a novel version of autoencoder for a fault recognition named extreme After that, a Self-Organizing Map (SOM) unsupervised algorithm has been modeled to transform a pattern recognized by a novel RBM called health value. The health value has been, then, utilized for the prediction of RUL using harmony methodology. In ) support vector multi-model classification technique was suggested for fault recognition of gearbox. The proposed algorithm extracts three types of feature from vibration signal: time-frequency, frequency and time. After that, a three layer Gaussian Bernoulli Boltzmann Machines (GDBM) has been applied to extracted features. In every GDBM softmax, the unit layer is attached at the top. After fine-tuning, the statistical output of the softmax layer generated from three GDBM has been assessed by support vector classification structure to predict the final stage of classification.  developed a single unit GDBM for feature containing three different tones, i.e., time-frequency, frequency and time. To predict a fault, the combination of softmax and the stacked layer is applied on top of GDBM.  applied a two layers DBM to extract deep information having statistical parameters based on wavelet packet transform (WPT) of input sensory signal to recognized fault diagnosis for the gearbox. In the proposed study author focus on the fusion of data, two different DBMs were trained over vibratory and acoustic signals. Random forest was applied to fuse information recognized by two layers of DBM. To make a useful application of DBN having DNN, Ma et al. proposed a novel model to minimize the evaluation under bearing accelerated life test .
Root mean square (RMS), a variant of statistical feature, used by Weibull statistics that can keep away fluctuation areas. Then, the feature of the frequency domain has been extracted as input raw data. To conclude, the proposed methodology is given in Fig. 3. The red block represents the online training, while the second part shows the offline training. Shao et al. (2017) proposed a DBN framework to diagnosis a fault of induction motor having the uninterrupted utilization of signal vibration data. Tao et al. (2016) proposed a novel framework multi-sensor DBN base information fusion model to predict bearing fault diagnosis. In the first phase, three signals of vibration extracted from three sensors and has been combined to extract 14 time-

MHSS using CNN
A new approach, which gains high attention of researcher for the high dimensional data such as time series, images, and information is CNN. CNN is made of neural networks, which contain extracted feature during the training phase and the assigned weights are adjusted during the training phase. The fundamental truth that CNN can resolves is the issue of manual extraction of features. The main feature and advantage of CNN is the process of feature extraction automation (Hussain et al. 2019). CNN is a simplified form of neural networks, which utilize the convolution process instead of a conventional multiplication matrix.
In some frameworks, the raw machinery information can be gathered in a single 2D format like spectrum frequency time, where in some cases, the presented data is in a 1D format such as time series. CNN algorithms are capable to learn robust and complex pattern with the convolutional layer from both format. Traditionally, convolutional layers, with the help of filters, are able to extract local patterns from raw data while convolutional layers can further be stacked together to create meaningful patterns. Liu et al. (2016a, b) presented a 2D-CNN framework for four variety of spinning machinery conditions classification. Two sensors, that are set perpendicular to each other, generate two accelerometer signals processed by DFT. The adopted CNN model consists of a directly connected layer and a single convolutional layer. After that, the top softmax-layer is applied for classification. Babu et al. (2016) created 2D deep convolution neural network for the prediction of RUL of a machine over normalized variable time sequence collected from sensor signals. In the proposed study, the mean pooling approach is applied rather than max pooling. As the value RUL is a continuous number, so the top layer is set off as a linear-regression layer.
Ding and He (2017) suggested a novel approach based on deep Convolutional Network (ConvNet), having wavelet packet energy (WPE) picture that has been utilized as an input to predict spindle bearing fault. To completely recognize the hierarchical structure, the multi-scale layer is appended succeeding to the end of the convolutional layer. The output of the last convolution layer is concatenated with the first pooling layer. Guo et al. (2016) suggested an algorithm based on adaptive deep convolution neural network (ADCNN). The hierarchical module is designed to predict fault size and pattern of fault. The process of fault pattern decision phase ADCNN was first applied to recognize the type of fault. ADCNN having an identical structure is utilized to recognize fault dimension. A function f is used to classify the fault type. The function f is determined as summation of probability given and shown in (1). where a i is the instance and P j is the probability.

Methodology
In this section, the proposed methodology is discussed in detail to prove the effectiveness of the proposed MHSS algorithm for fault diagnose. A data set of six gearboxes was recoded under a variant environment having different rotation speed. The framework of the proposed methodology is shown in Fig. 4.

Data set
Dataset used in this paper contains the vibration dataset recorded using SpectraQuest's Gearbox Fault Diagnostics Simulator (ZhiQiang et al. 2015). Dataset is collected with the aid of four variant sensors. These sensors were placed at four different locations. Data set has been gathered under different load percentage and the range was from zero to 90 percent load. Data set has been gathered in the following two different scenarios: broken tooth condition and healthy condition.

Preprocessing
The running time span of every signal was 0.5 s, a total of 120 fragments were extracted from the original signal for every condition. Every sample was then preprocessed using CAE. Therefore, each type of gearbox fault had 120 records to get a balanced dataset. There are a total number of 720 samples for six variant health conditions. The dimension size of the total sample was 6140. Hence, the dimension of fault gearbox was 6140 9 720. In the output layer, the numbers of the neuron are six, showing six different health states. Fine-tuning process and Iteration level of each hidden layer is set to 100 for both. Fine-tuning process, the value 100 is set for the first layer. The corresponding learning rates were 0.01 and 0.1. A subset 30% of the total samples was used for testing, whereas all of the reaming datasets were used for training.

Unsupervised bagging ensemble classifier
In proposed MHSS model, an ensemble classifier was utilized using bagging model. The extracted features from the dataset are input to ensemble technique to form a set of classifiers. Using the newly constructed set of classifiers, up-to-date data points are classified by measuring vote of their predictions, resulting in the improvement of predictive performance for fault diagnosis. Figure 5 demonstrates the block diagram of bagging ensemble classifier model ( Table 2). As demonstrated in Fig. 5, the features extracted using machine learning technique forms the input layer to the proposed model. Several extracted samples 's1, s2, s3…, sn' in the datasets are preprocessed into feature 'SFE1, SFE2, SFE3 …, SFEn'. Several classifiers 'C1, C2, Cn' are created with many samples. Lastly, a blend classifier named 'C' is acquired to achieve higher accuracy of fault diagnosis. The proposed Bagging Ensemble Classifier processes obtained samples (i.e., features) concurrently.
Every sample gets an equivalent weight. After that, the classifier is trained for each sample in the bagging ensemble model. Every sample feature is exploited to train the classifier and the final decision is made by using the vote of each component.

Autoencoders (AE)
The idea of AE was presented by Lecun (Rolfe & LeCun 2013). AE comprise of two parts; decoder and encoder (Fig. 6). Which is designed to learn from input data by reconstructing a new representation of data. The decoder and encoder can be observed as two different functions (Hinton & Zemel 1994).
The function f(z) map data point z to feature space from data space, where g(w) creates re-establishment of data point z by mapping w data space from feature space. In recent AE the two parameters, i.e.,w ¼ f ðzÞ and y ¼ gðwÞ are normally a variable function. Where the encoder in (w/ z) decoder is (y/w). However, y is the reconstruction of z. It is foremost to record that AE does not learn to copy from input z (Bourlard & Kamp 1988).

Sparse auto encoder (SpAE)
To view the internal shape of data, in our model, an extra supervision checks on the hidden layer is applied. A neuron is called active whose output is near to one, while a neuron is called inactive whose output gain is near to zero. The main motivation of SpAE is to minimize the number of inactive neurons.   Given a sample set of training data y 1 ; y 2 ; y 3 ::::::y m . The mean activation ith hidden layer is p i ¼ ð1=mÞ P m j¼1 ½h i ðx j Þ. SpAE impose the restriction p i ¼ p; while p is the activation value of the deserved average. In the hidden layer, most of activation gain is near to zero. Thus, it was restriction using the following equation.
where p j means activation while p is the predetermined average value of activation target of ith hidden neuron in the complete dataset. The sparsity addition checks can be learned from hidden layer to be work as sparse representation. Hence, this type of AE is called SPE as shown in Fig. 7.

Contractive autoencoders (CAE)
Some of the real-life applications of neurons involve data labeling for segmentation, denoising input data, detecting outliers, preprocessing, and replacing missing values in data. Many of these applications additionally work with SAEs. In CAE, the zero is not completely replaced but there are little bit changes . In training dataset, the alternative regularization help mapping. CAE is obtaining with the help of Eq. 4.
It is easy to observe that square of Jacobian corresponds to Frobenius norm to Y weight decay. gf(y) is an identity function in case of the linear function.

Stacked autoencoders (SAE)
SAE is the combination of many encoders put jointly having several layers of decoding and encoding. This function permits algorithm to adjust more weights, many layers, and the most important more robust. Stacked Denoising Autoencoder (SDA) may be yielding a productive pre-training solution, to train the model via setting the weight deep neural network (DNN). A type of supervised fine-tuning is applied to reduce classification error over the training label dataset as shown in Fig. 6.

Training and testing
After applying CNN variants to extract all important features, for training the assemble model, the feature map of train CNN variants, actual data and its label are used. We used the extracted feature vectors having class labels and actual data to train the assemble model. Assemble trained classifiers calculate the labels of new records in the form of feature vectors. Later, the performance of the proposed model is calculated. In this research, we utilized an ensemble model based on three different deep learning algorithms, namely RBM, DBN and DBM. The dataset were divided into test and training set with 70-30 ratio.

Classification
The classification phase has a key role in the field MHSS. The training dataset is used to train the model and the test dataset is used to validate the model's results. For the classification purpose, a bagging approach is used that combine the three classifiers, i.e., RBM, DBN, and DBM.

Restricted Boltzmann machine (RBM)
RBM is a useful variant of recurrent neural networks proposed by . RBM is very robust for feature learning, classification, filtering, and dimension reduction. RBM is a type of generative stochastic technique called recurrent neural networks. It has a probabilistic element, i.e., neuron, which is used to make up the whole network. RBM is two-layer NN shaping a multidirectional graph, which consists of two classes. The hidden unit d and visible unit b with a restriction that there is no communication between the hidden layer and visible layer and there is no communication between the group with node.

Deep belief network
Deep belief network (DBN) is a type of deep learning in which many RBM are combined. Figure 8 illustrates a generic deep belief network (Jia et al. 2016). The process of training is performed in a greedy approach having weight fine-tuning to extract hierarchical feature from given data. The purpose of DBN creation was to delineate a model, which distribute data among hidden, and input layer such that there is an uninterrupted relationship between upper layer nodes and lower layer nodes. The training process is performed layer-wise in parallel by balancing weight parameter and applying contrastive convergence. Apart from these restrictions, probabilities of dispersal DBN are invariable which is also robust to **noise transformation ).

Deep Boltzmann machine
A deep stricture of RBM is combined to form Deep Boltzmann machine (DBM) in which hidden layer are combined in hierarchy shape . Allowing RBMs networking restriction no connection among nonneighbor layer and single complete connectivity has been established between subsequent layers. DBM is also called Network of symmetrically grouped stochastic binary layers. The main difference between DBM and DBN is that DBM is completely un-directed graphical algorithm, where DBN is varied undirected/ directed type (Jia et al. 2016).

Experimental results
This section presents the experimental evaluation of our proposed ensemble model. The mean eigenvalue of variant hidden neuron considering different layers are shown in Fig. 9. The output results of the first layer provide the high mean value, i.e., 0.745, where the size of hidden neurons in the hidden layer is 451. The mean eigenvalue of the second layer is between 0.805 and 0.76, where the highest eigenvalues is set to 0.381 hidden neurons as shown in Fig. 10.
In the third layer, the highest Eigen mean value is 0.966, where the number of neurons in the hidden layer are 201 as shown in Fig. 11. As for the last fourth layer, the highest The results are shown in Table 3  at input layer correspond to input data. First layer has 451 neurons, 381 neurons at the second layer, 201 neurons at the third layer, 191 at the fourth layer and 7 neurons at the output layer. Figure 13 demonstrates the training performance of the proposed MHSS method. A decrease in the result is shown for MHSS in mean squared with the increasing hidden layers. The mean squared errors of MHSS algorithm is higher than 50 except for the 1st layer. The results also show that the MHSS has a higher feature learning ability than the traditional Algorithms.
The main features obtained from the peer techniques are shown in Fig. 14. It should be noted that the two main value features were normalized into [1, 0] along the projecting direction. The AE, SpAE, and CAE could distinguish ''G20,'' ''Normal,'' and ''G22 and B'' samples from other samples. However, it is clear that the peer algorithms, except for SAE method, especially for the ''B'' and ''G20 and B'' samples.
To verify the effectiveness of the proposed MHSS, the most commonly utilized techniques are employed in the testing, and testing results are listed in Table 4. The classification accuracy of all algorithms are above 90 percent except SAE. From the above results, it is clear that the performance of SAE and proposed MHSS algorithms are promising than AE, CAE, SAE and SpAE. The Lc value of the MHSS is 0.99, which is better than SAE 0.92. The SpAE method resulted in the lowest Lc value 0.64. It means that the features learnt by MHSS are more distinguishable than those learnt by others techniques.

Conclusion and future directions
Deep learning is a rapidly growing research area having state of art achievement in various application such as speech recognition, object recognition, machine translation and image segmentation. In current modern industrial manufacturing system, MHSS is achieving high popularity because of the widespread availability of sensors having low cost and access to the internet connection. Deep learning architecture gives useful tools to analyze and process these huge amounts of machinery data. In contrast to the traditional machine learning model, deep learning algorithms having superior achievements in the area of machine health monitoring.to perform pre-processing using sparse auto encoder can improve the result accuracy of MHSS. CAE applications are crucially important for MHSS. CNN and their types can handle MHSS feature extraction. Due to the complexity model, hyper-parameter selection is needed to acquire state-of-the-art performances.
Authors' Contribution Faizan Ullah and Abdu Salam conducted the research, completed the original draft, and revised it. Muhammad Abrar, Masood Ahmad, and Atif Khan contributed to data processing. Abdullah Alharbi, M. Irfan Uddin and Wael Alosaimi revised the

Declarations
Conflict of interest The authors declare that we have no conflicts of interests regarding the publication of this article.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent I consent the journal to review the paper. I inform that the manuscript has not been submitted to other journal for simultaneous consideration. The manuscript has not been published previously. The study is not split up into several parts to increase the quantity of submissions and submitted to various journals or to one journal over time. No data have been fabricated or manipulated (including images) to support my conclusions. No data, text or theories by others are presented as if they were of my own. Proper acknowledgements to other works are provided, and I use no material that is copyrighted. I consent to submit the paper, and I have contributed sufficiently to the scientific work and I am responsible and accountable for the results.