A Feature Selection Committee Method Using Empirical Mode Decomposition for Multiple Fault Classification in a Wind Turbine Gearbox

Gearboxes are widely used in various industries such as aircrafts, automobiles, wind turbines, ship industries among others. Due its complex configuration, it is a challenging task to identify fault and failures patterns. Its internal components, such as bearings and gears, have different fault patterns, that can appear in one or in both components. The vibration signals were processed using the Empirical Mode Decomposition (EMD) and the Pearson Correlation Coefficient (PCC) to select the significant Intrinsic Mode Functions (IMFs) and then 18 features were extract from this IMFs. Four features ranking techniques [ReliefF, Chi-Square, Max Relevance Min Redundancy (mRMR) and Decision Tree] were used in a committee to select the best feature set, among the 10 with the highest rank, that appears at least in 3 of the 4 methods. The new feature set was used as an input to Support Vector Machine (SVM), Random Forest (RF) and Artificial Neural Networks (ANN) algorithms. The results showed that the use of the PCC value as a tool for selecting the significant IMFs, combined with the feature committee led to good results for this classification problem. In this case study, the ANN model outperformed the SVM and the RF algorithms, by using only 4 features to achieve 95.42% of accuracy and 6 features to achieve 100% of accuracy.


Introduction
Gearboxes are an equipment that is widely used in several industries such as aircrafts, automobiles, wind turbines, ship industries among others [1,29].It plays an important role in industries for motion and torque transmission [28].
Due to its complex configuration, it is a challenging task to recognize fault and failures patterns because its internal components, such as bearings and gears, have different fault patterns, and the fault can appear in one or in both components.Therefore, the gearbox fault detection gained much attention and has become a studied topic in the last few years [2,3].
To diagnose gearboxes fault patterns, data-driven methodologies for condition monitoring use four basic stages: acquisition and conditioning of the signals; feature extraction; feature selection and classification [2].
The signals can be acquired from various physical variables such as vibration, acoustic emission, among others.The vibration signal analysis is the most commonly used technique for condition monitoring because it is easy to measure it and, besides that, it has the advantage that there is no stoppage of the machinery during the process [2].
Feature extraction of the acquire signals can be characterized and analyzed in three domains: time domain, frequency domain and time-frequency domain.The features extracted from these domains are used together or separately for pattern recognition and condition monitoring with machine learning algorithms such as Artificial Neural Network (ANN), Support Vector Machine (SVM).Time domain analysis can be computationally less expensive to implement, needing only signal conditioning as preprocessing, while frequency domain approach employs complex algorithms to extract representative features [2].Lei et al. [4] used specially designed time domain features with frequency domain features as input to K-Nearest Neighbors (KNN) algorithm, Praveenkumar et al. [5] used time domain features with the area under the waveform frequency domain separately and compared them to propose an online condition monitoring for an automobile gearbox.
The major disadvantage of this methods lies in its inability to analyze the non-stationary signals which are generally related to component defects [6].To overcome this drawback, in recent studies the time-frequency analysis methods are used in fault feature extraction of machinery signals [7].Among them, Wavelet Transform (WT) has gained attention in many fields as an important signal processing tool for fault detection and diagnosis of rotating machinery [2].
WT has been widely used for feature extraction and has been used with many features' selector algorithms and classifiers for fault detection.Saravanan et al. [8] used the Discrete Wavelet Transform (DWT) with ANN, Cerrada et al. [9] used the features extracted from the Wavelet Packet Transform (WPT) along with features extracted both in time and frequency domains, selecting a best set of them with a Genetic Algorithm (GA), Zarnaq et al. [10] used the DWT features with a Correlation-based feature selection (CFS) method and it was used as input to Random Forest and Multilayer Perceptron (MLP) Neural Networks algorithms.
However, WT has a few limitations such as the need of prior determined basis functions.Its analysis results depend on the choice of a basic function as a proper wavelet mother.An incorrect choice of the wavelet mother, can lead to a diagnosis error.In order to overcome the previous limitations, the Empirical Mode Decomposition (EMD) has been proposed as a signal processing tool, because it's self-adaptive nature, i.e.EMD decomposes a complex signal into a set of Intrinsic Mode Functions (IMFs) based in its local maxima and minima [1].
EMD technique also has been used as a feature extraction tool with many classifiers and feature selection algorithms for fault detection.Desavale et al. [3] used EMD features as input to ANN algorithm, selecting the most sensitive ones with Euclidian Distance, Vernekar et al. [11] used EMD features with Naive Bayes classifier algorithm, Suresh et al. [12] used EMD features with SVM classifier.
Several researches focused on single or multiple faults in gears and bearings separately, but in actual practice there may be faults in bearings and gears simultaneously.Dhamande et al. [13] used both features in time and frequency domain of a combined bearing-gear faulty gearbox in an ANN, and later, the same authors [14] used features extracted from both time and frequency domain with features extracted from DWT and Continuous Wavelet Transform (CWT) as input for ANN, SVM and Naive Bayes classifier.
The purpose of this work is to propose a method to classify a four-state condition Wind Turbine gearbox, based on a feature committee to select the most relevant features among those ranked by the ReliefF, Chi-Square, mRMR and Decision Tree.The EMD technique is used as a signal preprocessing tool to extract the IMFs, and the PCC value is used to select the most significant IMFs, of which the features were extracted.
This paper is structured as follows.Section 2 describes the EMD technique and the ANN, SVM and Random Forest classifiers, Sect. 3 explains the methodology used in this work such as the signal acquisition, signal processing, feature extraction and feature selection, Sect. 4 shows the results and lastly, Sect. 5 shows the conclusion.

Empirical Mode Decomposition (EMD)
EMD decomposes any complex signal into a finite number of components, called intrinsic mode functions (IMFs).An IMF is a function with the same number of extrema and zero crossings, whose envelopes are symmetric with respect to zero [15,16].
The process of extracting IMFs is called sifting and can be described as follows: 1. Identify all the local maxima and minima of any vibration signal x(t); 2. Connect them with a cubic spline to produce the upper and the lower envelopes; 3. Calculate the mean of the envelopes, m(t); 4. Calculate the difference between data and mean, d(t) = x(t) − m(t); 5. Verify if the conditions to be considered as an IMF are satisfied, which are: (a) The number of extrema and the number of zero crossing must be equal or differ at most by one; and (b) At any point, the mean value of the upper and lower envelopes is zero.
6.If these two conditions are not satisfied, repeat the steps above; 7. If the conditions are satisfied, d(t) is defined as one more variable c(t); 8. Calculate the residue r (t) = x(t)− c(t), by repeating the seven steps above; 9.The operation ends when the residue contains more than one extremum.

Support Vector Machine (SVM)
Support Vector Machine tries to set a linear boundary between two classes in such a way that the distance between the boundary and the nearest data point in each class is maximal, i.e. maximizing the margin.The nearest data points, known as support vectors (SVs), are used to define the margin [7,16].
In some cases, if the linear boundary in the input spaces is not enough to separate into two classes properly, it is possible to create a hyperplane that allows linear separation in the higher dimension.In SVM, this can be achieved by using a transformation φ(x) that converts the data from de input space to a higher dimensional space.A kernel function can be used to perform this transformation [7,16], and it can be defined as: ( By introducing the transformation performed by the kernel function, the basic form of SVM can be obtained: where b is a scalar threshold, x is the input vector, x i is the SVs obtained from training, l is the number of training sets and v i is a parameter used as weighting factor to determine which of the inputs are actually the SVs [7].Among the kernel functions in common use are linear, polynomial, radial basis and sigmoid functions [7]. Figure 1 shows how SVM operates.
Fig. 1 SVM operation [7] In this work, the SVM model was trained using the Gaussian kernel and varying the regularization index from 2 −5 to 2 15 .

Random Forest (RF)
Random Forest is an ensemble learning method for classification that actually consists of many decision trees (DTs) [10].
To classify a new instance, each DT provides a classification for input data, then RF collects the classifications and chooses the most voted prediction as the result.Essentially, RF enables a large number of weak or weakly correlated classifiers to form a strong classifier [10].
RF is a combination of predictors trees which depends on the random selection of the input variables and the random selection with replacement of samples from the data set.The selected variables and the bootstrap samples are used to grow every tree in the forest [9].
Figure 2 shows the structure of a RF.The complement set OOB b of each tree T b , is the Out-Of-Bag sample, and it is used as the cross validation set for the tree, i.e. is used as the test set during the process to grow the tree [9].
In this work, the RF model was trained using the search methods Random Search and Grid Search, varying the number of trees from 5 to 50 with 5 observations per leaf.

Artificial Neural Network (ANN)
Artificial Neural Network algorithm tries to mimic the biological neurons by simulating some of the functions of the Fig. 2 Structure of a Random Forest Algorithm [9] Fig. 3 Artificial Neural Network architecture [8] human brain.They were made up of processing elements, that simulate the neurons or the networks of neurons in the brain.The artificial neurons, which are computational units, receive input from the other neurons [17,18].
They have three layers: input layer, that receives information from an external source; hidden layer, which processes and performs transformation in the information and the output layer, that sends the processed information back to the external source.Figure 3 shows an architecture of an ANN [17,18].
A feed forward neural network trained with the Scaled Conjugate Gradient backpropagation with one hidden layer was used with a hyperbolic tangent activation function, which is expressed in the Eq. 3.
The output of the hidden layer's neurons, is the output of the activation function, and they are given as an input to the neurons of the output layer.If the output of the network is not equal to the output of the desired class, the loss is computed.The loss function used is the cross-entropy, represented in Eq. 4 [17].

Cross Entropy Loss
The weights are adjusted such that the cross-entropy is minimized.The learning rate determines the speed of weights updating.If it is too small, the problem may not converge and if it is too large, the gradient descent abruptly [17].In this work, a learning rate of 0.05 was used [8].As the ANN structure with one hidden layer achieved good results in [8] and [17], it was decided to proceed this way.
Those algorithms were chosen because they are enshrined in the literature with good results.

Data Acquisition
The data was collected from a "healthy" and a "damaged" gearbox of the same design under GRC dynamometer tests.Vibration data was collected with a sampling frequency of 40 kHz, during 60 s, per channel by accelerometers along with high speed shaft RPM signal using a National Instruments PXI-4472B [19].
The damaged gearbox was sent for field testing where it experienced two loss-of-oil events that damaged its internal bearings and gear elements [19].So, this dataset deals with compound real faults.
The two gearboxes are composed of one low speed planetary stage and two parallel stages.
This work analyzed only the parallel axes i.e. the High Speed and Intermediate Speed shafts and the dataset was provided by the National Renewable Energy Laboratory (NREL).
Table 1 shows the gears and bearings basic information, Fig. 4a shows the bearings locations and Fig. 4b shows the gearbox arrangement and Table 2 shows the faults and their locations.
As can be seen in Fig. 4a the two parallel axes are supported by one cylindrical roller bearing on the upwind side and by two tapered roller bearings on the downwind side of the assembly.

Signal Processing
The signals collected for the parallel axes (High Speed and Intermediate Speed) in the two conditions described above, were processed using the EMD technique in MAT-LAB workspace.
Each condition has 10 signals, and initially 2 signals were separated for test and the other 8 for training and validation.On these 8 signals, a k-fold cross-validation with 5 folds was used to separate the data in 80% for training and 20% for validation.
For the ANN, was used a simple separation of 70% for training, 15% for validation and 15% for test, because MAT-LAB ANN code makes this division automatically.
Each signal was partitioned in 10 equal parts before making the separation procedure described above, and then the IMFs were extracted of each of the segmented parts.Figure 5 shows a flowchart of the proposed method and Fig. 6 shows the IMFs extracted from a partitioned signal.The EMD technique is based in the assumption that any complex signal can be decomposed into a number of intrinsic functions (IMFs) i.e. any signal consists of different simple intrinsic modes of oscillations [7].However, not all IMFs contain valid information, so the selection of IMFs is a vital work.For this purpose, a statistical parameter called the Pearson Correlation Coefficient (PCC) is calculated for all the IMFs.PCC computes the linear relationship between the two signals and its magnitude ranges from − 1 to + 1.The closer to the extremes the greater is the correlation between the signals.Nonetheless, a value of 0 or close to 0, indicates a weak or no correlation between them [1].
The IMFs can be categorized as noise-part, signal-part and trend-part.Typically, the noise is captured by the IMFs with low indices, and the trend is captured by the IMFs with de high indices.The remaining IMFs contains only the signal part, which can be attributed as significant IMFs [20].
According to [20], the IMFs with low indices and low PCC value are identified as the noise-part IMFs.The IMFs with high indices and low PCC value are identified as the trendpart IMFs, and the remaining are identified as the signal-part IMFs.
As can be noted in Fig. 7, the PCC value becomes almost steady after 0.01.So, a threshold of 0.01 is set.The IMFs with PCC values above the threshold are chosen as the desired IMFs, i.e. the signal-part IMFs.From the Fig. 7, it can be noted that the first 7 IMFs have the PCC value greater than the threshold, so firstly 7 IMFs were chosen, and later the same analysis was done with 4 IMFs because they have the greater PCC value among the desired IMFs.
After extracting the features, the feature set was normalized using the min-max technique, using the Eq. 5, as follows [5]:

Feature Extraction
Feature extraction is a way of representing the original signal in terms of reduced dimensionality [12].Analyzing the huge length of observations can cause a time lapse besides been hard to interpret, so it is important to process and reduce the data in a way that is easy to identify the characteristics which correspond to faults without losing its originality [12,17].It can be considered the most important step in condition monitoring, as the features are used as input to the machine learning algorithms.A good feature should have the following attributes: computationally inexpensive, mathematically defined, easily explainable, insensitive to noise and no correlation between features [17].
After selecting the significant IMFs using the PCC value, 18 features were extracted from each one of them.The features are showed in Table 3.

Feature Ranking and Selection
Feature selection is the process of selecting the useful features among the extracted features from the raw data or IMFs.This is done with the purpose of increasing the accuracy of the classifier and reducing the size of the data which will reduce the computation time [17].
Four techniques were used for feature ranking and selection: ReliefF, Chi-Square, Max Relevance Min Redundancy (mRMR) and Decision Tree.The same techniques were used for feature ranking and then a new feature set composed by the features that appear at least in 3 of the 4 methods were used as input to the SVM, Random Forest and ANN.In the case of the RF, the Out of Bag Importance was used instead of the DT, because being a set of DT, it would make little sense using the ranking provided by a single tree.

ReliefF
ReliefF is a supervised algorithm for feature ranking and it is usually applied in data preprocessing as a feature subset Where N is the length of the IMF, x i is the ith value of the IMF, μ is the mean of the IMF, E is the energy of the IMF and 3,25] selection method [6].This method selects attributes according to how well their values can be distinguished between the instances that are close to each other.The basic idea is to randomly selects an instance Ri, then searches for their k of its nearest neighbors of the same class, called nearest hit Hi, and k nearest neighbors of different classes, called nearest misses Mj.It updates the quality estimation of the vector weights W allowing one to obtain more weights in attributes that discriminate the instance of neighbors of different classes [21,22].
The basic difference to Relief, is the selection of k hits and misses, which ensures greater robustness of the algorithm concerning to noise [22].In this work, the k parameter was set to 5.
where f t, i describes the value of the instance x i in the attribute f i , P represents the distance measurement, and f dc(xi) and f SC(xi) represent the value of the ith attribute of neighboring points to x i with different and same class labels [21].

Chi-Square
The Chi-Square test, in general, is used in statistics to verify the independence between two events.In the feature selection it is used to verify the independence between the occurrence of a specific term and the occurrence of a specific class.Thus, it estimates the following quantity for each feature and ranks them by their score.When a feature is independent of the class it can be discarded [6,21].
where χ 2 is the statistical test, Y j describes a number of observations in class j, u j represents the expected value of Y j , where u j = N p j .N is the number of observations and p j represents the probability of occurrence [21].

Max Relevance Min Redundancy (mRMR)
Max Relevance Min Redundancy (mRMR) is a feature selection technique which approach is to select an optimal dimension feature subspace to maximize the correlation between a feature and a class target, and minimize the correlation between each feature [23,24].The following criteria must be satisfied to select the m-feature: where F is the initial feature set, S is the feature subset, I ( f i , f j ) is the mutual information of two features, f i and f j ; and I (C, f i ) quantifies the relevance of the feature, f i , in S and the target class, C [24].

Decision Tree
A decision tree is essentially a tree-based knowledge representation methodology used to represent classification rules.It consists of a number of branches, one root, a number of nodes and a number of leaves.One branch is a chain of nodes from root to a leaf, and each node involves one attribute.The presence of an attribute and its position in the tree provides an information about the importance of that attribute [25].The Decision Tree algorithm used in this work was the CART (Classification and Regression Tree).

Out-of-Bag (OOB) Feature Importance
RF can provide a measure of the Feature Importance (FI) as follows: each tree b in the RF model keep the misclassification error rate using the OOB data (i.e. the percentage of instances in the OOB data incorrectly classified by tree b).Then randomly permute the values for predictor variable j in the OOB data and recompute the misclassification rate for each tree.The difference in classification rates, averaged over all trees in the RF model, is the permutation FI measure [26], as shown in Eq. 9: where e O O B b is the OOB misclassification rate for tree b, and e O O B b, j is the OOB misclassification rate for tree b when the values for predictor x j are randomly permuted in the OOB data [26].The e O O B is calculated as follows in the Eq.10: where N is the number of observations, y i is the prediction for all bootstrap samples and y O O B i is the averaged prediction [27].
Those features were used separately achieving good results in previous works [2,3,25], so it was decided to use them together in this work.

Classification Process
The features extracted from the models with 4 and 7 IMFs were used in the four ranking techniques in order to select the best set to be used separately as input to the SVM, the Random Forest and the ANN by showing the most relevant features, i.e., which features could lead to good results (the Out of Bag was used for feature selection instead of the Decision Tree only in the Random Forest models).
The algorithms ran with combinations from the highest ranked feature to the least ranked one, adding one feature per turn.
Figures 8 and 9 show the bar graph of the 4 ranking methods, ranking the features from the most to the least relevant, for the models with 4 and 7 IMFs respectively.extracted from 4 IMFs achieved better accuracy than the ML models using features extracted from 7 IMFs.Also, the ML models using 4 IMFs needed less features to achieve outstanding results, except for the mRMR, when using the SVM model, as can be seen in Table 4, and Random Forest, as can be seen in Table 5.The use of the decision tree for the SVM model with 4 IMFs calls attention, because of the low number of features selected, which can lead to a loss model's robustness.
To achieve better results, on the ANN algorithm, the neurons of the hidden layer were increased from 2 to 25 in every turn of feature combination, and for the four techniques.The ANN models with 4 IMFs achieved 100.00% of accuracy with almost the same number of features, but the number of neurons in the hidden layer was quite different: 18 for mRMR, 17 for ReliefF, 16 for the Decision Tree and 24 for Chi-Square.
The ANN models using features extracted from 7 IMFs also achieved good results, reaching 97.38% accuracy, but needing more features than the ANN models with 4 IMFs.
As the 3 ML models with 4 IFMs gave the best results, the next step was to reduce the number of features used in those models.10 Features with the highest rank given by the four ranking techniques were used in a committee to select those which appears at least in 3 of the 4 techniques.The performance of the remaining 9 features was compared using the ML models with 4 IMFs.
The reason to select only the 10 features with the highest rank is because the Decision Tree showed that only 9 features have relevance, so it was decided to use n + 1 feature, where n = 9.
Table 7 shows the 10 features with the highest ranks of each ranking technique for models with 4 IMFs, and Table 8 shows the features which appear at least in 3 of the 4 techniques.
As can be seen in Table 7, the features selected are almost the same but they appear in different positions.
The features showed in Table 8 were used as input to the SVM, Random Forest and ANN models with combinations from the first to the last feature, i.e. adding one feature per turn.Once again, to achieve better results, on the ANN algorithm, the neurons of the hidden layer were increased from 2 to 25 in every turn of feature combination.Table 9 shows the accuracy of the SVM and Random Forest models with these feature's combinations.The number of features in the first column of Table 9 correspond to how many features were present in each combination.So, number 1 matches the first feature, number 2 matches the first and the second features, and so on.
As showed in Table 9, the SVM model achieved higher accuracy (99.38%) when comparing with the RF model (98.44%), but he SVM model used 8 features against 5 features from the RF model.Taking as reference that above 95% accuracy all the models are considered to have an outstanding performance, then it is concluded that the RF model using only 5 features is the best one.
From the results given in Table 5, the RF model with 4 IMFs and using Chi-Square, achieved 100% of accuracy, but using 8 features, one still concludes that the results given in Table 9 are better.Tables 10 and 11 show the confusion matrix of the SVM and RF models respectively, with the common features.
The misclassified data can probably be justified because the gears of High-Speed Stage (HS-ST) and gears of the Intermediate Speed Stage (IMS-ST) have the same fault, scuffing, and the gear of the HS-ST and the pinion of the IMS-ST share the same shaft.
Table 12 shows the best results of the ANN model for each feature and number of neurons in the hidden layer combination.
From Table 12, can be noted that the ANN model achieved its best accuracy values using 4, 6, 7, 8 and 9 features.The results presented in this table, are the best achieved by combining the features with the number of neurons.The ANN achieved the same result when comparing to the four ranking techniques separately, but it used less neurons in the hidden layer which makes the model less complex and easier to implement.
In order to compare the results provided by the feature committee, the 3 algorithms were also tried with all the 18 features, without any previous feature selection method, and the results are shown in the Table 13.
As can be seen, the feature selection provided by the feature committee improved the performance of all classifiers.

Conclusion
A gearbox dataset with compound real faults provided by NREL was used to classify 4 state conditions: HSS healthy, HSS damaged, IMSS healthy and IMSS damaged.The signals were processed using the EMD technique to extract the IMFs, then the PCC value was used to select the significant IMFs.18 Features were extract from this IMFs, and four features ranking and selection techniques were used to select the best feature set: ReliefF, mRMR, Chi Square, Decision Tree.The Out of Bag Importance was used with the RF instead of the DT, because being a set of trees, it would make little sense in using a rank provided by a single tree.The features ranked by those techniques separately were used as input to the SVM, Random Forest and ANN models using a combination from the feature with the higher rank to the feature with the lower rank, adding one feature per turn.Comparing the results, both models with 4 IMFs gave better accuracy and used less features than the models with 7 IMFs.The ANN achieved the best results for the models with 4 IMFs (100.00%).
The 10 features with the highest rank in the 4 ranking methods with 4 IMFs were compared and the features which appear at least in 3 of the 4 methods were selected and used as a new feature set to the SVM, Random Forest and ANN, and the same procedure was done with the new feature set i.e. a combination from the first feature to the last one, adding one feature per turn, and in the ANN model, the number of neurons of the hidden layer was increased from 2 to 25 in each turn of the features combination to achieve better results.
The performance of the SVM, Random Forest and ANN models were compared, and the three models showed good results and once again, the ANN achieved the best results.Besides the parallel axes of the gearbox present faults with close patterns, the use of the PCC value as a tool to select the most significant IMFs improved the performance of the algorithms, and its combination with a feature committee which selected the features that were extracted of those selected IMFs, and with the great learning and classification performance of the ANN, the model was able to achieve outstanding results in this classification problem.The group of features selected by the committee was considered the best one, because its use achieved better results when compared with those obtained by the ranking methods individually and also with those obtained by the models with no feature selection.After comparing the performance of the three classifiers, the ANN was chosen as the mean classifier algorithm to this problem.
For future works, first, the usefulness of this method can be further tested to variable speed conditions.Second, this method can be also tested to differentiate a gear fault from a bearing fault.

Fig.
Fig. Flowchart of the proposed method

Fig. 8
Fig. 8 Ranked Features for the model with 4 IMFs a ReliefF, b mRMR, c Chi-Square, and d Decision Tree

Fig. 9
Fig. 9 Ranked Features for the model with 7 IMFs a ReliefF, b mRMR, c Chi-Square, and d Decision Tree

Table 1
Gears and bearings basic information

Table 3
Features extracted from the selected IMFs

Table 4
Comparative results of SVM model with 4 and 7 IMFs

Table 7
The 10 features with the highest rank in the 4 methods for models with 4 IMFs

Table 8
Features that appears in 3 of the 4 methods

Table 9
SVM and Random Forest accuracy for feature's combination

Table 13
Results with all features with no feature selection method