## 3.1 Decision tree algorithm

The decision tree consists of decision nodes, branches and leaf nodes. The highest node in the decision tree is called the root, and the root is the beginning of each decision tree node. The child node on each node in the decision tree is related to the algorithm decision tree. ID3 algorithm uses information in the decision tree of each node to obtain the feature selection criteria and repeatedly constructs the decision tree, which is a top-down, divide-and-rule inference process. ID3 algorithm is improved to obtain C4.5 algorithm, which optimizes the selection with the highest information acquisition rate of attribute test. It can not only deal with different attributes, but also deal with the continuity of attributes. Algorithmic tools can also be used to work with highly skewed or polymorphic digital data, as well as sequential or unarranged data. The algorithm uses hyperplanes to divide the space into two parts. Each part divides the current space into two parts, dividing repeatedly until the requirement is met, and finally each node divides the regions in the space.

The regression tree model can be expressed as follows:

Assuming that CART regression tree can be divided into M units corresponding to spatial input, and each unit has a fixed output value, the regression tree model can be expressed as follows:

The optimal value for each element is the partition of all vector inputs, and the corresponding representation output value, i.e

$$\widehat{{c}_{m}}=avg\left({y}_{i}\right|{\text{x}}_{\text{i}}\in {\text{R}}_{\text{m}})$$

1

In the input space division, input the optimal variable j into Eq. (1) to obtain the optimal segmentation point s.

$$\widehat{{c}_{1}}=avg\left({y}_{i}\right|{x}_{i}\in {R}_{1}(j,s))$$

2

$$\widehat{{c}_{2}}=avg\left({y}_{i}\right|{x}_{i}\in {R}_{2}(j,s))$$

3

## 3.2 BP neural network algorithm

The basic algorithm of BP includes two operations: forward release and error reverse transmission. In other words, the error of the output is calculated in the opposite direction from the output to the input, and the output is adjusted by adjusting the weight and threshold, and the output is input in the reverse direction. When published forward, the input signal is switched nonlinear to generate the output signal. Error reverse sending returns the input layer through the hidden layer and divides the error into cells in each layer. By adjusting the intensity of communication between input and output, hiding nodes, you can reduce errors in the direction of descent. Then the neural network can be trained to deal with the information transmitted in nonlinear so that the error of similar sample input is minimized.

The output of the node displayed in the hidden layer is shown in Formula (4), and formula (5) is the output of the node hidden layer displayed.

$${y}_{j}=f\left({net}_{j}\right)$$

4

$${O}_{k}=f\left({net}_{k}\right)$$

5

You can adjust the weight based on the error value to get the correct output if the desired output is not available through forward deployment. Suppose the sample set of BP neural network training input group:

$${E}^{\left(p\right)}=\frac{1}{2}\sum _{k=1}^{m}{({d}_{k}^{p}-{y}_{k}^{p})}^{2}$$

6

Input all the data into the neural network model, and the total error is:

$${E}_{s}=\sum _{p=1}^{N}{E}^{p}$$

7

The minimum average error box is the learning standard, and the ultimate goal of learning is to ensure that the error between the actual output and the expected output is less than the error tolerance through continuous updating.

$${w}_{jk}^{N+1}={w}_{jk}^{N}+\eta {\sigma }_{o}{y}_{j}$$

8

The weight of the output layer is corrected to:

$${w}_{jk}^{N+1}={w}_{jk}^{N}+\eta {\sigma }_{h}x$$

9

Given an infinitely differentiable activation function g(·) in any interval and an arbitrarily small error ε(ε < 0), in the case of any assignment:

$$‖{H}^{T}B-T‖=\epsilon$$

10

Formula (13) is solved to obtain

$$\widehat{B}={\left(H{H}^{T}\right)}^{-1}HT$$

11

Based on ridge regression principle, error can be eliminated by input correction coefficient λ, from which output weight can be obtained:

$$\widehat{B}={(H{H}^{T}+\lambda I)}^{-1}HT$$

12

So, the output function of the extreme learning machine is

$$\text{y}（\text{x}）=\text{h}\left(\text{x}\right)\widehat{\text{B}}$$

13

Duplicate nodes of hidden layer may appear in the immediate occurrence of duplicate nodes of hidden layer. Therefore, in order to make the model more accurate, ELM needs a large number of hidden nodes.

## 3.3 Parameter Learning

The learning structure of Bayesian networks can be defined as an optimization model: where G is the set of all variables in the sample D data set which may be related to each other; Ω is a set of constraints that must be met in a network structure. F is a scoring function to evaluate the network structure. Generally speaking, there are three types of structural learning methods, which are based on score search function, constraint and random sampling.

According to the uniform distribution, the K2 score of formula (14) can be obtained.

$${F}_{k2}(G/D)=logP\left(G\right)+{\sum }_{i=1}^{n}{\sum }_{j=1}^{{q}_{i}}\left[log\right(({r}_{i}-1)!/({m}_{ij}+{r}_{i}-1)!)+{\sum }_{k=1}^{{r}_{i}}log({m}_{ijk}!)]$$

14

The likelihood conditional probability is shown in (15) :

$${\theta }_{ijk}={m}_{ijk}/{m}_{ij}$$

15

Parameter learning is a process to determine the relationship between random variables. Bayes structure G network and sample D data set are required to determine the conditional probability distribution in each node of the model. Maximum likelihood estimation and Bayesian estimation are the main methods.

The method of parameter learning in this paper adopts maximum likelihood estimation, and likelihood degree can be expressed as follows according to the relationship between parameters and data sets:

$$\text{L}({\theta }/\text{D},\text{G})=\text{P}(\text{D}/{\theta },\text{G})$$

16

Maximum likelihood estimation selects the parameter that maximizes the likelihood function value, namely:

$${\theta }^{*}=argmaxL(\theta /D,G)$$

17

According to the independent co-distribution hypothesis of the data set and the structural characteristics of the Bayesian network, it can be obtained:

$$\text{L}({\theta }/\text{D},\text{G})=\text{l}\text{o}\text{g}{\prod }_{\text{i}=1}^{\text{m}}\text{P}({\text{d}}_{\text{i}}/{\theta })$$

18

## 3.4 Simulation results and evaluation

The characteristic curves of electric and hydraulic valve openings are normalized as input to the DA-ELM model. The activation function and node number model which affect the performance of DA-ELM are analyzed experimentally. See Fig. 1.

The training learning of XGBoost fault diagnosis model is obtained through the training set data. After learning and training, the test set is input into the "fault model application" module for model testing. The accuracy rate, recall rate, F1 measure and confusion matrix are used for quantitative estimation in the "model test evaluation" module. The evaluation confusion matrix of the XGBoost fault diagnosis model is shown in Table 1.

Table 1

Evaluation confusion matrix of XGBoost fault diagnosis model

Type | True situation | Normal | Slight | Moderate | Serious | Mean value |

Forecast result | Label | 0 | 1 | 2 | 3 | |

Normal | 0 | 75 | 5 | 4 | 3 | |

Slight | 1 | 6 | 66 | 4 | 1 | |

Moderate | 2 | 7 | 7 | 89 | 7 | |

Serious | 3 | 3 | 5 | 6 | 89 | |

Accuracy rate (%) | Precision | 87.61 | 87.47 | 82.36 | 87.75 | 86.29 |

Recall rate (%) | Recall | 85.67 | 84.55 | 86.78 | 86.79 | 85.95 |

F1 measure (%) | F1-score | 86.63 | 85.99 | 83.98 | 87.26 | 85.97 |

Based on the original software, changes in the accuracy of motor faults are observed by adjusting the size of the convolutional layer and the number of channels in the convolutional layer. The relationship between the accuracy of motor faults and the size of the convolutional kernel under different channels is shown in Table 2:

**Table 2** Relationship between motor fault accuracy and convolution kernel size in different channels (%)

According to the results in Table 2, when the number of channels in the package is in a certain layer, the error rate will become higher as the number of convolution nuclei goes from minimum to maximum. The convolution kernel dimension is very small and completely covers the data input, which leads to a significant increase in the convolution process, which is a long process model and leads to the degradation of network performance. When the dimension of convolution kernel exceeds the normal range, the features of input data cannot be accurately identified, thus the extraction function may be unclear and repetitive, resulting in the phenomenon of overfitting.