2.1 Support vector machine
Support vector machine (SVM) is a kind of generalized linear classifier for binary classification of data according to supervised learning method, its decision boundary is the maximum margin hyperplane for solving learning samples. Single classification support vector machine is a single classification algorithm, which is mainly used for outlier detection(Steinwart and Christmann, 2008), compared with the traditional support vector machine, the algorithm belongs to unsupervised learning method, and does not need to manually label the dataset. The algorithm forms a minimum hypersphere by training the observed high-dimensional sample data. The hypersphere contains a large number of background data, while the data falling outside the hypersphere are abnormal data, thus realizing the identification of specific values. Support vector machine algorithm in practical application, will encounter the problem of positive and negative data imbalance, that is, more normal data, less abnormal data(Osuna et al., 1997). For example, in the fault diagnosis of track circuit studied in this paper, it is difficult to obtain high-quality fault label data, and the traditional support vector machine algorithm is no longer applicable. However, OC-SVM can effectively solve the problem of imbalance between abnormal data and normal data, and can effectively collect signal data of different fault categories.
Set sample data \(\left\{{\chi }_{1},{\chi }_{2},\dots ,{\chi }_{l}\right\}\in {X}^{n}\), \(l\) number of samples, \(\varphi \left(\chi \right)\) is the function of mapping samples to feature space, ω and ρ represent the normal vector and offset of separating hyperplane in feature space, The expression of separation hyperplane is \({\omega }^{\text{{\rm T}}}\varphi \left(\chi \right)-\rho =0\). The objective is to maximize the distance between the separation hyperplane and the origin, then OC-SVM needs to solve the following optimization problems:
$$\underset{\omega ,\rho ,\xi }{\text{min}}\frac{1}{2}{‖\omega ‖}^{2}+\frac{1}{\upsilon \iota }\sum _{i=1}^{\iota }{\xi }_{i}-\rho$$
$$s.t. {\omega }^{\text{{\rm T}}}\varphi \left({\chi }_{i}\right)\ge \rho -{\xi }_{i}$$
$$\begin{array}{c}{\xi }_{i} \ge 0,i=\text{1,2},\dots \dots ,l\#(1)\end{array}$$
In the expression, \({\xi }_{i}\) is a slack variable, which means that outliers are allowed to exist, and ν is a parameter that controls the upper limit of the number of outliers and the lower limit of the number of all support vectors.
By using Lagrange multiplier method, the dual problem of the above optimization problem can be obtained, namely:
$$\underset{\alpha }{\text{min}}\frac{1}{2}\sum _{i}^{\iota }\sum _{j}^{\iota }{\alpha }_{i}{\alpha }_{j}\kappa \left({\chi }_{i},{\chi }_{j}\right)$$
$$s.t. \sum _{i=1}^{\iota }{\alpha }_{i}=\text{1,0}\le {\alpha }_{i}\le \frac{1}{\upsilon \iota }$$
$$\begin{array}{c}i=\text{1,2},\dots \dots ,l\#\left(2\right)\end{array}$$
Where \({\alpha }_{i}\) is the Lagrange coefficient corresponding to the sample \({\chi }_{i}\). kernel function \(k\left({\chi }_{i},{\chi }_{j}\right)\) = \(⟨\varphi \left({\chi }_{i}\right),\varphi \left({\chi }_{j}\right)⟩\), replaces the inner product in the feature space.
After solving optimization problem (2), the sample \({\chi }_{i}\) corresponding to Lagrange coefficient \({\alpha }_{i}>0\) becomes the support vector, Normal vectors \(\omega =\sum _{i=1}^{\iota }{\alpha }_{i}\varphi \left({\chi }_{i}\right)\) and hyperplane offsets can be determined by these support vectors, \({\chi }_{SV}\) represents a support vector, classification decision functions are then obtained :
$$\begin{array}{c}f\left(x\right)=sgn\left[{\omega }^{\text{{\rm T}}}\varphi \left(\chi \right)-\rho \right]\\ =sgn\left[\sum _{i}^{\iota }{\alpha }_{i}\kappa \left({\chi }_{i},\chi \right)-\rho \right]\#\left(3\right)\end{array}$$
The track circuit data are tested and substituted into Eq. (3). When the result is + 1, this group of data can be considered as a normal point; if the result is -1, the group of data is outliers (Chapelle et al., 2002, Awad and Khanna, 2015, Maldonado et al., 2021).
2.2 Deep learning
Deep learning is a feature learning method with multi-level representation, which converts raw data into higher level and more abstract representation through some simple nonlinear models. Deep learning is actually developed from deep neural networks (DNN). The number of hidden layers (the depth of the network) of deep neural networks is large. Increasing the depth of the network can reduce the number of features to be fitted in each layer, and represent complex functions with fewer parameters, which can extract high-level feature information. Therefore, deep neural networks have been widely used(Mukherjee et al., 2021, Attique Khan et al., 2021, Zhang et al., 2021).
By learning a deep nonlinear network structure, DNN realizes the approximation of complex functions and obtains the distributed expression of input data. As shown in Fig. 1, the neurons are connected in the form of acyclic graph, and the upper output is the representation of the input of the lower neurons, the input value propagates forward from the input layer neurons through the weighted connection layer by layer, passes through the hidden layer, and finally reaches the output layer to obtain the output; the output layer calculates the loss function to measure the difference between the actual output and the expected output of the network; the loss function is propagated forward layer by layer from the output end, and the gradient of the loss function with respect to intermediate variables is calculated, the gradient values of all parameters are obtained by chain method, the network adjusts the parameters according to the obtained gradient until the loss function reaches the minimum(Sharma and Singh, 2017).
Although the DNN model structure looks complex, studying its local model will find that it is a linear relationship \(\text{{\rm Z}}=\sum {\omega }_{i}{\chi }_{i}+b\) and an activation function \(\sigma \left(z\right)\). DNN includes two processes: forward propagation and backward propagation: the forward propagation algorithm uses the output of the previous layer to calculate the output of the next layer, and the backward propagation is the feedback of the original data.
The so-called forward propagation algorithm of DNN is to use several weight coefficient matrix w and bias vector b to carry out a series of linear operations and activation operations with the input value vector x, and calculate layer by layer from the input layer and get the output result. The commonly used activation functions include sigmoid function, softmax function, tanh function and ReLU function(Ertuğrul, 2018). Sigmoid activation function is commonly used in binary classification problems, when neurons are in a saturated state, the gradient will disappear, that is, when the output is 0 or 1, the gradient is almost 0, and the convergence rate is slow; Softmax activation function is commonly used in multi-classification problems, the input is mapped to a probability value, and the node corresponding to the maximum value is taken as the prediction target; The tanh activation function converges faster than sigmoid activation function, but when the input of neurons is large positive or small negative, the neurons will be saturated and the gradient disappears; ReLU activation function is the most commonly used activation function, and its calculation speed is faster, and its convergence speed is about six times faster than tanh function and sigmoid function. According to the data characteristics of this paper, the activation function is selected as follows: The hidden layer uses ReLU activation function;The output layer uses softmax activation function.
The so-called back propagation algorithm of DNN is the process of iteratively optimizing the loss function of DNN to obtain the minimum value(Hecht-Nielsen, 1992). Before reverse propagation, it is necessary to determine a loss function to measure the loss between the output data and the real data. In this paper, we choose cross entropy loss as a function to measure the loss, that is, for each group of data, we expect to minimize the following:
$$\begin{array}{c}L\left(\widehat{y},y\right)=-{\sum }_{j=1}^{C}{y}_{i}\text{log}{\widehat{y}}_{i}\#\left(4\right)\end{array}$$
Among them, \({\text{y}}_{\text{i}}\) is the target value, \({\widehat{\text{y}}}_{\text{i}}\) is the predicted value.
The expression of total loss function is:
$$\begin{array}{c}J=\frac{1}{m}{\sum }_{i=1}^{m}L\left(\widehat{y},y\right)\#\left(5\right)\end{array}$$
In this paper, Adam optimizer is used for gradient optimization of w, b, and the calculation process is as follows(Misra, 2019):
$$\begin{array}{c} sdw=\beta sdw+\left(1-\beta \right){\left(dw\right)}^{2}\#\left(6\right)\end{array}$$
$$\begin{array}{c} sdb=\beta sdb+\left(1-\beta \right){\left(db\right)}^{2}\#\left(7\right)\end{array}$$
$$\begin{array}{c} w=w-\alpha \frac{dw}{\sqrt{sdw+ϵ}}\#\left(8\right)\end{array}$$
$$\begin{array}{c}b=b-\alpha \frac{db}{\sqrt{sdb+ϵ}}\#\left(9\right)\end{array}$$
Among them, \(\beta\) is the weight value, \(\alpha\) is the learning rate, \(ϵ\) is the minimum constant.
The resulting w, b is fed back to the forward propagation process, so looped until the maximum number of iterations or minimum error requirements are met.