3.2. EMD-based decomposition signal processing
The purpose of EMD algorithm is to decompose the original signal into some series of IMFs, and then obtain the time-frequency relationship of the signal by Hilbert transform. Taking the terahertz time-domain spectral signal as an example, the basic calculation process of the EMD algorithm is as follows.
Step1: The terahertz time-domain spectral signal \(\text{x}\left(\text{t}\right)\) is calculate for all the maxima and minima points. All the curves form the upper envelope\(u\left(t\right)\) were made by connecting all the maxima points with the cubic spline function. The resulting curves form the lower envelope \(v\left(t\right)\) were made by connecting all the minima points with the cubic spline function.
Step2: The mean value of the upper and lower envelopes is calculated by the formula:
$$m\left(t\right) = \frac{u\left(t\right)+v\left(t\right)}{2}$$
1
The components of the EMD algorithm are defined as:
$$\text{h}\left(\text{t}\right)=\text{x}\left(\text{t}\right)- \text{m}\left(\text{t}\right)$$
2
Step3: Determine whether the component \(\text{h}\left(\text{t}\right)\) satisfies the definition of IMF. If \(\text{h}\left(\text{t}\right)\) satisfies the definition of IMF, then \(\text{h}\left(\text{t}\right)\) is the first IMF component filtered out, and it is noted as \({\text{c}}_{1}\left(\text{t}\right)\). If \(\text{h}\left(\text{t}\right)\) does not satisfy the definition of IMF, \(\text{h}\left(\text{t}\right)\) is taken as the original data and the above two steps are repeated until the first IMF component is calculated, which is also labeled as \({\text{c}}_{1}\left(\text{t}\right)\).
Step4: The first IMF component is raised from the original signal to obtain the residual signal, which is calculated as follows.
$${r}_{1}\left(\text{t}\right) =\text{h}\left(\text{t}\right) - {\text{c}}_{1}\left(\text{t}\right)$$
3
Repeat the above three steps with \({r}_{1}\left(\text{t}\right)\) as the new signal to be analyzed, so that the second IMF component \({\text{c}}_{2}\left(\text{t}\right)\) is obtained. Remove the second IMF component from \({r}_{1}\left(\text{t}\right)\) to obtain the new residual term.
$${r}_{2}\left(\text{t}\right) ={r}_{1}\left(\text{t}\right) - {\text{c}}_{2}\left(\text{t}\right)$$
4
The above steps are repeated continuously until the stopping criterion of the EMD algorithm is satisfied. The stopping criterion of the EMD algorithm uses the Cauchy convergence criterion, a test that requires the normalized squared difference between two adjacent extraction operations to be sufficiently small, as defined by:
$${SD}_{k}= \frac{\sum _{t=0}^{T}{\left|\begin{array}{cc}{ h}_{k-1 }\left(t\right) -& { h}_{k }\left(t\right)\end{array}\right|}^{2}}{\sum _{t=0}^{T}{ h}_{k-1 }{\left(t\right)}^{2}}$$
5
In the formula: T denotes the signal length, and the decomposition ends when \({SD}_{k}\) is less than the set threshold value.
Root mean square error (RMSE), also known as root-mean-square deviation, is a commonly used measure of the difference between the values. Its formula is:
$$RMSE=\sqrt{\frac{\left({\sum }_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}\right)}{(n-1)}}$$
6
The mean absolute error (MAE) is the average of the absolute values of the deviations of all individual observations from the arithmetic mean. Mean absolute error avoids the problem of errors canceling each other out and thus accurately reflects the magnitude of the actual prediction error. Its formula is:
$$MAE=\frac{{\sum }_{i=1}^{n}\left|{y}_{i}-{\widehat{y}}_{i}\right|}{n}$$
7
The R2 coefficient of determination (R2) is a statistical measure of how close the regression prediction is to the true data point. Its value ranges from 0 to 1 and represents the percentage correlation between the predicted value and the actual value of the target variable. If a model has an R2 value of 0, the model does not predict the target variable at all. Conversely, if the R2 value is 1, the model is considered to be able to predict the target variable accurately. A value between 0 and 1 indicates that the corresponding percentage of the target variable in the model can be explained by the input characteristics. The formula for this is:
$${R}^{2} =\frac{{\sum }_{i=1}^{n}{({\widehat{y}}_{i}-\overline{y})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\overline{y})}^{2}}$$
8
In the equations of RMSE, MAE, and R2, \(n\) is the number of samples, \({\widehat{y}}_{i}\) is the expression value of the target gene predicted by the \(i\)th sample. \({y}_{i}\) is the true expression value of the target gene in the ith sample, and \(\overline{y}\) is the sample mean.
3.3. Construction of core predictors using LSTM
LSTM is a special recurrent neural network (RNN) proposed to solve the deficiency of RNN in the long-range dependency problem. [21] LSTM achieves the selection of new information addition and the control of information accumulation rate by adding a gating mechanism. The addition of gating control makes up for the shortcomings of RNN network structure. Compared with the classical RNN network in which there is only a single tanh cyclic body, LSTM adds three gate structures and one memory unit. The flow chart is shown in Fig. 1b. Let \(W\) be the gate weight and \(b\) be the bias. The gate structure can be described as:
$$\begin{array}{c}g\end{array}\left(x\right)= \sigma \left(\begin{array}{c}{W}_{x }+ b\end{array}\right)$$
9
The three gates in the LSTM gate structure are the input gate, the output gate and the forget gate. The input gate mainly reflects the number of information stored in the cell state \({c}_{t}\) at the current moment of the input sequence \({x}_{t}\) at this time. The output state of the forget gate mainly determines the amount of information from the cell state \({c}_{t}\) to the output value of the output gate at this moment. Each gate state update of the LSTM is calculated as:
$${i}_{t}={\sigma } \left({W}_{i}{x}_{t} + {U}_{i}{h}_{t-1} + {b}_{i}\right)$$
10
$${f}_{t}={\sigma } \left({W}_{f}{x}_{t} + {U}_{f}{h}_{t-1} + {b}_{f}\right)$$
11
$${o}_{t}={\sigma } \left({W}_{o}{x}_{t} + {U}_{o}{h}_{t-1} + {b}_{o}\right)$$
12
Memory unit update:
$${\overline{c} }_{t}=\text{t}\text{a}\text{n}\text{h} \left({W}_{c}{x}_{t} + {U}_{c}{h}_{t-1} + {b}_{c}\right)$$
13
Status update:
$${c}_{t}= {f}_{t}\bullet {c}_{t} + {i}_{t}\bullet {\overline{c} }_{t}$$
14
$${h}_{t}={o}_{t} \bullet \text{t}\text{a}\text{n}\text{h}\left({c}_{t}\right)$$
15
The formula \({\sigma }\) is the sigmoid activation function. \({W}_{i}\), \({W}_{f}\), \({W}_{o}\), \({b}_{i}\), \({b}_{f}\), \({b}_{o}\) are the weight matrices and biases of the input gate, forgetting gate and output gate. \({W}_{c}\), \({b}_{c}\) are the weight matrices and biases of the memory cell after updating. \({U}_{i}\), \({U}_{f}\), \({U}_{o}\), \({U}_{c}\) are the state quantity weight matrices. \({h}_{t-1}\) is the previous moment state quantity.