3.1 CNN
Convolutional Neural Network (CNN) includes 1D CNN, 2D CNN and 3D CNN, 1D CNN is mainly used for series processing (Barzegar et al.2020; Livieris et al. 2020), 2D CNN is usually used for recognition of image and text (El Aswad et al. 2021; Dutta et al. 2021), and 3D CNN is mainly used for recognition of medical image and video data (Bellos et al. 2021; Wang et al. 2021). The typical structure of CNN model mainly includes convolutional layer to extract the local features of data, and different convolution kernels are equivalent to different feature extractors; Batch Normalization to make the data pass down more effectively; Pooling layer performs feature selection and finally enters the fully connected layer for output. The convolution process of 1D CNN is shown in Fig. 3.
Figure 3 shows the input time series is a n*3 matrix, the red box represents the convolution size of a filter is 3*3, and the 128 filters are convolved from top to bottom with a stride of 1. The feature dimension extracted after convolution with a filter is a (n-3 + 1)*3 matrix C1 ~ Cn−2, and finally get a 128*1 matrix after feature selection of pooling process. The feature dimension is related to the dimension of input data, the size of filter and the convolution step.
3.2 LSTM
Long Short-Term Memory (LSTM) neural network is a variant of recurrent neural networks that can effectively solve the problem of gradient explosion and gradient disappearance of simple recurrent neural networks(Bidoki et al. 2019; Rajesh et al. 2022). The life cycle of the memory unit of LSTM to store information is longer than the short-term memory , but shorter than the network parameters, so it is called the long-short-term memory neural network(Strake et al. 2020; Abduljabbar et al. 2021). In Fig. 4, \({c_t}\) is the internal cell status introduced by the LSTM network for linear cyclic information transfer, which also outputs information nonlinearly to the external status \({h_t}\) of the hidden layer. \({\widetilde {c}_t}\) is the candidate cell status obtained by the nonlinear function.
The forget gate\({f_t}\)controls how much information of internal cell status \({c_{t - 1}}\) needs to be forgotten at the previous moment: \({f_t}=\sigma ({W_f}{x_t}+{U_f}{h_{t - 1}}+{B_f})\); the input gate \({i_t}\) controls how much information needs to be saved for the candidate cell status \(\widetilde {{{c_t}}}=\tanh ({W_c}{x_t}+{U_c}{h_{t - 1}}+b{}_{c})\) at the current moment: \({i_t}=\sigma ({W_i}{x_t}+{U_t}{h_{t - 1}}+{b_i})\); the output gate \({o_t}\) controls how much information of the internal cell status \({c_t}={f_t} \odot {c_{t - 1}}+{i_t} \odot \widetilde {{{c_t}}}\) at the current moment needs to be output to the external state\({h_t}={o_t} \odot \tanh ({c_t})\):\({o_t}=\sigma ({W_o}{x_t}+{U_o}{h_{t - 1}}+{b_o})\). Where, and represent the weight, is the bias weight and \(\sigma\) is the sigmoid function.
3.3 TSD-CNN-LSTM
CNN and LSTM models mainly use the historical information of the time window to restore the change rule of time series over time, so as to realize the forecasting of the future response variables. Therefore, the selection of the time window \(\tau\)of the input feature is an important parameter for network training. If the time window is too large, it can fully extract the relevant information in the time series, but it also introduces the influence of more random noise. If the time window is too small, it can not fully reveal the correlation between the response variables and the historical feature. The width of the feature window reflects the influence of each subsequence feature in the past \(\tau\) months on the feature of the \(\tau +1\) month, and the hybrid model can provide a good fitting for the nonlinear relationship between them. Time series decomposition(TSD) can extract the component features of the original time series, using CNN re-extract and filter the component features, after which send to the LSTM network and output by the fully connected layer. Figure 5 shows the structure of TSD-CNN-LSTM and the calculation process of which is as follows:
(1) The original time series was decomposed into a feature vector group consisting of residual components, seasonal components and trend components:\(\{ {R_N},{S_N},{T_N}\}\).
(2) The mapping group of input feature of the convolution layer is a two-dimensional tensor\(F \in {{\mathbb{R}}^{\tau \times 3}}\), where \(\tau\)is the width of the mapping window of input features of CNN. In order to calculate the feature map\(y_{i}^{p}\)of the output of the \(ith\) sample, the input feature maps\(F_{i}^{1}\), \(F_{i}^{2}\) and \(F_{i}^{3}\) are convolutionally filtered with convolution kernels \({W^{p,d}}\), respectively.
$$F_{i}^{T}=\left[ {\begin{array}{*{20}{c}} {F_{i}^{1}} \\ {F_{i}^{2}} \\ {F_{i}^{3}} \end{array}} \right]=\left[ {\begin{array}{*{20}{c}} {{T_{i - \tau }},{T_{i - \tau +1}}, \ldots ,{T_{i - 1}}} \\ {{S_{i - \tau }},{S_{i - \tau +1}}, \ldots ,{S_{i - 1}}} \\ {{R_{i - \tau }},{R_{i - \tau +1}}, \ldots ,{R_{i - 1}}} \end{array}} \right]$$
5
$$\begin{array}{*{20}{c}} {{Z_i}^{p}={W^p} \otimes {F_i}+{b^p}=\sum\limits_{{d=1}}^{3} {{W^{p,d}} \otimes {F_i}^{d}+{b^p}} } \\ {{y_i}^{p}=f({Z_i}^{p})} \end{array}$$
6
where \({T_{i - \tau }}\), \({S_{i - \tau }}\) and \({R_{i - \tau }}\) represent the trend, seasonal and residual components of the past \(\tau\) months at the time corresponding to the \(ith\) sample, respectively; \(\otimes\) represents the cross-correlation operation(the convolution without flipping); \(p=1, \ldots ,P\) represents the number of convolution kernels, \({b^p}\) is the bias weight, and \(f(\cdot )\) is the Relu function, which can solves the problem of gradient disappearance and has a faster computational speed and convergence rate than tanh function and sigmoid function, and it is defined as Eq. (7):
(3) is resampled by the pooling layer and output after the convolution operation. This study uses average pooling for feature selection.
(4) enters the LSTM network to fit the time series relationship after it passes through the dropout layer, and outputs the predicted value from the fully connected layer. Where, dropout is an effective method to reduce the overfitting of the model: during model training, some neurons are dropped from the neural network according to a certain probability, and the dropped neurons are reconnected when used. The loss function of parameter update of neural network is the mean square error:
$$MSE=\frac{1}{n}\sum\limits_{{i=1}}^{m} {{{({x_i} - {x_p})}^2}}$$
8
where \({x_i}\) is the measured value corresponding to the \(ith\) training sample, \({x_p}\)is the predicted value of the model, and is the number of training samples.
3.4 Evaluation indicators of models
Using the Root mean square error (RMSE), Nash-Sutcliffe efficiency coefficient (NSE) and Mean absolute error (MAE) to evaluate the performance of the TSD-CNN-LSTM, TSD-CNN, TSD-LSTM and single LSTM models. The RMSE, NSE and MAE are defined as:
\(RMSE=\sqrt {\frac{1}{n}\sum\limits_{{i=1}}^{n} {{{({x_i} - {x_{pi}})}^2}} }\)
|
(9)
|
\(NSE=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{({x_i} - {x_{pi}})}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{({x_i} - \overline {{{x_i}}} )}^2}} }}\)
|
(10)
|
\(MAE=\frac{1}{n}\sum\limits_{{i=1}}^{n} {\left| {{x_i} - {x_{pi}}} \right|}\)
|
(11)
|
where is the number of all test samples; \({x_i}\) is the \(ith\) measured value; \({x_{pi}}\) is the \(ith\) predicted value; \(\overline {{{x_i}}}\) is the mean of all test samples.