2.2.2 Methods
1. Long short-term memory (LSTM) model
As a type of recurrent neural network, it is widely used for processing time-series data. Its unique forget gate and cell structure greatly alleviate the long-term dependence problems such as gradient disappearance/gradient explosion in recurrent neural networks to some extent. The structure is shown in Fig. 3 (4).
\({\text{x}}_{t}\) refers to the data value of the time t input sequence. \({c}_{t}\) refers to a memory cell or cell state, which is the core of the network and controls the transfer of information. \({i}_{t}\) refers to an input gate that determines the amount of information currently retained by \({x}_{t}\) in \({c}_{t}\). \({f}_{t}\) refers to a forgetting gate, which determines the amount of \({c}_{t-1}\) that is kept from the previous moment of cell state to the current \({c}_{t}\). \({o}_{t}\) refers to an output gate, which determines the amount of output \({h}_{t}\) that \({c}_{t}\) transmits to the current state. \({h}_{t-1}\) refers to the state of the hidden layer at moment t-1. \({\sigma }\) denotes the activation function sigmoid. The synergy of the forget gate and the cell state filters important features and allows them to be transmitted over longer distances.
2. Autoregressive recurrent network (DeepAR) model
The DeepAR model (Fig. 3 (3)) uses an autoregressive recurrent network architecture, consisting of multilayer long short-term memory (LSTM) components. The inputs to the model are the time series values \({\text{x}}_{t}\) up to t, and the covariates \({\text{z}}_{t+w}\) at time t+w. The time-series data and covariates are then concatenated and fed into the LSTM layer. The output of the LSTM layer is fed to two dense layers, one as an affine function to give the mean \({\mu }\) and the other as an affine transformation to generate the standard deviation \({\sigma }\). In the case of standard deviations, the output of the dense layer is fed into the SoftPlus layer to ensure positive values. Finally, the mean and standard deviation are the inputs to the Gaussian likelihood model used to generate the samples. In this paper, we use the mean and standard deviation to parameterize the Gaussian likelihood \({\theta }_{t+w}=(\mu ; \sigma )\). The likelihood \(l\left({z}_{t+w}\right|{\theta }_{t+w})\) is calculated, and the median is obtained as the final output \({\widehat{y}}_{t+w}\).
3. Light gradient boosting machine (LightGBM) model
LightGBM is a gradient boosting decision tree(GBDT) based data model proposed by Microsoft in 2017. LightGBM is a data model based on GBDT, which, like other boosting algorithms, combines weak learners into strong learners. The computational time of traditional GBDT algorithms is often consumed in the construction of the decision tree. The construction of a decision tree requires finding the optimal split points. The usual approach is to sort the feature values and then enumerate all possible feature points. This approach is time-consuming and requires a lot of memory. The LightGBM algorithm uses a modified histogram algorithm. It divides the continuous feature values into several intervals and selects the division points among the several intervals. As a result, it outperforms the GBDT algorithm in terms of training speed and spatial efficiency. Also, the decision tree is a weak classifier. Using the histogram algorithm will have the effect of regularisation, which can effectively prevent overfitting. In terms of reducing training data, the LightGBM algorithm uses a leaf-wise generation strategy. Compared to the traditional level (depth)-wise approach, leaf-wise reduces more losses when planting the same leaf.
4. Temporal convolutional network (TCN) model
The TCN network uses causal convolution and dilation convolutions on the basis of 1D convolution. The causal convolution ensures that temporal features are passed upwards, while the dilation convolution expands the perceptual field of the model, allowing the network to learn more temporal features with fewer layers. As well as a residual convolutional layer-hopping connection is also used, in order to solve the degradation problem of deep networks.
The structure of the TCN is shown in the Fig. 3 (2) and consists of two main parts: the causal/dilated convolution and the residual block. The left-hand side of the Fig. 3 (2) shows the causal and dilated convolution in the TCN architecture, where it can be seen that the value at time t of each layer depends only on the value at time t,t-1... of the previous layer, reflecting the properties of causal convolution. In addition, each layer extracts the information of the previous layer by jumping, and the expansion rate of each layer increases exponentially by 2, which reflects the characteristics of dilated convolution. Since dilated convolution is used, each layer is padded (usually by 0). The receptive field size for the null convolution is \((\text{K}-1)\text{d}+1\), so increasing either K or d increases the receptive field. In the Fig. 3 (2), d is 1,2,4 in this order, and k is 3. The right-hand side of the Fig. 3 (2) shows the residual block in the TCN architecture. Calculated as Eq. (1):
$$H\left(x\right)=F\left(x\right)+x$$
1
The input undergoes dilated convolution, weight normalization, activation function, Dropout (two rounds) as the F(x) in the residual function, while the input also undergoes 1x1 convolution filters as the x of shortcut connection. The two parts of the output are combined to give the final output.
5. Deep temporal convolutional network (DeepTCN) model
DeepTCN modifies the residual structure on the base of TCN and uses an encoding-decoding mechanism. In particular, the encoder part of this is the residual block, which is responsible for building stacked dilated causal convolutional networks to capture long-term temporal dependencies. The decoder part, on the other hand, is a variant of the residual block (referred to as RESNet-v) and is designed to integrate the output of stochastic processes with historical observations and future covariates. Finally, an output dense layer is used to map the output of RESNET-v to the final prediction.
The left-hand side of the Fig. 3 (5) shows the residual block of the encoder, where the input \({x}_{t}\) is first passed through a dilated causal convolutional layer and a batch normalization layer, repeated twice (passing through the activation function ReLU layer in the process). Then the residual operation is performed together with the original input \({x}_{t}\). Finally fed to the decoder residual block through the activation function ReLU layer. Here the batch normalization layer is designed to provide a stable distribution of activation values during training (Ioffe and Szegedy et al., 2015). This achieves faster convergence and shortens the training process of the model. On the right side of the Fig. 3 (5) is the decoder module RESNET-v, where \({\text{z}}_{t+w}\) is the future covariate. A dense layer and batch normalization are first applied to project future covariates, and RELU activation is applied, then processed by another dense layer and batch normalization. The processed result is combined with the output from the encoder to perform the residual operation. Finally, after the activation function ReLU, the output of the decoder residual block is mapped by the output dense layer to produce the probability prediction, and its median \({\widehat{y}}_{t+w}\) is taken as the final output.
6. Transformer model
The transformer (Fig. 3 (1)) discards the traditional CNN and RNN structure, and the entire network structure is composed entirely of the attention mechanism. More precisely, the transformer consists of and only consists of, self-attention and feed forward neural network. As a result of using the attention mechanism, the model can reduce the distance between any two positions in the sequence to a constant, which is a good solution to the problem of long-term dependence in time series problems. Secondly again, since it is not an RNN-like sequential structure, it has better parallelism and fits in with existing GPU frameworks.
The encoder consists of a position encoding layer and a stack of self-attentive and fully connected feedforward sublayers. Each sublayer is followed by a residual block and a layer normalization. Specifically, as the model does not contain a recurrent or convolutional structure, in order for the model to utilize the order of the sequence. After the input data, the position-encoded vectors calculated using the sine and cosine functions are added to the elements of the input vector to encode the sequential information in the time series data. The encoded result vector is fed into the self-attention sublayer and the fully connected feedforward sublayer in turn, with the final output being fed into the decoder module. The decoder section consists of a position encoding layer, a self-attention sublayer, a multi-headed attention sublayer, and a fully connected feed-forward sublayer stack. Again each sublayer is followed by a residual block and has a layer normalization. In the prediction issue, the input to the decoder consists of two parts, one being the output \({\widehat{y}}_{t+w-1}\) of the decoder at the previous moment and the other being the output of the encoder. \({\widehat{y}}_{t+w-1}\) is input to the self-attention sublayer after position encoding, after which it is input to the multi-headed attention sublayer at the same time as the output of the encoder, and finally to the output layer via the fully connected feedforward sublayer. The output layer consists of a dense layer with the activation function softmax, which maps the output of the last decoder layer to the target time series. In addition, we have used a form of look-ahead mask and single position offset between the decoder input and target output in the decoder to ensure that the prediction of time series data points depends only on the previous data points.