## 2.2 Long short-term memory network(LSTM)

A recurrent neural network (RNN) can be thought of as a neural network that passes in time with a depth that is the length of time. For moment t, the gradient it generates disappears after a few layers of propagation to history on the time axis and cannot affect the too distant past. To solve the problem of temporal gradient disappearance, the field of machine learning has developed the long and short time memory unit LSTM, which implements the temporal memory function and prevents gradient disappearance by means of gate switches. LSTM Networks is a kind of recurrent neural networks, and the algorithm was first published by Sepp Hochreiter and Jurgen Schmidhuber in Neural computation. Later, the internal structure was gradually improved through continuous improvement. It will perform better than general RNNs in processing and predicting data related to time series. At present, LSTM Networks have been widely used in robot control, text recognition and prediction, speech recognition, protein homology detection, and other fields. Based on the excellent performance of LSTM Networks in these areas, this paper aims to investigate whether LSTM can be applied to the prediction of electric energy consumption time series.

The most basic LSTM cell consists of three gates (input, forget, output) and a cell. Gate uses a sigmoid activation function, while input and cell state are usually transformed using tanh. the cell of the LSTM can be defined using the following equation.

Gates :

$${i}_{t}=g（{W}_{xi}{x}_{t}+{W}_{ℎi}{ℎ}_{t-1}+{b}_{i}）$$

1

$${f}_{t}=g（{W}_{xf}{x}_{t}+{W}_{ℎf}{ℎ}_{t-1}+{b}_{f}）$$

2

$${o}_{t}=g（{W}_{xo}{x}_{t}+{W}_{ℎo}{ℎ}_{t-1}+{b}_{o}）$$

3

Input transformation:

$$\tilde{{c}_{t}}=tanℎ\left({W}_{xc}{x}_{t}+{W}_{ℎc}{ℎ}_{t-1}+{b}_{\tilde{{c}_{t}}}\right)$$

4

Status Update:

$${c}_{t}={f}_{t}{ c}_{t-1}+{i}_{t}\tilde{{c}_{t}}$$

5

$${ℎ}_{t}={o}_{t} {tanℎ(c}_{t})$$

6

Thanks to the gating mechanism, the cell can be kept informed for a period of time while working and keep the internal gradient undisturbed by adverse changes during training. va-nilla LSTM has no forget gate and adds the cell state without changes during updates (it can be seen as a recursive link with a constant weight of 1), often referred to as a Constant Error Carousel(CEC). It is so named because it solves the severe gradient disappearance and gradient explosion problems during RNN training, thus making it possible to learn long-term relationships. The LSTM cell structure is schematically shown in Fig. 2.

## 2.3 Multi-layer CNN-LSTM combined prediction model

In load prediction, the load time series are complex and not smooth, so it is difficult to build a single model to capture all the features of the signal for training and making accurate prediction. Based on the above reasons, we selected two individual models with the most accurate prediction results from the analysis of individual models such as LSTM, CNN, RNN, and XGBoot to create a combination model with more accurate prediction results than individual models. The proposed power load forecasting process in this paper is shown in Fig. 3.

Step 1 : Data acquisition and processing. Raw loads are used as input, data are pre-processed, mean values are used to fill in vacant data and replace abnormal data, and the data are normalized.

Step 2: Build feature equations. In order to make the model for effective simulation and validation, feature datasets and labels will be constructed first, and then the data will be sliced based on the new feature datasets and label sets to obtain the training and test sets of the data, and finally the batch data will be created based on the training and test sets, where the batch data size will be defined and judged based on the dataset type so that the test and training batches can be constructed.

Step 3: Build the model. Use algorithmic statements for LSTM, CNN, RNN, XGBoot model construction, define their own model parameters as well as the model hierarchy, so as to complete the initial construction of the model.

Step 4: Compilation, training and validation of the model. After initializing the parameters of LSTM, CNN, RNN, and XGBoot prediction models by the above steps, the models are compiled and run, and the optimal two training models are found by comparing and analyzing the r^2 values.

Step 5:Model fusion. Based on the two optimal models that have been screened above, the parameters are tuned to determine the optimal parameters, and then the models are compiled, and the optimal single case training model is found by comparing and analyzing the r^2 values by changing the number of implied layers of the model and other parameter adjustments, and finally the optimized single case models are combined and the prediction effect is verified.

The number of layers of neural network mainly depends on the complexity of the learning target, theoretically increasing the number of network layers can improve the model nonlinear fitting ability, but the complexity of the model and training time will also increase, when the number of hidden layers is too large, the speed of update iteration will be reduced, the convergence effect and efficiency decreases, and the accuracy will not improve, so choose the solution with better effect and less time, this paper has been experimentally verified. In this paper, it is verified that setting 2 CNN layers fusion and then 2 LSTM implicit layers can well balance the prediction accuracy and learning efficiency. There are many kinds of gradient-based optimization algorithms, but the gradient-based adaptive momentum estimation algorithm is chosen in this paper, which can dynamically adjust the learning rate of each parameter, so that the learning rate of each iteration has a certain range, and the parameter changes are relatively stable. To evaluate the accuracy of the prediction model, R^2 value, Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are selected as evaluation indexes of this model to measure the accuracy of prediction.

The evaluation indexes are mathematically expressed as:

\({R}^{2}=\frac{\sum {(\widehat{yi}-\stackrel{-}{y}i)}^{2}}{\sum {(\stackrel{-}{y}i-\widehat{yi})}^{2}}\) (7 )

\(RMSE=\sqrt{\frac{1}{n}\sum _{i=1}^{n}{(\widehat{yi}-yi)}^{2}}\) (8 )

\(MAE=\frac{1}{n}\sum _{i=1}^{n}|\widehat{yi}-yi|\) (9 )

\(MAPE=\frac{100\%}{n}\sum _{i=1}^{n}\left|\frac{\widehat{yi}-yi}{yi}\right|\) (10 )

Where: N is the number of samples; \(yi\), \(\widehat{yi}\), \(\stackrel{-}{y}i\) are the actual load, the predicted load and the mean value of the actual fit at the ith sampling point of the prediction, respectively. R^2 is an index to evaluate the goodness of the regression model, which can visually represent the fitting effect of the prediction model, MAE can reflect the actual situation of the prediction error, RMSE, as a comprehensive index of error analysis, reflects the accuracy of the prediction, MAPE evaluates the degree of fluctuation of the model prediction error, and reflects the robustness and stability of the model.