The prediction accuracy of LSTM for multiperiod hybrid water quality time series is not high. To improve the accuracy of LSTM in predicting water quality, a surface water quality prediction model based on EEMD-LSTM is proposed. The flowchart for the EEMD-LSTM prediction model is shown in Fig. 2.
The EEMD-LSTM consists of the following steps.
Step 1: Data preprocessing. The min-max normalization (MMN) method is used to normalize the original water quality series (Singh and Singh 2020). MMN can accelerate the speed of the gradient descent method to find the optimal solution and improve the accuracy of the prediction model. Then, the isolation forest algorithm is used to identify abnormal fluctuations, and the input and output samples are determined according to the selected sliding time window width.
Step 2: Series decomposition. After the preprocessing of the original water quality time series, EEMD is used to decompose the series into multiple components that contain high-frequency and low-frequency components. The high-frequency components mainly reflect the influence of sudden pollution and discontinuous nonpoint source pollution, and the low-frequency components mainly reflect the physicochemical properties and long-term trend of surface water quality.
Step 3: Period calculation. The fast fourier transform (FFT) method can reflect the periodic characteristics of signals that cannot be extracted in the time domain from the frequency domain and is a commonly used signal analysis method (Guia et al. 2015). The components that have a great impact on water quality changes are identified according to the significant period.
Step 4: Then, independent LSTM submodels are developed for each decomposed component. When training the LSTM submodels, the mean squared error (MSE) of the training dataset is chosen as a criterion to calibrate the model, and the Adam algorithm is chosen as the optimizer. Finally, the prediction results of each submodel are aggregated to obtain the final water quality prediction results.
3.1 Ensemble empirical mode decomposition (EEMD)
Huang et al. (1998) proposed a new analysis and preprocessing method for nonlinear signals, which is referred to as empirical mode decomposition. This method is suitable for dealing with nonlinear and nonstationary time series. The EMD must obey the following two rules at the same time: (1) All the extrema and zero crossing numbers must be the same or different at most by one. (2) All upper and lower envelopes must be locally symmetrical along the time axis.
To solve the problem of mode mixing (i.e., decomposed IMFs that contain multiple frequencies), Zhaohua and Norden (2009) proposed an ensemble empirical mode decomposition method. EEMD utilizes the sensitivity of the signal-to-noise, first adding Gaussian white noise to the original signal to match the signals of different frequencies to the corresponding time scale and then implementing the EMD process.
Given an original signal \(x\left(t\right)\), the specific process of EEMD is as follows:
(1) Add Gaussian white noise to the original signal,
$${x}^{i}\left(t\right)=x\left(t\right)-{n}^{i}\left(t\right)$$
1
where \(i\) represents the number of times Gaussian white noise is added.
(2) Decompose the mixed signal \({x}^{i}\left(t\right)\) by EMD into IMFs \({C}_{j}^{i}\left(t\right)\), (\(j\)= 1, 2, …, n) and residual \({r}^{i}\left(t\right)\).
$${x}^{i}\left(t\right)=\sum _{j=1}^{n}{C}_{j}^{i}\left(t\right)+{r}^{i}\left(t\right)$$
2
where \({C}_{j}^{i}\left(t\right)\) represents the \(j\)th IMF component obtained by decomposing the \(i\)th mixed signal.
(3) Repeat the above steps \(N\) times with different Gaussian white noise each time and find the corresponding IMFs.
(4) Average the summation of corresponding decomposed IMFs \(N\) times to eliminate the influence of the added white noise on the original signal.
$$\stackrel{-}{{C}_{j}\left(t\right)}=\frac{1}{N}\sum _{j=1}^{n}{C}_{j}^{i}\left(t\right)$$
3
where \({C}_{j}^{i}\left(t\right)\) represents the \(j\)th IMF component.
Finally, after being decomposed by EEMD, the original signal \(x\left(t\right)\) can be expressed as:
$$x\left(t\right)=\sum _{j=1}^{N}\stackrel{-}{{c}_{j}\left(t\right)}+r\left(t\right), i=\text{1,2}, \dots ,N$$
4
3.2 Long short-term memory (LSTM)
A long short-term memory network is an improved network structure proposed on the basis of RNNs that effectively overcomes the long-term dependence problem and gradient vanishing problem of RNNs (Hochreiter and Schmidhuber 1997). LSTM is suitable for processing and predicting events with long time intervals and delays in time series (An et al. 2020). LSTM introduces gates, which can selectively remove or add information. The LSTM cell mainly includes four gate structures: forget gate, input gate, update gate and output gate (Hochreiter and Schmidhuber 1997). The function of the forget gate is to forget the irrelevant state information of the previous moment. The input gate determines what information can enter the memory cell at the current moment. The output gate determines the output of the complex network. The memory unit of LSTM can use these three gate structures to screen long-term and short-term memory information. The general architecture of the LSTM cell is shown in Fig. 3.
The key to LSTM is the transmission of the cell state, which controls the information passed into the network through the combination of three gates and determines the cell state. In Fig. 3, \({X}_{t}\) represents the input of the network at time \(t\), \({h}_{t}\) represents the output of the network at time \(t\), and \({C}_{t}\) represents the cell state at time \(t\).
$${f}_{t}=\sigma ({W}_{f}*\left[{h}_{t-1}, {X}_{t}\right]+{b}_{f})$$
5
The operation ‘*’ represents the elementwise multiplication of the vectors.
\({i}_{t}=\sigma ({W}_{i}*\left[{h}_{t-1}, {X}_{t}\right]+{b}_{i})\) | (6) |
---|
\({o}_{t}=\sigma ({W}_{o}*\left[{h}_{t-1}, {X}_{t}\right]+{b}_{o})\) | (7) |
\(\tilde{{C}_{t}}=\text{t}\text{a}\text{n}\text{h}({W}_{c}*\left[{h}_{t-1}, {X}_{t}\right]+{b}_{c})\)
|
(8)
|
\({C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*\tilde{{C}_{t}}\)
|
(9)
|
where \(\sigma\) is the logistic sigmoid function (\(\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}\)), \({W}_{f}\), \({W}_{i}\), \({W}_{o}\) and \({W}_{c}\) represent the weight matrices of the forget gate, the input gate, the output gate and the tanh layer, respectively, \({b}_{f}\), \({b}_{i}\), \({b}_{o}\) and \({b}_{c}\) represent the bias vectors of the forget gate, the input gate, the output gate and the tanh layer (\(\text{tanh}\left(x\right)=\frac{1-{e}^{-2x}}{1+{e}^{-x}}\)), respectively, \({f}_{t}\), \({i}_{t}\) and \({o}_{t}\) represent the output of the forget gate, the input gate and the output gate at time \(t\), respectively, and \(\tilde{{C}_{t}}\) is an update vector for the cell state.
Finally, the output \({h}_{t}\) of the memory cell is obtained through the hyperbolic tangent activation function tanh.
$${h}_{t}={O}_{t}*\text{t}\text{a}\text{n}\text{h}\left({C}_{t}\right)$$
10
LSTM is suitable for processing and predicting time series data due to its good ability to deal with the long-term dependence problem on time series data and the problem of gradient disappearance.
3.3 Data preprocessing
3.3.1 Data normalization
Data normalization is an important data preprocessing step that can accelerate the speed of the gradient descent method to find the optimal solution and improve the accuracy of the forecasting model. A large amount of unscaled data will slow the learning speed of the artificial neural network and the convergence speed of the model. Since LSTM is very sensitive to fluctuations in time series data and to capturing the trends in time series data, the data need to be normalized before being fed to the neural network (ArunKumar et al. 2021). Original data are normalized using the min-max normalization (MMN) method, which linearly scales unnormalized data to predefined lower and upper bounds (Singh and Singh 2020). The equation is given as follows:
$${x}_{n}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$
11
where \(x\) represents the original time series data, \({x}_{n}\) represents the normalized time series data, \({x}_{min}\) represents the minimum value of the time series data, and \({x}_{max}\) represents the maximum value of the time series data. The min-max normalization method scales the data between 0 and 1.
3.3.2 Outlier detection
The real-time monitoring data of water quality are usually unprocessed raw data. Weather factors such as strong wind and heavy rainfall may affect the results of real-time water quality monitoring, and problems such as abnormal monitoring equipment or manual input errors will lead to missing values, abnormal values or noise in the original data. Abnormal values will affect the accuracy of the model prediction. Certain methods are used to identify these outliers and deal with them. The characteristics of abnormal data are as follows: (1) they represent a small proportion of the sample data; and (2) they have significantly different properties compared with normal sample data.
Liu et al. (2008) proposed the isolation forest algorithm and applied it to data outlier detection. The isolation forest algorithm has a linear time complexity and high accuracy and is a neural network algorithm that meets the requirements of big data processing. Any outlier detection method requires an anomaly score, and the calculation equation of the search path length of the isolation forest is as follows:
$$c\left(n\right)=2H\left(n-1\right)-\left(\frac{2\left(n-1\right)}{n}\right)$$
12
where \(n\) is the number of samples, \(H\left(i\right)\) is the harmonic number and can be estimated by \(ln\left(i\right)+ \xi\) (Euler’s constant), and \(c\left(n\right)\) is the average path length of the binary search tree.
By normalizing the length of the isolated binary tree, a number between 0 and 1 can be obtained as the abnormal score of the detected sample. The anomaly score \(s\) of an instance \(x\) is defined as:
$$s(x, n)={2}^{\frac{E\left(h\left(x\right)\right)}{c\left(n\right)}}$$
13
where \(h\left(x\right)\) represents the path length from the root node to the x node, and \(E\left(h\right(x\left)\right)\) is the average of the path lengths of all the isolated trees in the isolated forest for the sample point \(x\). When the anomaly score is larger, the sample point is more likely to be an outlier. Based on the anomaly score s, we can make the following assessments (Liu et al. 2008):
(1) If the anomaly score is very close to 1, then the data are definitely anomalies.
(2) If the anomaly score is much smaller than 0.5, then it is safe to regard the data as normal instances.
(3) If all the anomaly scores are approximately 0.5, then there are no distinct outliers in the sample.
3.4 Performance evaluation
To objectively and comprehensively evaluate the prediction performance of each model, four different evaluation indicators are selected: root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and determination coefficient (R2). RMSE is sensitive to errors that are evident in the experimental data. MAE is the average value of absolute error and can truly reflect the state of the model's error in prediction. MAPE is the expected value of the absolute error and percentage of the true value. The smaller the RMSE, MAE, and MAEP are, the more accurate the prediction result and the better the model effect. The value of the determination coefficient R2 is between 0 and 1, and the closer to 1 the value is, the better the model’s prediction ability of the regression effect. Generally, if the coefficient of determination exceeds 0.8, the model is considered to have a high goodness of fit. The specific calculation equation of each loss function is as follows:
\(RMSE=\sqrt{\frac{1}{N}\sum _{i=1}^{N}{({y}_{i}-{y}_{i}^{*})}^{2}}\)
|
(14)
|
\(MAE=\frac{1}{N}\sum _{i=1}^{N}{|y}_{i}-{y}_{i}^{*}|\)
|
(15)
|
\(MAPE=\frac{1}{N}\sum _{i=1}^{N}\left|\frac{{y}_{i}-{y}_{i}^{*}}{{y}_{i}}\right|\times 100\%\)
|
(16)
|
\({R}^{2}=1-\frac{\sum _{i=1}^{N}{({y}_{i}-{y}_{i}^{*})}^{2}}{\sum _{i=1}^{N}{({y}_{i}-\stackrel{-}{{y}_{i}})}^{2}}\)
|
(17)
|
where \(N\) is the number of samples, \({y}_{i}\) is the measured value, \({y}_{i}^{*}\) is the predicted value, and \(\stackrel{-}{{y}_{i}}\) is the average value of the measured data.